Abstract chunk embedder¶

Overview¶

The AbstractChunkEmbedder defines the interface for embedding systems that operate on document chunks. It transforms a list of Chunk objects with empty vectors into a list of Chunk objects with their vector fields populated, ready for indexing into a vector store.

Design Notes¶

Interaction Patterns¶

The AbstractChunkEmbedder supports a single, focused interaction pattern:

Embedding Pattern:
Accept an ordered list of Chunk objects with empty vectors
Populate each Chunk.vector with a dense float embedding
Return chunks in the same order with identity fields preserved

Implementation Requirements¶

Ordering Consistency: Returned chunks must preserve the original input order — index i in the output must correspond to index i in the input
Identity Preservation: document_id, chunk_index, text, and metadata must be passed through unchanged
Length Invariant: The returned list must always have the same length as the input list
Empty Input: An empty input list must return an empty list without error or side effects
Non-empty Vectors: Every returned Chunk.vector must be a non-empty float sequence

Pydantic Integration¶

AbstractChunkEmbedder inherits from BaseModel rather than plain ABC. This enforces that all implementations are Pydantic models, enabling:

Declarative field definitions with validation
PrivateAttr for non-serializable runtime objects (e.g. HTTP clients, loaded models)
model_post_init as the standard hook for post-construction initialization

Docstring abstract chunk embedder¶

abstract_chunk_embedder ¶

Classes:

AbstractChunkEmbedder –

Interface for all chunk embedders.

AbstractChunkEmbedder ¶


              flowchart TD
              database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder[AbstractChunkEmbedder]

              

              click database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder href "" "database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder"

Interface for all chunk embedders.

A ChunkEmbedder transforms a list of Chunk objects (with empty vectors) into a list of Chunk objects with their vector fields populated, ready for indexing into a vector store.

Contract¶

embed() is the single method every implementation must provide.
Returned chunks must preserve the original ordering and identity fields (document_id, chunk_index, text, metadata).
The returned list must have the same length as the input list.
An empty input list must return an empty list without error.
Each returned Chunk.vector must be non-empty.

Implementations¶

:class:OpenAICompatibleChunkEmbedder – batched embedding via any /v1/embeddings endpoint

Methods:

embed –

Populate the vector field of each chunk and return the results.

embed `abstractmethod` ¶

embed(chunks: list[Chunk]) -> list[Chunk]

Populate the vector field of each chunk and return the results.

Parameters¶

chunks: Ordered list of Chunk objects whose vector fields are expected to be empty. May be empty, in which case an empty list is returned.

Returns¶

list[Chunk] Flat, ordered list of chunks with vector populated. Index i in the output corresponds to index i in chunks.

Source code in src/database_builder_libs/models/abstract_chunk_embedder.py

@abstractmethod
def embed(self, chunks: list[Chunk]) -> list[Chunk]:
    """
    Populate the ``vector`` field of each chunk and return the results.

    Parameters
    ----------
    chunks:
        Ordered list of ``Chunk`` objects whose ``vector`` fields are
        expected to be empty.  May be empty, in which case an empty list
        is returned.

    Returns
    -------
    list[Chunk]
        Flat, ordered list of chunks with ``vector`` populated.  Index *i*
        in the output corresponds to index *i* in *chunks*.
    """
    raise NotImplementedError