Abstract chunk embedder¶
Overview¶
The AbstractChunkEmbedder defines the interface for embedding systems that operate on document chunks. It transforms a list of Chunk objects with empty vectors into a list of Chunk objects with their vector fields populated, ready for indexing into a vector store.
Design Notes¶
Interaction Patterns¶
The AbstractChunkEmbedder supports a single, focused interaction pattern:
- Embedding Pattern:
- Accept an ordered list of
Chunkobjects with empty vectors - Populate each
Chunk.vectorwith a dense float embedding - Return chunks in the same order with identity fields preserved
Implementation Requirements¶
- Ordering Consistency: Returned chunks must preserve the original input order — index i in the output must correspond to index i in the input
- Identity Preservation:
document_id,chunk_index,text, andmetadatamust be passed through unchanged - Length Invariant: The returned list must always have the same length as the input list
- Empty Input: An empty input list must return an empty list without error or side effects
- Non-empty Vectors: Every returned
Chunk.vectormust be a non-empty float sequence
Pydantic Integration¶
AbstractChunkEmbedder inherits from BaseModel rather than plain ABC. This enforces that all implementations are Pydantic models, enabling:
- Declarative field definitions with validation
PrivateAttrfor non-serializable runtime objects (e.g. HTTP clients, loaded models)model_post_initas the standard hook for post-construction initialization
Docstring abstract chunk embedder¶
abstract_chunk_embedder
¶
Classes:
-
AbstractChunkEmbedder–Interface for all chunk embedders.
AbstractChunkEmbedder
¶
flowchart TD
database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder[AbstractChunkEmbedder]
click database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder href "" "database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder"
Interface for all chunk embedders.
A ChunkEmbedder transforms a list of Chunk objects (with empty vectors)
into a list of Chunk objects with their vector fields populated,
ready for indexing into a vector store.
Contract¶
embed()is the single method every implementation must provide.- Returned chunks must preserve the original ordering and identity fields
(
document_id,chunk_index,text,metadata). - The returned list must have the same length as the input list.
- An empty input list must return an empty list without error.
- Each returned
Chunk.vectormust be non-empty.
Implementations¶
- :class:
OpenAICompatibleChunkEmbedder– batched embedding via any /v1/embeddings endpoint
Methods:
-
embed–Populate the
vectorfield of each chunk and return the results.
embed
abstractmethod
¶
Populate the vector field of each chunk and return the results.
Parameters¶
chunks:
Ordered list of Chunk objects whose vector fields are
expected to be empty. May be empty, in which case an empty list
is returned.
Returns¶
list[Chunk]
Flat, ordered list of chunks with vector populated. Index i
in the output corresponds to index i in chunks.
Source code in src/database_builder_libs/models/abstract_chunk_embedder.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |