Skip to content

Abstract chunk embedder

Overview

The AbstractChunkEmbedder defines the interface for embedding systems that operate on document chunks. It transforms a list of Chunk objects with empty vectors into a list of Chunk objects with their vector fields populated, ready for indexing into a vector store.

Design Notes

Interaction Patterns

The AbstractChunkEmbedder supports a single, focused interaction pattern:

  1. Embedding Pattern:
  2. Accept an ordered list of Chunk objects with empty vectors
  3. Populate each Chunk.vector with a dense float embedding
  4. Return chunks in the same order with identity fields preserved

Implementation Requirements

  • Ordering Consistency: Returned chunks must preserve the original input order — index i in the output must correspond to index i in the input
  • Identity Preservation: document_id, chunk_index, text, and metadata must be passed through unchanged
  • Length Invariant: The returned list must always have the same length as the input list
  • Empty Input: An empty input list must return an empty list without error or side effects
  • Non-empty Vectors: Every returned Chunk.vector must be a non-empty float sequence

Pydantic Integration

AbstractChunkEmbedder inherits from BaseModel rather than plain ABC. This enforces that all implementations are Pydantic models, enabling:

  • Declarative field definitions with validation
  • PrivateAttr for non-serializable runtime objects (e.g. HTTP clients, loaded models)
  • model_post_init as the standard hook for post-construction initialization

Docstring abstract chunk embedder

abstract_chunk_embedder

Classes:

AbstractChunkEmbedder


              flowchart TD
              database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder[AbstractChunkEmbedder]

              

              click database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder href "" "database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder"
            

Interface for all chunk embedders.

A ChunkEmbedder transforms a list of Chunk objects (with empty vectors) into a list of Chunk objects with their vector fields populated, ready for indexing into a vector store.

Contract

  • embed() is the single method every implementation must provide.
  • Returned chunks must preserve the original ordering and identity fields (document_id, chunk_index, text, metadata).
  • The returned list must have the same length as the input list.
  • An empty input list must return an empty list without error.
  • Each returned Chunk.vector must be non-empty.

Implementations

  • :class:OpenAICompatibleChunkEmbedder – batched embedding via any /v1/embeddings endpoint

Methods:

  • embed

    Populate the vector field of each chunk and return the results.

embed abstractmethod

embed(chunks: list[Chunk]) -> list[Chunk]

Populate the vector field of each chunk and return the results.

Parameters

chunks: Ordered list of Chunk objects whose vector fields are expected to be empty. May be empty, in which case an empty list is returned.

Returns

list[Chunk] Flat, ordered list of chunks with vector populated. Index i in the output corresponds to index i in chunks.

Source code in src/database_builder_libs/models/abstract_chunk_embedder.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@abstractmethod
def embed(self, chunks: list[Chunk]) -> list[Chunk]:
    """
    Populate the ``vector`` field of each chunk and return the results.

    Parameters
    ----------
    chunks:
        Ordered list of ``Chunk`` objects whose ``vector`` fields are
        expected to be empty.  May be empty, in which case an empty list
        is returned.

    Returns
    -------
    list[Chunk]
        Flat, ordered list of chunks with ``vector`` populated.  Index *i*
        in the output corresponds to index *i* in *chunks*.
    """
    raise NotImplementedError