Skip to content

Transformer-based chunk embedder

Overview

The TransformersChunkEmbedder embeds document chunks locally using any HuggingFace Transformers model. It runs inference directly on the host machine via transformers and torch, making it suitable for air-gapped environments or deployments where sending data to an external API is not acceptable.

Vectors are produced by mean-pooling the last hidden state over the token dimension, weighted by the attention mask.

Configuration

Configuration happens through Pydantic fields passed at construction:

embedder = TransformersChunkEmbedder(
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
    device="cuda",      # Optional, defaults to "cuda" if available else "cpu"
    max_length=512,     # Optional, default 512 tokens
    batch_size=32,      # Optional, default 32 chunks per forward pass
)

Usage and Practical Examples

Basic Embedding

from database_builder_libs.models.chunk import Chunk
from database_builder_libs.utility.embed_chunk.transformer_based import TransformersChunkEmbedder

embedder = TransformersChunkEmbedder(
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
)

chunks = [
    Chunk(document_id="doc-1", chunk_index=0, text="Hello world", vector=[], metadata={}),
    Chunk(document_id="doc-1", chunk_index=1, text="Second chunk", vector=[], metadata={}),
]

embedded = embedder.embed(chunks)
# Each chunk now has its vector field populated
print(embedded[0].vector)  # [0.023, -0.117, ...]

Using a Local Model

embedder = TransformersChunkEmbedder(
    model_name_or_path="/models/bge-m3",
    device="cuda",
)

Using with CPU Only

embedder = TransformersChunkEmbedder(
    model_name_or_path="BAAI/bge-small-en-v1.5",
    device="cpu",
    batch_size=8,  # Reduce batch size on CPU to avoid memory pressure
)

Empty Input

# Empty input is handled gracefully without running any inference
result = embedder.embed([])
assert result == []

Behavior and Edge Cases

Batching

The embedder processes chunks in batches of batch_size, running one forward pass per batch. Reduce batch_size if running into GPU out-of-memory errors on large documents:

# 5 chunks with batch_size=2 produces 3 forward passes
embedder = TransformersChunkEmbedder(
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
    device="cuda",
    batch_size=2,
)
embedded = embedder.embed(chunks)  # 3 forward passes
assert len(embedded) == 5

Truncation

Sequences longer than max_length tokens are silently truncated by the tokenizer. Increase max_length for models that support longer contexts:

embedder = TransformersChunkEmbedder(
    model_name_or_path="jinaai/jina-embeddings-v2-base-en",
    max_length=8192,
)

Device Selection

The device field defaults to "cuda" if a GPU is available, otherwise "cpu". Override explicitly for multi-GPU setups or to force CPU inference:

# Force a specific GPU
embedder = TransformersChunkEmbedder(
    model_name_or_path="BAAI/bge-m3",
    device="cuda:1",
)

Mean Pooling

The embedding for each chunk is computed by mean-pooling the last hidden state across the token dimension, weighted by the attention mask. Padding tokens are excluded from the average:

# Longer and shorter chunks in the same batch are handled correctly —
# padding tokens do not contribute to the embedding
embedded = embedder.embed([short_chunk, long_chunk])

Docstring

transformer_based

Classes:

TransformersChunkEmbedder


              flowchart TD
              database_builder_libs.utility.embed_chunk.transformer_based.TransformersChunkEmbedder[TransformersChunkEmbedder]
              database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder[AbstractChunkEmbedder]

                              database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder --> database_builder_libs.utility.embed_chunk.transformer_based.TransformersChunkEmbedder
                


              click database_builder_libs.utility.embed_chunk.transformer_based.TransformersChunkEmbedder href "" "database_builder_libs.utility.embed_chunk.transformer_based.TransformersChunkEmbedder"
              click database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder href "" "database_builder_libs.models.abstract_chunk_embedder.AbstractChunkEmbedder"
            

Chunk embedder backed by any HuggingFace Transformers model.

Runs inference locally using transformers and torch. The vector for each chunk is produced by mean-pooling the last hidden state over the token dimension, respecting the attention mask.

Attributes

model_name_or_path: HuggingFace model identifier or local path, e.g. "sentence-transformers/all-MiniLM-L6-v2" or "/models/bge-m3". device: Torch device to run inference on. Defaults to "cuda" if available, otherwise "cpu". max_length: Maximum token length passed to the tokenizer. Sequences longer than this are truncated. batch_size: Number of chunks to encode in a single forward pass. Reduce if running into GPU memory errors.