Abstract vector store¶

Overview¶

The AbstractVectorStore defines the interface for semantic retrieval storage systems. It manages document chunks with their embeddings and enables nearest-neighbor similarity search across various vector database backends (FAISS, Qdrant, Pinecone, etc.).

Design Notes¶

Interaction Patterns¶

The AbstractVectorStore supports three main interaction patterns:

Indexing Pattern:
- Store chunks with embeddings
- Overwrite existing chunks with same identity
- Validate embedding dimensionality
Semantic Search Pattern:
- Accept query vector
- Find k-nearest neighbors by similarity
- Return chunks ordered by relevance (without embeddings)
Document Management Pattern:
- Retrieve all chunks for a document
- Delete all vectors for GDPR compliance
- Maintain document-level consistency

Implementation Requirements¶

Embedding Consistency: All vectors in an index must use the same model and dimensionality
Distance Metric: Implementations must specify and validate the distance metric (cosine, euclidean, etc.)
Batch Operations: Support efficient batch insertion for large documents

Docstring abstract chunk¶

chunk ¶

Classes:

Chunk –

Atomic retrievable semantic unit of a document.

Chunk `dataclass` ¶

Chunk(document_id: DocumentId, chunk_index: int, text: str, vector: Sequence[float], metadata: Mapping[str, Any] | None = None)

Atomic retrievable semantic unit of a document.

A Chunk represents a deterministic fragment of a source document that can be independently embedded, indexed, retrieved and deleted.

Identity¶

The unique identity of a chunk is defined by the pair: (document_id, chunk_index)

This identity MUST remain stable across re-indexing runs as long as the underlying document content has not changed in that region.

Fields¶

document_id : DocumentId Stable identifier of the parent document. Must remain constant across synchronization cycles.

int

Monotonically increasing positional index within the document. Defines ordering and chunk identity. Must not depend on runtime factors (timestamps, randomness, hashing).

str

Exact textual content used to generate the embedding. Any change to this text requires overwriting the stored vector.

Sequence[float]

Embedding vector representing the semantic meaning of text.

Requirements: - Fixed dimensionality across entire index - Generated by a single embedding model - Deterministic for identical input text

Mapping[str, Any] | None

Optional lightweight structured information about the chunk (e.g., page number, heading, section).

Constraints: - Must be JSON-serializable - Must NOT affect chunk identity - Safe to drop without breaking retrieval correctness

Returned by:

Main modules
- Vector Stores
  - abstract_vector_store AbstractVectorStore
    - get_document_chunks
    - similarity_search
  - qdrant_store QdrantDatastore
    - get_document_chunks
    - similarity_search
- Text embed utilities abstract_chunk_embedder AbstractChunkEmbedder embed
- Text chunk utilities abstract_chunk_strategy AbstractChunkingStrategy chunk

Used by:

Main modules
- Text embed utilities abstract_chunk_embedder AbstractChunkEmbedder embed
- Vector Stores abstract_vector_store AbstractVectorStore store_chunks

Docstring abstract store¶

abstract_vector_store ¶

Classes:

AbstractVectorStore –

Semantic retrieval storage for embedding-based search.

AbstractVectorStore ¶

AbstractVectorStore()


              flowchart TD
              database_builder_libs.models.abstract_vector_store.AbstractVectorStore[AbstractVectorStore]

              

              click database_builder_libs.models.abstract_vector_store.AbstractVectorStore href "" "database_builder_libs.models.abstract_vector_store.AbstractVectorStore"

Semantic retrieval storage for embedding-based search.

The store persists document chunks together with their embeddings and supports nearest-neighbour semantic retrieval.

The interface is designed to be backend-agnostic and compatible with: FAISS, Qdrant, Pinecone, pgvector, Weaviate, Elasticsearch, etc.

Consistency guarantees¶

Implementations MUST guarantee:

Deterministic retrieval for identical index state
No duplicate chunks returned
Stable chunk identity across writes
Full deletion of document vectors (GDPR requirement)

Embedding contract¶

All stored vectors must: - Have identical dimensionality - Use the same distance metric - Be normalized if required by backend

Mixing embedding models in one index is forbidden.

Methods:

connect –

Initialize the vector index and verify accessibility.
delete_document –

Permanently remove all vectors belonging to a document.
get_document_chunks –

Retrieve all chunks belonging to a document.
similarity_search –

Perform nearest-neighbour semantic search.
store_chunks –

Insert or update chunks and their embeddings.

Source code in src/database_builder_libs/models/abstract_vector_store.py

def __init__(self) -> None:
    self._connected: bool = False
    self._connecting: bool = False

connect ¶

connect(config: dict | None = None) -> None

Initialize the vector index and verify accessibility.

This method should: - Create index if missing - Validate embedding dimensionality - Validate distance metric compatibility

Raises¶

ConnectionError Backend unreachable. RuntimeError Index exists but is incompatible.

Source code in src/database_builder_libs/models/abstract_vector_store.py

def connect(self, config: dict | None = None) -> None:
    """
    Initialize the vector index and verify accessibility.

    This method should:
    - Create index if missing
    - Validate embedding dimensionality
    - Validate distance metric compatibility

    Raises
    ------
    ConnectionError
        Backend unreachable.
    RuntimeError
        Index exists but is incompatible.
    """
    if self._connected:
        return

    self._connecting = True
    try:
        self._connect_impl(config)
        self._connected = True
    finally:
        self._connecting = False

delete_document `abstractmethod` ¶

delete_document(document_id: DocumentId) -> int

Permanently remove all vectors belonging to a document.

This operation must be irreversible and guarantee that the document cannot appear in future search results.

Parameters¶

document_id : DocumentId Identifier of the document to delete.

Returns¶

int Number of deleted chunks.

After successful deletion, similarity_search() MUST NOT return any chunk originating from this document.

Raises¶

RuntimeError If deletion could not be fully verified.

Source code in src/database_builder_libs/models/abstract_vector_store.py

@abstractmethod
def delete_document(self, document_id: DocumentId) -> int:
    """
    Permanently remove all vectors belonging to a document.

    This operation must be irreversible and guarantee that the
    document cannot appear in future search results.

    Parameters
    ----------
    document_id : DocumentId
        Identifier of the document to delete.

    Returns
    -------
    int
        Number of deleted chunks.

    GDPR Requirement
    ----------------
    After successful deletion, similarity_search() MUST NOT return
    any chunk originating from this document.

    Raises
    ------
    RuntimeError
        If deletion could not be fully verified.
    """
    raise NotImplementedError

get_document_chunks `abstractmethod` ¶

get_document_chunks(document_id: DocumentId) -> List[Chunk]

Retrieve all chunks belonging to a document.

Returns¶

List[Chunk] All chunks for the document ordered by original document order.

Raises¶

KeyError If document does not exist. RuntimeError If store not connected.

Source code in src/database_builder_libs/models/abstract_vector_store.py

@abstractmethod
def get_document_chunks(self, document_id: DocumentId) -> List[Chunk]:
    """
    Retrieve all chunks belonging to a document.

    Returns
    -------
    List[Chunk]
        All chunks for the document ordered by original document order.

    Raises
    ------
    KeyError
        If document does not exist.
    RuntimeError
        If store not connected.
    """

    raise NotImplementedError

similarity_search `abstractmethod` ¶

similarity_search(vector: Sequence[float], limit: int = 10) -> List[Chunk]

Perform nearest-neighbour semantic search.

Parameters¶

vector : Sequence[float] Query embedding. Must match index dimensionality. limit : int Maximum number of results to return.

Returns¶

List[Chunk] Ordered by similarity descending (most relevant first).

Guarantees¶

At most limit results returned
No duplicate chunks
Ordering must reflect backend similarity score

Raises¶

ValueError If vector dimensionality mismatch. RuntimeError If store not connected.

Source code in src/database_builder_libs/models/abstract_vector_store.py

@abstractmethod
def similarity_search(
    self,
    vector: Sequence[float],
    limit: int = 10,
) -> List[Chunk]:
    """
    Perform nearest-neighbour semantic search.

    Parameters
    ----------
    vector : Sequence[float]
        Query embedding. Must match index dimensionality.
    limit : int
        Maximum number of results to return.

    Returns
    -------
    List[Chunk]
        Ordered by similarity descending (most relevant first).

    Guarantees
    ----------
    - At most `limit` results returned
    - No duplicate chunks
    - Ordering must reflect backend similarity score

    Raises
    ------
    ValueError
        If vector dimensionality mismatch.
    RuntimeError
        If store not connected.
    """
    raise NotImplementedError

store_chunks `abstractmethod` ¶

store_chunks(chunks: List[Chunk]) -> None

Insert or update chunks and their embeddings.

Behaviour¶

Operation must be idempotent
Existing chunks with same (document_id, chunk_id) MUST be overwritten
Partial document updates are allowed

Parameters¶

chunks : List[Chunk] Chunks containing text, metadata, and embedding vector.

Raises¶

RuntimeError If called before connect(). ValueError If embedding dimensionality mismatch occurs.

Source code in src/database_builder_libs/models/abstract_vector_store.py

@abstractmethod
def store_chunks(self, chunks: List[Chunk]) -> None:
    """
    Insert or update chunks and their embeddings.

    Behaviour
    ---------
    - Operation must be idempotent
    - Existing chunks with same (document_id, chunk_id) MUST be overwritten
    - Partial document updates are allowed

    Parameters
    ----------
    chunks : List[Chunk]
        Chunks containing text, metadata, and embedding vector.

    Raises
    ------
    RuntimeError
        If called before connect().
    ValueError
        If embedding dimensionality mismatch occurs.
    """
    raise NotImplementedError

Abstract vector store¶

Overview¶

Design Notes¶

Interaction Patterns¶

Implementation Requirements¶

Docstring abstract chunk¶

chunk ¶

Chunk dataclass ¶

Identity¶

Fields¶

Docstring abstract store¶

abstract_vector_store ¶

AbstractVectorStore ¶

Consistency guarantees¶

Embedding contract¶

connect ¶

Raises¶

delete_document abstractmethod ¶

Parameters¶

Returns¶

GDPR Requirement¶

Raises¶

get_document_chunks abstractmethod ¶

Returns¶

Raises¶

similarity_search abstractmethod ¶

Parameters¶

Returns¶

Guarantees¶

Raises¶

store_chunks abstractmethod ¶

Behaviour¶

Parameters¶

Raises¶

Chunk `dataclass` ¶

delete_document `abstractmethod` ¶

get_document_chunks `abstractmethod` ¶

similarity_search `abstractmethod` ¶

store_chunks `abstractmethod` ¶