Skip to content

Abstract vector store

Overview

The AbstractVectorStore defines the interface for semantic retrieval storage systems. It manages document chunks with their embeddings and enables nearest-neighbor similarity search across various vector database backends (FAISS, Qdrant, Pinecone, etc.).

Design Notes

Interaction Patterns

The AbstractVectorStore supports three main interaction patterns:

  1. Indexing Pattern:

    • Store chunks with embeddings
    • Overwrite existing chunks with same identity
    • Validate embedding dimensionality
  2. Semantic Search Pattern:

    • Accept query vector
    • Find k-nearest neighbors by similarity
    • Return chunks ordered by relevance (without embeddings)
  3. Document Management Pattern:

    • Retrieve all chunks for a document
    • Delete all vectors for GDPR compliance
    • Maintain document-level consistency

Implementation Requirements

  • Embedding Consistency: All vectors in an index must use the same model and dimensionality
  • Distance Metric: Implementations must specify and validate the distance metric (cosine, euclidean, etc.)
  • Batch Operations: Support efficient batch insertion for large documents

Docstring abstract chunk

chunk

Classes:

  • Chunk

    Atomic retrievable semantic unit of a document.

Chunk dataclass

Chunk(document_id: DocumentId, chunk_index: int, text: str, vector: Sequence[float], metadata: Mapping[str, Any] | None = None)

Atomic retrievable semantic unit of a document.

A Chunk represents a deterministic fragment of a source document that can be independently embedded, indexed, retrieved and deleted.

Identity

The unique identity of a chunk is defined by the pair: (document_id, chunk_index)

This identity MUST remain stable across re-indexing runs as long as the underlying document content has not changed in that region.

Fields

document_id : DocumentId Stable identifier of the parent document. Must remain constant across synchronization cycles.

int

Monotonically increasing positional index within the document. Defines ordering and chunk identity. Must not depend on runtime factors (timestamps, randomness, hashing).

str

Exact textual content used to generate the embedding. Any change to this text requires overwriting the stored vector.

Sequence[float]

Embedding vector representing the semantic meaning of text.

Requirements: - Fixed dimensionality across entire index - Generated by a single embedding model - Deterministic for identical input text

Docstring abstract store

abstract_vector_store

Classes:

AbstractVectorStore

AbstractVectorStore()

              flowchart TD
              database_builder_libs.models.abstract_vector_store.AbstractVectorStore[AbstractVectorStore]

              

              click database_builder_libs.models.abstract_vector_store.AbstractVectorStore href "" "database_builder_libs.models.abstract_vector_store.AbstractVectorStore"
            

Semantic retrieval storage for embedding-based search.

The store persists document chunks together with their embeddings and supports nearest-neighbour semantic retrieval.

The interface is designed to be backend-agnostic and compatible with: FAISS, Qdrant, Pinecone, pgvector, Weaviate, Elasticsearch, etc.

Consistency guarantees

Implementations MUST guarantee:

  • Deterministic retrieval for identical index state
  • No duplicate chunks returned
  • Stable chunk identity across writes
  • Full deletion of document vectors (GDPR requirement)

Embedding contract

All stored vectors must: - Have identical dimensionality - Use the same distance metric - Be normalized if required by backend

Mixing embedding models in one index is forbidden.

Methods:

Source code in src/database_builder_libs/models/abstract_vector_store.py
38
39
40
def __init__(self) -> None:
    self._connected: bool = False
    self._connecting: bool = False

connect

connect(config: dict | None = None) -> None

Initialize the vector index and verify accessibility.

This method should: - Create index if missing - Validate embedding dimensionality - Validate distance metric compatibility

Raises

ConnectionError Backend unreachable. RuntimeError Index exists but is incompatible.

Source code in src/database_builder_libs/models/abstract_vector_store.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def connect(self, config: dict | None = None) -> None:
    """
    Initialize the vector index and verify accessibility.

    This method should:
    - Create index if missing
    - Validate embedding dimensionality
    - Validate distance metric compatibility

    Raises
    ------
    ConnectionError
        Backend unreachable.
    RuntimeError
        Index exists but is incompatible.
    """
    if self._connected:
        return

    self._connecting = True
    try:
        self._connect_impl(config)
        self._connected = True
    finally:
        self._connecting = False

delete_document abstractmethod

delete_document(document_id: DocumentId) -> int

Permanently remove all vectors belonging to a document.

This operation must be irreversible and guarantee that the document cannot appear in future search results.

Parameters

document_id : DocumentId Identifier of the document to delete.

Returns

int Number of deleted chunks.

GDPR Requirement

After successful deletion, similarity_search() MUST NOT return any chunk originating from this document.

Raises

RuntimeError If deletion could not be fully verified.

Source code in src/database_builder_libs/models/abstract_vector_store.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
@abstractmethod
def delete_document(self, document_id: DocumentId) -> int:
    """
    Permanently remove all vectors belonging to a document.

    This operation must be irreversible and guarantee that the
    document cannot appear in future search results.

    Parameters
    ----------
    document_id : DocumentId
        Identifier of the document to delete.

    Returns
    -------
    int
        Number of deleted chunks.

    GDPR Requirement
    ----------------
    After successful deletion, similarity_search() MUST NOT return
    any chunk originating from this document.

    Raises
    ------
    RuntimeError
        If deletion could not be fully verified.
    """
    raise NotImplementedError

get_document_chunks abstractmethod

get_document_chunks(document_id: DocumentId) -> List[Chunk]

Retrieve all chunks belonging to a document.

Returns

List[Chunk] All chunks for the document ordered by original document order.

Raises

KeyError If document does not exist. RuntimeError If store not connected.

Source code in src/database_builder_libs/models/abstract_vector_store.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@abstractmethod
def get_document_chunks(self, document_id: DocumentId) -> List[Chunk]:
    """
    Retrieve all chunks belonging to a document.

    Returns
    -------
    List[Chunk]
        All chunks for the document ordered by original document order.

    Raises
    ------
    KeyError
        If document does not exist.
    RuntimeError
        If store not connected.
    """

    raise NotImplementedError
similarity_search(vector: Sequence[float], limit: int = 10) -> List[Chunk]

Perform nearest-neighbour semantic search.

Parameters

vector : Sequence[float] Query embedding. Must match index dimensionality. limit : int Maximum number of results to return.

Returns

List[Chunk] Ordered by similarity descending (most relevant first).

Guarantees
  • At most limit results returned
  • No duplicate chunks
  • Ordering must reflect backend similarity score
Raises

ValueError If vector dimensionality mismatch. RuntimeError If store not connected.

Source code in src/database_builder_libs/models/abstract_vector_store.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
@abstractmethod
def similarity_search(
    self,
    vector: Sequence[float],
    limit: int = 10,
) -> List[Chunk]:
    """
    Perform nearest-neighbour semantic search.

    Parameters
    ----------
    vector : Sequence[float]
        Query embedding. Must match index dimensionality.
    limit : int
        Maximum number of results to return.

    Returns
    -------
    List[Chunk]
        Ordered by similarity descending (most relevant first).

    Guarantees
    ----------
    - At most `limit` results returned
    - No duplicate chunks
    - Ordering must reflect backend similarity score

    Raises
    ------
    ValueError
        If vector dimensionality mismatch.
    RuntimeError
        If store not connected.
    """
    raise NotImplementedError

store_chunks abstractmethod

store_chunks(chunks: List[Chunk]) -> None

Insert or update chunks and their embeddings.

Behaviour
  • Operation must be idempotent
  • Existing chunks with same (document_id, chunk_id) MUST be overwritten
  • Partial document updates are allowed
Parameters

chunks : List[Chunk] Chunks containing text, metadata, and embedding vector.

Raises

RuntimeError If called before connect(). ValueError If embedding dimensionality mismatch occurs.

Source code in src/database_builder_libs/models/abstract_vector_store.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
@abstractmethod
def store_chunks(self, chunks: List[Chunk]) -> None:
    """
    Insert or update chunks and their embeddings.

    Behaviour
    ---------
    - Operation must be idempotent
    - Existing chunks with same (document_id, chunk_id) MUST be overwritten
    - Partial document updates are allowed

    Parameters
    ----------
    chunks : List[Chunk]
        Chunks containing text, metadata, and embedding vector.

    Raises
    ------
    RuntimeError
        If called before connect().
    ValueError
        If embedding dimensionality mismatch occurs.
    """
    raise NotImplementedError