Abstract vector store¶
Overview¶
The AbstractVectorStore defines the interface for semantic retrieval storage systems. It manages document chunks with their embeddings and enables nearest-neighbor similarity search across various vector database backends (FAISS, Qdrant, Pinecone, etc.).
Design Notes¶
Interaction Patterns¶
The AbstractVectorStore supports three main interaction patterns:
-
Indexing Pattern:
- Store chunks with embeddings
- Overwrite existing chunks with same identity
- Validate embedding dimensionality
-
Semantic Search Pattern:
- Accept query vector
- Find k-nearest neighbors by similarity
- Return chunks ordered by relevance (without embeddings)
-
Document Management Pattern:
- Retrieve all chunks for a document
- Delete all vectors for GDPR compliance
- Maintain document-level consistency
Implementation Requirements¶
- Embedding Consistency: All vectors in an index must use the same model and dimensionality
- Distance Metric: Implementations must specify and validate the distance metric (cosine, euclidean, etc.)
- Batch Operations: Support efficient batch insertion for large documents
Docstring abstract chunk¶
chunk
¶
Classes:
-
Chunk–Atomic retrievable semantic unit of a document.
Chunk
dataclass
¶
Chunk(document_id: DocumentId, chunk_index: int, text: str, vector: Sequence[float], metadata: Mapping[str, Any] | None = None)
Atomic retrievable semantic unit of a document.
A Chunk represents a deterministic fragment of a source document that can be independently embedded, indexed, retrieved and deleted.
Identity¶
The unique identity of a chunk is defined by the pair: (document_id, chunk_index)
This identity MUST remain stable across re-indexing runs as long as the underlying document content has not changed in that region.
Fields¶
document_id : DocumentId Stable identifier of the parent document. Must remain constant across synchronization cycles.
int
Monotonically increasing positional index within the document. Defines ordering and chunk identity. Must not depend on runtime factors (timestamps, randomness, hashing).
str
Exact textual content used to generate the embedding. Any change to this text requires overwriting the stored vector.
Sequence[float]
Embedding vector representing the semantic meaning of text.
Requirements: - Fixed dimensionality across entire index - Generated by a single embedding model - Deterministic for identical input text
-
Main modules
- Vector Stores
-
Text embed utilities
abstract_chunk_embedderAbstractChunkEmbedderembed -
Text chunk utilities
abstract_chunk_strategyAbstractChunkingStrategychunk
-
Main modules
-
Text embed utilities
abstract_chunk_embedderAbstractChunkEmbedderembed -
Vector Stores
abstract_vector_storeAbstractVectorStorestore_chunks
-
Text embed utilities
Docstring abstract store¶
abstract_vector_store
¶
Classes:
-
AbstractVectorStore–Semantic retrieval storage for embedding-based search.
AbstractVectorStore
¶
AbstractVectorStore()
flowchart TD
database_builder_libs.models.abstract_vector_store.AbstractVectorStore[AbstractVectorStore]
click database_builder_libs.models.abstract_vector_store.AbstractVectorStore href "" "database_builder_libs.models.abstract_vector_store.AbstractVectorStore"
Semantic retrieval storage for embedding-based search.
The store persists document chunks together with their embeddings and supports nearest-neighbour semantic retrieval.
The interface is designed to be backend-agnostic and compatible with: FAISS, Qdrant, Pinecone, pgvector, Weaviate, Elasticsearch, etc.
Consistency guarantees¶
Implementations MUST guarantee:
- Deterministic retrieval for identical index state
- No duplicate chunks returned
- Stable chunk identity across writes
- Full deletion of document vectors (GDPR requirement)
Embedding contract¶
All stored vectors must: - Have identical dimensionality - Use the same distance metric - Be normalized if required by backend
Mixing embedding models in one index is forbidden.
Methods:
-
connect–Initialize the vector index and verify accessibility.
-
delete_document–Permanently remove all vectors belonging to a document.
-
get_document_chunks–Retrieve all chunks belonging to a document.
-
similarity_search–Perform nearest-neighbour semantic search.
-
store_chunks–Insert or update chunks and their embeddings.
Source code in src/database_builder_libs/models/abstract_vector_store.py
38 39 40 | |
connect
¶
connect(config: dict | None = None) -> None
Initialize the vector index and verify accessibility.
This method should: - Create index if missing - Validate embedding dimensionality - Validate distance metric compatibility
Raises¶
ConnectionError Backend unreachable. RuntimeError Index exists but is incompatible.
Source code in src/database_builder_libs/models/abstract_vector_store.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
delete_document
abstractmethod
¶
delete_document(document_id: DocumentId) -> int
Permanently remove all vectors belonging to a document.
This operation must be irreversible and guarantee that the document cannot appear in future search results.
Parameters¶
document_id : DocumentId Identifier of the document to delete.
Returns¶
int Number of deleted chunks.
GDPR Requirement¶
After successful deletion, similarity_search() MUST NOT return any chunk originating from this document.
Raises¶
RuntimeError If deletion could not be fully verified.
Source code in src/database_builder_libs/models/abstract_vector_store.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
get_document_chunks
abstractmethod
¶
Retrieve all chunks belonging to a document.
Returns¶
List[Chunk] All chunks for the document ordered by original document order.
Raises¶
KeyError If document does not exist. RuntimeError If store not connected.
Source code in src/database_builder_libs/models/abstract_vector_store.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
similarity_search
abstractmethod
¶
Perform nearest-neighbour semantic search.
Parameters¶
vector : Sequence[float] Query embedding. Must match index dimensionality. limit : int Maximum number of results to return.
Returns¶
List[Chunk] Ordered by similarity descending (most relevant first).
Guarantees¶
- At most
limitresults returned - No duplicate chunks
- Ordering must reflect backend similarity score
Raises¶
ValueError If vector dimensionality mismatch. RuntimeError If store not connected.
Source code in src/database_builder_libs/models/abstract_vector_store.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
store_chunks
abstractmethod
¶
Insert or update chunks and their embeddings.
Behaviour¶
- Operation must be idempotent
- Existing chunks with same (document_id, chunk_id) MUST be overwritten
- Partial document updates are allowed
Parameters¶
chunks : List[Chunk] Chunks containing text, metadata, and embedding vector.
Raises¶
RuntimeError If called before connect(). ValueError If embedding dimensionality mismatch occurs.
Source code in src/database_builder_libs/models/abstract_vector_store.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |