Qdrant store¶

Overview¶

QdrantDatastore is a concrete AbstractVectorStore implementation backed by the Qdrant vector database. It stores document Chunk objects, each identified by a deterministic hash‑derived point ID, and provides fast cosine‑similarity search.

Design notes¶

Configuration Example¶

config = {
    "url": "http://localhost:6333",
    "collection": "knowledge_base",
    "vector_size": 768  # Must match embedding model output
}

Docstring¶

qdrant_store ¶

Classes:

QdrantDatastore –

Qdrant implementation of the semantic vector store.

QdrantDatastore ¶

QdrantDatastore()


              flowchart TD
              database_builder_libs.stores.qdrant.qdrant_store.QdrantDatastore[QdrantDatastore]
              database_builder_libs.models.abstract_vector_store.AbstractVectorStore[AbstractVectorStore]

                              database_builder_libs.models.abstract_vector_store.AbstractVectorStore --> database_builder_libs.stores.qdrant.qdrant_store.QdrantDatastore
                


              click database_builder_libs.stores.qdrant.qdrant_store.QdrantDatastore href "" "database_builder_libs.stores.qdrant.qdrant_store.QdrantDatastore"
              click database_builder_libs.models.abstract_vector_store.AbstractVectorStore href "" "database_builder_libs.models.abstract_vector_store.AbstractVectorStore"

Qdrant implementation of the semantic vector store.

Stores Chunk embeddings and enables similarity-based retrieval.

Conceptual model¶

Document → multiple Chunks → embedding vectors → nearest neighbour search

Identity¶

Each chunk is uniquely identified by: (document_id, chunk_index)

This pair is deterministically mapped to a stable Qdrant point id using hashing. Re-indexing the same document overwrites existing vectors instead of duplicating them.

Stored payload¶

Each vector stores: document_id chunk_index text metadata...

Retrieval never returns embeddings — only semantic matches.

Consistency guarantees¶

Idempotent writes (upsert)
Stable ranking for unchanged index
No duplicate chunks returned
Full document deletion removes all vectors (GDPR requirement)

Embedding requirements¶

All stored vectors must: - Match configured dimensionality - Be generated by the same embedding model - Use cosine similarity

Methods:

connect –

Initialize the vector index and verify accessibility.
delete_document –

Permanently remove all vectors for a document.
get_document_chunks –

Retrieve all chunks belonging to a document.
similarity_search –

Perform semantic nearest-neighbour search.

Source code in src/database_builder_libs/stores/qdrant/qdrant_store.py

def __init__(self) -> None:
    super().__init__()
    self.client: QdrantClient | None = None
    self.collection: str | None = None
    self.vector_size: int | None = None

connect ¶

connect(config: dict | None = None) -> None

Initialize the vector index and verify accessibility.

This method should: - Create index if missing - Validate embedding dimensionality - Validate distance metric compatibility

Raises¶

ConnectionError Backend unreachable. RuntimeError Index exists but is incompatible.

Source code in src/database_builder_libs/models/abstract_vector_store.py

def connect(self, config: dict | None = None) -> None:
    """
    Initialize the vector index and verify accessibility.

    This method should:
    - Create index if missing
    - Validate embedding dimensionality
    - Validate distance metric compatibility

    Raises
    ------
    ConnectionError
        Backend unreachable.
    RuntimeError
        Index exists but is incompatible.
    """
    if self._connected:
        return

    self._connecting = True
    try:
        self._connect_impl(config)
        self._connected = True
    finally:
        self._connecting = False

delete_document ¶

delete_document(document_id: DocumentId) -> int

Permanently remove all vectors for a document.

Guarantees¶

After completion, no chunk from this document will appear in similarity_search() results.

Returns¶

int Number of deleted chunks.

Source code in src/database_builder_libs/stores/qdrant/qdrant_store.py

def delete_document(self, document_id: DocumentId) -> int:
    """
    Permanently remove all vectors for a document.

    Guarantees
    ----------
    After completion, no chunk from this document will appear
    in similarity_search() results.

    Returns
    -------
    int
        Number of deleted chunks.
    """

    self._ensure_connected()

    chunks = self.get_document_chunks(document_id)
    if not chunks:
        return 0

    filt = Filter(
        must=[FieldCondition(key=DOC_ID, match=MatchValue(value=document_id))]
    )

    result = self._client().delete(
        collection_name=self._collection(),
        points_selector=filt,
        wait=True,
    )

    if result.status != "completed":
        raise RuntimeError(f"Qdrant delete failed: {result.status}")

    return len(chunks)

get_document_chunks ¶

get_document_chunks(document_id: DocumentId) -> List[Chunk]

Retrieve all chunks belonging to a document.

Returns chunks ordered by chunk_index to reconstruct document order.

Source code in src/database_builder_libs/stores/qdrant/qdrant_store.py

def get_document_chunks(self, document_id: DocumentId) -> List[Chunk]:
    """
    Retrieve all chunks belonging to a document.

    Returns chunks ordered by chunk_index to reconstruct document order.
    """
    self._ensure_connected()

    filt = Filter(
        must=[FieldCondition(key=DOC_ID, match=MatchValue(value=document_id))]
    )

    chunks: List[Chunk] = []
    offset = None

    while True:
        records, offset = self._client().scroll(
            collection_name=self._collection(),
            scroll_filter=filt,
            with_payload=True,
            with_vectors=False,
            limit=512,
            offset=offset,
        )

        for r in records:
            payload = r.payload or {}

            doc_id = payload.get(DOC_ID)
            idx = payload.get(CHUNK_INDEX)
            if doc_id is None or idx is None:
                continue

            chunks.append(
                Chunk(
                    document_id=doc_id,
                    chunk_index=idx,
                    text=payload.get(TEXT, ""),
                    vector=(),
                    metadata={
                        k: v
                        for k, v in payload.items()
                        if k not in (DOC_ID, CHUNK_INDEX, TEXT)
                    },
                )
            )

        if offset is None:
            break

    return sorted(chunks, key=lambda c: c.chunk_index)

similarity_search ¶

similarity_search(vector: Sequence[float], limit: int = 10) -> List[Chunk]

Perform semantic nearest-neighbour search.

Returns¶

List[Chunk] Ordered by cosine similarity descending.

Notes¶

Returned chunks DO NOT include stored embeddings
Metadata and text are preserved
Results are deterministic for identical index state

Source code in src/database_builder_libs/stores/qdrant/qdrant_store.py

def similarity_search(
    self,
    vector: Sequence[float],
    limit: int = 10,
) -> List[Chunk]:
    """
    Perform semantic nearest-neighbour search.

    Returns
    -------
    List[Chunk]
        Ordered by cosine similarity descending.

    Notes
    -----
    - Returned chunks DO NOT include stored embeddings
    - Metadata and text are preserved
    - Results are deterministic for identical index state
    """
    self._ensure_connected()
    expected_dim = self._vector_size()
    if len(vector) != expected_dim:
        raise ValueError(
            f"Query vector has wrong dimension: expected {expected_dim}, got {len(vector)}"
        )
    response = self._client().query_points(
        collection_name=self._collection(),
        query=list(vector),
        limit=limit,
        with_payload=True,
        with_vectors=False,
    )

    results: List[Chunk] = []

    for point in response.points:
        payload = point.payload or {}
        doc_id = payload.get(DOC_ID)
        idx = payload.get(CHUNK_INDEX)
        if doc_id is None or idx is None:
            continue

        results.append(
            Chunk(
                document_id=doc_id,
                chunk_index=idx,
                text=payload.get(TEXT, ""),
                vector=(),  # never return query vector
                metadata={
                    k: v
                    for k, v in payload.items()
                    if k not in (DOC_ID, CHUNK_INDEX, TEXT)
                },
            )
        )

    return results