Architectural overview¶

flowchart LR

%% ================= EXTERNAL =================
ext[External Systems<br/>Zotero / SharePoint / APIs / Files]

%% ================= USER APP =================
subgraph APP["user-app"]
    direction TB
    mapping[Domain Mapping / Data Science<br/>interpret & normalize data]
    docs[Documents to embed]
    nodes[Nodes & Relationships]

    mapping --> docs
    mapping --> nodes
end

%% ================= DATABASE BUILDER LIBS =================
subgraph DBL["database-builder-libs"]
    direction TB

    src[AbstractSource<br/>retrieve artefacts]

    vec[Vectorizer + Chunking]
    qdrant[(Vector Store<br/>Qdrant)]

    typedb[(Graph Store<br/>TypeDB)]

    vec --> qdrant
end

ext --> src
src --> mapping

docs --> vec
nodes --> typedb

database-builder-libs provides a structured ingestion backbone for building knowledge systems. It does not decide what the data means — it only guarantees that once meaning is provided, it can be stored, indexed, and retrieved consistently.

The pipeline is intentionally split into two parts:

Outside the library — interpretation and domain logic
Inside the library — synchronization, indexing, and persistence

This separation keeps the library reusable across domains while still supporting complex knowledge models.

Artefact retrieval¶

The library standardizes access to external systems through the AbstractSource interface.

A source implementation is responsible only for:

connecting to an external system
listing modified artefacts
returning normalized content objects

At this stage the library treats data as opaque payload. No semantic processing, enrichment, or classification occurs here.

The purpose of this layer is to make heterogeneous systems behave like a consistent incremental data stream.

Domain interpretation (outside the library)¶

After retrieval, the data leaves the responsibility of the library.

The application interprets the content and produces two independent outputs:

Documents — unstructured information suitable for semantic retrieval
Nodes & relationships — structured facts suitable for logical storage

The library intentionally does not provide tooling for this step because correctness depends entirely on domain knowledge. Embedding or storing uninterpreted data would make the system unreliable.

The library therefore acts as a persistence engine, not a knowledge extraction framework.

Document indexing¶

Documents are re-entered into the library through the vectorization pipeline.

The library then:

Splits documents into chunks
Converts chunks into embeddings
Stores vectors in the vector datastore

This produces a semantic index that supports similarity-based retrieval. The vector store answers relevance questions, not factual questions.

Knowledge storage¶

Structured entities and relationships are stored in the graph datastore.

This layer represents explicit knowledge:

entities
attributes
relations

Unlike the vector index, the graph represents deterministic information. It is intended for correctness and reasoning rather than relevance.

System roles¶

The library manages persistence and retrieval mechanics:

Responsibility	Provided by
Synchronization	library
Chunking & embedding	library
Vector storage	library
Graph storage	library
Domain meaning	application
Ontology decisions	application
Data interpretation	application

Architectural intent¶

The library separates knowledge interpretation from knowledge storage.

This allows:

different applications to share the same storage infrastructure
consistent indexing guarantees
deterministic persistence behaviour
interchangeable domain models