Skip to content

PDF source

Overview

The PDF source allows you to parse, extract metadata from, chunk, and embed PDF documents stored in a local folder. It implements the AbstractSource interface to provide incremental synchronization based on file modification times, and exposes a fully configurable extraction pipeline that can combine embedded PDF metadata, structural heuristics from Docling, and LLM-based extraction for each metadata field independently.

Design Notes

Interaction Patterns

The PDFSource follows these interaction patterns:

  1. Connection Pattern:

    • Validate folder path exists on disk
    • Initialise the Docling document parser
    • Build the LLM client if credentials are provided
    • All field extraction strategies are configured at connect time
  2. Synchronization Pattern:

    • Scan folder recursively for .pdf files
    • Compare file mtime against last_synced cursor
    • Return stable relative paths as artefact identifiers
    • Sorted ascending by mtime for deterministic ordering
  3. Content Retrieval Pattern:

    • Parse each PDF with Docling to produce a structural IR
    • Run the metadata extraction pipeline (see Configuration below)
    • Chunk the parsed sections using the configured strategy
    • Embed chunks if an embedder is configured
    • Pack everything into a Content object
  4. Metadata Extraction Pattern:

    • Each field (title, authors, summary, publishing_institute, acknowledgements) has its own ordered strategy list
    • Strategies are tried in order; extraction stops on first success when stop_on_success=True
    • Expensive operations (LLM call, PDF metadata read) are cached and run at most once per document

Implementation Details

  • Incremental Sync: Uses filesystem mtime for change detection. Deletions are not reported.
  • Stable Identifiers: Artefact IDs are paths relative to folder_path and remain stable as long as files are not moved.
  • Parser Resilience: Docling conversion failures are caught and logged per file; the pipeline continues and returns a Content with empty chunks and metadata for that file.
  • LLM Caching: The LLM is called at most once per document regardless of how many fields request it. The result is cached and reused across fields.
  • PDF Metadata Robustness: The embedded PDF metadata reader handles both plain dicts and DocumentInformation objects (pypdf), as well as gracefully dropping any non-string keys caused by library version mismatches.
  • Chunking: Sections produced by Docling are passed to the configured AbstractChunkingStrategy. Chunking failures are caught and logged without aborting the pipeline.
  • Embedding: If an AbstractChunkEmbedder is configured, it is called on the chunk list after chunking. Embedding failures return the original un-embedded chunks.

Configuring the PDFSource

Configuration is passed as a dict to src.connect({...}) and validated into a PDFDocumentConfig internally. All keys except folder_path are optional.

Minimal setup

Only folder_path is required. With no other config, metadata extraction uses embedded PDF metadata and Docling heuristics only — no LLM calls are made.

src = PDFSource()
src.connect({"folder_path": "/data/papers"})

Enabling LLM extraction

Provide llm_base_url, llm_api_key, and optionally llm_model to enable LLM-based extraction. The LLM receives the first 60 lines of the document and is expected to return JSON with title, authors, publishing_institute, and acknowledgements.

src = PDFSource()
src.connect({
    "folder_path":  "/data/papers",
    "llm_base_url": "http://localhost:11434/v1",
    "llm_api_key":  "ollama",
    "llm_model":    "gemma2:9b",
})

Any OpenAI-compatible endpoint is supported. The default model is gpt-4.1-mini.

Configuring extraction strategies per field

Each metadata field can be configured independently using a FieldExtractionConfig that holds an OrderedStrategyConfig. The three available strategies are:

Strategy Key Description
ExtractionStrategy.FILE_METADATA "file_metadata" Reads embedded PDF metadata via pypdf. Fast, no ML. Reliable for well-tagged PDFs; often noisy for Word-derived documents.
ExtractionStrategy.DOCLING "docling" Infers values from the Docling structural IR. Uses section headers for title, abstract section detection for summary.
ExtractionStrategy.LLM "llm" Sends the document header to the configured LLM. Most flexible but requires LLM credentials and adds latency.

The OrderedStrategyConfig controls the order strategies are tried and whether to stop on the first success:

from database_builder_libs.sources.pdf_source import (
    FieldExtractionConfig,
    OrderedStrategyConfig,
    ExtractionStrategy,
)

# Try LLM first, fall back to Docling heuristics
title_config = FieldExtractionConfig(
    strategies=OrderedStrategyConfig(
        order=[ExtractionStrategy.LLM, ExtractionStrategy.DOCLING],
        stop_on_success=True,   # default — stops as soon as one strategy succeeds
    )
)

# Run both strategies regardless of success — LLM overwrites FILE_METADATA result
authors_config = FieldExtractionConfig(
    strategies=OrderedStrategyConfig(
        order=[ExtractionStrategy.FILE_METADATA, ExtractionStrategy.LLM],
        stop_on_success=False,  # LLM always runs and overwrites
    )
)

# Disable a field entirely
acks_config = FieldExtractionConfig(enabled=False)

Pass these configs by field name in the connect dict:

src.connect({
    "folder_path":          "/data/papers",
    "llm_base_url":         "http://localhost:11434/v1",
    "llm_api_key":          "ollama",
    "title":                title_config,
    "authors":              authors_config,
    "acknowledgements":     acks_config,
})

Default extraction strategies

When no field config is provided, PDFSource uses these defaults:

Field Default strategy order Notes
title FILE_METADATADOCLING LLM not in default chain; add it if embedded metadata is unreliable
authors FILE_METADATALLM LLM used as fallback when embedded metadata is empty or noisy
summary DOCLING Abstract section detection is reliable; LLM rarely needed
publishing_institute FILE_METADATA LLM can be added for documents without well-tagged metadata
acknowledgements LLM Only LLM can extract these reliably from free text

Configuring chunking and embedding

Section chunking and vector embedding are controlled via SectionsConfig. Disable chunking entirely, or plug in any AbstractChunkingStrategy and AbstractChunkEmbedder implementation.

from database_builder_libs.sources.pdf_source import SectionsConfig
from database_builder_libs.utility.chunk.summary_and_sections import SummaryAndSectionsStrategy
from database_builder_libs.utility.embed_chunk.openai_compatible import OpenAICompatibleChunkEmbedder

src.connect({
    "folder_path": "/data/papers",
    "sections": SectionsConfig(
        chunking_strategy=SummaryAndSectionsStrategy(),
        embedder=OpenAICompatibleChunkEmbedder(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
            model="nomic-embed-text",
        ),
    ),
})

To skip chunking and embedding entirely (metadata and file stats only):

src.connect({
    "folder_path": "/data/papers",
    "sections": SectionsConfig(enabled=False),
})

Skipping fields already known from an external source

When metadata for a document is already available from an external source (e.g. Zotero), pass FieldExtractionConfig(enabled=False) explicitly for those fields. Omitting a field from the config is not sufficient — PDFDocumentConfig has defaults for every field and will fall back to them silently.

src.connect({
    "folder_path": "/data/papers",
    "title":   FieldExtractionConfig(enabled=False),  # already known from Zotero
    "authors": FieldExtractionConfig(enabled=False),  # already known from Zotero
    # summary, publishing_institute and acknowledgements will use their defaults
})

Content object structure

PDFSource.get_content() returns one Content object per artefact. Content.content is a dict with the following keys:

Key Type Description
file_path str Absolute path to the PDF file
file_name str Filename only
file_size int File size in bytes
num_pages int \| None Page count from Docling; None if conversion failed
pdf_meta dict Raw embedded PDF metadata from pypdf, keys stripped of leading /
metadata dict DocumentMetadata serialised via dataclasses.asdict()
num_sections int Section count from Docling
num_tables int Table count from Docling
num_figures int Figure count from Docling
section_titles list[str] Ordered list of section header strings
chunks list[dict] Chunk objects serialised via dataclasses.asdict()

The metadata dict mirrors DocumentMetadata and includes a source sub-dict that maps each populated field to the strategy that filled it, for example:

{
  "title": "A Randomised Controlled Trial ...",
  "authors": ["Heyman, Bob", "Harrington, Barbara"],
  "publishing_institute": {"name": "Housing Studies", "parent": null},
  "summary": "This paper discusses ...",
  "acknowledgements": [
    {"name": "NEARG", "type": "organization", "relation": "collaboration"}
  ],
  "source": {
    "title": "zotero",
    "authors": "zotero",
    "summary": "docling_heuristic",
    "publishing_institute": "llm",
    "acknowledgements": "llm"
  },
  "keywords": null,
  "literature_type": null,
  "strategic_overview": null,
  "target_groups": null,
  "best_practices": null
}

Docstring

pdf_source

Classes:

  • PDFDocumentConfig

    Full configuration for the PDFSource pipeline.

  • PDFSource

    Self-contained PDF source that parses, extracts metadata, chunks, and

PDFDocumentConfig


              flowchart TD
              database_builder_libs.sources.pdf_source.PDFDocumentConfig[PDFDocumentConfig]

              

              click database_builder_libs.sources.pdf_source.PDFDocumentConfig href "" "database_builder_libs.sources.pdf_source.PDFDocumentConfig"
            

Full configuration for the PDFSource pipeline.

Every metadata field has a FieldExtractionConfig with an ordered list of :class:ExtractionStrategy values. The pipeline tries each strategy in order and stops on first success when stop_on_success=True.

Available strategies

FILE_METADATA Reads embedded PDF metadata via pypdf (fast, no ML). DOCLING Infers fields from the Docling structural IR. LLM Sends the first ~60 lines to the configured LLM. Requires llm_base_url and llm_api_key to be set.

Defaults

  • title : FILE_METADATA → DOCLING
  • authors : FILE_METADATA → LLM
  • summary : DOCLING
  • publishing_institute: FILE_METADATA
  • acknowledgements : LLM

PDFSource


              flowchart TD
              database_builder_libs.sources.pdf_source.PDFSource[PDFSource]
              database_builder_libs.models.abstract_source.AbstractSource[AbstractSource]

                              database_builder_libs.models.abstract_source.AbstractSource --> database_builder_libs.sources.pdf_source.PDFSource
                


              click database_builder_libs.sources.pdf_source.PDFSource href "" "database_builder_libs.sources.pdf_source.PDFSource"
              click database_builder_libs.models.abstract_source.AbstractSource href "" "database_builder_libs.models.abstract_source.AbstractSource"
            

Self-contained PDF source that parses, extracts metadata, chunks, and embeds in a single configurable pipeline.

Mapping

PDF file path (relative) → Content.id_ File mtime (UTC) → Content.date

Content.content keys

file_path, file_name, file_size, num_pages File-level stats. pdf_meta Raw pypdf embedded metadata dict. num_sections, num_tables, num_figures, section_titles Structural counts from Docling. metadata :class:DocumentMetadata serialised via dataclasses.asdict(). chunks List of :class:~database_builder_libs.models.chunk.Chunk dicts.

Methods:

  • connect

    Establish connection to the external source.

  • get_all_documents_metadata

    Return lightweight file-level metadata for all PDFs (no Docling conversion).

  • get_content

    Run the full parse → metadata → chunk → embed pipeline for each artefact.

  • get_list_artefacts

    Return PDFs modified after last_synced, sorted ascending by mtime.

connect

connect(config: Mapping[str, Any] | None = None) -> None

Establish connection to the external source.

Idempotent: safe to call multiple times.

Raises

ConnectionError PermissionError ValueError

Source code in src/database_builder_libs/models/abstract_source.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def connect(self, config: Mapping[str, Any] | None = None) -> None:
    """
    Establish connection to the external source.

    Idempotent: safe to call multiple times.

    Raises
    ------
    ConnectionError
    PermissionError
    ValueError
    """
    if self._connected:
        return

    self._connect_impl(config or {})
    self._connected = True

get_all_documents_metadata

get_all_documents_metadata(limit: int = -1) -> List[dict[str, Any]]

Return lightweight file-level metadata for all PDFs (no Docling conversion).

Source code in src/database_builder_libs/sources/pdf_source.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def get_all_documents_metadata(self, limit: int = -1) -> List[dict[str, Any]]:
    """Return lightweight file-level metadata for all PDFs (no Docling conversion)."""
    self._ensure_connected()
    assert self._config is not None
    results: List[dict[str, Any]] = []
    for pdf_path in sorted(self._config.folder_path.rglob("*.pdf")):
        if limit != -1 and len(results) >= limit:
            break
        stat = pdf_path.stat()
        results.append({
            "id":       str(pdf_path.relative_to(self._config.folder_path)),
            "path":     pdf_path,
            "size":     stat.st_size,
            "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc),
            "pdf_meta": self._read_pdf_meta(str(pdf_path)),
        })
    return results

get_content

get_content(artefacts: list[tuple[str, datetime]]) -> list[Content]

Run the full parse → metadata → chunk → embed pipeline for each artefact.

Source code in src/database_builder_libs/sources/pdf_source.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def get_content(self, artefacts: list[tuple[str, datetime]]) -> list[Content]:
    """Run the full parse → metadata → chunk → embed pipeline for each artefact."""
    self._ensure_connected()
    assert self._config is not None
    assert self._parser is not None

    contents: list[Content] = []
    for relative_id, modified in artefacts:
        pdf_path = self._config.folder_path / relative_id
        if not pdf_path.exists():
            raise KeyError(f"Artefact '{relative_id}' no longer exists.")

        stat         = pdf_path.stat()
        pdf_path_str = str(pdf_path.resolve())

        parsed: Optional[ParsedDocument] = None
        num_pages: Optional[int] = None
        try:
            parsed    = self._parser.parse(pdf_path_str)
            num_pages = len(parsed.doc.pages) if parsed.doc.pages else None
        except DocumentConversionError as exc:
            logger.warning(f"Docling conversion failed for '{relative_id}': {exc}")
        except Exception as exc:
            logger.warning(f"Unexpected parse error for '{relative_id}': {exc}")

        metadata = self._extract_metadata(pdf_path_str=pdf_path_str, parsed=parsed)
        chunks   = self._chunk(parsed=parsed, document_id=relative_id) if (
            self._config.sections.enabled and parsed is not None
        ) else []
        if chunks and self._config.sections.embedder is not None:
            chunks = self._embed(chunks)

        contents.append(Content(
            date=modified,
            id_=relative_id,
            content={
                "file_path":  pdf_path_str,
                "file_name":  pdf_path.name,
                "file_size":  stat.st_size,
                "num_pages":  num_pages,
                "pdf_meta":   self._read_pdf_meta(pdf_path_str),
                "metadata":   dataclasses.asdict(metadata),
                **self._structural_meta(parsed),
                "chunks":     [dataclasses.asdict(chunk) for chunk in chunks],
            },
        ))
        logger.debug(f"Processed '{relative_id}': {len(chunks)} chunk(s).")

    logger.info(f"get_content returning {len(contents)} Content object(s).")
    return contents

get_list_artefacts

get_list_artefacts(last_synced: Optional[datetime]) -> list[tuple[str, datetime]]

Return PDFs modified after last_synced, sorted ascending by mtime.

Source code in src/database_builder_libs/sources/pdf_source.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
def get_list_artefacts(self, last_synced: Optional[datetime]) -> list[tuple[str, datetime]]:
    """Return PDFs modified after ``last_synced``, sorted ascending by mtime."""
    self._ensure_connected()
    assert self._config is not None
    if last_synced is not None and last_synced.tzinfo is None:
        last_synced = last_synced.replace(tzinfo=timezone.utc)
    artefacts = [
        (str(pdf_path.relative_to(self._config.folder_path)),
         datetime.fromtimestamp(pdf_path.stat().st_mtime, tz=timezone.utc))
        for pdf_path in self._config.folder_path.rglob("*.pdf")
        if last_synced is None
        or datetime.fromtimestamp(pdf_path.stat().st_mtime, tz=timezone.utc) > last_synced
    ]
    artefacts.sort(key=lambda artefact: artefact[1])
    logger.info(f"get_list_artefacts: {len(artefacts)} PDF(s) since {last_synced}.")
    return artefacts