PDF source¶

Overview¶

The PDF source allows you to parse, extract metadata from, chunk, and embed PDF documents stored in a local folder. It implements the AbstractSource interface to provide incremental synchronization based on file modification times, and exposes a fully configurable extraction pipeline that can combine embedded PDF metadata, structural heuristics from Docling, and LLM-based extraction for each metadata field independently.

Design Notes¶

Interaction Patterns¶

The PDFSource follows these interaction patterns:

Connection Pattern:
- Validate folder path exists on disk
- Initialise the Docling document parser
- Build the LLM client if credentials are provided
- All field extraction strategies are configured at connect time
Synchronization Pattern:
- Scan folder recursively for .pdf files
- Compare file mtime against last_synced cursor
- Return stable relative paths as artefact identifiers
- Sorted ascending by mtime for deterministic ordering
Content Retrieval Pattern:
- Parse each PDF with Docling to produce a structural IR
- Run the metadata extraction pipeline (see Configuration below)
- Chunk the parsed sections using the configured strategy
- Embed chunks if an embedder is configured
- Pack everything into a Content object
Metadata Extraction Pattern:
- Each field (title, authors, summary, publishing_institute, acknowledgements) has its own ordered strategy list
- Strategies are tried in order; extraction stops on first success when stop_on_success=True
- Expensive operations (LLM call, PDF metadata read) are cached and run at most once per document

Implementation Details¶

Incremental Sync: Uses filesystem mtime for change detection. Deletions are not reported.
Stable Identifiers: Artefact IDs are paths relative to folder_path and remain stable as long as files are not moved.
Parser Resilience: Docling conversion failures are caught and logged per file; the pipeline continues and returns a Content with empty chunks and metadata for that file.
LLM Caching: The LLM is called at most once per document regardless of how many fields request it. The result is cached and reused across fields.
PDF Metadata Robustness: The embedded PDF metadata reader handles both plain dicts and DocumentInformation objects (pypdf), as well as gracefully dropping any non-string keys caused by library version mismatches.
Chunking: Sections produced by Docling are passed to the configured AbstractChunkingStrategy. Chunking failures are caught and logged without aborting the pipeline.
Embedding: If an AbstractChunkEmbedder is configured, it is called on the chunk list after chunking. Embedding failures return the original un-embedded chunks.

Configuring the PDFSource¶

Configuration is passed as a dict to src.connect({...}) and validated into a PDFDocumentConfig internally. All keys except folder_path are optional.

Minimal setup¶

Only folder_path is required. With no other config, metadata extraction uses embedded PDF metadata and Docling heuristics only — no LLM calls are made.

src = PDFSource()
src.connect({"folder_path": "/data/papers"})

Enabling LLM extraction¶

Provide llm_base_url, llm_api_key, and optionally llm_model to enable LLM-based extraction. The LLM receives the first 60 lines of the document and is expected to return JSON with title, authors, publishing_institute, and acknowledgements.

src = PDFSource()
src.connect({
    "folder_path":  "/data/papers",
    "llm_base_url": "http://localhost:11434/v1",
    "llm_api_key":  "ollama",
    "llm_model":    "gemma2:9b",
})

Any OpenAI-compatible endpoint is supported. The default model is gpt-4.1-mini.

Configuring extraction strategies per field¶

Each metadata field can be configured independently using a FieldExtractionConfig that holds an OrderedStrategyConfig. The three available strategies are:

Strategy	Key	Description
`ExtractionStrategy.FILE_METADATA`	`"file_metadata"`	Reads embedded PDF metadata via pypdf. Fast, no ML. Reliable for well-tagged PDFs; often noisy for Word-derived documents.
`ExtractionStrategy.DOCLING`	`"docling"`	Infers values from the Docling structural IR. Uses section headers for title, abstract section detection for summary.
`ExtractionStrategy.LLM`	`"llm"`	Sends the document header to the configured LLM. Most flexible but requires LLM credentials and adds latency.

The OrderedStrategyConfig controls the order strategies are tried and whether to stop on the first success:

from database_builder_libs.sources.pdf_source import (
    FieldExtractionConfig,
    OrderedStrategyConfig,
    ExtractionStrategy,
)

# Try LLM first, fall back to Docling heuristics
title_config = FieldExtractionConfig(
    strategies=OrderedStrategyConfig(
        order=[ExtractionStrategy.LLM, ExtractionStrategy.DOCLING],
        stop_on_success=True,   # default — stops as soon as one strategy succeeds
    )
)

# Run both strategies regardless of success — LLM overwrites FILE_METADATA result
authors_config = FieldExtractionConfig(
    strategies=OrderedStrategyConfig(
        order=[ExtractionStrategy.FILE_METADATA, ExtractionStrategy.LLM],
        stop_on_success=False,  # LLM always runs and overwrites
    )
)

# Disable a field entirely
acks_config = FieldExtractionConfig(enabled=False)

Pass these configs by field name in the connect dict:

src.connect({
    "folder_path":          "/data/papers",
    "llm_base_url":         "http://localhost:11434/v1",
    "llm_api_key":          "ollama",
    "title":                title_config,
    "authors":              authors_config,
    "acknowledgements":     acks_config,
})

Default extraction strategies¶

When no field config is provided, PDFSource uses these defaults:

Field	Default strategy order	Notes
`title`	`FILE_METADATA` → `DOCLING`	LLM not in default chain; add it if embedded metadata is unreliable
`authors`	`FILE_METADATA` → `LLM`	LLM used as fallback when embedded metadata is empty or noisy
`summary`	`DOCLING`	Abstract section detection is reliable; LLM rarely needed
`publishing_institute`	`FILE_METADATA`	LLM can be added for documents without well-tagged metadata
`acknowledgements`	`LLM`	Only LLM can extract these reliably from free text

Configuring chunking and embedding¶

Section chunking and vector embedding are controlled via SectionsConfig. Disable chunking entirely, or plug in any AbstractChunkingStrategy and AbstractChunkEmbedder implementation.

from database_builder_libs.sources.pdf_source import SectionsConfig
from database_builder_libs.utility.chunk.summary_and_sections import SummaryAndSectionsStrategy
from database_builder_libs.utility.embed_chunk.openai_compatible import OpenAICompatibleChunkEmbedder

src.connect({
    "folder_path": "/data/papers",
    "sections": SectionsConfig(
        chunking_strategy=SummaryAndSectionsStrategy(),
        embedder=OpenAICompatibleChunkEmbedder(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
            model="nomic-embed-text",
        ),
    ),
})

To skip chunking and embedding entirely (metadata and file stats only):

src.connect({
    "folder_path": "/data/papers",
    "sections": SectionsConfig(enabled=False),
})

Skipping fields already known from an external source¶

When metadata for a document is already available from an external source (e.g. Zotero), pass FieldExtractionConfig(enabled=False) explicitly for those fields. Omitting a field from the config is not sufficient — PDFDocumentConfig has defaults for every field and will fall back to them silently.

src.connect({
    "folder_path": "/data/papers",
    "title":   FieldExtractionConfig(enabled=False),  # already known from Zotero
    "authors": FieldExtractionConfig(enabled=False),  # already known from Zotero
    # summary, publishing_institute and acknowledgements will use their defaults
})

Content object structure¶

PDFSource.get_content() returns one Content object per artefact. Content.content is a dict with the following keys:

Key	Type	Description
`file_path`	`str`	Absolute path to the PDF file
`file_name`	`str`	Filename only
`file_size`	`int`	File size in bytes
`num_pages`	`int \\| None`	Page count from Docling; `None` if conversion failed
`pdf_meta`	`dict`	Raw embedded PDF metadata from pypdf, keys stripped of leading `/`
`metadata`	`dict`	`DocumentMetadata` serialised via `dataclasses.asdict()`
`num_sections`	`int`	Section count from Docling
`num_tables`	`int`	Table count from Docling
`num_figures`	`int`	Figure count from Docling
`section_titles`	`list[str]`	Ordered list of section header strings
`chunks`	`list[dict]`	`Chunk` objects serialised via `dataclasses.asdict()`

The metadata dict mirrors DocumentMetadata and includes a source sub-dict that maps each populated field to the strategy that filled it, for example:

{
  "title": "A Randomised Controlled Trial ...",
  "authors": ["Heyman, Bob", "Harrington, Barbara"],
  "publishing_institute": {"name": "Housing Studies", "parent": null},
  "summary": "This paper discusses ...",
  "acknowledgements": [
    {"name": "NEARG", "type": "organization", "relation": "collaboration"}
  ],
  "source": {
    "title": "zotero",
    "authors": "zotero",
    "summary": "docling_heuristic",
    "publishing_institute": "llm",
    "acknowledgements": "llm"
  },
  "keywords": null,
  "literature_type": null,
  "strategic_overview": null,
  "target_groups": null,
  "best_practices": null
}

Docstring¶

pdf_source ¶

Classes:

PDFDocumentConfig –

Full configuration for the PDFSource pipeline.
PDFSource –

Self-contained PDF source that parses, extracts metadata, chunks, and

PDFDocumentConfig ¶


              flowchart TD
              database_builder_libs.sources.pdf_source.PDFDocumentConfig[PDFDocumentConfig]

              

              click database_builder_libs.sources.pdf_source.PDFDocumentConfig href "" "database_builder_libs.sources.pdf_source.PDFDocumentConfig"

Full configuration for the PDFSource pipeline.

Every metadata field has a FieldExtractionConfig with an ordered list of :class:ExtractionStrategy values. The pipeline tries each strategy in order and stops on first success when stop_on_success=True.

Available strategies¶

FILE_METADATA Reads embedded PDF metadata via pypdf (fast, no ML). DOCLING Infers fields from the Docling structural IR. LLM Sends the first ~60 lines to the configured LLM. Requires llm_base_url and llm_api_key to be set.

Defaults¶

title : FILE_METADATA → DOCLING
authors : FILE_METADATA → LLM
summary : DOCLING
publishing_institute: FILE_METADATA
acknowledgements : LLM

PDFSource ¶


              flowchart TD
              database_builder_libs.sources.pdf_source.PDFSource[PDFSource]
              database_builder_libs.models.abstract_source.AbstractSource[AbstractSource]

                              database_builder_libs.models.abstract_source.AbstractSource --> database_builder_libs.sources.pdf_source.PDFSource
                


              click database_builder_libs.sources.pdf_source.PDFSource href "" "database_builder_libs.sources.pdf_source.PDFSource"
              click database_builder_libs.models.abstract_source.AbstractSource href "" "database_builder_libs.models.abstract_source.AbstractSource"

Self-contained PDF source that parses, extracts metadata, chunks, and embeds in a single configurable pipeline.

Mapping¶

PDF file path (relative) → Content.id_ File mtime (UTC) → Content.date

Content.content keys¶

file_path, file_name, file_size, num_pages File-level stats. pdf_meta Raw pypdf embedded metadata dict. num_sections, num_tables, num_figures, section_titles Structural counts from Docling. metadata :class:DocumentMetadata serialised via dataclasses.asdict(). chunks List of :class:~database_builder_libs.models.chunk.Chunk dicts.

Methods:

connect –

Establish connection to the external source.
get_all_documents_metadata –

Return lightweight file-level metadata for all PDFs (no Docling conversion).
get_content –

Run the full parse → metadata → chunk → embed pipeline for each artefact.
get_list_artefacts –

Return PDFs modified after last_synced, sorted ascending by mtime.

connect ¶

connect(config: Mapping[str, Any] | None = None) -> None

Establish connection to the external source.

Idempotent: safe to call multiple times.

Raises¶

ConnectionError PermissionError ValueError

Source code in src/database_builder_libs/models/abstract_source.py

def connect(self, config: Mapping[str, Any] | None = None) -> None:
    """
    Establish connection to the external source.

    Idempotent: safe to call multiple times.

    Raises
    ------
    ConnectionError
    PermissionError
    ValueError
    """
    if self._connected:
        return

    self._connect_impl(config or {})
    self._connected = True

get_all_documents_metadata ¶

get_all_documents_metadata(limit: int = -1) -> List[dict[str, Any]]

Return lightweight file-level metadata for all PDFs (no Docling conversion).

Source code in src/database_builder_libs/sources/pdf_source.py

def get_all_documents_metadata(self, limit: int = -1) -> List[dict[str, Any]]:
    """Return lightweight file-level metadata for all PDFs (no Docling conversion)."""
    self._ensure_connected()
    assert self._config is not None
    results: List[dict[str, Any]] = []
    for pdf_path in sorted(self._config.folder_path.rglob("*.pdf")):
        if limit != -1 and len(results) >= limit:
            break
        stat = pdf_path.stat()
        results.append({
            "id":       str(pdf_path.relative_to(self._config.folder_path)),
            "path":     pdf_path,
            "size":     stat.st_size,
            "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc),
            "pdf_meta": self._read_pdf_meta(str(pdf_path)),
        })
    return results

get_content ¶

get_content(artefacts: list[tuple[str, datetime]]) -> list[Content]

Run the full parse → metadata → chunk → embed pipeline for each artefact.

Source code in src/database_builder_libs/sources/pdf_source.py

def get_content(self, artefacts: list[tuple[str, datetime]]) -> list[Content]:
    """Run the full parse → metadata → chunk → embed pipeline for each artefact."""
    self._ensure_connected()
    assert self._config is not None
    assert self._parser is not None

    contents: list[Content] = []
    for relative_id, modified in artefacts:
        pdf_path = self._config.folder_path / relative_id
        if not pdf_path.exists():
            raise KeyError(f"Artefact '{relative_id}' no longer exists.")

        stat         = pdf_path.stat()
        pdf_path_str = str(pdf_path.resolve())

        parsed: Optional[ParsedDocument] = None
        num_pages: Optional[int] = None
        try:
            parsed    = self._parser.parse(pdf_path_str)
            num_pages = len(parsed.doc.pages) if parsed.doc.pages else None
        except DocumentConversionError as exc:
            logger.warning(f"Docling conversion failed for '{relative_id}': {exc}")
        except Exception as exc:
            logger.warning(f"Unexpected parse error for '{relative_id}': {exc}")

        metadata = self._extract_metadata(pdf_path_str=pdf_path_str, parsed=parsed)
        chunks   = self._chunk(parsed=parsed, document_id=relative_id) if (
            self._config.sections.enabled and parsed is not None
        ) else []
        if chunks and self._config.sections.embedder is not None:
            chunks = self._embed(chunks)

        contents.append(Content(
            date=modified,
            id_=relative_id,
            content={
                "file_path":  pdf_path_str,
                "file_name":  pdf_path.name,
                "file_size":  stat.st_size,
                "num_pages":  num_pages,
                "pdf_meta":   self._read_pdf_meta(pdf_path_str),
                "metadata":   dataclasses.asdict(metadata),
                **self._structural_meta(parsed),
                "chunks":     [dataclasses.asdict(chunk) for chunk in chunks],
            },
        ))
        logger.debug(f"Processed '{relative_id}': {len(chunks)} chunk(s).")

    logger.info(f"get_content returning {len(contents)} Content object(s).")
    return contents

get_list_artefacts ¶

get_list_artefacts(last_synced: Optional[datetime]) -> list[tuple[str, datetime]]

Return PDFs modified after last_synced, sorted ascending by mtime.

Source code in src/database_builder_libs/sources/pdf_source.py

def get_list_artefacts(self, last_synced: Optional[datetime]) -> list[tuple[str, datetime]]:
    """Return PDFs modified after ``last_synced``, sorted ascending by mtime."""
    self._ensure_connected()
    assert self._config is not None
    if last_synced is not None and last_synced.tzinfo is None:
        last_synced = last_synced.replace(tzinfo=timezone.utc)
    artefacts = [
        (str(pdf_path.relative_to(self._config.folder_path)),
         datetime.fromtimestamp(pdf_path.stat().st_mtime, tz=timezone.utc))
        for pdf_path in self._config.folder_path.rglob("*.pdf")
        if last_synced is None
        or datetime.fromtimestamp(pdf_path.stat().st_mtime, tz=timezone.utc) > last_synced
    ]
    artefacts.sort(key=lambda artefact: artefact[1])
    logger.info(f"get_list_artefacts: {len(artefacts)} PDF(s) since {last_synced}.")
    return artefacts