Text structure extraction (DocumentParserDocling)¶

Overview¶

DocumentParserDocling converts raw document files into structured text using the Docling library. It supports PDF, Word, PowerPoint, Excel, HTML, Markdown, and CSV, and produces a ParsedDocument that is consumed by the chunking and embedding stages of the pipeline.

Supported formats¶

Extension	Format
`.pdf`	PDF (PyPdfium backend)
`.docx`	Word
`.pptx`	PowerPoint
`.xlsx`	Excel
`.html`	HTML
`.md`	Markdown
`.csv`	CSV

Design notes¶

Processing flow¶

DocumentParserDocling follows this processing pattern:

Format validation
- The file extension is checked against the allow-list before any I/O takes place.
- Unsupported extensions raise ValueError immediately.
Document conversion
- The input (file path or byte stream) is wrapped in a DocumentStream and handed to Docling's DocumentConverter.
- Format-specific options are applied (PDF uses the PyPdfium backend; all others use the default pipeline).
- A hard file-size cap of 64 MB and a 180-second timeout are enforced on every call.
Error handling
- If Docling reports errors or the resulting document has no pages, a DocumentConversionError is raised.
- Each failure is wrapped in a ConversionFault that captures the raw ErrorItem list, the document hash, and the file path, so callers have full context for logging or retry logic.
Content extraction
- A single iterate_items(BODY) walk over the DoclingDocument node graph populates sections, tables, figures, code blocks, list blocks, and footnotes in document order.
- Section text is accumulated in a buffer and flushed whenever a SectionHeaderItem or the end of the document is reached, producing (title, body_text, tables) tuples.
- Consecutive LIST_ITEM nodes within the same section are grouped into a single ExtractedListBlock rather than emitted as isolated bullets.
- A second iterate_items(BODY + FURNITURE) walk collects page headers and footers, deduplicated across pages, into ExtractedFurniture entries.

Configuration details¶

Setting	Value	Notes
PDF backend	`PyPdfiumDocumentBackend`	Fast, no native deps
OCR	Disabled by default	Enable by passing `do_ocr=True`
OCR languages	English, Dutch	Configurable via `EasyOcrOptions`
Document timeout	180 s	Applies to all formats
File size limit	64 MB (`67_108_864` bytes)	Enforced in `convert()` call

Return type — `ParsedDocument`¶

ParsedDocument is a frozen dataclass. All fields are populated in a single extraction pass and are immutable after construction.

Field	Type	Description
`doc`	`DoclingDocument`	Full Docling IR. Retain for downstream access to the raw node graph, bounding boxes, or provenance.
`name`	`str`	Original filename passed to the converter.
`sections`	`list[RawSection]`	Body text as `(title, text, tables)` tuples, grouped by section header. The leading nameless section (content before the first header) is included when non-empty, with `""` as its title. Primary input to chunking strategies.
`tables`	`list[ExtractedTable]`	All body tables with captions, in document order.
`figures`	`list[ExtractedFigure]`	All pictures with captions, in document order.
`code_blocks`	`list[ExtractedCodeBlock]`	All `CODE`-labelled items, attributed to their enclosing section.
`list_blocks`	`list[ExtractedListBlock]`	Consecutive `LIST_ITEM` runs grouped per section.
`footnotes`	`list[ExtractedFootnote]`	All `FOOTNOTE`-labelled items, in document order.
`furniture`	`list[ExtractedFurniture]`	Page headers and footers, deduplicated across pages.

Error types¶

`DocumentConversionError(ValueError)`¶

Raised when the Docling pipeline fails or the output document is empty.

try:
    result = parser.parse_stream(name="report.pdf", stream=stream)
except DocumentConversionError as exc:
    for fault in exc.faults:
        print(fault.path_file_document, fault.hashvalue, fault.faults)

`ConversionFault`¶

Dataclass attached to DocumentConversionError.faults. One entry per failed document.

Field	Type	Description
`faults`	`Sequence[ErrorItem]`	Raw Docling error items from the pipeline.
`hashvalue`	`str`	Document hash assigned by Docling.
`path_file_document`	`Path`	Path or name of the file that failed.

Docstring vectorize document¶

DocumentParserDocling ¶

DocumentParserDocling(*, path_dir_artifacts: str | None = None)

Docling implementation of the document parsing pipeline.

Converts a raw document file into a :class:ParsedDocument containing the full Docling IR and all structured content extracted in a single pass.

Mapping¶

Raw file / byte stream → DoclingDocument → ParsedDocument

Extraction pass¶

One iterate_items(BODY) walk produces sections, tables, figures, code blocks, list blocks, and footnotes. A second iterate_items(BODY + FURNITURE) walk collects page headers/footers.

Supported formats¶

PDF, DOCX, PPTX, XLSX, HTML, Markdown, CSV.

Lifecycle¶

Instantiate once and call :meth:parse or :meth:parse_stream per document.

Parameters¶

path_dir_artifacts : str | None When provided, Docling loads ML model artefacts from this directory instead of downloading them at runtime.

Methods:

parse –

Convert a file on disk and extract all content types.
parse_stream –

Convert an in-memory byte stream and extract all content types.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py

def __init__(self, *, path_dir_artifacts: str | None = None) -> None:
    pdf_opts = PdfPipelineOptions(
        artifacts_path=path_dir_artifacts,
        do_ocr=False,
        document_timeout=180,
        ocr_options=EasyOcrOptions(download_enabled=False, lang=["en", "nl"]),
    )
    default_opts = PipelineOptions(document_timeout=180)
    self._converter = DocumentConverter(
        allowed_formats=_ALLOWED_FORMATS,
        format_options={
            InputFormat.CSV:  CsvFormatOption(pipeline_options=default_opts),
            InputFormat.DOCX: WordFormatOption(pipeline_options=default_opts),
            InputFormat.HTML: HTMLFormatOption(pipeline_options=default_opts),
            InputFormat.MD:   MarkdownFormatOption(pipeline_options=default_opts),
            InputFormat.PDF:  PdfFormatOption(pipeline_options=pdf_opts, backend=PyPdfiumDocumentBackend),
            InputFormat.PPTX: PowerpointFormatOption(pipeline_options=default_opts),
            InputFormat.XLSX: ExcelFormatOption(pipeline_options=default_opts),
        },
    )

parse ¶

parse(path: str) -> ParsedDocument

Convert a file on disk and extract all content types.

Parameters¶

path : str Absolute or relative path to the document file.

Returns¶

ParsedDocument

Raises¶

FileNotFoundError If path does not point to an existing file. ValueError If the file extension is not supported. DocumentConversionError If the Docling pipeline reports errors or produces an empty document.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py

def parse(self, path: str) -> ParsedDocument:
    """
    Convert a file on disk and extract all content types.

    Parameters
    ----------
    path : str
        Absolute or relative path to the document file.

    Returns
    -------
    ParsedDocument

    Raises
    ------
    FileNotFoundError
        If *path* does not point to an existing file.
    ValueError
        If the file extension is not supported.
    DocumentConversionError
        If the Docling pipeline reports errors or produces an empty document.
    """
    file_path = Path(path)
    if not file_path.exists():
        raise FileNotFoundError(f"Document not found: '{path}'")
    with file_path.open("rb") as fh:
        return self._convert_and_extract(name=file_path.name, stream=fh)

parse_stream ¶

parse_stream(name: str, stream: IO[bytes]) -> ParsedDocument

Convert an in-memory byte stream and extract all content types.

Useful when the document is not stored on disk (e.g. downloaded from an API or read from object storage).

Parameters¶

name : str Filename including extension (e.g. "report.pdf"). Used by Docling to determine the input format. stream : IO[bytes] Readable byte stream of the document content.

Returns¶

ParsedDocument

Raises¶

ValueError If the file extension is not supported. DocumentConversionError If conversion fails or produces an empty document.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py

def parse_stream(self, name: str, stream: IO[bytes]) -> ParsedDocument:
    """
    Convert an in-memory byte stream and extract all content types.

    Useful when the document is not stored on disk (e.g. downloaded from
    an API or read from object storage).

    Parameters
    ----------
    name : str
        Filename including extension (e.g. ``"report.pdf"``).
        Used by Docling to determine the input format.
    stream : IO[bytes]
        Readable byte stream of the document content.

    Returns
    -------
    ParsedDocument

    Raises
    ------
    ValueError
        If the file extension is not supported.
    DocumentConversionError
        If conversion fails or produces an empty document.
    """
    return self._convert_and_extract(name=name, stream=stream)

Text structure extraction (DocumentParserDocling)¶

Overview¶

Supported formats¶

Design notes¶

Processing flow¶

Configuration details¶

Return type — ParsedDocument¶

Error types¶

DocumentConversionError(ValueError)¶

ConversionFault¶

Docstring vectorize document¶

DocumentParserDocling ¶

Mapping¶

Extraction pass¶

Supported formats¶

Lifecycle¶

Parameters¶

parse ¶

Parameters¶

Returns¶

Raises¶

parse_stream ¶

Parameters¶

Returns¶

Raises¶

Return type — `ParsedDocument`¶

`DocumentConversionError(ValueError)`¶

`ConversionFault`¶