Skip to content

Text structure extraction (DocumentParserDocling)

Overview

DocumentParserDocling converts raw document files into structured text using the Docling library. It supports PDF, Word, PowerPoint, Excel, HTML, Markdown, and CSV, and produces a ParsedDocument that is consumed by the chunking and embedding stages of the pipeline.


Supported formats

Extension Format
.pdf PDF (PyPdfium backend)
.docx Word
.pptx PowerPoint
.xlsx Excel
.html HTML
.md Markdown
.csv CSV

Design notes

Processing flow

DocumentParserDocling follows this processing pattern:

  1. Format validation

    • The file extension is checked against the allow-list before any I/O takes place.
    • Unsupported extensions raise ValueError immediately.
  2. Document conversion

    • The input (file path or byte stream) is wrapped in a DocumentStream and handed to Docling's DocumentConverter.
    • Format-specific options are applied (PDF uses the PyPdfium backend; all others use the default pipeline).
    • A hard file-size cap of 64 MB and a 180-second timeout are enforced on every call.
  3. Error handling

    • If Docling reports errors or the resulting document has no pages, a DocumentConversionError is raised.
    • Each failure is wrapped in a ConversionFault that captures the raw ErrorItem list, the document hash, and the file path, so callers have full context for logging or retry logic.
  4. Content extraction

    • A single iterate_items(BODY) walk over the DoclingDocument node graph populates sections, tables, figures, code blocks, list blocks, and footnotes in document order.
    • Section text is accumulated in a buffer and flushed whenever a SectionHeaderItem or the end of the document is reached, producing (title, body_text, tables) tuples.
    • Consecutive LIST_ITEM nodes within the same section are grouped into a single ExtractedListBlock rather than emitted as isolated bullets.
    • A second iterate_items(BODY + FURNITURE) walk collects page headers and footers, deduplicated across pages, into ExtractedFurniture entries.

Configuration details

Setting Value Notes
PDF backend PyPdfiumDocumentBackend Fast, no native deps
OCR Disabled by default Enable by passing do_ocr=True
OCR languages English, Dutch Configurable via EasyOcrOptions
Document timeout 180 s Applies to all formats
File size limit 64 MB (67_108_864 bytes) Enforced in convert() call

Return type — ParsedDocument

ParsedDocument is a frozen dataclass. All fields are populated in a single extraction pass and are immutable after construction.

Field Type Description
doc DoclingDocument Full Docling IR. Retain for downstream access to the raw node graph, bounding boxes, or provenance.
name str Original filename passed to the converter.
sections list[RawSection] Body text as (title, text, tables) tuples, grouped by section header. The leading nameless section (content before the first header) is included when non-empty, with "" as its title. Primary input to chunking strategies.
tables list[ExtractedTable] All body tables with captions, in document order.
figures list[ExtractedFigure] All pictures with captions, in document order.
code_blocks list[ExtractedCodeBlock] All CODE-labelled items, attributed to their enclosing section.
list_blocks list[ExtractedListBlock] Consecutive LIST_ITEM runs grouped per section.
footnotes list[ExtractedFootnote] All FOOTNOTE-labelled items, in document order.
furniture list[ExtractedFurniture] Page headers and footers, deduplicated across pages.

Error types

DocumentConversionError(ValueError)

Raised when the Docling pipeline fails or the output document is empty.

try:
    result = parser.parse_stream(name="report.pdf", stream=stream)
except DocumentConversionError as exc:
    for fault in exc.faults:
        print(fault.path_file_document, fault.hashvalue, fault.faults)

ConversionFault

Dataclass attached to DocumentConversionError.faults. One entry per failed document.

Field Type Description
faults Sequence[ErrorItem] Raw Docling error items from the pipeline.
hashvalue str Document hash assigned by Docling.
path_file_document Path Path or name of the file that failed.

Docstring vectorize document

DocumentParserDocling

DocumentParserDocling(*, path_dir_artifacts: str | None = None)

Docling implementation of the document parsing pipeline.

Converts a raw document file into a :class:ParsedDocument containing the full Docling IR and all structured content extracted in a single pass.

Mapping

Raw file / byte stream → DoclingDocument → ParsedDocument

Extraction pass

One iterate_items(BODY) walk produces sections, tables, figures, code blocks, list blocks, and footnotes. A second iterate_items(BODY + FURNITURE) walk collects page headers/footers.

Supported formats

PDF, DOCX, PPTX, XLSX, HTML, Markdown, CSV.

Lifecycle

Instantiate once and call :meth:parse or :meth:parse_stream per document.

Parameters

path_dir_artifacts : str | None When provided, Docling loads ML model artefacts from this directory instead of downloading them at runtime.

Methods:

  • parse

    Convert a file on disk and extract all content types.

  • parse_stream

    Convert an in-memory byte stream and extract all content types.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
def __init__(self, *, path_dir_artifacts: str | None = None) -> None:
    pdf_opts = PdfPipelineOptions(
        artifacts_path=path_dir_artifacts,
        do_ocr=False,
        document_timeout=180,
        ocr_options=EasyOcrOptions(download_enabled=False, lang=["en", "nl"]),
    )
    default_opts = PipelineOptions(document_timeout=180)
    self._converter = DocumentConverter(
        allowed_formats=_ALLOWED_FORMATS,
        format_options={
            InputFormat.CSV:  CsvFormatOption(pipeline_options=default_opts),
            InputFormat.DOCX: WordFormatOption(pipeline_options=default_opts),
            InputFormat.HTML: HTMLFormatOption(pipeline_options=default_opts),
            InputFormat.MD:   MarkdownFormatOption(pipeline_options=default_opts),
            InputFormat.PDF:  PdfFormatOption(pipeline_options=pdf_opts, backend=PyPdfiumDocumentBackend),
            InputFormat.PPTX: PowerpointFormatOption(pipeline_options=default_opts),
            InputFormat.XLSX: ExcelFormatOption(pipeline_options=default_opts),
        },
    )

parse

parse(path: str) -> ParsedDocument

Convert a file on disk and extract all content types.

Parameters

path : str Absolute or relative path to the document file.

Returns

ParsedDocument

Raises

FileNotFoundError If path does not point to an existing file. ValueError If the file extension is not supported. DocumentConversionError If the Docling pipeline reports errors or produces an empty document.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def parse(self, path: str) -> ParsedDocument:
    """
    Convert a file on disk and extract all content types.

    Parameters
    ----------
    path : str
        Absolute or relative path to the document file.

    Returns
    -------
    ParsedDocument

    Raises
    ------
    FileNotFoundError
        If *path* does not point to an existing file.
    ValueError
        If the file extension is not supported.
    DocumentConversionError
        If the Docling pipeline reports errors or produces an empty document.
    """
    file_path = Path(path)
    if not file_path.exists():
        raise FileNotFoundError(f"Document not found: '{path}'")
    with file_path.open("rb") as fh:
        return self._convert_and_extract(name=file_path.name, stream=fh)

parse_stream

parse_stream(name: str, stream: IO[bytes]) -> ParsedDocument

Convert an in-memory byte stream and extract all content types.

Useful when the document is not stored on disk (e.g. downloaded from an API or read from object storage).

Parameters

name : str Filename including extension (e.g. "report.pdf"). Used by Docling to determine the input format. stream : IO[bytes] Readable byte stream of the document content.

Returns

ParsedDocument

Raises

ValueError If the file extension is not supported. DocumentConversionError If conversion fails or produces an empty document.

Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def parse_stream(self, name: str, stream: IO[bytes]) -> ParsedDocument:
    """
    Convert an in-memory byte stream and extract all content types.

    Useful when the document is not stored on disk (e.g. downloaded from
    an API or read from object storage).

    Parameters
    ----------
    name : str
        Filename including extension (e.g. ``"report.pdf"``).
        Used by Docling to determine the input format.
    stream : IO[bytes]
        Readable byte stream of the document content.

    Returns
    -------
    ParsedDocument

    Raises
    ------
    ValueError
        If the file extension is not supported.
    DocumentConversionError
        If conversion fails or produces an empty document.
    """
    return self._convert_and_extract(name=name, stream=stream)