Text structure extraction (DocumentParserDocling)¶
Overview¶
DocumentParserDocling converts raw document files into structured text using the
Docling library. It supports PDF, Word, PowerPoint,
Excel, HTML, Markdown, and CSV, and produces a ParsedDocument that is consumed by the
chunking and embedding stages of the pipeline.
Supported formats¶
| Extension | Format |
|---|---|
.pdf |
PDF (PyPdfium backend) |
.docx |
Word |
.pptx |
PowerPoint |
.xlsx |
Excel |
.html |
HTML |
.md |
Markdown |
.csv |
CSV |
Design notes¶
Processing flow¶
DocumentParserDocling follows this processing pattern:
-
Format validation
- The file extension is checked against the allow-list before any I/O takes place.
- Unsupported extensions raise
ValueErrorimmediately.
-
Document conversion
- The input (file path or byte stream) is wrapped in a
DocumentStreamand handed to Docling'sDocumentConverter. - Format-specific options are applied (PDF uses the PyPdfium backend; all others use the default pipeline).
- A hard file-size cap of 64 MB and a 180-second timeout are enforced on every call.
- The input (file path or byte stream) is wrapped in a
-
Error handling
- If Docling reports
errorsor the resulting document has no pages, aDocumentConversionErroris raised. - Each failure is wrapped in a
ConversionFaultthat captures the rawErrorItemlist, the document hash, and the file path, so callers have full context for logging or retry logic.
- If Docling reports
-
Content extraction
- A single
iterate_items(BODY)walk over theDoclingDocumentnode graph populates sections, tables, figures, code blocks, list blocks, and footnotes in document order. - Section text is accumulated in a buffer and flushed whenever a
SectionHeaderItemor the end of the document is reached, producing(title, body_text, tables)tuples. - Consecutive
LIST_ITEMnodes within the same section are grouped into a singleExtractedListBlockrather than emitted as isolated bullets. - A second
iterate_items(BODY + FURNITURE)walk collects page headers and footers, deduplicated across pages, intoExtractedFurnitureentries.
- A single
Configuration details¶
| Setting | Value | Notes |
|---|---|---|
| PDF backend | PyPdfiumDocumentBackend |
Fast, no native deps |
| OCR | Disabled by default | Enable by passing do_ocr=True |
| OCR languages | English, Dutch | Configurable via EasyOcrOptions |
| Document timeout | 180 s | Applies to all formats |
| File size limit | 64 MB (67_108_864 bytes) |
Enforced in convert() call |
Return type — ParsedDocument¶
ParsedDocument is a frozen dataclass. All fields are populated in a single extraction
pass and are immutable after construction.
| Field | Type | Description |
|---|---|---|
doc |
DoclingDocument |
Full Docling IR. Retain for downstream access to the raw node graph, bounding boxes, or provenance. |
name |
str |
Original filename passed to the converter. |
sections |
list[RawSection] |
Body text as (title, text, tables) tuples, grouped by section header. The leading nameless section (content before the first header) is included when non-empty, with "" as its title. Primary input to chunking strategies. |
tables |
list[ExtractedTable] |
All body tables with captions, in document order. |
figures |
list[ExtractedFigure] |
All pictures with captions, in document order. |
code_blocks |
list[ExtractedCodeBlock] |
All CODE-labelled items, attributed to their enclosing section. |
list_blocks |
list[ExtractedListBlock] |
Consecutive LIST_ITEM runs grouped per section. |
footnotes |
list[ExtractedFootnote] |
All FOOTNOTE-labelled items, in document order. |
furniture |
list[ExtractedFurniture] |
Page headers and footers, deduplicated across pages. |
Error types¶
DocumentConversionError(ValueError)¶
Raised when the Docling pipeline fails or the output document is empty.
try:
result = parser.parse_stream(name="report.pdf", stream=stream)
except DocumentConversionError as exc:
for fault in exc.faults:
print(fault.path_file_document, fault.hashvalue, fault.faults)
ConversionFault¶
Dataclass attached to DocumentConversionError.faults. One entry per failed document.
| Field | Type | Description |
|---|---|---|
faults |
Sequence[ErrorItem] |
Raw Docling error items from the pipeline. |
hashvalue |
str |
Document hash assigned by Docling. |
path_file_document |
Path |
Path or name of the file that failed. |
Docstring vectorize document¶
DocumentParserDocling
¶
DocumentParserDocling(*, path_dir_artifacts: str | None = None)
Docling implementation of the document parsing pipeline.
Converts a raw document file into a :class:ParsedDocument containing
the full Docling IR and all structured content extracted in a single pass.
Mapping¶
Raw file / byte stream → DoclingDocument → ParsedDocument
Extraction pass¶
One iterate_items(BODY) walk produces sections, tables, figures,
code blocks, list blocks, and footnotes. A second
iterate_items(BODY + FURNITURE) walk collects page headers/footers.
Supported formats¶
PDF, DOCX, PPTX, XLSX, HTML, Markdown, CSV.
Lifecycle¶
Instantiate once and call :meth:parse or :meth:parse_stream per document.
Parameters¶
path_dir_artifacts : str | None When provided, Docling loads ML model artefacts from this directory instead of downloading them at runtime.
Methods:
-
parse–Convert a file on disk and extract all content types.
-
parse_stream–Convert an in-memory byte stream and extract all content types.
Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
parse
¶
parse(path: str) -> ParsedDocument
Convert a file on disk and extract all content types.
Parameters¶
path : str Absolute or relative path to the document file.
Returns¶
ParsedDocument
Raises¶
FileNotFoundError If path does not point to an existing file. ValueError If the file extension is not supported. DocumentConversionError If the Docling pipeline reports errors or produces an empty document.
Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
parse_stream
¶
Convert an in-memory byte stream and extract all content types.
Useful when the document is not stored on disk (e.g. downloaded from an API or read from object storage).
Parameters¶
name : str
Filename including extension (e.g. "report.pdf").
Used by Docling to determine the input format.
stream : IO[bytes]
Readable byte stream of the document content.
Returns¶
ParsedDocument
Raises¶
ValueError If the file extension is not supported. DocumentConversionError If conversion fails or produces an empty document.
Source code in src/database_builder_libs/utility/extract/document_parser_docling.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | |