Chunking strategies¶

Overview¶

The AbstractChunkingStrategy class defines the contract that all concrete chunking implementations must satisfy. It encapsulates the transformation of raw document sections into a flat, ordered list of Chunk objects that are ready for embedding and indexing. By adhering to this interface, different splitting approaches — section-based, fixed-size, sliding-window, or summary-prefixed — can be swapped interchangeably while the rest of the pipeline remains agnostic to the underlying strategy.

Design notes¶

Interaction pattern¶

AbstractChunkingStrategy follows a single-phase transformation pattern:

Input — an ordered sequence of RawSection tuples, each carrying a section title, body text, and any tables extracted from that section. These are produced directly by DocumentParserDocling and passed in without further transformation.
Chunking — the strategy splits, merges, or partitions the sections according to its own logic and returns a flat list of Chunk objects.
Output — every returned Chunk has a non-empty text, a stable chunk_index starting from 0, the caller-supplied document_id, and an empty vector list. Embedding is a separate downstream concern.

Choosing a strategy¶

Class	Chunks produced	When to use
`SectionChunkingStrategy`	One per section	Clean heading structure; sections are already semantically coherent
`FixedSizeChunkingStrategy`	One or more per section	Uniform context window needed; no overlap required
`SlidingWindowChunkingStrategy`	More than fixed-size due to overlap	Boundary recall matters; dense technical text with cross-boundary sentences
`SummaryAndSectionsStrategy`	One per section (+ 1 summary if provided)	Section structure must be preserved with an optional LLM-generated summary chunk prepended

Common chunk fields¶

Every Chunk returned by any strategy has the following fields:

Field	Type	Description
`document_id`	`str`	The `document_id` passed to `.chunk()`, unchanged.
`chunk_index`	`int`	Monotonically increasing from 0.
`text`	`str`	Non-empty chunk body.
`vector`	`list`	Always `[]` until the embedding stage populates it.
`metadata`	`dict`	Strategy-specific; see each strategy below.

Implementations¶

SectionChunkingStrategy¶

Produces exactly one chunk per non-empty document section. This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the right default when sections are already semantically coherent units such as academic papers or reports.

Parameters

Parameter	Type	Default	Description
`min_chars`	`int`	`20`	Sections shorter than this after stripping are silently dropped, preventing index pollution from stub sections such as lone headings with no body.
`include_title_in_text`	`bool`	`False`	When `True`, the section title is prepended to the chunk text as `"<title>\n<text>"`. Useful when the title adds retrieval signal that does not appear in the body.

Metadata fields

Key	Type	Description
`section_title`	`str`	Title of the source section.
`has_tables`	`bool`	`True` if the section contained at least one table.

FixedSizeChunkingStrategy¶

Splits each section's text into non-overlapping fixed-size character windows. Each section may produce one or more chunks depending on its length relative to chunk_size. Splits are made on whitespace boundaries wherever possible to avoid cutting words mid-token.

Parameters

Parameter	Type	Default	Description
`chunk_size`	`int`	`1000`	Target maximum number of characters per chunk.
`min_chars`	`int`	`20`	Windows shorter than this are dropped; typically catches the last fragment of a short section.

Metadata fields

Key	Type	Description
`section_title`	`str`	Title of the source section.
`has_tables`	`bool`	`True` if the source section contained at least one table.

SlidingWindowChunkingStrategy¶

Produces overlapping character windows across each section's text. Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of a larger index and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.

Parameters

Parameter	Type	Default	Description
`chunk_size`	`int`	`1000`	Target maximum number of characters per window.
`overlap`	`int`	`200`	Number of characters shared between consecutive windows. Must be strictly less than `chunk_size`.
`min_chars`	`int`	`20`	Windows shorter than this are dropped.

Raises

Exception	Condition
`ValueError`	`overlap >= chunk_size`.

Metadata fields

Key	Type	Description
`section_title`	`str`	Title of the source section.
`has_tables`	`bool`	`True` if the source section contained at least one table.

SummaryAndSectionsStrategy¶

Produces one optional summary chunk followed by one chunk per non-empty section, preserving the document's natural heading structure. Unlike SummaryAndNSectionsStrategy, section boundaries and titles are never merged or discarded — each section maps to exactly one body chunk. Use this strategy when section-level retrieval granularity must be preserved and an optional LLM-generated summary chunk is desired at index 0.

Chunk layout

index 0      →  summary text            (only when summary is provided and non-blank)
index 1..N   →  one chunk per section   (in document order)

index 0..N   →  one chunk per section   (when no summary is provided)

Parameters

Parameter	Type	Default	Description
`min_chars`	`int`	`20`	Sections shorter than this after stripping are silently dropped.

Metadata fields — summary chunk

Key	Type	Description
`chunk_type`	`str`	Always `"summary"`.

Metadata fields — body chunks

Key	Type	Description
`chunk_type`	`str`	Always `"body"`.
`section_title`	`str`	Title of the source section.
`has_tables`	`bool`	`True` if the source section contained at least one table.

Docstrings AbstractChunkStrategy¶

abstract_chunk_strategy ¶

Classes:

AbstractChunkingStrategy –

Interface for all chunking strategies.

AbstractChunkingStrategy ¶


              flowchart TD
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

              

              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"

Interface for all chunking strategies.

A ChunkingStrategy transforms a sequence of raw document sections into a flat list of Chunk objects that are ready for embedding and indexing.

Contract¶

chunk() is the single method every implementation must provide.
Returned chunk_index values must be monotonically increasing from 0 and stable across re-indexing runs for identical input.
text on each Chunk must be non-empty.
vector is left as an empty sequence — embedding is a separate concern.
metadata may carry arbitrary JSON-serialisable fields but must never influence chunk identity.

Implementations¶

:class:SectionChunkingStrategy – one chunk per docling section (default)
:class:FixedSizeChunkingStrategy – splits text into fixed-size windows
:class:SlidingWindowChunkingStrategy – overlapping fixed-size windows
:class:SummaryAndNSectionsStrategy – summary chunk + N evenly-merged body chunks

Methods:

chunk –

Convert raw sections into Chunk objects.

chunk `abstractmethod` ¶

chunk(sections: Sequence[RawSection], *, document_id: str, summary: str | None = None) -> list[Chunk]

Convert raw sections into Chunk objects.

Parameters¶

sections: Ordered sequence of (title, text, tables) tuples as produced by TextStructureExtractor.extract_sections(). document_id: Stable identifier of the parent document. Passed through unchanged into every Chunk.document_id. summary: Optional pre-extracted summary string (e.g. from TextMetadata.summary). Most strategies ignore this; it is consumed by :class:SummaryAndNSectionsStrategy.

Returns¶

list[Chunk] Flat, ordered list of chunks. May be empty if sections is empty or all sections are blank after cleaning.

Source code in src/database_builder_libs/models/abstract_chunk_strategy.py

@abstractmethod
def chunk(
    self,
    sections: Sequence[RawSection],
    *,
    document_id: str,
    summary: str | None = None,
) -> list[Chunk]:
    """
    Convert raw sections into ``Chunk`` objects.

    Parameters
    ----------
    sections:
        Ordered sequence of ``(title, text, tables)`` tuples as produced
        by ``TextStructureExtractor.extract_sections()``.
    document_id:
        Stable identifier of the parent document.  Passed through
        unchanged into every ``Chunk.document_id``.
    summary:
        Optional pre-extracted summary string (e.g. from
        ``TextMetadata.summary``).  Most strategies ignore this; it is
        consumed by :class:`SummaryAndNSectionsStrategy`.

    Returns
    -------
    list[Chunk]
        Flat, ordered list of chunks.  May be empty if *sections* is empty
        or all sections are blank after cleaning.
    """
    raise NotImplementedError

Docstrings SectionChunkingStrategy¶

n_points_section ¶

Classes:

SectionChunkingStrategy –

Produces exactly one Chunk per non-empty document section.

SectionChunkingStrategy `dataclass` ¶

SectionChunkingStrategy(min_chars: int = 20, include_title_in_text: bool = False)


              flowchart TD
              database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy[SectionChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"

Produces exactly one Chunk per non-empty document section.

This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the best default when sections are already semantically coherent units (e.g. academic papers, reports).

Attributes¶

min_chars : int Sections whose text is shorter than this threshold (after stripping) are silently dropped. Prevents index pollution from stub sections such as lone headings with no body. Default: 20. include_title_in_text : bool When True the section title is prepended to the chunk text as "<title>\n<text>". Useful when the title adds retrieval signal that does not appear in the body. Default: False.

Docstrings FixedSizeChunkingStrategy¶

n_points_fixed_size ¶

Classes:

FixedSizeChunkingStrategy –

Splits section text into non-overlapping fixed-size character windows.

FixedSizeChunkingStrategy `dataclass` ¶

FixedSizeChunkingStrategy(chunk_size: int = 1000, min_chars: int = 20)


              flowchart TD
              database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy[FixedSizeChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"

Splits section text into non-overlapping fixed-size character windows.

Each section may produce one or more chunks depending on its length relative to chunk_size. Splits are made on whitespace boundaries wherever possible to avoid cutting words mid-token.

Attributes¶

chunk_size : int Target maximum number of characters per chunk. Default: 1000. min_chars : int Windows shorter than this are dropped (typically the last fragment of a short section). Default: 20.

Docstrings SlidingWindowChunkingStrategy¶

n_points_sliding_window ¶

Classes:

SlidingWindowChunkingStrategy –

Produces overlapping character windows across each section's text.

SlidingWindowChunkingStrategy `dataclass` ¶

SlidingWindowChunkingStrategy(chunk_size: int = 1000, overlap: int = 200, min_chars: int = 20)


              flowchart TD
              database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy[SlidingWindowChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"

Produces overlapping character windows across each section's text.

Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of index size and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.

Attributes¶

chunk_size : int Target maximum number of characters per window. Default: 1000. overlap : int Number of characters shared between consecutive windows. Must be strictly less than chunk_size. Default: 200. min_chars : int Windows shorter than this threshold are dropped. Default: 20.

Raises¶

ValueError If overlap >= chunk_size.

Docstrings SummaryAndSectionsStrategy¶

summary_and_sections ¶

Classes:

SummaryAndSectionsStrategy –

Produces one optional summary chunk followed by one chunk per section.

SummaryAndSectionsStrategy `dataclass` ¶

SummaryAndSectionsStrategy(min_chars: int = 20)


              flowchart TD
              database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy[SummaryAndSectionsStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy
                


              click database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy href "" "database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"

Produces one optional summary chunk followed by one chunk per section.

Combines the summary prepending of :class:SummaryAndNSectionsStrategy with the section-preserving behaviour of :class:SectionChunkingStrategy. Section boundaries, titles, and table flags are all preserved on each body chunk.

Chunk layout (when summary is present)¶

index 0 → summary text index 1..N → one chunk per non-empty section, in document order

Chunk layout (when summary is absent)¶

index 0..N → one chunk per non-empty section, in document order

Attributes¶

min_chars : int Sections whose text is shorter than this after stripping are silently dropped. Default: 20.

Chunking strategies¶

Overview¶

Design notes¶

Interaction pattern¶

Choosing a strategy¶

Common chunk fields¶

Implementations¶

SectionChunkingStrategy¶

FixedSizeChunkingStrategy¶

SlidingWindowChunkingStrategy¶

SummaryAndSectionsStrategy¶

Docstrings AbstractChunkStrategy¶

abstract_chunk_strategy ¶

AbstractChunkingStrategy ¶

Contract¶

Implementations¶

chunk abstractmethod ¶

Parameters¶

Returns¶

Docstrings SectionChunkingStrategy¶

n_points_section ¶

SectionChunkingStrategy dataclass ¶

Attributes¶

Docstrings FixedSizeChunkingStrategy¶

n_points_fixed_size ¶

FixedSizeChunkingStrategy dataclass ¶

Attributes¶

Docstrings SlidingWindowChunkingStrategy¶

n_points_sliding_window ¶

SlidingWindowChunkingStrategy dataclass ¶

Attributes¶

Raises¶

Docstrings SummaryAndSectionsStrategy¶

summary_and_sections ¶

SummaryAndSectionsStrategy dataclass ¶

Chunk layout (when summary is present)¶

Chunk layout (when summary is absent)¶

Attributes¶

chunk `abstractmethod` ¶

SectionChunkingStrategy `dataclass` ¶

FixedSizeChunkingStrategy `dataclass` ¶

SlidingWindowChunkingStrategy `dataclass` ¶

SummaryAndSectionsStrategy `dataclass` ¶