Chunking strategies¶
Overview¶
The AbstractChunkingStrategy class defines the contract that all concrete chunking
implementations must satisfy. It encapsulates the transformation of raw document sections
into a flat, ordered list of Chunk objects that are ready for embedding and indexing.
By adhering to this interface, different splitting approaches — section-based, fixed-size,
sliding-window, or summary-prefixed — can be swapped interchangeably while the rest of
the pipeline remains agnostic to the underlying strategy.
Design notes¶
Interaction pattern¶
AbstractChunkingStrategy follows a single-phase transformation pattern:
- Input — an ordered sequence of
RawSectiontuples, each carrying a section title, body text, and any tables extracted from that section. These are produced directly byDocumentParserDoclingand passed in without further transformation. - Chunking — the strategy splits, merges, or partitions the sections according to its
own logic and returns a flat list of
Chunkobjects. - Output — every returned
Chunkhas a non-emptytext, a stablechunk_indexstarting from 0, the caller-supplieddocument_id, and an emptyvectorlist. Embedding is a separate downstream concern.
Choosing a strategy¶
| Class | Chunks produced | When to use |
|---|---|---|
SectionChunkingStrategy |
One per section | Clean heading structure; sections are already semantically coherent |
FixedSizeChunkingStrategy |
One or more per section | Uniform context window needed; no overlap required |
SlidingWindowChunkingStrategy |
More than fixed-size due to overlap | Boundary recall matters; dense technical text with cross-boundary sentences |
SummaryAndSectionsStrategy |
One per section (+ 1 summary if provided) | Section structure must be preserved with an optional LLM-generated summary chunk prepended |
Common chunk fields¶
Every Chunk returned by any strategy has the following fields:
| Field | Type | Description |
|---|---|---|
document_id |
str |
The document_id passed to .chunk(), unchanged. |
chunk_index |
int |
Monotonically increasing from 0. |
text |
str |
Non-empty chunk body. |
vector |
list |
Always [] until the embedding stage populates it. |
metadata |
dict |
Strategy-specific; see each strategy below. |
Implementations¶
SectionChunkingStrategy¶
Produces exactly one chunk per non-empty document section. This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the right default when sections are already semantically coherent units such as academic papers or reports.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
min_chars |
int |
20 |
Sections shorter than this after stripping are silently dropped, preventing index pollution from stub sections such as lone headings with no body. |
include_title_in_text |
bool |
False |
When True, the section title is prepended to the chunk text as "<title>\n<text>". Useful when the title adds retrieval signal that does not appear in the body. |
Metadata fields
| Key | Type | Description |
|---|---|---|
section_title |
str |
Title of the source section. |
has_tables |
bool |
True if the section contained at least one table. |
FixedSizeChunkingStrategy¶
Splits each section's text into non-overlapping fixed-size character windows. Each section
may produce one or more chunks depending on its length relative to chunk_size. Splits are
made on whitespace boundaries wherever possible to avoid cutting words mid-token.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
1000 |
Target maximum number of characters per chunk. |
min_chars |
int |
20 |
Windows shorter than this are dropped; typically catches the last fragment of a short section. |
Metadata fields
| Key | Type | Description |
|---|---|---|
section_title |
str |
Title of the source section. |
has_tables |
bool |
True if the source section contained at least one table. |
SlidingWindowChunkingStrategy¶
Produces overlapping character windows across each section's text. Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of a larger index and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
1000 |
Target maximum number of characters per window. |
overlap |
int |
200 |
Number of characters shared between consecutive windows. Must be strictly less than chunk_size. |
min_chars |
int |
20 |
Windows shorter than this are dropped. |
Raises
| Exception | Condition |
|---|---|
ValueError |
overlap >= chunk_size. |
Metadata fields
| Key | Type | Description |
|---|---|---|
section_title |
str |
Title of the source section. |
has_tables |
bool |
True if the source section contained at least one table. |
SummaryAndSectionsStrategy¶
Produces one optional summary chunk followed by one chunk per non-empty section, preserving
the document's natural heading structure. Unlike SummaryAndNSectionsStrategy, section
boundaries and titles are never merged or discarded — each section maps to exactly one body
chunk. Use this strategy when section-level retrieval granularity must be preserved and an
optional LLM-generated summary chunk is desired at index 0.
Chunk layout
index 0 → summary text (only when summary is provided and non-blank)
index 1..N → one chunk per section (in document order)
index 0..N → one chunk per section (when no summary is provided)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
min_chars |
int |
20 |
Sections shorter than this after stripping are silently dropped. |
Metadata fields — summary chunk
| Key | Type | Description |
|---|---|---|
chunk_type |
str |
Always "summary". |
Metadata fields — body chunks
| Key | Type | Description |
|---|---|---|
chunk_type |
str |
Always "body". |
section_title |
str |
Title of the source section. |
has_tables |
bool |
True if the source section contained at least one table. |
Docstrings AbstractChunkStrategy¶
abstract_chunk_strategy
¶
Classes:
-
AbstractChunkingStrategy–Interface for all chunking strategies.
AbstractChunkingStrategy
¶
flowchart TD
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]
click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
Interface for all chunking strategies.
A ChunkingStrategy transforms a sequence of raw document sections into a
flat list of Chunk objects that are ready for embedding and indexing.
Contract¶
chunk()is the single method every implementation must provide.- Returned
chunk_indexvalues must be monotonically increasing from 0 and stable across re-indexing runs for identical input. texton eachChunkmust be non-empty.vectoris left as an empty sequence — embedding is a separate concern.metadatamay carry arbitrary JSON-serialisable fields but must never influence chunk identity.
Implementations¶
- :class:
SectionChunkingStrategy– one chunk per docling section (default) - :class:
FixedSizeChunkingStrategy– splits text into fixed-size windows - :class:
SlidingWindowChunkingStrategy– overlapping fixed-size windows - :class:
SummaryAndNSectionsStrategy– summary chunk + N evenly-merged body chunks
Methods:
-
chunk–Convert raw sections into
Chunkobjects.
chunk
abstractmethod
¶
chunk(sections: Sequence[RawSection], *, document_id: str, summary: str | None = None) -> list[Chunk]
Convert raw sections into Chunk objects.
Parameters¶
sections:
Ordered sequence of (title, text, tables) tuples as produced
by TextStructureExtractor.extract_sections().
document_id:
Stable identifier of the parent document. Passed through
unchanged into every Chunk.document_id.
summary:
Optional pre-extracted summary string (e.g. from
TextMetadata.summary). Most strategies ignore this; it is
consumed by :class:SummaryAndNSectionsStrategy.
Returns¶
list[Chunk] Flat, ordered list of chunks. May be empty if sections is empty or all sections are blank after cleaning.
Source code in src/database_builder_libs/models/abstract_chunk_strategy.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
Docstrings SectionChunkingStrategy¶
n_points_section
¶
Classes:
-
SectionChunkingStrategy–Produces exactly one
Chunkper non-empty document section.
SectionChunkingStrategy
dataclass
¶
flowchart TD
database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy[SectionChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy
click database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy"
click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
Produces exactly one Chunk per non-empty document section.
This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the best default when sections are already semantically coherent units (e.g. academic papers, reports).
Attributes¶
min_chars : int
Sections whose text is shorter than this threshold (after stripping)
are silently dropped. Prevents index pollution from stub sections
such as lone headings with no body. Default: 20.
include_title_in_text : bool
When True the section title is prepended to the chunk text as
"<title>\n<text>". Useful when the title adds retrieval signal
that does not appear in the body. Default: False.
Docstrings FixedSizeChunkingStrategy¶
n_points_fixed_size
¶
Classes:
-
FixedSizeChunkingStrategy–Splits section text into non-overlapping fixed-size character windows.
FixedSizeChunkingStrategy
dataclass
¶
flowchart TD
database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy[FixedSizeChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy
click database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy"
click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
Splits section text into non-overlapping fixed-size character windows.
Each section may produce one or more chunks depending on its length
relative to chunk_size. Splits are made on whitespace boundaries
wherever possible to avoid cutting words mid-token.
Attributes¶
chunk_size : int Target maximum number of characters per chunk. Default: 1000. min_chars : int Windows shorter than this are dropped (typically the last fragment of a short section). Default: 20.
Docstrings SlidingWindowChunkingStrategy¶
n_points_sliding_window
¶
Classes:
-
SlidingWindowChunkingStrategy–Produces overlapping character windows across each section's text.
SlidingWindowChunkingStrategy
dataclass
¶
flowchart TD
database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy[SlidingWindowChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy
click database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy"
click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
Produces overlapping character windows across each section's text.
Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of index size and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.
Attributes¶
chunk_size : int
Target maximum number of characters per window. Default: 1000.
overlap : int
Number of characters shared between consecutive windows.
Must be strictly less than chunk_size. Default: 200.
min_chars : int
Windows shorter than this threshold are dropped. Default: 20.
Raises¶
ValueError
If overlap >= chunk_size.
Docstrings SummaryAndSectionsStrategy¶
summary_and_sections
¶
Classes:
-
SummaryAndSectionsStrategy–Produces one optional summary chunk followed by one chunk per section.
SummaryAndSectionsStrategy
dataclass
¶
SummaryAndSectionsStrategy(min_chars: int = 20)
flowchart TD
database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy[SummaryAndSectionsStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]
database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy
click database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy href "" "database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy"
click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
Produces one optional summary chunk followed by one chunk per section.
Combines the summary prepending of :class:SummaryAndNSectionsStrategy
with the section-preserving behaviour of :class:SectionChunkingStrategy.
Section boundaries, titles, and table flags are all preserved on each body
chunk.
Chunk layout (when summary is present)¶
index 0 → summary text index 1..N → one chunk per non-empty section, in document order
Chunk layout (when summary is absent)¶
index 0..N → one chunk per non-empty section, in document order
Attributes¶
min_chars : int Sections whose text is shorter than this after stripping are silently dropped. Default: 20.