Skip to content

Chunking strategies

Overview

The AbstractChunkingStrategy class defines the contract that all concrete chunking implementations must satisfy. It encapsulates the transformation of raw document sections into a flat, ordered list of Chunk objects that are ready for embedding and indexing. By adhering to this interface, different splitting approaches — section-based, fixed-size, sliding-window, or summary-prefixed — can be swapped interchangeably while the rest of the pipeline remains agnostic to the underlying strategy.


Design notes

Interaction pattern

AbstractChunkingStrategy follows a single-phase transformation pattern:

  1. Input — an ordered sequence of RawSection tuples, each carrying a section title, body text, and any tables extracted from that section. These are produced directly by DocumentParserDocling and passed in without further transformation.
  2. Chunking — the strategy splits, merges, or partitions the sections according to its own logic and returns a flat list of Chunk objects.
  3. Output — every returned Chunk has a non-empty text, a stable chunk_index starting from 0, the caller-supplied document_id, and an empty vector list. Embedding is a separate downstream concern.

Choosing a strategy

Class Chunks produced When to use
SectionChunkingStrategy One per section Clean heading structure; sections are already semantically coherent
FixedSizeChunkingStrategy One or more per section Uniform context window needed; no overlap required
SlidingWindowChunkingStrategy More than fixed-size due to overlap Boundary recall matters; dense technical text with cross-boundary sentences
SummaryAndSectionsStrategy One per section (+ 1 summary if provided) Section structure must be preserved with an optional LLM-generated summary chunk prepended

Common chunk fields

Every Chunk returned by any strategy has the following fields:

Field Type Description
document_id str The document_id passed to .chunk(), unchanged.
chunk_index int Monotonically increasing from 0.
text str Non-empty chunk body.
vector list Always [] until the embedding stage populates it.
metadata dict Strategy-specific; see each strategy below.

Implementations

SectionChunkingStrategy

Produces exactly one chunk per non-empty document section. This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the right default when sections are already semantically coherent units such as academic papers or reports.

Parameters

Parameter Type Default Description
min_chars int 20 Sections shorter than this after stripping are silently dropped, preventing index pollution from stub sections such as lone headings with no body.
include_title_in_text bool False When True, the section title is prepended to the chunk text as "<title>\n<text>". Useful when the title adds retrieval signal that does not appear in the body.

Metadata fields

Key Type Description
section_title str Title of the source section.
has_tables bool True if the section contained at least one table.

FixedSizeChunkingStrategy

Splits each section's text into non-overlapping fixed-size character windows. Each section may produce one or more chunks depending on its length relative to chunk_size. Splits are made on whitespace boundaries wherever possible to avoid cutting words mid-token.

Parameters

Parameter Type Default Description
chunk_size int 1000 Target maximum number of characters per chunk.
min_chars int 20 Windows shorter than this are dropped; typically catches the last fragment of a short section.

Metadata fields

Key Type Description
section_title str Title of the source section.
has_tables bool True if the source section contained at least one table.

SlidingWindowChunkingStrategy

Produces overlapping character windows across each section's text. Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of a larger index and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.

Parameters

Parameter Type Default Description
chunk_size int 1000 Target maximum number of characters per window.
overlap int 200 Number of characters shared between consecutive windows. Must be strictly less than chunk_size.
min_chars int 20 Windows shorter than this are dropped.

Raises

Exception Condition
ValueError overlap >= chunk_size.

Metadata fields

Key Type Description
section_title str Title of the source section.
has_tables bool True if the source section contained at least one table.

SummaryAndSectionsStrategy

Produces one optional summary chunk followed by one chunk per non-empty section, preserving the document's natural heading structure. Unlike SummaryAndNSectionsStrategy, section boundaries and titles are never merged or discarded — each section maps to exactly one body chunk. Use this strategy when section-level retrieval granularity must be preserved and an optional LLM-generated summary chunk is desired at index 0.

Chunk layout

index 0      →  summary text            (only when summary is provided and non-blank)
index 1..N   →  one chunk per section   (in document order)
index 0..N   →  one chunk per section   (when no summary is provided)

Parameters

Parameter Type Default Description
min_chars int 20 Sections shorter than this after stripping are silently dropped.

Metadata fields — summary chunk

Key Type Description
chunk_type str Always "summary".

Metadata fields — body chunks

Key Type Description
chunk_type str Always "body".
section_title str Title of the source section.
has_tables bool True if the source section contained at least one table.

Docstrings AbstractChunkStrategy

abstract_chunk_strategy

Classes:

AbstractChunkingStrategy


              flowchart TD
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

              

              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
            

Interface for all chunking strategies.

A ChunkingStrategy transforms a sequence of raw document sections into a flat list of Chunk objects that are ready for embedding and indexing.

Contract

  • chunk() is the single method every implementation must provide.
  • Returned chunk_index values must be monotonically increasing from 0 and stable across re-indexing runs for identical input.
  • text on each Chunk must be non-empty.
  • vector is left as an empty sequence — embedding is a separate concern.
  • metadata may carry arbitrary JSON-serialisable fields but must never influence chunk identity.

Implementations

  • :class:SectionChunkingStrategy – one chunk per docling section (default)
  • :class:FixedSizeChunkingStrategy – splits text into fixed-size windows
  • :class:SlidingWindowChunkingStrategy – overlapping fixed-size windows
  • :class:SummaryAndNSectionsStrategy – summary chunk + N evenly-merged body chunks

Methods:

  • chunk

    Convert raw sections into Chunk objects.

chunk abstractmethod

chunk(sections: Sequence[RawSection], *, document_id: str, summary: str | None = None) -> list[Chunk]

Convert raw sections into Chunk objects.

Parameters

sections: Ordered sequence of (title, text, tables) tuples as produced by TextStructureExtractor.extract_sections(). document_id: Stable identifier of the parent document. Passed through unchanged into every Chunk.document_id. summary: Optional pre-extracted summary string (e.g. from TextMetadata.summary). Most strategies ignore this; it is consumed by :class:SummaryAndNSectionsStrategy.

Returns

list[Chunk] Flat, ordered list of chunks. May be empty if sections is empty or all sections are blank after cleaning.

Source code in src/database_builder_libs/models/abstract_chunk_strategy.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@abstractmethod
def chunk(
    self,
    sections: Sequence[RawSection],
    *,
    document_id: str,
    summary: str | None = None,
) -> list[Chunk]:
    """
    Convert raw sections into ``Chunk`` objects.

    Parameters
    ----------
    sections:
        Ordered sequence of ``(title, text, tables)`` tuples as produced
        by ``TextStructureExtractor.extract_sections()``.
    document_id:
        Stable identifier of the parent document.  Passed through
        unchanged into every ``Chunk.document_id``.
    summary:
        Optional pre-extracted summary string (e.g. from
        ``TextMetadata.summary``).  Most strategies ignore this; it is
        consumed by :class:`SummaryAndNSectionsStrategy`.

    Returns
    -------
    list[Chunk]
        Flat, ordered list of chunks.  May be empty if *sections* is empty
        or all sections are blank after cleaning.
    """
    raise NotImplementedError

Docstrings SectionChunkingStrategy

n_points_section

Classes:

SectionChunkingStrategy dataclass

SectionChunkingStrategy(min_chars: int = 20, include_title_in_text: bool = False)

              flowchart TD
              database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy[SectionChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_section.SectionChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
            

Produces exactly one Chunk per non-empty document section.

This is the simplest strategy and maps cleanly onto the heading structure that Docling extracts. It is the best default when sections are already semantically coherent units (e.g. academic papers, reports).

Attributes

min_chars : int Sections whose text is shorter than this threshold (after stripping) are silently dropped. Prevents index pollution from stub sections such as lone headings with no body. Default: 20. include_title_in_text : bool When True the section title is prepended to the chunk text as "<title>\n<text>". Useful when the title adds retrieval signal that does not appear in the body. Default: False.

Docstrings FixedSizeChunkingStrategy

n_points_fixed_size

Classes:

FixedSizeChunkingStrategy dataclass

FixedSizeChunkingStrategy(chunk_size: int = 1000, min_chars: int = 20)

              flowchart TD
              database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy[FixedSizeChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_fixed_size.FixedSizeChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
            

Splits section text into non-overlapping fixed-size character windows.

Each section may produce one or more chunks depending on its length relative to chunk_size. Splits are made on whitespace boundaries wherever possible to avoid cutting words mid-token.

Attributes

chunk_size : int Target maximum number of characters per chunk. Default: 1000. min_chars : int Windows shorter than this are dropped (typically the last fragment of a short section). Default: 20.

Docstrings SlidingWindowChunkingStrategy

n_points_sliding_window

Classes:

SlidingWindowChunkingStrategy dataclass

SlidingWindowChunkingStrategy(chunk_size: int = 1000, overlap: int = 200, min_chars: int = 20)

              flowchart TD
              database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy[SlidingWindowChunkingStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy
                


              click database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy href "" "database_builder_libs.utility.chunk.n_points_sliding_window.SlidingWindowChunkingStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
            

Produces overlapping character windows across each section's text.

Overlapping windows preserve cross-boundary context that non-overlapping splits lose, at the cost of index size and some retrieval redundancy. Useful for dense technical text where important sentences often span what would otherwise be a hard split boundary.

Attributes

chunk_size : int Target maximum number of characters per window. Default: 1000. overlap : int Number of characters shared between consecutive windows. Must be strictly less than chunk_size. Default: 200. min_chars : int Windows shorter than this threshold are dropped. Default: 20.

Raises

ValueError If overlap >= chunk_size.

Docstrings SummaryAndSectionsStrategy

summary_and_sections

Classes:

SummaryAndSectionsStrategy dataclass

SummaryAndSectionsStrategy(min_chars: int = 20)

              flowchart TD
              database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy[SummaryAndSectionsStrategy]
              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy[AbstractChunkingStrategy]

                              database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy --> database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy
                


              click database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy href "" "database_builder_libs.utility.chunk.summary_and_sections.SummaryAndSectionsStrategy"
              click database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy href "" "database_builder_libs.models.abstract_chunk_strategy.AbstractChunkingStrategy"
            

Produces one optional summary chunk followed by one chunk per section.

Combines the summary prepending of :class:SummaryAndNSectionsStrategy with the section-preserving behaviour of :class:SectionChunkingStrategy. Section boundaries, titles, and table flags are all preserved on each body chunk.

Chunk layout (when summary is present)

index 0 → summary text index 1..N → one chunk per non-empty section, in document order

Chunk layout (when summary is absent)

index 0..N → one chunk per non-empty section, in document order

Attributes

min_chars : int Sections whose text is shorter than this after stripping are silently dropped. Default: 20.