Skip to content

Abstract source

Overview

The AbstractSource class defines the contract that all concrete data‑source adapters must implement. It encapsulates the lifecycle of a synchronizable external system, providing a clear separation between connection handling, artefact discovery, and content retrieval. By adhering to this interface, different back‑ends (e.g., Zotero, SharePoint, REST APIs) can be swapped interchangeably while the rest of the pipeline remains agnostic to the underlying source.

Design notes

Interaction Pattern

The AbstractSource follows a three-phase interaction pattern:

  1. Connection Phase: Establish connection to external system using backend-specific configuration
  2. Discovery Phase: Query for artefacts modified since last synchronization timestamp
  3. Retrieval Phase: Fetch normalized content for discovered artefacts

This design enables efficient incremental synchronization while maintaining consistency through stable identifiers and deterministic content serialization.

Docstring

abstract_source

Classes:

  • AbstractSource

    Abstract interface describing a synchronizable external data source.

  • Content

    Representation of a single artefact retrieved from a source.

AbstractSource


              flowchart TD
              database_builder_libs.models.abstract_source.AbstractSource[AbstractSource]

              

              click database_builder_libs.models.abstract_source.AbstractSource href "" "database_builder_libs.models.abstract_source.AbstractSource"
            

Abstract interface describing a synchronizable external data source.

A Source implementation is responsible for: 1. Establishing a connection to a remote system 2. Discovering which artefacts changed since a timestamp 3. Retrieving normalized content for those artefacts

The interface is designed for incremental synchronization workflows.

Lifecycle

connect_to_source() MUST be called before any other method.

Consistency guarantees

Implementations must ensure: - Stable artefact identifiers across runs - Monotonic modification timestamps per artefact - Deterministic content serialization

Typical implementations: SharePoint, Zotero, REST APIs, file repositories, databases.

Methods:

  • connect

    Establish connection to the external source.

  • get_content

    Retrieve normalized content for provided artefacts.

  • get_list_artefacts

    Return identifiers of artefacts modified since last_synced.

connect

connect(config: Mapping[str, Any] | None = None) -> None

Establish connection to the external source.

Idempotent: safe to call multiple times.

Raises

ConnectionError PermissionError ValueError

Source code in src/database_builder_libs/models/abstract_source.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def connect(self, config: Mapping[str, Any] | None = None) -> None:
    """
    Establish connection to the external source.

    Idempotent: safe to call multiple times.

    Raises
    ------
    ConnectionError
    PermissionError
    ValueError
    """
    if self._connected:
        return

    self._connect_impl(config or {})
    self._connected = True

get_content abstractmethod

get_content(artefacts: list[tuple[str, datetime]]) -> list[Content]

Retrieve normalized content for provided artefacts.

Parameters

artefacts : list[tuple[str, datetime]] Artefacts returned from get_list_artefacts().

Returns

list[Content] Content objects corresponding to requested artefacts.

Guarantees
  • One Content object per artefact_id
  • Returned content.date must match the provided timestamp unless the source updated during retrieval.
Notes

Implementations should batch requests where possible.

Raises

RuntimeError If called before connect_to_source(). KeyError If an artefact no longer exists.

Source code in src/database_builder_libs/models/abstract_source.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
@abstractmethod
def get_content(self, artefacts: list[tuple[str, datetime]]) -> list[Content]:
    """
    Retrieve normalized content for provided artefacts.

    Parameters
    ----------
    artefacts : list[tuple[str, datetime]]
        Artefacts returned from get_list_artefacts().

    Returns
    -------
    list[Content]
        Content objects corresponding to requested artefacts.

    Guarantees
    ----------
    - One Content object per artefact_id
    - Returned content.date must match the provided timestamp
      unless the source updated during retrieval.

    Notes
    -----
    Implementations should batch requests where possible.

    Raises
    ------
    RuntimeError
        If called before connect_to_source().
    KeyError
        If an artefact no longer exists.
    """
    raise NotImplementedError

get_list_artefacts abstractmethod

get_list_artefacts(last_synced: Optional[datetime]) -> list[tuple[str, datetime]]

Return identifiers of artefacts modified since last_synced.

Parameters

last_synced : datetime | None UTC timestamp of last successful synchronization. If None, the implementation must return ALL available artefacts.

Returns

list[tuple[str, datetime]] A list of (artefact_id, last_modified_timestamp).

Requirements
  • Returned timestamps must be timezone-aware.
  • Each artefact_id must appear at most once.
  • The list should be ordered by timestamp ascending if possible.
Raises

RuntimeError If called before connect_to_source().

Source code in src/database_builder_libs/models/abstract_source.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
@abstractmethod
def get_list_artefacts(
    self, last_synced: Optional[datetime]
) -> list[tuple[str, datetime]]:
    """
    Return identifiers of artefacts modified since `last_synced`.

    Parameters
    ----------
    last_synced : datetime | None
        UTC timestamp of last successful synchronization.
        If None, the implementation must return ALL available artefacts.

    Returns
    -------
    list[tuple[str, datetime]]
        A list of (artefact_id, last_modified_timestamp).

    Requirements
    ------------
    - Returned timestamps must be timezone-aware.
    - Each artefact_id must appear at most once.
    - The list should be ordered by timestamp ascending if possible.

    Raises
    ------
    RuntimeError
        If called before connect_to_source().
    """
    raise NotImplementedError

Content


              flowchart TD
              database_builder_libs.models.abstract_source.Content[Content]

              

              click database_builder_libs.models.abstract_source.Content href "" "database_builder_libs.models.abstract_source.Content"
            

Representation of a single artefact retrieved from a source.

An artefact corresponds to a uniquely identifiable entity in the external system (e.g., SharePoint document, Zotero item, database record).

Attributes

date : datetime Last modification timestamp of the artefact in the source system. Must be timezone-aware (UTC recommended). id_ : str Stable unique identifier of the artefact in the source. This identifier MUST remain constant across synchronizations. content : dict Normalized payload retrieved from the source.

The structure is implementation specific but must be JSON-serializable
and deterministic: identical source state must produce identical dict.