Skip to content

Zotero source

Overview

The Zotero source allows you to retrieve documents and metadata from a Zotero database using its API. It implements the AbstractSource interface to provide incremental synchronization of Zotero library items.

Design Notes

Interaction Patterns

The ZoteroSource follows these interaction patterns:

  1. Connection Pattern:

    • Initialize with library credentials
    • Create pyzotero client instance
    • Optional collection filtering
  2. Synchronization Pattern:

    • Query items modified since last sync
    • Convert Zotero timestamps to UTC
    • Return stable item keys
  3. Content Retrieval Pattern:

    • Fetch full item metadata
    • Normalize to Content objects
    • Preserve Zotero data structure
  4. Attachment Download Pattern:

    • Check for local file availability first
    • Fall back to API download if needed
    • Save as {item_id}.pdf

Implementation Details

  • Timestamp Handling: All timestamps converted to UTC for consistency
  • Deletion Limitation: Zotero API doesn't report deleted items in sync
  • Attachment Priority: Prefers local Zotero storage over API downloads for performance
  • Error Handling: Gracefully handles missing attachments and continues processing

Docstring

zotero_source

Classes:

ZoteroSource


              flowchart TD
              database_builder_libs.sources.zotero_source.ZoteroSource[ZoteroSource]
              database_builder_libs.models.abstract_source.AbstractSource[AbstractSource]

                              database_builder_libs.models.abstract_source.AbstractSource --> database_builder_libs.sources.zotero_source.ZoteroSource
                


              click database_builder_libs.sources.zotero_source.ZoteroSource href "" "database_builder_libs.sources.zotero_source.ZoteroSource"
              click database_builder_libs.models.abstract_source.AbstractSource href "" "database_builder_libs.models.abstract_source.AbstractSource"
            

Zotero implementation of AbstractSource.

Provides incremental synchronization of a Zotero library and exposes items as canonical Content objects.

Mapping

Zotero item → Content item.key → Content.id_ item.data → Content.content item.dateModified → Content.date

Synchronization semantics

  • get_list_artefacts() performs incremental sync using Zotero since
  • Returned timestamps are UTC
  • Identifiers are stable across runs
  • Deleted items are NOT reported (Zotero API limitation)

Attachment handling

download_zotero_item() retrieves the first attachment: - Prefers local Zotero storage when available - Falls back to API download

Lifecycle

connect() must be called before using the source.

Methods:

connect

connect(config: Mapping[str, Any] | None = None) -> None

Establish connection to the external source.

Idempotent: safe to call multiple times.

Raises

ConnectionError PermissionError ValueError

Source code in src/database_builder_libs/models/abstract_source.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def connect(self, config: Mapping[str, Any] | None = None) -> None:
    """
    Establish connection to the external source.

    Idempotent: safe to call multiple times.

    Raises
    ------
    ConnectionError
    PermissionError
    ValueError
    """
    if self._connected:
        return

    self._connect_impl(config or {})
    self._connected = True

download_zotero_item

download_zotero_item(*, item_id: str, download_path: str) -> None

Download the first attachment of specified zotero item to specified path

This function is a wrapper around the dump api to provide a means to download attachments of zotero items using local & cloud api. As the default (at this time) dump api_call only provides cloud download functionality.

Parameters:

  • `item_id`

    The specific item_id of the item to get the attachment/pdf from (key attribute from above mentioned zotero dict)

  • `download_path`

    The folder to download the item to, the file_path will be -> /.pdf

Source code in src/database_builder_libs/sources/zotero_source.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def download_zotero_item(
    self,
    *,
    item_id: str,
    download_path: str,
) -> None:
    """Download the first attachment of specified zotero item to specified path

    This function is a wrapper around the dump api to provide a means to download attachments of zotero items using
    local & cloud api. As the default (at this time) dump api_call only provides cloud download functionality.

    Args:
       `item_id`: The specific item_id of the item to get the attachment/pdf from (`key` attribute from above mentioned zotero dict)
       `download_path`: The folder to download the item to, the file_path will be -> <download_path>/<item_id>.pdf
    """
    self._ensure_connected()
    assert self._zotero is not None

    logger.debug("Fetching File: {}", item_id)

    children = self._zotero.children(item_id)
    if not children:
        logger.warning("No child attachments found for item {}", item_id)
        return

    attachments = [
        c for c in children if c.get("data", {}).get("itemType") == "attachment"
    ]
    if not attachments:
        logger.warning("No attachment-type children for item {}", item_id)
        return

    attachment = attachments[0]
    data = attachment.get("data", {})
    local_path = data.get("path")

    download_dir = Path(download_path)
    download_dir.mkdir(parents=True, exist_ok=True)
    target = download_dir / f"{item_id}.pdf"
    if local_path and Path(local_path).exists():
        logger.info("Copying local attachment from {}", local_path)
        shutil.copy(local_path, target)
        return

    logger.info("Local attachment not found, downloading via Zotero API")

    self._zotero.dump(
        itemkey=attachment["key"],
        filename=target.name,
        path=download_path,
    )

get_all_documents_metadata

get_all_documents_metadata(collection_id: str) -> List[dict[str, Any]]

Retrieve the metadata of all documents within collection

This function calls the zotero collection items api: 'https://api.zotero.org/users//collections//items/top' Using the pyzotero library and returns a list containing dictionaries of metadata. Keep in mind, that the structures returned by this function are large and take some time to retrieve.

Parameters:

  • `collection_id`

    The collection to retrieve document metadata from (should be visible in WebURL when using zotero webportal)

Yields:

  • List[dict[str, Any]]

    List containing document-metadata dict for all documents in the library (one dict per document).

  • List[dict[str, Any]]

    The dict output closely resembles the dict output format of pyzotero:

  • https ( List[dict[str, Any]] ) –

    //pyzotero.readthedocs.io/en/latest/#zotero.Zotero.collection_items_top

Source code in src/database_builder_libs/sources/zotero_source.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def get_all_documents_metadata(self, collection_id: str) -> List[dict[str, Any]]:
    """Retrieve the metadata of all documents within collection

    This function calls the zotero collection items api:
    'https://api.zotero.org/users/<library_id>/collections/<collection_id>/items/top'
    Using the pyzotero library and returns a list containing dictionaries of metadata.
    Keep in mind, that the structures returned by this function are large and take some time to retrieve.

    Args:
        `collection_id`: The collection to retrieve document metadata from
                        (should be visible in WebURL when using zotero webportal)

    Yields:
        List containing document-metadata dict for all documents in the library (one dict per document).
        The dict output closely resembles the dict output format of pyzotero:
        https://pyzotero.readthedocs.io/en/latest/#zotero.Zotero.collection_items_top
    """
    self._ensure_connected()
    assert self._zotero is not None

    return self._zotero.everything(
        self._zotero.collection_items_top(collection_id, limit=None)
    )

get_content

get_content(artefacts: list[tuple[str, datetime]]) -> list[Content]

Fetch normalized content for Zotero items.

Each artefact is retrieved individually and converted to Content.

Guarantees
  • One Content object per artefact
  • Content.date reflects the modification timestamp observed during listing.
  • Content.content may represent a newer revision if the item changed during retrieval.
  • Content.content contains raw Zotero data field

This method does not download attachments.

Source code in src/database_builder_libs/sources/zotero_source.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def get_content(self, artefacts: list[tuple[str, datetime]]) -> list[Content]:
    """
    Fetch normalized content for Zotero items.

    Each artefact is retrieved individually and converted to `Content`.

    Guarantees
    ----------
    - One Content object per artefact
    - Content.date reflects the modification timestamp observed during listing.
    - Content.content may represent a newer revision if the item changed during retrieval.
    - Content.content contains raw Zotero `data` field

    This method does not download attachments.
    """
    self._ensure_connected()
    assert self._zotero is not None

    contents: list[Content] = []

    for item_key, modified in artefacts:
        item = self._zotero.item(item_key)
        if not item:
            continue
        data = item.get("data", {})

        contents.append(
            Content(
                id_=item_key,
                date=modified,
                content=data,
            )
        )

    return contents

get_list_artefacts

get_list_artefacts(last_synced: Optional[datetime]) -> list[tuple[str, datetime]]

Return Zotero items modified after last_synced.

Parameters

last_synced : datetime | None UTC timestamp of last successful sync. If None, all items are returned.

Returns

list[(item_key, modified_time)]

Sync guarantees
  • item_key is stable across runs
  • timestamps are timezone-aware UTC
  • includes newly created and modified items
  • DOES NOT include deleted items (Zotero limitation)
Notes

Zotero since uses server modification time, not file change time.

Source code in src/database_builder_libs/sources/zotero_source.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def get_list_artefacts(
    self, last_synced: Optional[datetime]
) -> list[tuple[str, datetime]]:
    """
    Return Zotero items modified after `last_synced`.

    Parameters
    ----------
    last_synced : datetime | None
        UTC timestamp of last successful sync.
        If None, all items are returned.

    Returns
    -------
    list[(item_key, modified_time)]

    Sync guarantees
    ---------------
    - item_key is stable across runs
    - timestamps are timezone-aware UTC
    - includes newly created and modified items
    - DOES NOT include deleted items (Zotero limitation)

    Notes
    -----
    Zotero `since` uses server modification time, not file change time.
    """
    self._ensure_connected()
    assert self._zotero is not None
    assert self._config is not None

    if self._config.collection:
        items_iter = self._zotero.collection_items_top(
            self._config.collection, limit=None
        )
    else:
        items_iter = self._zotero.items()

    items = list(self._zotero.everything(items_iter))

    artefacts: list[tuple[str, datetime]] = []

    # If no cursor → epoch
    if last_synced is None:
        last_synced = datetime(1970, 1, 1, tzinfo=timezone.utc)

    for item in items:
        data = item.get("data", {})
        key = data.get("key")

        # Zotero reality: sometimes only dateAdded exists
        modified_str = data.get("dateModified") or data.get("dateAdded")
        if not key or not modified_str:
            continue

        modified = isoparse(modified_str).astimezone(timezone.utc)

        if modified > last_synced:
            artefacts.append((key, modified))

    return artefacts