v1.0.0·9,758 records·0 files·12.5 MB
Jun 24, 20261
README

Cambridge Open Engage

Preprints and early research outputs from Cambridge Open Engage, Cambridge University Press's open scholarship platform (built on the Engage preprint engine). Scraped from the public API: https://www.cambridge.org/engage/coe/public-api/v1.

Record types

Type Description
Preprint A research output (working paper, poster, presentation, …) with full metadata, metrics, funders, and a direct link to the PDF as hosted by Open Engage.
Author Authors, deduplicated by ORCID (falling back to normalized name), with institutions and ROR IDs.
Subject The 37 top-level disciplinary subjects, with item counts.
Category Finer-grained disciplinary sub-categories, linked to their subject.
Event Academic events / conferences associated with preprints.
EventGroup Parent series grouping related events (e.g. "MPSA Annual Meeting").
Community Partner communities / originating societies (the item origin, e.g. APSA, COE).
License Content licenses offered on the platform (CC0, CC BY 4.0, …).

Relationships are expressed by reference: a Preprint stores subjectId, categoryIds, eventId, licenseId, communityId, and authorIds.

Files / PDFs

PDF binaries are not uploaded. Each Preprint carries assetUrl — a direct link to the PDF as currently hosted by Open Engage — plus assetFileName, assetMimeType, assetFileSizeBytes, and a thumbnailUrl. Supplementary materials are listed under supplementaryFiles with their own hosted URLs.

Scraping

pnpm scrape cambridge-up/open-engage
pnpm push cambridge-up/open-engage --env dev

The scraper enumerates all items via the paginated /items endpoint, then hydrates each one via /items/{id} for full fidelity. Detail responses are cached to data/.cache/ so reruns are fast and don't re-hit the API; requests are throttled (bounded concurrency + backoff on 429/5xx) to be polite.