Cambridge Open Engage
Preprints and early research outputs from Cambridge Open Engage,
Cambridge University Press's open scholarship platform (built on the Engage
preprint engine). Scraped from the public API:
https://www.cambridge.org/engage/coe/public-api/v1.
Record types
| Type | Description |
|---|---|
| Preprint | A research output (working paper, poster, presentation, …) with full metadata, metrics, funders, and a direct link to the PDF as hosted by Open Engage. |
| Author | Authors, deduplicated by ORCID (falling back to normalized name), with institutions and ROR IDs. |
| Subject | The 37 top-level disciplinary subjects, with item counts. |
| Category | Finer-grained disciplinary sub-categories, linked to their subject. |
| Event | Academic events / conferences associated with preprints. |
| EventGroup | Parent series grouping related events (e.g. "MPSA Annual Meeting"). |
| Community | Partner communities / originating societies (the item origin, e.g. APSA, COE). |
| License | Content licenses offered on the platform (CC0, CC BY 4.0, …). |
Relationships are expressed by reference: a Preprint stores subjectId,
categoryIds, eventId, licenseId, communityId, and authorIds.
Files / PDFs
PDF binaries are not uploaded. Each Preprint carries assetUrl — a direct
link to the PDF as currently hosted by Open Engage — plus assetFileName,
assetMimeType, assetFileSizeBytes, and a thumbnailUrl. Supplementary
materials are listed under supplementaryFiles with their own hosted URLs.
Scraping
pnpm scrape cambridge-up/open-engage
pnpm push cambridge-up/open-engage --env dev
The scraper enumerates all items via the paginated /items endpoint, then
hydrates each one via /items/{id} for full fidelity. Detail responses are
cached to data/.cache/ so reruns are fast and don't re-hit the API; requests
are throttled (bounded concurrency + backoff on 429/5xx) to be polite.