Cambridge Open Engage

Preprints and early research outputs from Cambridge Open Engage, Cambridge University Press's open scholarship platform (built on the Engage preprint engine). Scraped from the public API: https://www.cambridge.org/engage/coe/public-api/v1.

Record types

Type	Description
Preprint	A research output (working paper, poster, presentation, …) with full metadata, metrics, funders, and a direct link to the PDF as hosted by Open Engage.
Author	Authors, deduplicated by ORCID (falling back to normalized name), with institutions and ROR IDs.
Subject	The 37 top-level disciplinary subjects, with item counts.
Category	Finer-grained disciplinary sub-categories, linked to their subject.
Event	Academic events / conferences associated with preprints.
EventGroup	Parent series grouping related events (e.g. "MPSA Annual Meeting").
Community	Partner communities / originating societies (the item `origin`, e.g. APSA, COE).
License	Content licenses offered on the platform (CC0, CC BY 4.0, …).

Relationships are expressed by reference: a Preprint stores subjectId, categoryIds, eventId, licenseId, communityId, and authorIds.

Files / PDFs

PDF binaries are not uploaded. Each Preprint carries assetUrl — a direct link to the PDF as currently hosted by Open Engage — plus assetFileName, assetMimeType, assetFileSizeBytes, and a thumbnailUrl. Supplementary materials are listed under supplementaryFiles with their own hosted URLs.

Scraping

pnpm scrape cambridge-up/open-engage
pnpm push cambridge-up/open-engage --env dev

The scraper enumerates all items via the paginated /items endpoint, then hydrates each one via /items/{id} for full fidelity. Detail responses are cached to data/.cache/ so reruns are fast and don't re-hit the API; requests are throttled (bounded concurrency + backoff on 429/5xx) to be polite.