Data and Model Sharing (Quest)

Published onSep 17, 2019
Questions for data collaboratives

1. Provenance

1.1 Sources
1.11 What sources are used; drawn from what set?
1.12 Are the sources versioned?
1.13 How often is each source pulled / pushed / otherwise updated?

1.2 Formats
1.21 What formats + schemas + shapes are used?
1.22 Are these defined in an overall spec?
1.23 Are the shapes + their specs versioned?

1.3 Credit
1.31 Who was involved in producing and sharing data?
1.32 Is there CREDiT-style attribution for different roles?
1.33 What social / institutional / environmental dependencies are there?

2. Process provenance

2.1 Toolchains
2.11 What toolchains and pipelines are involved?
2.12 What upstreams contribute to this work?
2.13 How are changes to these workflows recorded?
2.14 When do changes trigger a recompilation?

2.2 Dependencies
2.21 Is there an explicit process- or workflow-dependency tree?
2.22 When does a stale dependency trigger a recompilation?
2.23 Are there any push options for updates, or flagging of critical updates?

3. Reuse

3.1 Dumps
3.11 How are dumps provided: name, format, versioning?
3.12 Is there a feed of updates to dumps?

3.2 Logging use
3.21 What downstreams are using this work?
3.22 Is a log of this use visible kept, and at what level of detail?
3.23 Is this usage visible to other reusers, via pingbacks or other?

3.31 Is this used in any metastudies?
3.32 What processing (schema mappings, fuzzings or anonymization, other) is used for each including metastudy?
3.33 Is the mapping for use in any metastudy encoded in a named package or configuration file that others could use?

4. Data selection

4.1 Selection filter
4.11 How was data chosen for measurement/inclusion?
4.12 How is it noted when this changes?

4.2 Data cleaning
4.21 What data cleaning or noise correction, were used in compiling the data?
4.22 What other workflows were applied to the raw data?
4.23 How were these workflows registered before the raw data was gathered?
4.24 How are these workflows and pipelines named and versioned?

4.3 What similar efforts or alternatives exist?

5. Replication

5.1 Replicability
5.11 What is the whole tale of your work -- what environment and setup are needed to replicate it?
5.12 Is this articulated in a [whole tale] file?
5.13 Does this file include workflow + usage notes?

5.2 Replicatedness
5.21 Has your process been replicated in practice?
5.22 By how many independent parties has it been replicated?


