Checklists for data product staging

Naming the level of development of a data product is a matter of judgement and experience, but the following list of questions can help you develop that judgement and be consistent in its application. Feel free to use it as a starting point for your own checklist.

Data projects come in many shapes and sizes and forms. If there are questions you use to judge a project’s maturity or if there’s a major aspect of a data project missing below, please email me at joseph@simplicial.io and I’ll add it.

Developmental status of a data set

Here, we’ll consider data set to be a static set of facts or observations. You might imagine a CSV file, a set of images, a snapshot of a database table.

  • Is this data set more raw/unprocessed or more clean? Are there known holes in the data? Meaningless variables/columns? Is the format of each variable (especially dates and times) consistent? Are there any strange encodings (e.g. -9999 entails "n/a", 0 entails "not reported")?

  • Are the "items" or "rows" in the data set all comparable units or observations? Do some rows mean different things than others? Are there "summary" rows embedded in the data set such as totals or averages? Is this a report or a data set?

  • Is the data generation more automatic or more manual? Is the process for generating this data documented? Is it repeatable? Who understands it well and what would happen if they weren’t available? If this data set is derived from others, what level of development are they at?

  • Is this data semantically sound and understood or does it lack context? Does each variable have a clear intended meaning and are those meanings documented anywhere? Do categorical variables have consistent and well-understood levels? Is the region of applicability of this data understood, are the bounds, edge-cases, and limits of the observations understood?

  • Are basic summary statistics available for this data set? Do you have a sense for what "normal" values are for various variables and what and where outliers live? Do you have a sense for the marginal distributions? What about the higher-order (pairwise, etc) correlations and distributions?

  • Is the data set highly accessible or is it a challenge to get to it? Do you have the data locally? If it is remote, do the access policies support your needs? Is the data available digitally or only on paper or some other medium?

  • Is the generating or sampling process well understood? Do you know what sampling biases might exist? Do you have a sense for the scale of that impact? Is the sampling mechanism ignorable? Is it observed? Is that process even known? Is it documented? Do you have summary statistics showing the exposure or coverage of the data?

  • Is the data contextualized against other known data sets? Can you establish links between items in this data set and items in other known data sets? Is it easy to do? Are there mappings between identifiers? Are there consistent, global identifiers? How are these linkages checked and maintained? Is there a process for rectifying data sets if linkages are broken?

  • Is this data contemporanous or up-to-date? How long ago was it updated? Can it be updated automatically? If so, are the summary statistics and other outputs from this data set also maintained and updated? Are snapshots of previous versions maintained?

  • Is this data traceable to its sources? Can you walk backward from each entity in the data set and determine how it was generated? Can you attribute variability or error in the data set to upstream sources?

  • Finally, and perhaps most importantly, is the organization context of this data clear? Do the semantics of the variables make sense to decision makers? Does this data get applied in organizational orientation or decision making? Is it even available? Is the data properly visualized and is that linked to compelling argumentation to support and contextualize the meaning the data captures?

Developmental status of a model, inference, or analysis

Here, we’ll consider "inferential products" including models, tests, analyses, etc. There are a lot of varieties to these kinds of data products, but they are similar in that they tend to be the result of consuming a data set and summarizing and/or extrapolating from it.

  • Is the performance metric of this model practically relevant? Is the model just optimizing for some convenient metric or is it clear how this metric relates to organizational value? If it is a proxy metric, is it the best proxy metric? Does the metric treat outliers or data transformations as it should? Is there a more robust version of it available and would that be beneficial?

  • Does this model perform and generalize well to real data? Is it being appropriately evaluated or are there ways for it to "cheat" by exploiting training information? Does the evaluation data set put the model through a strenuous test dealing with edge cases and "exotic" situations or events?

  • Is the output of the model interpretable? Are they properly transformed and contextualized to make sense in real, domain-specific terms? Can additional outputs be generated to better contextualize the results? Can the model benefit by additionally outputting comparable examples, adversarial examples, or simulations?

  • Is the reasoning of the model available? Can you use its output to learn about the world or to tell stories about the results? Are those stories believeable? Is it understood in what places they break down and how?

  • Does the model validate physically or practically? If the model predicts a physical process can you extract known relations from its outputs and validate them? If it predicts something guarded by common sense, do common sense relationships hold it its outputs?

  • Is there a process for supporting the model? Is its performance evaluated regularly with fresh data over relevant scales? Does the model get retrained or even rebuilt when its performance degrades? Do you know or have a process to learn how quickly the model will degrade?

  • Is the output of the model visualized? Do those visualizations contextualize the results using the language of the domain? Do they contextualize the stability, accuracy, or error of the results? Do they illustrate the mechanism or explain the result?

  • Does the model produce predictions sufficiently quickly? Can its speed or memory performance be better optimized? Would that make it more practically applicable?

  • Is the model consistent or is it biased? If it is biased, is that bias conservative with respect to the decisions being made on the basis of the model outputs? Is the bias acceptably small?

  • Does the model represent its own uncertainty? Do predictions come with error bars, posterior predictive distributions, or a sense for other, comparable results? Can you compare predictions the model is confident in to predictions the model cannot speak to? Is there a sense for the region of applicability of the model?

  • Finally, and perhaps most importantly, is the model connected to organizational outcomes? Do its outputs fit naturally into the workflows and worldviews of decisionmakers? Are they using it regularly? Does their use hint at a different, better operational design? Do they trust the outputs, or do they just read and then ignore them?