Data Discovery

Data Source Inventory

Following an initial screening inventory, a subset of the sources are selected for a full inventory.

Example from a recent project:

Full Inventory Process

  • Description/Features
    • What is the temporal nature of the data: longitudinal, time-series, or one time point?
    • Are the data geospatial? If Yes, at what level, (e.g. census tracts, coordinates)?
  • Metadata
    • Is there information available to assess the transparency and soundness of the methods to gather the data for our purposes, (ie., supplementing the census)?
    • Is there a description of each variable in the source along with their valid values?
    • Are there unique IDs for unique elements that can be used for linking data?
    • Is there a data dictionary or codebook?
  • Selectivity
    • What unit is represented at the record level ofthe data source, (e.g., person, household, family, housing unit, property)?
    • Does this universe match the stated intentions for the data collection? If not, what has been included or excluded and why (e.g. do the data exclude certain individuals due to the way the data are collected)?
    • What is the sampling technique used (if applicable, e.g., convenience, snowball, random)?
    • What is the coverage, (e.g. response rate)?
  • Stability/Coherence
    • Were there any changes to the universe of data being captured (including geographical areas covered) and if so what were they, (e.g, changed the geographical boundaries of census tracts)?
    • Were there any changes in the data capture method and if so what were they, (e.g, revised questions, data collection mode, classification categories, algorithms for social media data)?
    • Were there any changes in the sources of data and if so what were they, (.g., data were reported by teachers in 2010 and reported by principals in 2011; used Current Population Survey in 2011 and American Community Survey in 2012)?
  • Accuracy
    • Are there any known sources of error, (e.g., missing records, missing values, duplications, erroneous inclusions)?
    • Describe any quality control checks performed by the data’s owner, (e.¢., deleted duplicates, checked for recording errors)
  • Accessibility
    • Are any records or fields collected, but not included in data source, such as for confidentiality reasons, (e.g, does not include any student files in which there are less the 5 students in a category)?
    • Is there a subset of variables and/or data that must be obtained through a separate process, (e.g. state level data openly available, but one must apply to get census tract)?
    • If yes, is there a separate legal, regulatory, or administrative restrictions on accessing the data source?”
    • Cost? Is it a one time, annual, or project-based payment?
  • Privacy and security
    • Was consent given by participant? If so, how was consent given, (e.g. online form, in-person discussion)?
    • Are there legal limitations or restrictions on the use of the data, (¢.g., Family Educational Rights and Privacy Act -FERPA)?
    • What confidentiality policies are in place, (¢.g., cannot share data outside of requesting institution; does not include personally identifiable information)?
  • Research
    • What research has been done with this dataset, (e.¢., impact of policies, predictors of student suecess, housing stock inven- tory assessment)?
    • Include any links to research if provided.
    • List any other data use notes provided by the supplier.