Data Discovery

Data Source Screening

The first step in before conducting a full data inventory is to screen the data sources, identifying which sources are worthy of a deeper look and which are worthy of consideration for profiling. The screening includes five questions and a qualitative evaluation of purpose, data collection method, selectivity, accessibility, and description.

Example from a recent project:

Screening Inventory Process

  1. Are the data collected opinion-based, (e.g., people’s attitudes, preferences, etc.)?
  2. Are the data collection recurring, (i.e., must be collected at least annually)?
  3. Are there data available for 2013?
  4. Geographic granularity
    • For Education
      1. Are the data collected at least the school level?
      2. Can the data be linked to other education/workforce datasets, (e.g., K-12, higher education, workforce)?
      3. If this is a state dataset, how do they define school districts within this state?
      4. If applicable, what types of schools does it cover, (¢.g.. public, private, charter)?
    • For Housing
      1. Are the data collected at the property or housing unit level?

Additional Screening Information

  • Purpose — What is the purpose of the organization collecting the data, (e.¢., the Virginia Department of Education (VDOE) coordinates education for the state and makes policy recommendations)? — Why are the data collected and how does the organization use the data, (e.g., VDOE collects the data for administrative purposes to assess student and school progress and to inform school policies)? — Who else uses these data, (¢.g., businesses, policy-makers, citizens, researchers)? — Who do they sell the data to, (¢.g., Zillow for individual homeowners, CoreLogic for multiple uses, business for economic development, Chief Economists at trade associations)?

  • Method
    • What is the data collection method, (e.g., paper questionnaire, operator entry, online survey, interview, sensors, algorithms for creating datasets from twitter feeds)?
    • What is the type of data collected, (e.g., designed collection, intentional observation, administrative data, digital data)?
    • If designed, who created the questions, (.g., government, researchers, private business)?
    • What are the raw sources of the collected data prior to any aggregation, (e.g., self-report, third party)?
  • Description
    • What is the general topic of the data, (¢.g., student learning, housing quality)?
    • What are the earliest and latest dates for which data are available, (e.g., 1995-2005)?
  • Timeliness
    • Are the data collected and available periodically, (e.g, every year or decade)?
    • How soon after a reference period ends can a data source be prepared and provided, (e.g., one year)?
  • Selectivity
    • What is the universe (¢.g., population) that the data represents (e.g., students who attended public school in Virginia in 1995)?
  • Accessibility
    • How are the data accessed, (¢.¢., API, downloaded - csv, txt, etc.)?
    • Are they open data?
    • Any legal, regulatory, or administrative restrictions on accessing the data source?
    • Cost? Is it one-time or annual or project-based payment?
    • Describe any gaps/concerns you see with this dataset
  • Does this dataset appear to meet for the needs for your study? Yes/No