Are datasets, observation units, and their attributes consistently named, sufficiently described, and appropriately formatted for combination with other datasets?

Metadata is generally defined "data that provides information about other data". []. The main purpose of metadata is the facilitation of the discovery relevant information pertaining to a particular object/resource. It does this by "allowing resources to be found by relevant criteria, identifying resources, bringing similar resources together, distinguishing dissimilar resources, and giving location information."[National Information Standards Organization; Rebecca Guenther; Jaqueline Radebaugh (2004). Understanding Metadata. Bethesda, MD: NISO Press. ISBN 1-880124-62-9. Retrieved 2 April 2014.] Therefore, generally speaking, a lack of metadata for a dataset can present significant impediments to the use of said dataset. Being more specific, when dealing with data that is to be used for research purposes, it is of vital importance to discover if the datasets (tables), their observational observation units (records/rows), and their attributes (fields/columns) are consistently named, sufficiently described, and appropriately formatted for analysis and for combination with other project datasets. Additionally, does information exist regrading any transformations that have occurred to original data sources in the creation of said dataset, as well as who did the transforming?

Observation Unit Definition

When a dataset is provided without definition of the purpose of that dataset, we have an issue with the Observation Unit Definition. Why was this data collected? When dealing with datasets not orignally collected for research purposes (e.g administrative data), there is often no easy answer to this question. To correct issues of observational unit definition, it is often necessary to first generate separate new datasets from the dataset provided, each representing only a single observational unit type. At this point a new observational unit definition can be created.

This type of metadata is issue is quite common, and we experienced said issue when dealing with certain 3rd-party produced housing datasets. A single dataset would include, within each record(row), multiple potential observational units (e.g housing unit data, listing service data, owner data, neighborhood data). Only after defining the observational units needed and extracting the necessary fields could a new observational unit definition be generated (e.g. Housing Unit Specifications for Arlington County VA from CY 20XX to CY 20XX).

Observation Unit Attributes Definition

Attributes (columns) without definition and/or non-meaningful/confusing naming

note. It is sometimes necessary to separate multiple variables represented in a single attribute (column) before creating definitions

Semantic Confusion

The concept of semantic interoperability here refers to the ability of data systems to exchange data with other data systems unambiguously. Semantic interoperability is concerned not just with the data syntax, but also with the transmission of the meaning with the data (semantics). This is generally accomplished by adding metadata to a dataset, thereby defining a controlled, shared vocabulary. Without this shared vocabulary, \textbf{Semantic Confusion} can occur, where names and syntax may agree, but definitions don’t. For example, while combining two data sets, it may be found that two fields(attributes) have the same name (say “Grade”) but their definitions are completely different (because “Grade” can refer to both a ‘score’ for a test/class or the ‘level/year’ of schooling).

Multiple Attribute Names

Attributes with different names but the same definition

e.g. attributes name “Grade” and “Year” both referring to ‘level/year’ of schooling

Inconsistent Attribute Formats

Attributes of the same type that are formatted differently

e.g. most commonly an issue when dealing with dates and times

Metadata Management

Here provided are the specifications of the Lexicon, a metadata repository.


Data Provenance

The concept of Provenance is very broad and has different meanings within different fields of inquiry. For the purposes of Data Profiling, we find it useful to apply the definition provided by the World Wide Web Consortium (W3C)

“Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.” []

Housing Example An example of why it is important to consider data provenance was seen in the housing case study. Some of the datasets used were provided by 3rd party vendors. As part of the value-added of these data products, 3rd party vendors often perform some set of transformations on the original data to enhance data consistency and quality. Sometimes the transformation processes used are readily available to the client, and the client can validate their application by repeating the transformations and producing identical results. Other this information is not made available, thus necessitating further investigation and experimentation on the part of the client to ensure that the data provided is, in fact, a true representation of the original source data.

Property Crime Example Another example from our studies was from a commercial data provider which provided indicators of neighborhood quality based on patented algorithms. We were unable to reconcile differences found in their crime indexes and data from Arlington County, Virginia Police Incident Tracking system. Figure 3.5 presents the misalignment in these data sources by census tract. The figure shows property crime counts as calculated by the commercial provider and as directly pulled from the Arlington County Police Incident Tracking system. The county data had five census tracts with counts greater than 300. These were not shown in the boxplot to allow a better scaled comparison to the commercial provider. The commericial provider did not describe, nor makes available, their methods to adjust the counts.