Data Cleaning & Transformation

Data transformation refers to the mapping of dataset field values from a provided format into an expected or more useful format. Data cleaning refers to the process of either fixing or removing data in a data source thatis incorrect, incomplete, improperly formatted, or duplicated. Data transformation refers to the mapping of dataset field values from a provided format into an expected or more useful format. In practice, these two activities often occur together and include assessing the following:
  • Missing values
  • Date and time formats
  • De-duplication
  • Outliers
  • Normalization

Aggregation

Normalization

When normalizing a dataset, one maps the original data range into another data range. This is a form of scaling. The generalized steps for normalization of a dataset field is to identify the minimum and maximum values in the orginal dataset field, identify the minimum and maximum values of the new normalized scale, then calculate the new normalized field value of any number x in the original dataset using an equation similar in form to newvalue = (max’-min’)/(max-min)*(value-max)+max’.

Winsorization

Many statistics can be heavily influenced by outliers. One strategy is to set all outliers to a specified percentile of the data. This is called Winsorization of Winsorizing. In Winsorization, extreme values are limited in the statistical data to reduce the effect of these spurious outliers. For example, a 90\% Winsorisation would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. Winsorised estimators are usually more robust to outliers than their more standard forms, although there are alternatives, such as trimming, that will achieve a similar effect. Note that Winsorising is not equivalent to trimming or truncation. In a trimmed estimator, the extreme values are discarded; in a Winsorised estimator, the extreme values are instead replaced by certain percentiles (the trimmed minimum and maximum). Therefore, statistics derived from the two sets would not be equivalent (e.g. a Winsorised mean is not equal to a truncated mean).

Smoothing

Smoothing a dataset refers to the creation an approximating function that attempts to capture important patterns in the data while leaving out noise and other unimportant patterns. There are many forms of smoothing.