17 Data Transformation of a Variable

Sometimes, data may not exhibit the characteristics of the assumed distribution. In that scenario, you have option to eliminate the outliers. More about this here - http://blog.einext.com/r-1/outlier-analysis. Otherwise, you have may apply some transformations so that new data do not show any outliers.

For illustration, we will use olympics2016 medal count data.

Load Data

url = "https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv"

olympics = read.csv(url)

totals = olympics$Total

Boxplot totals, you see there are outliers.

boxplot(totals)

One common technique to transform data is to scale it ... so that mean = 0 and SD = 1

Scaling (z-sc0res):

totals.z = scale(totals)

Mean of z-score should be 0

round(mean(totals.z))

SD of z-score should be 1

round(sd(totals.z))

Boxplot to find outliers

boxplot(totals.z, main = "Boxplot of Z-score of Total Medal Count")

Logarithmic Transformation

  • Useful while dealing with large value data, such as population, revenue etc.

  • If negative values are present, square each variable before taking logs

  • If 0 is a possible, add 1 or 0.1 to each to value, before taking log.

totals.log = log10(totals)

boxplot(totals.log, main = "Boxplot of Log of Total Medal Count ")

The above plot does not show any outliers, so it is now almost ready for further analysis of the data. Remember a lot of statistical methods are applicable when data exhibit the normal distribution ... data without outliers and several other properties.

Ranking

Ranking transformation keep the rank of the data but not the actual values.

rank(totals)

Dichotomize

Divide the data into 2 or more buckets. Here is example of 2 buckets

totals.dic = ifelse(totals > 20, "High", "Low")