# 17 Data Transformation of a Variable

Sometimes, data may not exhibit the characteristics of the assumed distribution. In that scenario, you have option to eliminate the outliers. More about this here - http://blog.einext.com/r-1/outlier-analysis. Otherwise, you have may apply some transformations so that new data do not show any outliers.

For illustration, we will use olympics2016 medal count data.

Load Data

url = "https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv"

``olympics = read.csv(url) ``
``totals = olympics\$Total``

Boxplot totals, you see there are outliers.

``boxplot(totals)`` One common technique to transform data is to scale it ... so that mean = 0 and SD = 1

## Scaling (z-sc0res):

``totals.z = scale(totals)``

Mean of z-score should be 0

``round(mean(totals.z))``

SD of z-score should be 1

``round(sd(totals.z))``

Boxplot to find outliers

``boxplot(totals.z, main = "Boxplot of Z-score of Total Medal Count")`` ## Logarithmic Transformation

• Useful while dealing with large value data, such as population, revenue etc.
• If negative values are present, square each variable before taking logs
• If 0 is a possible, add 1 or 0.1 to each to value, before taking log.
``totals.log = log10(totals)``
``boxplot(totals.log, main = "Boxplot of Log of Total Medal Count ")`` The above plot does not show any outliers, so it is now almost ready for further analysis of the data. Remember a lot of statistical methods are applicable when data exhibit the normal distribution ... data without outliers and several other properties.

## Ranking

Ranking transformation keep the rank of the data but not the actual values.

``rank(totals)``

## Dichotomize

Divide the data into 2 or more buckets. Here is example of 2 buckets

``totals.dic = ifelse(totals > 20, "High", "Low")``