# 17 Data Transformation of a Variable

Sometimes, data may not exhibit the characteristics of the assumed distribution. In that scenario, you have option to eliminate the outliers. More about this here - http://blog.einext.com/r-1/outlier-analysis. Otherwise, you have may apply some transformations so that new data do not show any outliers.

For illustration, we will use olympics2016 medal count data.

url = "https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv"

totals = olympics\$Total

Boxplot totals, you see there are outliers.

boxplot(totals) One common technique to transform data is to scale it ... so that mean = 0 and SD = 1

## Scaling (z-sc0res):

totals.z = scale(totals)

Mean of z-score should be 0

round(mean(totals.z))

SD of z-score should be 1

round(sd(totals.z))

Boxplot to find outliers

boxplot(totals.z, main = "Boxplot of Z-score of Total Medal Count") ## Logarithmic Transformation

• Useful while dealing with large value data, such as population, revenue etc.

• If negative values are present, square each variable before taking logs

• If 0 is a possible, add 1 or 0.1 to each to value, before taking log.

totals.log = log10(totals)

boxplot(totals.log, main = "Boxplot of Log of Total Medal Count ") The above plot does not show any outliers, so it is now almost ready for further analysis of the data. Remember a lot of statistical methods are applicable when data exhibit the normal distribution ... data without outliers and several other properties.

## Ranking

Ranking transformation keep the rank of the data but not the actual values.

rank(totals)

## Dichotomize

Divide the data into 2 or more buckets. Here is example of 2 buckets

totals.dic = ifelse(totals > 20, "High", "Low")