17 Data Transformation of a Variable
Sometimes, data may not exhibit the characteristics of the assumed distribution. In that scenario, you have option to eliminate the outliers. More about this here - http://blog.einext.com/r-1/outlier-analysis. Otherwise, you have may apply some transformations so that new data do not show any outliers.
For illustration, we will use olympics2016 medal count data.
url = "https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv"
Boxplot totals, you see there are outliers.
One common technique to transform data is to scale it ... so that mean = 0 and SD = 1
Mean of z-score should be 0
SD of z-score should be 1
Boxplot to find outliers
Useful while dealing with large value data, such as population, revenue etc.
If negative values are present, square each variable before taking logs
If 0 is a possible, add 1 or 0.1 to each to value, before taking log.
The above plot does not show any outliers, so it is now almost ready for further analysis of the data. Remember a lot of statistical methods are applicable when data exhibit the normal distribution ... data without outliers and several other properties.
Ranking transformation keep the rank of the data but not the actual values.
Divide the data into 2 or more buckets. Here is example of 2 buckets