16 Outlier Analysis

Outliers for discrete or categorical variables

Those categorical values are outliers for which density (proportion to population) is less than 10%.

For illustration, let's use Olympics 2016 data on medal count.

Load olympics dataset from github and create a dataframe.

> url = "https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv"

> olympics = read.table(url, header = TRUE, sep = ",", quote = "\"")

Boxplot total medal count

> boxplot(olympics$Total, main = "Boxplot for Total Medal Count")

Run boxplot.stat to see outliers.

> boxplot.stats(olympics$Total)

Add a proportion column to the dataset

> olympics$Proportion = olympics$Total / sum(olympics$Total)

Create a subset of rows for which Proportion is less than or equal to > 0.05

> olympics2 = olympics[olympics$Proportion > 0.05, ]

Boxplot the olympics2 - the slimmed down dataset to see whether there is any outliers.

The plot shows that there is no outliers in the sample dataset from the population.

Outliers for continuous or quantitative variables

For illustration, we will use beaver1 dataset which is built-in R.


> summary(beaver1$temp)

Min. 1st Qu. Median Mean 3rd Qu. Max.

36.33 36.76 36.87 36.86 36.96 37.53

Draw a histogram

> hist(beaver1$temp)

Histogram does not reveal any outliers in this scenario, but clearly reveals behavior of a normal distribution.

Let's do boxplot to see if there any outliers.

> boxplot(beaver1.slim$temp, main = "Boxplot of Temp in Beaver1 Dataset")

Boxplot clearly shows there are outliers both upper range and lower range of values.

Remove the outliers using selector statement. Boundaries are min and max from the boxplot ... these Q1 and Q3 values.

> beaver1.slim = beaver1[beaver1$temp >= 36.76 & beaver1$temp <= 36.96, ]

Now boxplot again

> boxplot(beaver1.slim$temp, main = "Boxplot of Temp in Beaver1 Dataset")

There is no more outliers. It could be possible that new dataset has outliers in the sample space. You could repeat the above process to remove the outliers until you get a satisfactory sample after all outliers are removed.

For more practice: remove skew from ggplot2::diamonds$Price data.