09 Working with Missing Data
Consider the following dataset.
x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)
What is mean of x .. it should say NA.
mean(x)
Compute mean excluding missing values
mean(x, na.rm = TRUE)
Check which values are NA
is.na(x)
Find indexes of NA values
which(is.na(x))
One way of handling NA values is to replace them with 0.
x[is.na(x)] = 0
The works if you are performing is "sum" type operation, but will not be appropriate if you are multiplying values or take logs.
Another option could be replace NA values with mean or median of non NA values.
x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)
x[is.na(x)] = mean(x, na.rm = TRUE)
If you do not want to modify your original data, you can use ifelse function
x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)
x.new = ifelse(is.na(x), 0, x)
Let's load some sample data.
> df = read.csv("data/mobile-sales-data.csv")
> df
Country Age Salary Purchased
1 France 44 72000 No
2 Spain 27 48000 Yes
3 Germany 30 54000 No
4 Spain 38 61000 No
5 Germany 40 NA Yes
6 France 35 58000 Yes
7 Spain NA 52000 No
8 France 48 79000 Yes
9 Germany 50 83000 No
10 France 37 67000 Yes
You can see there are 2 NA values there. Let's replace them by median respective column.
Which columns have NA in them
> which(colSums(is.na(df)) > 0)
Which rows have NA in them
> which(rowSums(is.na(df)) > 0)
Find media values for Age and Salary column ignoring the NA
> median(df$Age, na.rm = TRUE)
[1] 38
> median(df$Salary, na.rm = TRUE)
[1] 61000
Replace NA values with respective median values
> df$Age[is.na(df$Age)] = median(df$Age, na.rm = TRUE)
> df$Salary[is.na(df$Salary)] = median(df$Salary, na.rm = TRUE)
> df
Country Age Salary Purchased
1 France 44 72000 No
2 Spain 27 48000 Yes
3 Germany 30 54000 No
4 Spain 38 61000 No
5 Germany 40 61000 Yes
6 France 35 58000 Yes
7 Spain 38 52000 No
8 France 48 79000 Yes
9 Germany 50 83000 No
10 France 37 67000 Yes
If you want replace all replace NA in all columns with median
for(i in 1:ncol(df)){
if(is.numeric(df[,i])){
df[is.na(df[,i]), i] = median(df[,i], na.rm = TRUE)
}
}
For categorical column, you can replace the NA values with a generic "None" value that acts like a placeholder.
Handling Missing Values using MICE
> require(mice)
> require(VIM)
> md.pattern(df)
Country Purchased Age Salary
8 1 1 1 1 0
1 1 1 0 1 1
1 1 1 1 0 1
0 0 1 1 2
> aggr_plot <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=1, ylab=c("Histogram of missing data","Pattern"))
Imputation
More sophisticated technique to handle missing data is by leveraging underlying distribution of the data. To do so, use one of the packages available.
mi
mice
imputation
For practice use the following dataset
car::Davis