09 Working with Missing Data

Consider the following dataset.

x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)

What is mean of x .. it should say NA.

mean(x)

Compute mean excluding missing values

mean(x, na.rm = TRUE)

Check which values are NA

is.na(x)

Find indexes of NA values

which(is.na(x))

One way of handling NA values is to replace them with 0.

x[is.na(x)] = 0

The works if you are performing is "sum" type operation, but will not be appropriate if you are multiplying values or take logs.

Another option could be replace NA values with mean or median of non NA values.

x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)
x[is.na(x)] = mean(x, na.rm = TRUE)

If you do not want to modify your original data, you can use ifelse function

x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)

x.new = ifelse(is.na(x), 0, x)

Let's load some sample data.

> df = read.csv("data/mobile-sales-data.csv")

> df

Country Age Salary Purchased

1 France 44 72000 No

2 Spain 27 48000 Yes

3 Germany 30 54000 No

4 Spain 38 61000 No

5 Germany 40 NA Yes

6 France 35 58000 Yes

7 Spain NA 52000 No

8 France 48 79000 Yes

9 Germany 50 83000 No

10 France 37 67000 Yes

You can see there are 2 NA values there. Let's replace them by median respective column.

Which columns have NA in them

> which(colSums(is.na(df)) > 0)

Which rows have NA in them

> which(rowSums(is.na(df)) > 0)

Find media values for Age and Salary column ignoring the NA

> median(df$Age, na.rm = TRUE)
[1] 38
> median(df$Salary, na.rm = TRUE)
[1] 61000

Replace NA values with respective median values

> df$Age[is.na(df$Age)] = median(df$Age, na.rm = TRUE)
> df$Salary[is.na(df$Salary)] = median(df$Salary, na.rm = TRUE)
> df
   Country Age Salary Purchased
1   France  44  72000        No
2    Spain  27  48000       Yes
3  Germany  30  54000        No
4    Spain  38  61000        No
5  Germany  40  61000       Yes
6   France  35  58000       Yes
7    Spain  38  52000        No
8   France  48  79000       Yes
9  Germany  50  83000        No
10  France  37  67000       Yes

If you want replace all replace NA in all columns with median

for(i in 1:ncol(df)){ 
    if(is.numeric(df[,i])){
        df[is.na(df[,i]), i] = median(df[,i], na.rm = TRUE)
    }
}

For categorical column, you can replace the NA values with a generic "None" value that acts like a placeholder.

Handling Missing Values using MICE

> require(mice)
> require(VIM)
> md.pattern(df)
  Country Purchased Age Salary  
8       1         1   1      1 0
1       1         1   0      1 1
1       1         1   1      0 1
        0         0   1      1 2
> aggr_plot <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=1, ylab=c("Histogram of missing data","Pattern"))

Imputation

More sophisticated technique to handle missing data is by leveraging underlying distribution of the data. To do so, use one of the packages available.

  • mi
  • mice
  • imputation

For practice use the following dataset

  • car::Davis