# 09 Working with Missing Data

Consider the following dataset.

``x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)``

What is mean of x .. it should say NA.

``mean(x)``

Compute mean excluding missing values

``mean(x, na.rm = TRUE)``

Check which values are NA

``is.na(x)``

Find indexes of NA values

``which(is.na(x))``

One way of handling NA values is to replace them with 0.

``x[is.na(x)] = 0``

The works if you are performing is "sum" type operation, but will not be appropriate if you are multiplying values or take logs.

Another option could be replace NA values with mean or median of non NA values.

``x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)``
``x[is.na(x)] = mean(x, na.rm = TRUE)``

If you do not want to modify your original data, you can use ifelse function

x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)

``x.new = ifelse(is.na(x), 0, x)``

``> df = read.csv("data/mobile-sales-data.csv")``

> df

Country Age Salary Purchased

1 France 44 72000 No

2 Spain 27 48000 Yes

3 Germany 30 54000 No

4 Spain 38 61000 No

5 Germany 40 NA Yes

6 France 35 58000 Yes

7 Spain NA 52000 No

8 France 48 79000 Yes

9 Germany 50 83000 No

10 France 37 67000 Yes

You can see there are 2 NA values there. Let's replace them by median respective column.

Which columns have NA in them

``> which(colSums(is.na(df)) > 0)``

Which rows have NA in them

``> which(rowSums(is.na(df)) > 0)``

Find media values for Age and Salary column ignoring the NA

``> median(df\$Age, na.rm = TRUE)``
`` 38``
``> median(df\$Salary, na.rm = TRUE)``
`` 61000``

Replace NA values with respective median values

``> df\$Age[is.na(df\$Age)] = median(df\$Age, na.rm = TRUE)``
``> df\$Salary[is.na(df\$Salary)] = median(df\$Salary, na.rm = TRUE)``
``> df``
``   Country Age Salary Purchased``
``1   France  44  72000        No``
``2    Spain  27  48000       Yes``
``3  Germany  30  54000        No``
``4    Spain  38  61000        No``
``5  Germany  40  61000       Yes``
``6   France  35  58000       Yes``
``7    Spain  38  52000        No``
``8   France  48  79000       Yes``
``9  Germany  50  83000        No``
``10  France  37  67000       Yes``

If you want replace all replace NA in all columns with median

``for(i in 1:ncol(df)){ ``
``    if(is.numeric(df[,i])){``
``        df[is.na(df[,i]), i] = median(df[,i], na.rm = TRUE)``
``    }``
``}``

For categorical column, you can replace the NA values with a generic "None" value that acts like a placeholder.

Handling Missing Values using MICE

``> require(mice)``
``> require(VIM)``
``> md.pattern(df)``
``  Country Purchased Age Salary  ``
``8       1         1   1      1 0``
``1       1         1   0      1 1``
``1       1         1   1      0 1``
``        0         0   1      1 2``
``> aggr_plot <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=1, ylab=c("Histogram of missing data","Pattern"))`` ## Imputation

More sophisticated technique to handle missing data is by leveraging underlying distribution of the data. To do so, use one of the packages available.

• mi
• mice
• imputation

For practice use the following dataset

• car::Davis