09 Working with Missing Data
Consider the following dataset.
What is mean of x .. it should say NA.
Compute mean excluding missing values
Check which values are NA
Find indexes of NA values
One way of handling NA values is to replace them with 0.
The works if you are performing is "sum" type operation, but will not be appropriate if you are multiplying values or take logs.
Another option could be replace NA values with mean or median of non NA values.
If you do not want to modify your original data, you can use ifelse function
x = c(1, 2, 3, 4, NA, 6, 7, 8, NA)
Let's load some sample data.
Country Age Salary Purchased
1 France 44 72000 No
2 Spain 27 48000 Yes
3 Germany 30 54000 No
4 Spain 38 61000 No
5 Germany 40 NA Yes
6 France 35 58000 Yes
7 Spain NA 52000 No
8 France 48 79000 Yes
9 Germany 50 83000 No
10 France 37 67000 Yes
You can see there are 2 NA values there. Let's replace them by median respective column.
Which columns have NA in them
Which rows have NA in them
Find media values for Age and Salary column ignoring the NA
Replace NA values with respective median values
If you want replace all replace NA in all columns with median
For categorical column, you can replace the NA values with a generic "None" value that acts like a placeholder.
Handling Missing Values using MICE
More sophisticated technique to handle missing data is by leveraging underlying distribution of the data. To do so, use one of the packages available.
For practice use the following dataset