18 Tidy Data

Causes of messiness of data

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

  • Column headers are values not variable names
  • Multiple variables are stored in a single column
  • Variables are stored in both rows and columns
  • Multiple types of experimental unit stored in the same table
  • One type of experimental unit in multiple tables

Tidy Data (… it fits the 3NF model in relational database)

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table

Tools

  • reshape2::melt - changes columns into rows
  • reshape2::decast - shifts the variables from rows to columns
  • stringr::str_replace
  • stringr::str_sub
  • stringr::match
  • stringr::str_split_fixed
  • plyr::arrange

Key function for data manipulation:

  • subset: filter data
  • transform: add new columns to the dataframe base on existing columns
  • summarize: create new dataframe from an existing dataframe
  • arrange: change the order of the rows
> library(reshape2) # Data Tiding ... dcast and melt functions
> library(ggplot2) # Visualization
> library(plyr) # Data transformation and join
> library(stringr) # String manipulation using regular expression
> library(MASS) # Contains lot of useful dataset

Case 1: Column headers are values not variable names

> url = "http://stat405.had.co.nz/data/pew.txt"
> pew = read.table(url, header = TRUE, check.names = FALSE)
> str(pew)
'data.frame': 18 obs. of 11 variables:
$ religion : Factor w/ 18 levels "Agnostic","Atheist",..: 1 2 3 4 5 6 7 8 9 10 ...
$ <$10k : int 27 12 27 418 15 575 1 228 20 19 ...
$ $10-20k : int 34 27 21 617 14 869 9 244 27 19 ...
$ $20-30k : int 60 37 30 732 15 1064 7 236 24 25 ...
$ $30-40k : int 81 52 34 670 11 982 9 238 24 25 ...
$ $40-50k : int 76 35 33 638 10 881 11 197 21 30 ...
$ $50-75k : int 137 70 58 1116 35 1486 34 223 30 95 ...
$ $75-100k : int 122 73 62 949 21 949 47 131 15 69 ...
$ $100-150k : int 109 59 39 792 17 723 48 81 11 87 ...
$ >150k : int 84 74 53 633 18 414 54 78 6 151 ...
$ Don't know/refused: int 96 76 54 1489 116 1529 37 339 37 162 ...

Melting Data keeping religion as column, rest of the columns into row attribute.

> pew.melt = melt(pew, "religion")
> head(pew.melt)
religion variable value
1 Agnostic <$10k 27
2 Atheist <$10k 12
3 Buddhist <$10k 27
4 Catholic <$10k 418
5 Don’t know/refused <$10k 15
6 Evangelical Prot <$10k 575

Set header of the pew.melt data frame

> colnames(pew.melt) = c("Relion", "Income", "Count")
> str(pew.melt)

'data.frame': 180 obs. of 3 variables:

 $ Relion: Factor w/ 18 levels "Agnostic","Atheist",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Income: Factor w/ 10 levels "<$10k","$10-20k",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Count : int  27 12 27 418 15 575 1 228 20 19 ...

Here is another example based on tuber dataset on Tuberculosis

> tb = read.csv("http://stat405.had.co.nz/data/tb.csv")
> colnames(tb)
 [1] "iso2"         "year"         "new_sp"       "new_sp_m04"  
 [5] "new_sp_m514"  "new_sp_m014"  "new_sp_m1524" "new_sp_m2534"
 [9] "new_sp_m3544" "new_sp_m4554" "new_sp_m5564" "new_sp_m65"  
[13] "new_sp_mu"    "new_sp_f04"   "new_sp_f514"  "new_sp_f014" 
[17] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"

[21] "new_sp_f5564" "new_sp_f65" "new_sp_fu"

Clean up column names

> require(stringr)
> colnames(tb) = str_replace(colnames(tb), "new_sp", "")
> colnames(tb) = str_replace(colnames(tb), "_", "")
> colnames(tb)
 [1] "iso2"  "year"  ""      "m04"   "m514"  "m014"  "m1524" "m2534" "m3544"
[10] "m4554" "m5564" "m65"   "mu"    "f04"   "f514"  "f014"  "f1524" "f2534"
[19] "f3544" "f4554" "f5564" "f65"   "fu"   

Melt data ... columns to rows

> tb.melt = melt(tb, id = c("iso2", "year"), na.rm = TRUE)
> str(tb.melt)

'data.frame': 38831 obs. of 4 variables:

 $ iso2    : Factor w/ 212 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year    : int  1993 1994 1996 1997 1998 1999 2000 2001 2002 2003 ...
 $ variable: Factor w/ 21 levels "","m04","m514",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : int  15 24 8 17 1 4 1 3 2 7 ...

Rename "value" column to "cases".

> colnames(tb.melt)[4] = "cases"
> str(tb.melt)

'data.frame': 38831 obs. of 4 variables:

 $ iso2    : Factor w/ 212 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year    : int  1993 1994 1996 1997 1998 1999 2000 2001 2002 2003 ...
 $ variable: Factor w/ 21 levels "","m04","m514",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cases   : int  15 24 8 17 1 4 1 3 2 7 ...

Order data

> require(plyr)

> tb.melt = arrange(tb.melt, iso2, variable, year)

Now let's split “variable” column into separate columns

> str_sub(tb.melt$variable, 2)

Extract gender field from the "variable" column.

> tb.melt$sex = as.factor(str_sub(tb.melt$variable, 1, 1))

Create a lookup table of a single vector

> ages = c("04" = "0-4", "514" = "5-14", "014" = "0-14", "1524" = "15-24", 
    "2534" = "25-34", "3544" = "35-44", "4554" = "45-54", "5564" = "55-64", 

"65" = "65+", "u" = NA)

Replace the column value with more user friendly values bases on ages vector

> tb.melt$age = factor(ages[str_sub(tb.melt$variable, 2)], levels = ages)

Remove "variable" column from the dataset.

> tb.melt$variable = NULL

Reorder the columns

> tb.melt = tb.melt[c("iso2", "year", "sex", "age", "cases")]

> tb.melt[5:25, ]

iso2 year sex age cases

5 AD 1998 <NA> 1

6 AD 1999 <NA> 4

7 AD 2000 <NA> 1

8 AD 2001 <NA> 3

9 AD 2002 <NA> 2

10 AD 2003 <NA> 7

11 AD 2004 <NA> 3

12 AD 2005 <NA> 5

13 AD 2006 <NA> 8

14 AD 2007 <NA> 2

15 AD 2008 <NA> 3

16 AD 2005 m 0-4 0

17 AD 2006 m 0-4 0

18 AD 2008 m 0-4 0

19 AD 2005 m 5-14 0

20 AD 2006 m 5-14 0

21 AD 2008 m 5-14 0

22 AD 1996 m 0-14 0

23 AD 1997 m 0-14 0

24 AD 1998 m 0-14 0

25 AD 1999 m 0-14 0

The above data is much cleaner is ready for further processing.

Variables in Row and Column Names

Load weather data set

> weather = read.csv("http://stat405.had.co.nz/data/weather.txt", sep = "\t")
> head(weather)
           id year month element d1  d2  d3 d4  d5 d6 d7 d8 d9 d10 d11 d12
1 MX000017004 2010     1    TMAX NA  NA  NA NA  NA NA NA NA NA  NA  NA  NA
2 MX000017004 2010     1    TMIN NA  NA  NA NA  NA NA NA NA NA  NA  NA  NA
3 MX000017004 2010     2    TMAX NA 273 241 NA  NA NA NA NA NA  NA 297  NA
4 MX000017004 2010     2    TMIN NA 144 144 NA  NA NA NA NA NA  NA 134  NA
5 MX000017004 2010     3    TMAX NA  NA  NA NA 321 NA NA NA NA 345  NA  NA
6 MX000017004 2010     3    TMIN NA  NA  NA NA 142 NA NA NA NA 168  NA  NA
  d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30
1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 278
2  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 145
3  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 299  NA  NA  NA  NA  NA  NA  NA
4  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 107  NA  NA  NA  NA  NA  NA  NA
5  NA  NA  NA 311  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
6  NA  NA  NA 176  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
  d31
1  NA
2  NA
3  NA
4  NA
5  NA
6  NA

Melt weather data keep first 4 columns fixed

> weather.melt = melt(weather, id = 1:4, na.rm = TRUE)
> head(weather.melt)
            id year month element variable value
21 MX000017004 2010    12    TMAX       d1   299
22 MX000017004 2010    12    TMIN       d1   138
25 MX000017004 2010     2    TMAX       d2   273
26 MX000017004 2010     2    TMIN       d2   144
41 MX000017004 2010    11    TMAX       d2   313
42 MX000017004 2010    11    TMIN       d2   163

Extract the day number from "variable" column and remove variable column

> weather.melt$day = as.integer(str_replace(weather.melt$variable, "d", ""))
> weather.melt$variable = NULL

Lowercase values in "element" column

> weather.melt$element = tolower(weather.melt$element)
> head(weather.melt)
            id year month element value day
21 MX000017004 2010    12    tmax   299   1
22 MX000017004 2010    12    tmin   138   1
25 MX000017004 2010     2    tmax   273   2
26 MX000017004 2010     2    tmin   144   2
41 MX000017004 2010    11    tmax   313   2
42 MX000017004 2010    11    tmin   163   2

Rearrange the columns in weather.melt data.

> weather.melt = weather.melt[c("id", "year", "month", "element", "day", "value")]

Shift the "element" column to separate columns

> weather.melt2 = dcast(weather.melt, ... ~ element)
> head(weather.melt2)
           id year month day tmax tmin
1 MX000017004 2010     1  30  278  145
2 MX000017004 2010     2   2  273  144
3 MX000017004 2010     2   3  241  144
4 MX000017004 2010     2  11  297  134
5 MX000017004 2010     2  23  299  107
6 MX000017004 2010     3   5  321  142

Now the data looks much cleaner and is now ready for further analysis.