18 Tidy Data

Causes of messiness of data

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

  • Column headers are values not variable names

  • Multiple variables are stored in a single column

  • Variables are stored in both rows and columns

  • Multiple types of experimental unit stored in the same table

  • One type of experimental unit in multiple tables

Tidy Data (… it fits the 3NF model in relational database)

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table

Tools

  • reshape2::melt - changes columns into rows

  • reshape2::decast - shifts the variables from rows to columns

  • stringr::str_replace

  • stringr::str_sub

  • stringr::match

  • stringr::str_split_fixed

  • plyr::arrange

Key function for data manipulation:

  • subset: filter data

  • transform: add new columns to the dataframe base on existing columns

  • summarize: create new dataframe from an existing dataframe

  • arrange: change the order of the rows

> library(reshape2) # Data Tiding ... dcast and melt functions

> library(ggplot2) # Visualization

> library(plyr) # Data transformation and join

> library(stringr) # String manipulation using regular expression

> library(MASS) # Contains lot of useful dataset

Case 1: Column headers are values not variable names

> url = "http://stat405.had.co.nz/data/pew.txt"

> pew = read.table(url, header = TRUE, check.names = FALSE)

> str(pew)

'data.frame': 18 obs. of 11 variables:

$ religion : Factor w/ 18 levels "Agnostic","Atheist",..: 1 2 3 4 5 6 7 8 9 10 ...

$ <$10k : int 27 12 27 418 15 575 1 228 20 19 ...

$ $10-20k : int 34 27 21 617 14 869 9 244 27 19 ...

$ $20-30k : int 60 37 30 732 15 1064 7 236 24 25 ...

$ $30-40k : int 81 52 34 670 11 982 9 238 24 25 ...

$ $40-50k : int 76 35 33 638 10 881 11 197 21 30 ...

$ $50-75k : int 137 70 58 1116 35 1486 34 223 30 95 ...

$ $75-100k : int 122 73 62 949 21 949 47 131 15 69 ...

$ $100-150k : int 109 59 39 792 17 723 48 81 11 87 ...

$ >150k : int 84 74 53 633 18 414 54 78 6 151 ...

$ Don't know/refused: int 96 76 54 1489 116 1529 37 339 37 162 ...

Melting Data keeping religion as column, rest of the columns into row attribute.

> pew.melt = melt(pew, "religion")

> head(pew.melt)

religion variable value

1 Agnostic <$10k 27

2 Atheist <$10k 12

3 Buddhist <$10k 27

4 Catholic <$10k 418

5 Don’t know/refused <$10k 15

6 Evangelical Prot <$10k 575

Set header of the pew.melt data frame

> colnames(pew.melt) = c("Relion", "Income", "Count")

> str(pew.melt)

'data.frame': 180 obs. of 3 variables:

$ Relion: Factor w/ 18 levels "Agnostic","Atheist",..: 1 2 3 4 5 6 7 8 9 10 ...

$ Income: Factor w/ 10 levels "<$10k","$10-20k",..: 1 1 1 1 1 1 1 1 1 1 ...

$ Count : int 27 12 27 418 15 575 1 228 20 19 ...

Here is another example based on tuber dataset on Tuberculosis

> tb = read.csv("http://stat405.had.co.nz/data/tb.csv")

> colnames(tb)

[1] "iso2" "year" "new_sp" "new_sp_m04"

[5] "new_sp_m514" "new_sp_m014" "new_sp_m1524" "new_sp_m2534"

[9] "new_sp_m3544" "new_sp_m4554" "new_sp_m5564" "new_sp_m65"

[13] "new_sp_mu" "new_sp_f04" "new_sp_f514" "new_sp_f014"

[17] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"

[21] "new_sp_f5564" "new_sp_f65" "new_sp_fu"

Clean up column names

> require(stringr)

> colnames(tb) = str_replace(colnames(tb), "new_sp", "")

> colnames(tb) = str_replace(colnames(tb), "_", "")

> colnames(tb)

[1] "iso2" "year" "" "m04" "m514" "m014" "m1524" "m2534" "m3544"

[10] "m4554" "m5564" "m65" "mu" "f04" "f514" "f014" "f1524" "f2534"

[19] "f3544" "f4554" "f5564" "f65" "fu"

Melt data ... columns to rows

> tb.melt = melt(tb, id = c("iso2", "year"), na.rm = TRUE)

> str(tb.melt)

'data.frame': 38831 obs. of 4 variables:

$ iso2 : Factor w/ 212 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...

$ year : int 1993 1994 1996 1997 1998 1999 2000 2001 2002 2003 ...

$ variable: Factor w/ 21 levels "","m04","m514",..: 1 1 1 1 1 1 1 1 1 1 ...

$ value : int 15 24 8 17 1 4 1 3 2 7 ...

Rename "value" column to "cases".

> colnames(tb.melt)[4] = "cases"

> str(tb.melt)

'data.frame': 38831 obs. of 4 variables:

$ iso2 : Factor w/ 212 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...

$ year : int 1993 1994 1996 1997 1998 1999 2000 2001 2002 2003 ...

$ variable: Factor w/ 21 levels "","m04","m514",..: 1 1 1 1 1 1 1 1 1 1 ...

$ cases : int 15 24 8 17 1 4 1 3 2 7 ...

Order data

> require(plyr)

> tb.melt = arrange(tb.melt, iso2, variable, year)

Now let's split “variable” column into separate columns

> str_sub(tb.melt$variable, 2)

Extract gender field from the "variable" column.

> tb.melt$sex = as.factor(str_sub(tb.melt$variable, 1, 1))

Create a lookup table of a single vector

> ages = c("04" = "0-4", "514" = "5-14", "014" = "0-14", "1524" = "15-24",

"2534" = "25-34", "3544" = "35-44", "4554" = "45-54", "5564" = "55-64",

"65" = "65+", "u" = NA)

Replace the column value with more user friendly values bases on ages vector

> tb.melt$age = factor(ages[str_sub(tb.melt$variable, 2)], levels = ages)

Remove "variable" column from the dataset.

> tb.melt$variable = NULL

Reorder the columns

> tb.melt = tb.melt[c("iso2", "year", "sex", "age", "cases")]

> tb.melt[5:25, ]

iso2 year sex age cases

5 AD 1998 <NA> 1

6 AD 1999 <NA> 4

7 AD 2000 <NA> 1

8 AD 2001 <NA> 3

9 AD 2002 <NA> 2

10 AD 2003 <NA> 7

11 AD 2004 <NA> 3

12 AD 2005 <NA> 5

13 AD 2006 <NA> 8

14 AD 2007 <NA> 2

15 AD 2008 <NA> 3

16 AD 2005 m 0-4 0

17 AD 2006 m 0-4 0

18 AD 2008 m 0-4 0

19 AD 2005 m 5-14 0

20 AD 2006 m 5-14 0

21 AD 2008 m 5-14 0

22 AD 1996 m 0-14 0

23 AD 1997 m 0-14 0

24 AD 1998 m 0-14 0

25 AD 1999 m 0-14 0

The above data is much cleaner is ready for further processing.

Variables in Row and Column Names

Load weather data set

> weather = read.csv("http://stat405.had.co.nz/data/weather.txt", sep = "\t")

> head(weather)

id year month element d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12

1 MX000017004 2010 1 TMAX NA NA NA NA NA NA NA NA NA NA NA NA

2 MX000017004 2010 1 TMIN NA NA NA NA NA NA NA NA NA NA NA NA

3 MX000017004 2010 2 TMAX NA 273 241 NA NA NA NA NA NA NA 297 NA

4 MX000017004 2010 2 TMIN NA 144 144 NA NA NA NA NA NA NA 134 NA

5 MX000017004 2010 3 TMAX NA NA NA NA 321 NA NA NA NA 345 NA NA

6 MX000017004 2010 3 TMIN NA NA NA NA 142 NA NA NA NA 168 NA NA

d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30

1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 278

2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 145

3 NA NA NA NA NA NA NA NA NA NA 299 NA NA NA NA NA NA NA

4 NA NA NA NA NA NA NA NA NA NA 107 NA NA NA NA NA NA NA

5 NA NA NA 311 NA NA NA NA NA NA NA NA NA NA NA NA NA NA

6 NA NA NA 176 NA NA NA NA NA NA NA NA NA NA NA NA NA NA

d31

1 NA

2 NA

3 NA

4 NA

5 NA

6 NA

Melt weather data keep first 4 columns fixed

> weather.melt = melt(weather, id = 1:4, na.rm = TRUE)

> head(weather.melt)

id year month element variable value

21 MX000017004 2010 12 TMAX d1 299

22 MX000017004 2010 12 TMIN d1 138

25 MX000017004 2010 2 TMAX d2 273

26 MX000017004 2010 2 TMIN d2 144

41 MX000017004 2010 11 TMAX d2 313

42 MX000017004 2010 11 TMIN d2 163

Extract the day number from "variable" column and remove variable column

> weather.melt$day = as.integer(str_replace(weather.melt$variable, "d", ""))

> weather.melt$variable = NULL

Lowercase values in "element" column

> weather.melt$element = tolower(weather.melt$element)

> head(weather.melt)

id year month element value day

21 MX000017004 2010 12 tmax 299 1

22 MX000017004 2010 12 tmin 138 1

25 MX000017004 2010 2 tmax 273 2

26 MX000017004 2010 2 tmin 144 2

41 MX000017004 2010 11 tmax 313 2

42 MX000017004 2010 11 tmin 163 2

Rearrange the columns in weather.melt data.

> weather.melt = weather.melt[c("id", "year", "month", "element", "day", "value")]

Shift the "element" column to separate columns

> weather.melt2 = dcast(weather.melt, ... ~ element)

> head(weather.melt2)

id year month day tmax tmin

1 MX000017004 2010 1 30 278 145

2 MX000017004 2010 2 2 273 144

3 MX000017004 2010 2 3 241 144

4 MX000017004 2010 2 11 297 134

5 MX000017004 2010 2 23 299 107

6 MX000017004 2010 3 5 321 142

Now the data looks much cleaner and is now ready for further analysis.