10 Correlation Analysis

For illustration, we are going to Boston Housing price data. You can find the description in the link below.

https://archive.ics.uci.edu/ml/datasets/Housing

> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

> housing = read.table(url)

> str(housing)

How many variables and rows are there?

506 observations and 14 variables.

Set user friendly names to the columns

> colnames(housing) = c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE" ,"DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")

Find correlation among the variables

> cor(housing)

Reduce decimal places in the correlation matrix

> round(cor(housing), 2)

Find answers to the following questions

  1. Which variables are correlated with MEDV having absolute value of correlation coefficient > 0.5?

  2. Which variables are positively correlated with MEDV and which are negatively correlated?

  3. Which variables are strongly correlated (absolutely correlation coefficient is close to 0.80)?

Find confidence interval for a pair of variables

$ cor.test(housing$MEDV, housing$LSTAT)

Find statistical significance for the variables. For that you can use Hmisc package.

> require(Hmisc)

> rcorr(as.matrix(housing))

p-values are confidence score of the correlation. Hypothesis tests use p-value to weigh the strength of the evidence (what the data are telling you about the population). It is number between 0 to 1. Interpretation is like this:

  • A small p value (< 0.05) indicates strong evidence against null hypothesis, so you reject it.

  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you cannot reject it.

  • Marginal (p = 0.05) non conclusive.

Exercise: find correlation based on Swiss fertility dataset.