10 Correlation Analysis
For illustration, we are going to Boston Housing price data. You can find the description in the link below.
https://archive.ics.uci.edu/ml/datasets/Housing
> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
> housing = read.table(url)
> str(housing)
How many variables and rows are there?
506 observations and 14 variables.
Set user friendly names to the columns
> colnames(housing) = c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE" ,"DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
Find correlation among the variables
> cor(housing)
Reduce decimal places in the correlation matrix
> round(cor(housing), 2)
Find answers to the following questions
Which variables are correlated with MEDV having absolute value of correlation coefficient > 0.5?
Which variables are positively correlated with MEDV and which are negatively correlated?
Which variables are strongly correlated (absolutely correlation coefficient is close to 0.80)?
Find confidence interval for a pair of variables
$ cor.test(housing$MEDV, housing$LSTAT)
Find statistical significance for the variables. For that you can use Hmisc package.
> require(Hmisc)
> rcorr(as.matrix(housing))
p-values are confidence score of the correlation. Hypothesis tests use p-value to weigh the strength of the evidence (what the data are telling you about the population). It is number between 0 to 1. Interpretation is like this:
A small p value (< 0.05) indicates strong evidence against null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you cannot reject it.
Marginal (p = 0.05) non conclusive.
Exercise: find correlation based on Swiss fertility dataset.