10 Correlation Analysis
For illustration, we are going to Boston Housing price data. You can find the description in the link below.
How many variables and rows are there?
506 observations and 14 variables.
Set user friendly names to the columns
Find correlation among the variables
Reduce decimal places in the correlation matrix
> round(cor(housing), 2)
Find answers to the following questions
Which variables are correlated with MEDV having absolute value of correlation coefficient > 0.5?
Which variables are positively correlated with MEDV and which are negatively correlated?
Which variables are strongly correlated (absolutely correlation coefficient is close to 0.80)?
Find confidence interval for a pair of variables
Find statistical significance for the variables. For that you can use Hmisc package.
p-values are confidence score of the correlation. Hypothesis tests use p-value to weigh the strength of the evidence (what the data are telling you about the population). It is number between 0 to 1. Interpretation is like this:
A small p value (< 0.05) indicates strong evidence against null hypothesis, so you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you cannot reject it.
Marginal (p = 0.05) non conclusive.
Exercise: find correlation based on Swiss fertility dataset.