# 10 Correlation Analysis

For illustration, we are going to Boston Housing price data. You can find the description in the link below.

https://archive.ics.uci.edu/ml/datasets/Housing

``> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"``
``> housing = read.table(url)``
``> str(housing)``

How many variables and rows are there?

506 observations and 14 variables.

Set user friendly names to the columns

``> colnames(housing) = c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE" ,"DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")``

Find correlation among the variables

``> cor(housing)``

Reduce decimal places in the correlation matrix

> round(cor(housing), 2)

Find answers to the following questions

1. Which variables are correlated with MEDV having absolute value of correlation coefficient > 0.5?
2. Which variables are positively correlated with MEDV and which are negatively correlated?
3. Which variables are strongly correlated (absolutely correlation coefficient is close to 0.80)?

Find confidence interval for a pair of variables

``\$ cor.test(housing\$MEDV, housing\$LSTAT)``

Find statistical significance for the variables. For that you can use Hmisc package.

``> require(Hmisc)``
``> rcorr(as.matrix(housing))``

p-values are confidence score of the correlation. Hypothesis tests use p-value to weigh the strength of the evidence (what the data are telling you about the population). It is number between 0 to 1. Interpretation is like this:

• A small p value (< 0.05) indicates strong evidence against null hypothesis, so you reject it.
• A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you cannot reject it.
• Marginal (p = 0.05) non conclusive.

Exercise: find correlation based on Swiss fertility dataset.