ML - 01 Linear Regression

> fit1 <- lm(price ~ ., diamonds)

> summary(fit1)

Call:

lm(formula = price ~ ., data = diamonds)

Residuals:

Min 1Q Median 3Q Max

-21376.0 -592.4 -183.5 376.4 10694.2

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5753.762 396.630 14.507 < 2e-16 ***

carat 11256.978 48.628 231.494 < 2e-16 ***

cut.L 584.457 22.478 26.001 < 2e-16 ***

cut.Q -301.908 17.994 -16.778 < 2e-16 ***

cut.C 148.035 15.483 9.561 < 2e-16 ***

cut^4 -20.794 12.377 -1.680 0.09294 .

color.L -1952.160 17.342 -112.570 < 2e-16 ***

color.Q -672.054 15.777 -42.597 < 2e-16 ***

color.C -165.283 14.725 -11.225 < 2e-16 ***

color^4 38.195 13.527 2.824 0.00475 **

color^5 -95.793 12.776 -7.498 6.59e-14 ***

color^6 -48.466 11.614 -4.173 3.01e-05 ***

clarity.L 4097.431 30.259 135.414 < 2e-16 ***

clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***

clarity.C 982.205 24.152 40.668 < 2e-16 ***

clarity^4 -364.918 19.285 -18.922 < 2e-16 ***

clarity^5 233.563 15.752 14.828 < 2e-16 ***

clarity^6 6.883 13.715 0.502 0.61575

clarity^7 90.640 12.103 7.489 7.06e-14 ***

depth -63.806 4.535 -14.071 < 2e-16 ***

table -26.474 2.912 -9.092 < 2e-16 ***

x -1008.261 32.898 -30.648 < 2e-16 ***

y 9.609 19.333 0.497 0.61918

z -50.119 33.486 -1.497 0.13448

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1130 on 53916 degrees of freedom

Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198

F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16

Interpretations

Residuals:

The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression (y - ŷ). For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we're missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.

> hist(fit1$residuals, breaks = 50, border = 0, col = "orange", probability = TRUE)
> curve(dnorm(x, mean = mean(fit1$residuals), sd = sd(fit1$residuals)), add = TRUE)

Think of it like a dartboard. A good model is going to hit the bullseye some of the time (but not every time). When it doesn't hit the bullseye, it's missing in all of the other buckets evenly (i.e. not just missing in the 16 bin) and it also misses closer to the bullseye as opposed to on the outer edges of the dartboard.

Significance Stars

The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. for high significance and * for low significance. In this case, indicates that it's unlikely that no relationship exists b/w heights of parents and heights of their children.

Estimated Coefficients

The estimated coefficient is the value of slope calculated by the regression. It might seem a little confusing that the Intercept also has a value, but just think of it as a slope that is always multiplied by 1. This number will obviously vary based on the magnitude of the variable you're inputting into the regression, but it's always good to spot check this number to make sure it seems reasonable.

Standard Error of the Coefficient Estimate

Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you'd like this value to be at least an order of magnitude less than the coefficient estimate.

In our model Estimated Slop = 11241.726, Std Error = 48.635 So, t-statistics = Estimated Slop / Std Error = 11241.726 / 48.635 = 231.144

t-value or t-statistics

Score that measures whether or not the coefficient for this variable is meaningful for the model. You probably won't use this value itself, but know that it is used to calculate the p-value and the significance levels.

You can calculate the p-value from t-statistic and the degree of freedom.

http://www.socscistatistics.com/pvalues/tdistribution.aspx

Significance Legend

The more punctuation there is next to your variables, the better.

Blank=bad, Dots=pretty good, Stars=good, More Stars=very good

Residual Std Error

The Residual Std Error is just the standard deviation of your residuals. You'd like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error.

> sd(fit1$residuals)
[1] 1129.853

Degrees of Freedom

The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable).

Degrees of Freedom (DF) = 53940 (total number of observations) - 24 (no of predictors) = 53915

Variable p-value

Probability the variable is NOT relevant that means all coefficients be 0. You want this number to be as small as possible. If the number is really small, R will display it in scientific notation. In or example 2e-16 means that the odds that parent is meaningless is about 1⁄5000000000000000

p-value = Probability (H0) = Probability or likelihood that slope will be -2.171320 = < .00001 (You can use http://www.socscistatistics.com/pvalues/tdistribution.aspx)

p-value calculates takes 3 parameters:t-statistics, DF, significance level (usually 0.05), one/two tailed

R-squared

Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. An R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable.

WARNING: While a high R-squared indicates good correlation, correlation does not always imply causation.

SSR = sum((y_pred - y_actual) ^ 2)

SST =sum((y_actual - mean(y_actual))^ 2)

R2 = 1 - SSR/SST

SSR = sum(fit1$residuals ^ 2)

SST =sum((diamonds$price - mean(diamonds$price))^ 2)
R2 = 1 - SSR/SST
R2
[1] 0.9197915

R-square adjusted

R-square adjusted is penalized for having a large number of parameters in the model

R2_adjusted = 1 - (1 - R2) * (n - 1) / df

> 1 - (1 - R2) * (nrow(diamonds) - 1) / fit1$df

[1] 0.9197573

F-statistic & resulting p-value

F = (explained variance) / (unexplained variance)

Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. As we add more features, R2 score improves but that may lead to overfitting. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost). If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

When there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. Large F-statistic suggests that at least some predicators are related to the response variable.

Source:

> help(summary.lm)