Definition of the multiple linear regression model
Motivation for multiple regression
Incorporate more explanatory factors into the model
Explicitly hold fixed other factors that otherwise would be in
Allow for more flexible functional forms
Example: Wage equation
Interpretation of the multiple regression model
The multiple linear regression model manages to hold the values of other explanatory variables fixed even if, in reality, they are correlated with the explanatory variable under consideration
„Ceteris paribus“-interpretation
It has still to be assumed that unobserved factors do not change if the explanatory variables are changed
Example: Determinants of college GPA
Holding ACT fixed, another point on high school grade point average is associated with another .453 points college grade point average
Or: If we compare two students
with the same ACT, but the hsGPA of student A is one point higher, we predict student A to have a colGPA that is .453 higher than that of student B
Holding high school grade point average fixed, another 10 points on ACT are associated with less than one point on college GPA
Standard assumptions for the multiple regression model
Assumption MLR.1 (Linear in parameters)
Assumption MLR.2 (Random sampling)
Standard assumptions for the multiple regression model (cont.)
Assumption MLR.3 (No perfect collinearity)
Remarks on MLR.3
The assumption only rules out
perfect collinearity/correlation bet-ween explanatory variables; imperfect correlation is allowed
If an explanatory variable is a perfect linear combination of other explanatory variables it is superfluous and may be eliminated
Constant variables are also ruled out (collinear with intercept)
Example for perfect collinearity: small sample
Example for perfect collinearity: relationships between regressors
Standard assumptions for the multiple regression model (cont.)
Assumption MLR.4 (Zero conditional mean)
In a multiple regression model, the zero conditional mean assumption is much more likely to hold because fewer things end up in the error
Example: Average test scores
Discussion of the zero mean conditional assumption
Explanatory variables that are correlated with the error term are called
endogenous; endogeneity is a violation of assumption MLR.4
Explanatory variables that are uncorrelated with the error term are called
exogenous; MLR.4 holds if all explanat. var. are exogenous
Exogeneity is the key assumption for a causal interpretation of the regression, and for unbiasedness of the OLS estimators
Theorem 3.1 (Unbiasedness of OLS)
Unbiasedness is an average property in repeated samples; in a given sample, the estimates may still be far away from the true values
Including irrelevant variables in a regression model
Omitting relevant variables: the simple case;
Conclusion: All estimated coefficients will be biased
Standard assumptions for the multiple regression model (cont.)
Assumption MLR.5 (Homoscedasticity)
Example: Wage equation
Short hand notation
Assumption MLR.6 (Normality of error terms)
Theorem 3.2 (Sampling variances of OLS slope estimators)
An example for multicollinearity
Discussion of the multicollinearity problem
In the above example, it would probably be better to lump all expen-diture categories together because effects cannot be disentangled
In other cases, dropping some independent variables may reduce multicollinearity (but this may lead to omitted variable bias)
Only the sampling variance of the variables involved in multicollinearity will be inflated; the estimates of other effects may be very precise
Note that multicollinearity is not a violation of MLR.3 in the strict sense
Multicollinearity may be detected through „variance inflation factors“
Estimating the error variance
Theorem 3.3 (Unbiased estimator of the error variance)
Efficiency of OLS: The Gauss-Markov Theorem
Under assumptions MLR.1 - MLR.5, OLS is unbiased
However, under these assumptions there may be many other estimators that are unbiased
Which one is the unbiased estimator with the
smallest variance?
In order to answer this question one usually limits oneself to linear estimators, i.e. estimators linear in the dependent variable
Theorem 3.4 (Gauss-Markov Theorem)
Under assumptions MLR.1 - MLR.5, the OLS estimators are the best linear unbiased estimators (BLUEs) of the regression coefficients, i.e.
OLS is only the best estimator if MLR.1 – MLR.5 hold; if there is heteroscedasticity for example, there are better estimators.
Estimation of the sampling variances of the OLS estimators
Note that these formulas are only valid under assumptions MLR.1-MLR.5 (in particular, there has to be homoscedasticity)
Theorem 4.1 (Normal sampling distributions)
Testing hypotheses about a single population parameter
Theorem 4.1 (t-distribution for standardized estimators)
Null hypothesis (for more general hypotheses, see below)
t-statistic (or t-ratio)
Distribution of the t-statistic
if the null hypothesis is true
Goal: Define a rejection rule so that, if it is true, H0 is rejected only with a small probability (= significance level, e.g. 5%)
Testing against one-sided alternatives (greater than zero)
Example: Wage equation
Test whether, after controlling for education and tenure, higher work experience leads to higher hourly wages
Example: Wage equation (cont.)
Testing against one-sided alternatives (less than zero)
Example: Student performance and school size
Test whether smaller school size leads to better student performance
Example: Student performance and school size (cont.)
Example: Student performance and school size (cont.)
Alternative specification of functional form:
Example: Student performance and school size (cont.)
Testing against two-sided alternatives
Example: Determinants of college GPA
„Statistically significant“ variables in a regression
If a regression coefficient is different from zero in a two-sided test, the corresponding variable is said to be
„statistically significant“
If the number of degrees of freedom is large enough so that the nor-mal approximation applies, the following rules of thumb apply:
Guidelines for discussing economic and statistical significance
If a variable is statistically significant, discuss the magnitude of the coefficient to get an idea of its economic or practical importance
The fact that a coefficient is statistically significant does not necessa-rily mean it is economically or practically significant!
If a variable is statistically and economically important but has the „wrong“ sign, the regression model might be misspecified
If a variable is statistically insignificant at the usual levels (10%, 5%, 1%), one may think of dropping it from the regression
If the sample size is small, effects might be imprecisely estimated so that the case for dropping insignificant variables is less strong
Testing more general hypotheses about a regression coefficient
Null hypothesis
The test works exactly as before, except that the hypothesized value is substracted from the estimate when forming the statistic
Example: Campus crime and enrollment
An interesting hypothesis is whether crime increases by one percent if enrollment is increased by one percent
Computing p-values for t-tests
If the significance level is made smaller and smaller, there will be a point where the null hypothesis cannot be rejected anymore
The reason is that, by lowering the significance level, one wants to avoid more and more to make the error of rejecting a correct H0
The smallest significance level at which the null hypothesis is still rejected, is called the
p-value of the hypothesis test
A small p-value is evidence against the null hypothesis because one would reject the null hypothesis even at small significance levels
A large p-value is evidence in favor of the null hypothesis
P-values are more informative than tests at fixed significance levels
How the p-value is computed (here: two-sided test)
Confidence intervals
Simple manipulation of the result in Theorem 4.2 implies that
Interpretation of the confidence interval
The bounds of the interval are random
In repeated samples, the interval that is constructed in the above way will cover the population regression coefficient in 95% of the cases
Confidence intervals for typical confidence levels
Relationship between confidence intervals and hypotheses tests
Example: Model of firms‘ R&D expenditures
Testing hypotheses about a linear combination of parameters
Example: Return to education at 2 year vs. at 4 year colleges
Impossible to compute with standard regression output because
Alternative method
Estimation results
This method works
always for single linear hypotheses
Testing multiple linear restrictions: The F-test
Testing exclusion restrictions
Estimation of the unrestricted model
Estimation of the restricted model
Test statistic
Rejection rule (Figure 4.7)
Test decision in example
The three variables are „jointly significant“
They were not significant when tested individually
The likely reason is multicollinearity between them
Test of overall significance of a regression
The test of overall significance is reported in most regression packages; the null hypothesis is usually overwhelmingly rejected
Testing general linear restrictions with the F-test
Example: Test whether house price assessments are rational
Unrestricted regression
Restricted regression
Test statistic
Regression output for the unrestricted regression
The F-test works for general multiple linear hypotheses
For all tests and confidence intervals, validity of assumptions MLR.1 – MLR.6 has been assumed. Tests may be invalid otherwise.
Models with interaction terms
Interaction effects complicate interpretation of parameters
Reparametrization of interaction effects
Advantages of reparametrization
Easy interpretation of all parameters
Standard errors for partial effects at the mean values available
If necessary, interaction may be centered at other interesting values
Qualitative Information
Examples: gender, race, industry, region, rating grade, …
A way to incorporate qualitative information is to use dummy variables
They may appear as the dependent or as independent variables
A single dummy independent variable
Dummy variable trap
Estimated wage equation with intercept shift
Does that mean that women are discriminated against?
Not necessarily. Being female may be correlated with other produc-tivity characteristics that have not been controlled for.
Using dummy explanatory variables in equations for log(y)
Using dummy variables for multiple categories
1) Define membership in each category by a dummy variable
2) Leave out one category (which becomes the base category)