Similar presentations:
ASB1114: Regression
1.
ASB1114:Regression
2.
The two-variable linear regression modelEstimation
The population (‘true’) regression line:
The linear regression model is used to investigate casual relationships between 2 [or more]
random variables.
For example, let xi = disposable income of household i in 2013
yi = consumer expenditure of household i in 2013
We expect to find positive relationship between xi and yi. We might also expect that xi
(disposable income) largely determines yi (consumer expenditure).
3.
The two-variable linear regression modelEstimation
The population (‘true’) regression line:
xi is an independent random variable (or explanatory variable) determined exogenously.
yi is the dependent variable determined endogenously (i.e. ‘explained’, at least in part, by xi)
The population regression line shows, for each value of xi, the mean or average value of the yi’s
associated with that value of xi.
4.
The two-variable linear regression modelEstimation
The population (‘true’) regression line:
We can write:
yi = 1 + 2xi + ui , where:
1 represents the intercept of the population regression line (the point at which the population
regression line intersects the vertical axis).
2 represents the slope of the population regression line:
If β2 > 0, the population regression line is upward sloping;
If β2 < 0, the population regression line is downward sloping.
5.
The two-variable linear regression modelEstimation
The population (‘true’) regression line:
We can write:
yi = 1 + 2xi + ui , where:
ui is the error term. In practice, the relationship between xi and yi for any particular household is
never described exactly by the position of the population regression line. Each individual
household is located somewhere either above or below the line (their expenditure is either a bit
more or bit less than the average for households of their income level). The error term allows for
this divergence.
We assume E(ui)=0 and var(ui) = 2.
6.
The two-variable linear regression modelEstimation
The population (‘true’) regression line:
The population scatter diagram plots the values of xi and yi against one another for all
households in the population.
A regression analysis describes the relationship between xi and yi that is summarised by this
diagram.
ui is represented diagrammatically by the vertical distance between the ‘point’ for the i’th
household, and the population regression line.
7.
Population scatter diagramyi
slope = 2
E(Yi|xi)= 1+ 2xi
1
xi
8.
Population regression lineyi
i=3
y3
u3 >0
E ( y i | x i ) 1 2 x i
y2
i=2
u2 >0
u1 <0
y1
x2
x1
i=1
x3
xi
9.
The two-variable linear regression modelEstimation
The sample (‘estimated’) regression line:
So far, the discussion refers to the population (or ‘true’) regression model:
y i 1 2x i ui
1 and 2 are unknown parameters, representing the true relationship between xi and yi.
To identify the values of the unknown parameters 1, 2 and 2, we would need complete
information about every member of the population. However, we can take a random sample of
observations of yi and xi, in order to obtain estimates of 1, 2 and 2.
10.
The two-variable linear regression modelEstimation
The sample (‘estimated’) regression line:
If we take a random sample of points from the population scatter diagram, and using just those
points, fit a line through the centre of them, we obtain the sample regression line.
Because the sample is never perfectly representative of the parent population, the sample
regression line never coincides precisely with the population regression line.
We will always overestimate or underestimate 1 and 2 to some degree.
11.
The two-variable linear regression modelEstimation
The sample (‘estimated’) regression line:
Therefore it is important to develop notation to distinguish the ‘true’ parameters ( 1 and 2)
from their sample estimates, as follows:
Population (‘true’) model: y i 1 2x i ui (for i=1....N where N=population size)
Sample (‘estimated’) model: y i ˆ 1 ˆ 2x i ei (for i=1.....n where n=sample size)
12.
The two-variable linear regression modelEstimation
The sample (‘estimated’) regression line:
In the sample estimated model, ̂1 and ̂2 are the sample estimators of 1 and 2.
ei is sample estimator of ui. ei is the estimated error term or residual.
ŷ i ˆ 1 ˆ 2x i are the estimated values or fitted values of the dependent variable.
13.
The two-variable linear regression modelEstimation
The sample (‘estimated’) regression line:
In the population (true) model, the ui’s measure the vertical distances between the points in
the population scatter diagram and the population regression line. The ui’s are unknown
(because 1 and 2, and therefore the position of the population regression line, are unknown).
In the sample (estimated) model the ei’s measure the vertical distances between the points
selected in the sample and the sample regression line. The ei’s are the sample estimators of the
ui’s. The ei’s can be calculated once we have obtained ̂1 and ̂2 , the estimates of 1 and 2.
14.
Sample scatter diagramyi
slope = ̂ 2
ŷ i ˆ 1 ˆ 2 x i
E(Yi|xi)= 1+ 2xi
̂1
xi
15.
Population and sample regression linesyi
i=3
y3
e3 >0
y2
i=2
e >0
E ( y i | x i ) 1 2 x i
ŷ i ˆ 1 ˆ 2 x i
2
e1 <0
y1
x2
x1
i=1
x3
xi
16.
Ordinary Least Squares estimation of 1and 2
How do we use the data in the sample to obtain values for ̂1 and ̂2?
Intuitively, by making the estimated regression line ‘fit’ the collection of points as closely as
possible, we make the estimated errors as small as possible. The specific criterion is:
“choose ̂1 and ̂2 to make e2i as small as possible”
First, square all of the estimated error terms to make them all positive. Then choose ̂1 and ̂2
to minimise the sum of the squared error terms. This method is known as Ordinary Least
Squares (OLS) estimation.
17.
Ordinary Least Squares estimation of 1and 2
The formulae for ̂1 and ̂2 are as follows:
(x x )(y y)
n x iy i x i y i
i
i
ˆ
ˆ 2
or
2
2
(x x )
n x 2i x i 2
i
ˆ 1 y ˆ 2 x or ˆ 1 y i ˆ 2 x i
n
n
18.
Ordinary Least Squares estimation of 1and 2
Example:
A property developer is interested in investigating the relationship between annual family income
(X, in £’000s) and square footage of their homes (Y, in hundreds of square feet). A random sample
of 10 families is selected, with the following results:
Family
1
2
3
4
5
6
7
8
9
10
Income, xi
22
60
45
37
30
50
56
34
26
40
Sq.Footage, yi
16
30
26
24
22
21
32
18
21
20
Estimate the coefficients of the two-variable linear regression model yi = 1 + 2xi + ui
19.
Ordinary Least Squares estimation of 1and 2
Solution:
20.
Ordinary Least Squares estimation of 1and 2
Solution:
x
400
40
10
y
230
23
10
ˆ n x i y i x i y i 10 9670 400 230 0.3250
2
2
10 17446 400 400
n x i2 x i
ˆ 1 y ˆ 2 x 23 0.3250 40 9.9986
Alternatively, using ( x i x )( y i y) 470 and ( x i x ) 1446
2
470
ˆ 2
0.3250
1446
21.
2Estimation of = var(ui), and estimation
of var( ̂ ) and var( ̂ )
1
2
For purposes of statistical inference, it is useful to have a method for estimating 2 = var(ui), the
parameter that measures the dispersion of the points around the population regression line in
the population scatter diagram.
Note that ui = yi – 1 – 2xi
2 = var(ui) = E(u2i ) [E(ui )]2 = E(u2i ) because E(ui)=0
22.
2Estimation of = var(ui), and estimation
of var( ̂ ) and var( ̂)
1
2
The following formula is used for estimating σ2 = E(ui2):
2
e
2
ˆ i
n 2
The use of n–2 (rather than n) in the denominator is known as a ‘degrees of freedom’
correction.
The adjustment reflects the loss of 2 degrees of freedom (2 of the n pieces of information in
the sample) in obtaining ̂1 and ̂2 . There are only n–2 degrees of freedom (pieces of
information) left with which to estimate 2.
23.
2Estimation of = var(ui), and estimation
of var( ̂ ) and var( ̂ )
1
2
̂ is an estimated variance. The corresponding estimated standard deviation is known as the
standard error of the regression:
2
e2i
ˆ
ˆ
n 2
2
24.
2Estimation of = var(ui), and estimation
of var( ̂ ) and var( ̂)
1
2
The expression for the estimated variance of ̂2 is:
2
ˆ
vâr( ˆ 2 )
2
(x x )
i
The corresponding standard deviation is known as the standard error of ̂2 :
se( ˆ 2 ) vâr( ˆ 2 )
N.B. This result is used for statistical inference: using the sample estimate ( ̂2) to test a
hypothesis about 2. There are equivalent results for ̂1
25.
The two-variable linear regressionmodel: Statistical inference
With reference to the two-variable linear regression model, the key problem (as always in
statistics) is to decide what the values of the sample estimates ( ̂1 and ̂2 ) allow us to infer
about the unknown values of the corresponding true parameters 1 and 2.
To test the null hypothesis 2= (where θ is some numerical value being proposed for 2):
ˆ 2
Test Statistic: t
̴ t(n–2) if H0 is true.
ˆ
se( 2 )
As seen in Section 3, the decision rule depends upon whether the test is one tail or two tail,
and, in the case of the former, on the structure of the alternative hypothesis.
26.
The two-variable linear regressionmodel: Statistical inference
The procedures are as follows (significance level: =0.05):
One-tail tests:
(a)
Test H0: 2= against H0: 2>
Decision rule: Accept H0 if t ≤ t0.05
Reject H0 if t > t0.05
(b)
Test H0: 2= against H0: 2<
Decision rule: Accept H0 if t ≥ –t0.05
Reject H0 if t < –t0.05
27.
The two-variable linear regressionmodel: Statistical inference
The procedures are as follows (significance level: =0.05):
Two-tail test:
(c) Test H0: 2= against H1: 2≠
Decision rule: Accept H0 if –t0.025 ≤ t ≤ t0.025
Reject H0 if t < –t0.025 or t > t0.025
28.
The two-variable linear regressionmodel: Statistical inference
Comments on hypothesis testing are as follows:
(a) Often, in the context of the regression model, we want to test H0: 2=0 against a suitably defined
(one- or two-sided) alternative. This tests whether there is any relationship between xi and yi.
Acceptance of H0: 2=0 implies that there is no relationship. Usually we want to reject H0;
otherwise there is no point in trying to explain yi using xi.
(b) The form of the alternative hypothesis usually depends whether we have any pre-conceived idea
as to the direction of the relationship between xi and yi.
If a positive relationship is excepted:
Test H0: 2=0 against H1: 2>0
If a negative relationship is expected:
Test H0: 2=0 against H1: 2<0
If there is no pre-conceived idea of the direction of the relationship:
Test H0: 2=0 against H1: 2≠0
29.
The two-variable linear regressionmodel: Statistical inference
Comments on hypothesis testing are as follows:
(c) A test of H0: 2=0 against H1: 2≠0 will always produce the same results (test statistic and
critical values) as a test of H0: =0 against H1: ≠0, where ρ is the correlation coefficient
between xi and yi.
30.
Ordinary Least Squares estimation of 1and 2
Example:
A property developer is interested in investigating the relationship between annual family income
(X, in £’000s) and square footage of their homes (Y, in hundreds of square feet). A random sample
of 10 families is selected, with the following results:
Family
1
2
3
4
5
6
7
8
9
10
Income, xi
22
60
45
37
30
50
56
34
26
40
Sq.Footage, yi
16
30
26
24
22
21
32
18
21
20
For the regression model yi = 1 + 2xi + ui, test the null hypothesis H0: 2=0 against H1: 2>0.
31.
Ordinary Least Squares estimation of 1and 2
Solution:
32.
Ordinary Least Squares estimation of 1and 2
Solution:
As before, ˆ 2
n x i yi x i yi
n x i2 x i
2
10 9670 400 230
0.3250
10 17446 400 400
ˆ 1 y ˆ 2 x 23 0.3250 40 9.9986
79.2338
ei
ˆ
9.9042
n 2
8
2
2
ˆ 3.1471
33.
Ordinary Least Squares estimation of 1and 2
Solution:
2
Using ( x i x ) 1446
2
ˆ
9.9042
vâr( ˆ 2 )
0.006849
2
1446
(x i x)
se( ˆ 2 ) 0.006849 0.08276
34.
Ordinary Least Squares estimation of 1and 2
Solution:
35.
Ordinary Least Squares estimation of 1and 2
Solution:
where t0.05 is the 5% critical value from the t(8) distribution.
t0.05 from t(8) is 1.8595
t > t0.05
decision is reject H0.
This test is identical to the test of H0: =0 against H1: >0.
36.
MS ExcelTo run a regression on Excel, click on the Data tab, then Data Analysis, then Regression.
For this to work, the Analysis ToolPak must be installed (see Panopto of the lecture for details on
how to set the following up and run the regression):
Click on ‘File’
Choose ‘Options’
Choose ‘Add-Ins’ and ‘Manage Excel Add-Ins’
Tick ‘Analysis ToolPak’ option
37.
38.
Regression analysis: SummaryLinear regression model: used to investigate causal relationships between two or more
variables.
Independent/explanatory variable, xi: determined exogenously (i.e. outside the model).
Variable thought to be responsible for the change in a model.
Dependent variable, yi: determined endogenously (i.e. explained/partly explained by the
independent/explanatory variable). Variable we are seeking to explain by a model.
39.
Regression analysis: SummaryPopulation (‘true’) regression line: shows the average value of the dependent variable, yi,
associated with each possible value of the independent variable, xi. For a two-variable linear
regression model, we can express the population regression line as follows:
yi = 1 + 2xi + ui
• β1 represents the intercept of the population regression line i.e. the point at which the line intersects
the vertical axis.
• β2 represents the slope of the population regression line
• ui represents the error term
40.
Regression analysis: SummarySample (‘estimated’) regression line: we usually take a random sample of n observations of the
dependent and independent variable in order to obtain estimates of β1, β2 and σ2. The sample
regression line shows the average value of the dependent variable, yi, associated with each
value of the independent variable, xi, for the sample of n observations. For a two-variable
regression model, we can express the sample regression line as follows:
y i ˆ 1 ˆ 2x i ei
• ̂1 represents the sample estimator of β1.
• ̂2 represents the sample estimator of β2.
• ei represents the estimated error term or residual, and is the sample estimator of ui
41.
Regression analysis: SummaryOrdinary least squares (OLS) estimation: a method used for obtaining estimates of β1 and β2,
using the data obtained from your sample of n observations. The formulas are as follows:
n x iy i x i y i
ˆ 2 (x i x)(y i y)
or
ˆ 2
2
(x x )
n x 2i x i 2
i
ˆ 1 y ˆ 2 x or ˆ 1 y i ˆ 2 x i
n
n
42.
Regression analysis: SummaryStandard error of the regression: An estimated standard deviation. The formula is as follows:
e2i
ˆ
ˆ
n 2
2
Estimated variance of ̂2 : The formula is as follows:
2
ˆ
vâr( ˆ 2 )
2
(x x )
i
43.
Regression analysis: SummaryStandard error of ̂2 : The square root of the corresponding estimated variance. The formula
is as follows:
se( ˆ 2 ) vâr( ˆ 2 )
Hypothesis testing relating to the two-variable linear regression model: The test statistic is
calculated as follows:
ˆ 2 ̴ t(n–2) if H is true
0
t
ˆ
se( 2 )
44.
ReadingsCurwin, J., Slater, R. and Eadson, D. (2013). Quantitative Methods for Business Decisions, 7th ed.
Hampshire: Cengage.
o Chapters 15, 16
Newbold, P., Carlson, W.L. and Thorne, B. (2013). Statistics for Business and Economics, 8th ed.
Harlow: Pearson.
o Chapters 11, 12