0% found this document useful (0 votes)
12 views

Lse Ppa M4u3 Notes

Uploaded by

Tartarus Lord
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lse Ppa M4u3 Notes

Uploaded by

Tartarus Lord
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MODULE 4 UNIT 3

Correlation and regression

© 2023 LSE
All Rights Reserved
lse.ac.uk

Table of contents
1. Introduction 3
2. Defining the relationship between variables 3
3. Regression analysis 5
3.1 The regression coefficient 7
3.2 Accuracy of fit 9
3.3 Accuracy of fit 10
3.4 The p-value 10
4. Analysing the results of a regression 11
5. Multivariate regression 12
5.1 Accuracy of fit 13
5.2 Policy application 13
6. Conclusion 15
7. Bibliography 15

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 2 of 15
lse.ac.uk

Learning outcomes:

LO4: Discuss the relevance of regression and correlation to policy problems.

LO5: Use statistical skills to interpret policy data.

1. Introduction
Policy stakeholders need to understand the relationship between different variables when
deciding which policies to support or take forward. In this unit, you will learn how to interpret
one of the most common statistical tests used to measure relationships: regression
analysis. There are many relationships that are important when it comes to deciding on
goals for public policy. For example, the high incidence of diabetes has become a policy
issue of considerable concern in many high-income nations in recent years. It is important
to understand the links between diabetes and various risk factors in order to create targeted
health policy. By establishing the strength of the relationship between different variables
and the disease, policymakers can identify the most impactful use of public funds.

In these notes, you will learn more about the various elements that make up a regression
analysis. This includes the correlation coefficient, standard error, and p-value. This
introduction to regression is intended to provide you with the skills to critically engage with
regression analyses that you are presented with in your role as a policy stakeholder.

2. Defining the relationship between variables


When a researcher uses regression analysis, they seek to understand what impact
changing the independent variable will have on the dependent variable. This is known as
a bivariate regression, as there are two variables being studied. Consider the following
example. A researcher is trying to establish the impact of exercise on the concentration
levels of school children. The independent variable in this study is exercise, as the
researcher wants to see what impact changing exercise will have on concentration (the
dependent variable).

There are many different relationships that could exist between these two variables. These
relationships can be presented visually using a scatterplot, such as those presented in
Figure 1. The independent variable is always listed on the horizontal axis, while the
dependent variable is on the vertical axis.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 3 of 15
lse.ac.uk

Figure 1: Examples of scatterplots.

You can use the pattern of the data spread on the graph to make predictions about the
relationships between the variables. Scatterplot 5, for example, shows that an increase in
exercise leads to an increase in concentration up until a certain level, at which point further
increases in exercise lead to reductions in concentration.

In these notes, you will learn how regression can be used to interpret linear relationships,
such as the one shown in Scatterplot 3. A linear relationship means that a change in the
independent variable will lead to a directly proportional change in the dependent variable,
either positively or negatively. Scatterplot 3 shows a positive relationship between exercise
and concentration, where an increase in exercise leads to an increase in concentration.

Pause and reflect:

Imagine that you are a policymaker in the education department, and you are charged
with designing a new initiative to include more sports training in the school curriculum.
What policy recommendation would you make, based on the relationship in Scatterplot
5 compared to that in Scatterplot 6?

Sometimes the relationship between the variables is not as clear as those shown in
Scatterplots 3 and 5 are. In some research, the data may look more like Scatterplots 4 or
6, or it may not have a pattern that you recognise. Regression helps researchers to
establish the underlying trends in the data, even when the trends are small. Sometimes
you may want to use non-linear regression methods – for instance, when the data you are
analysing is known to follow a non-linear relationship. However, because of the minimal
assumptions needed to justify linear regression, it is often used even when looking at non-
linear processes. Frequently, you can also modify your data to make it linear, even if the
data-generating process itself is not linear.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 4 of 15
lse.ac.uk

In this unit, you will learn about linear regression, which can occur between variables with
either a positive or negative relationship. A positive relationship indicates that an increase
in the independent variable leads to an increase in the dependent variable, while a negative
relationship means that an increase in the independent variable leads to a decrease in the
dependent variable. You will learn more about modelling the relationship between variables
mathematically in the next section on regression analysis.

3. Regression analysis
Linear regression is used to identify the relationship between variables plotted on a
scatterplot, such as those shown in Figure 1 in the previous section. One advantage of
linear regression is that it can be used both to understand the relationship between one
independent variable and one dependent variable – as in the examples provided earlier –
or it can be used to understand the relationship between many independent variables and
one dependent variable.

A regression analysis results in an equation that can be used to determine the slope and
intercept of the regression line. An example of a regression graph is shown in Figure 2.
The datapoints used for this research are shown in blue, and the regression line has been
added in red.

Figure 2: An example of a regression.

The line drawn using a regression is a “line of best fit”. The line of best fit is created
mathematically in order to minimise the difference between each datapoint as a collection
of all the datapoints. Consider the regressions in Figure 3 in order to understand the
concept of the line of best fit. The distance between each datapoint and the regression line
has been indicated in black. The distances between the regression line and each point are

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 5 of 15
lse.ac.uk

known as the errors. The line of best fit is found by summing each of these errors and
finding the total where it is the smallest.

Compare the regressions that are illustrated in Regression 1 to those in Regression 2 as


shown in Figure 3. Which do you think is the most useful regression line?

Figure 3: Accuracy of fit.

The red line in Regression 2 would be a poor line of best fit, as only a few datapoints are
closely related to this line. It therefore does not give an accurate indication of the
relationship between the variables.

A regression equation takes the same form as the formula for a straight line, which is as
follows:

y = a + bx

Where,

y = Dependent variable

x = Independent variable

a = Intercept

b = Slope

The formula is a mathematical function that results in the values for a and b. The slope of
the line shows the relationship between variables. If the slope of the graph is 1, this
indicates that for each single unit increase in the independent variable, the dependent
variable will increase by 1. Conversely, if the slope of the graph is −0.5, this indicates that
for each single unit increase in the independent variable, the dependent variable will
decrease by a value of 0.5. You will learn more about the correlation between variables in
Section 3.1.

The regression line can be used to make predictions about values in the data. If you know
the value of one variable in the data set, for example, you can use this to predict the value

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 6 of 15
lse.ac.uk

of the second variable. Regression is also useful for understanding the strength of the
relationship between the independent and dependent variables. In other words, it allows
researchers to identify whether certain independent variables will have a more profound
impact on changing the dependent variable relative to others, and this is typically indicated
by the magnitude of the slope.

You will learn more about the elements of a regression in the following sections. In order
to interpret a regression, you need to understand several pieces of information calculated
during a regression: the correlation coefficient, accuracy of fit, and the coefficient of
determination.

Explore further:

This unit is intended to equip you with the skills to analyse, rather than conduct, a
regression analysis. Regression analysis can be done by hand, and there are also
many online tools that can speed up the process. You may watch the tutorial by Dr
Eugene O’Loughlin from the National College of Ireland if you would like to learn more
about how to conduct a regression yourself.

3.1 The regression coefficient


The regression coefficient is an indicator of the strength of the relationship between two
variables that have a linear relationship. If you refer to the equation used to calculate a
line, the regression coefficient is the slope of the line, which is b in the equation:

y = a + bx

A positive value indicates a positive correlation between the independent and


dependent variables, and a negative value indicates a negative correlation.

Consider the following example. A researcher is interested in the relationship between


post-secondary years of education and monthly income in pounds. The researcher
observed a regression coefficient of 70.5. This indicates that each additional year of
education results in an increase in income of £70.5 per month. A regression coefficient of
5 would indicate that each additional year of education only leads to an additional £5 in
income per month on average. In the first case, policymakers might be justified in efforts
to promote secondary education. In the second case, policymakers might consider other
constraints on people’s earning potential.

Correlation is valuable to policy specialists, as it provides an indicator of the likely impact


a policy intervention might have on a target group. For example, if there is a strong positive
relationship between sugar consumption and heart disease, policy interventions that
reduce sugar consumption should also reduce the incidence of heart disease in the
population. If, on the other hand, there is a weak correlation between meat consumption
and heart disease, a campaign advocating vegetarianism would be less effective than a
campaign targeting sugar.

Note that correlation between two variables does not mean that one variable is the cause
of another. The terms “correlation” and “causation” are often confused, but the distinction

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 7 of 15
lse.ac.uk

between these concepts is important. In Video 1 Part 1, Dr Daniel Berliner explains the
difference between correlation and causation, and highlights several factors you should
consider when you interpret research claiming a relationship between variables. In Video
1 Part 2, Daniel considers the challenge of establishing correlation and causation in public
policy.

Video 1 Part 1: Dr Daniel Berliner considers the impact of data relationships on public policy.
(Access this set of notes on the Online Campus to engage with this video and download its
transcript.)

Video 1 Part 2: Dr Daniel Berliner considers the impact of data relationships on public policy.
(Access this set of notes on the Online Campus to engage with this video and download its
transcript.)

As you engage with policy issues in the future, remember to think critically about claims of
correlation and causation. These terms are often conflated, especially in clickbait articles
making radical claims based on confounding variables. When interpreting research, you

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 8 of 15
lse.ac.uk

can use the principles highlighted by Sir Austin Bradford Hill (1965), the pioneer of
randomised trials, which you will learn about in Module 5. In his influential work on
causation, Sir Hill highlights several considerations, including the following three:

1. Consistency: The finding should be consistently observed by different people in


different circumstances.

2. Temporality: A change in the independent variable should lead to a change in the


dependent variable, so the change in the dependent variable must occur after that
of the independent variable.

3. Coherence: The findings of the research should be logical, given current research
in the field.

As well as considering these factors, you should also critically engage with whether causal
relationships are plausible. A new trend that has developed as a result of the invention of
big data tools, which use computing power to find patterns and trends in large volumes of
data, is “data dredging”. This entails using powerful computing tools to look for patterns in
data without trying to prove clear hypotheses or research questions.

Data dredging often results in unexpected correlations that are entirely the result of chance.
For example, the graphs for per-capita consumption of mozzarella cheese and civil
engineering doctorates awarded in the US are remarkably similar. It is clear to most people,
however, that these two factors are unrelated, and that the similarity is not an example of
causation. You should critically engage with the plausibility of the results of the research,
especially in instances where you suspect that the researchers have been using data
dredging.

Explore further:

Several sites have been developed to demonstrate how data dredging leads to
spurious, and often ridiculous, conclusions. Tyler Vigen’s website is one such
humorous compilation of examples that illustrate the dangers of data dredging.

3.2 Accuracy of fit: Standard error


The standard error measures the uncertainty of the relationship between a single
dependent variable and a single independent variable. This is calculated by measuring the
average distance that the sample data falls from the regression line. In Unit 1, you learnt
that for normally distributed data approximately 68% of the data falls within 1 standard
deviation of the mean. If you imagine the regression line as a mean in the data, 68% of
data should fall within 1 standard error of the line, while approximately 95% should fall
within 2 standard errors.

Consider the example to understand the relationship between sugar consumption and
weight. If the mean is 79kg, and the standard error value is 1.5kg, 68% of the data will
likely fall between 77.5kg and 80.5kg. The smaller the standard error, the closer the data
is to the fitted line. You will notice from this example that the standard error calculation
results in the same units as the dependent variable (kg).

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 9 of 15
lse.ac.uk

3.3 Accuracy of fit: R²


The R2 value is also a measure of accuracy of fit, which you learnt about in terms of
standard error in the previous section. It is used to measure the strength of the relationship
between the dependent and independent variables in the regression model. R2 values
range between 0 and 1, where 1 indicates that the regression is a perfect fit for the data
and 0 indicates that the regression does not fit the data. This value is usually expressed
as a percentage, so that a value of 1 is explained as a R2 of 100%. Figure 4 gives a visual
representation of how standard error relates to the distribution of data in a scatterplot. As
you can see, Graph 1 has a low R2 value of 20%. Graph 2, on the other hand, has a R2 of
80%.

Figure 4: A comparison of standard error values.

If a regression has a low R2 value, it does not necessarily mean that the model is poor.
Human behaviour studies, for example, may result in R2 values below 50%. This is due to
the high incidence of unexplained variation in human behaviour. A low R2 value in other
fields that require precision may, however, be problematic. Similarly, a high R2 value is
also not always an indicator of a good model. In cases of data dredging, for example, the
regression may result in a high R2 value for two variables that cannot plausibly be linked.
You should therefore interpret the R2 value critically, taking into account other statistical
values and the broader context of the experiment.

3.4 The p-value


As discussed in the previous unit, researchers often rely on samples from some larger
population for their study. For instance, researchers may only be able to study the weight
and sugar consumption of 1,000 individuals in a country. When this is the case,
researchers often want to know whether the relationship they observe in their sample is a
relationship that also exists in the population of interest (e.g., the country).

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 10 of 15
lse.ac.uk

One indicator of uncertainty in the population is the p-value. The p-value ranges between
0 and 1 and indicates the probability that we would expect some observed relationship in
the sample if there was actually no relationship in the population. For example, suppose
we found that men consume 10g more sugar in a week than women. How confident should
we be that this difference would also hold if we looked at the full population of men and
women?

Based on our sample size, suppose we concluded that there was a p-value of 0.05. In this
case, we can be fairly confident that the difference between men and women is not just
due to random variation in our sample. We would only expect no relationship between men
and women in the population 5% of the time. Conversely, if the p-value is 0.5, it is highly
likely that the difference between men and women is just due to random variation in our
sample. That is, it is quite plausible that we just happened to sample men who were more
likely to eat sugar and that if we were to go and sample more men and women, we would
find that in fact there is no difference in sugar consumption.

One reason we may observe a large p-value is that there is no relationship between our
variables in the population. However, another reason for observing a large p-value is that
the sample is too small. Suppose we looked at whether unemployed individuals were less
likely to pay taxes in a sample of 500 individuals. If we observed a large p-value, this might
indicate that there is no relationship between unemployment and taxation. A more likely
explanation, however, is that the relationship between these variables is small and we do
not have enough data to make firm conclusions about the relationship. If there are only a
few unemployed individuals in our sample, for instance, then our sample might be very
different from the population. When a researcher observes a p-value greater than 0.05 it
is often good practice to collect more data in order to be more certain about the relationship
between the variables.

Many researchers consider the p-value to be a good measure of statistical significance.


When a finding is statistically significant it means that it is unlikely that the finding is due to
sampling error or random chance. Researchers differentiate between statistical
significance, as indicated by the p-value, and practical significance. Practical significance
considers whether the results of the regression have a useful implication. Consider
research that indicates that drinking coffee is associated with an increase in body mass of
0.02kg over 3 years with a p-value of 0.04. This experiment is statistically, but not
practically, significant, as 0.02kg over 3 years is of little practical relevance to diet.

4. Analysing the results of a regression


Now that you understand the basic components of a regression, you can apply your
learning to a practical case study. The regression in Figure 5 has been created from a
hypothetical survey of 200 students. The research seeks to understand the relationship
between high school and university grades as percentage scores.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 11 of 15
lse.ac.uk

Figure 5: Regression analysis.

What deductions can you make about the relationship between the two variables, based
on the information given in Figure 5? Consider the following two questions before reading
the answers that follow:

1. What can you surmise about the relationship between the variables, based on the
regression coefficient?

2. What does the R2 value indicate to you?

You can surmise the following information from Figure 5. The variables have a positive
relationship. An increase in high school grades is related to an increase in university
grades. This shows a strong correlation of 0.75. The R2 value is 64%. This indicates that
the model generated by the regression strongly fits the sample data.

In the previous sections, you learnt about bivariate regression, which seeks to understand
the relationship between two variables. In the next section, you will learn more about
multivariate regression, which uses the same principles to compare relationships between
several different variables.

5. Multivariate regression
In policy, it is likely that you will need to understand the correlation between more than two
variables. This is often the case in policy evaluations where a complex interplay of variables
might influence the dependent variable. Researchers use multivariate regression in order
to control for other variables that might explain why an independent variable is related to a
dependent variable.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 12 of 15
lse.ac.uk

Consider, for instance, the relationship between secondary school education and income.
While it is likely that education leads to higher levels of income, there are also other
variables that might explain this relationship. For instance, students in urban areas or
students with educated parents might be more likely to have more years of secondary
education and more job opportunities. Therefore, in order to establish whether promoting
secondary school education will lead to higher income, researchers also need to consider
what other variables affect both education and income. Multivariate regression is one tool
to account for these other variables.

A multivariate regression results in the same indicators you learnt about in Section 3, such
as the regression correlation coefficient and standard error. You can therefore apply what
you learnt about bivariate regression to understanding a multivariate regression, with the
addition of the R2 value (pronounced “R-squared”).

5.1 Accuracy of fit


In a multivariate regression, a dependent variable is modelled as a function of several
independent variables, each with their own coefficient. This gives an idea of each
independent variable’s relationship with the dependent variable while holding all else
equal, namely controlling for the other independent variables. Researchers calculate the
standard error for each variable, which provides an indicator of the uncertainty around that
specific relationship. The R2 value, on the other hand, provides a measure of the
representativeness of the entire model. If the R2 value is low, researchers may choose to
add additional variables to a multivariate regression. This helps to explain some of the
previously unexplained variation, which reduces the total uncertainty of the model and thus
yields a higher R2 value.

In a multivariate regression, there is not necessarily a close relationship between standard


error and the R2 value. A researcher could have a very precisely estimated relationship
between one of the independent variables and the dependent variable but still have a low
R2 value in the model. If a researcher regresses income on education for a large sample
of people, for example, that relationship will be highly statistically significant, but the R2
value might still be very low.

5.2 Policy application


In this section you will apply what you have learnt about regression to real-world data.
Imagine that you are a policymaker interested in determining the relationships between
education and income. The results of a regression of hourly wages on years of education
has been provided in Figure 6. This is from a 1976 survey of 3,010 individuals in the US.
The regression coefficients are shown in Columns 1 and 2 with the standard errors in
parentheses.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 13 of 15
lse.ac.uk

Figure 4: Relationship between education and income. (Adapted from: rdrr.io, n.d.)

Focus on the linear regression data given in Column 1. What conclusions can you draw
from this information about the relationship between education and income? Try to make
your own deductions about the relationship based on the p-value, R2 value, and coefficients
given in the column before reading the answers in the next paragraph.

You will notice from the coefficient in Column 1 that each additional year of education is
associated with about 29.6 additional cents per hour in wages. Note that the p-value
associated with this coefficient is smaller than 0.01, and the standard error is quite small.
This means it is likely that the true relationship between education and income in the
population is greater than zero. The regression table also shows a R2 value of 0.09. This
means that education explains about 9% of the variation in wages in the sample, while
91% of the variation in wages is explained by other factors.

As a policymaker, you might be sceptical of the statement that each year of education
causes an increase of 29.6 cents in wages. Can you think of other variables that might
explain the relationship between education and income? As you learnt previously, one
possibility is that parental income and education explain part of a person’s access to
educational and employment opportunities. For example, parents with tertiary education
might be more willing to pay for higher education for their children. If so, the regression
coefficient in Column 1 may overestimate the true causal effect of education on income.

In response to this issue, you might choose to use a multivariate regression to control for
the confounding effect of parental education on income. The results of jointly estimating
the effect of income and parental income using a multivariate regression are given in
Column 2. Note that in Column 2 the regression coefficient for a person’s years of
education decreases from 29.6 to 25.9. This means that each additional year of education
is associated with an increase in 25.9 cents per hour in wages. This demonstrates that the
results in Column 1 are indeed likely to be an overestimate of the causal relationship

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 14 of 15
lse.ac.uk

between education and income, as some of the correlation between education and income
is due to the impact of a person’s education and their parents’ education.

You can see from this example how multivariate regression can be a powerful tool to
evaluate the effects of policies. A multivariate regression allows researchers to control for
several variables that might explain the relationship between an independent variable and
a dependent variable. This example also illustrates the importance of considering possible
confounding variables when interpreting regression coefficients.

6. Conclusion
In these notes, you learnt more about how to interpret a linear regression. You learnt about
the information that results from a regression, including the correlation coefficient, p-value,
and the R2 value. It is important to understand regression in order to engage critically with
statistics you are presented with. The next module on policy evaluation builds on these
skills. You will learn more about choosing and proving a hypothesis, as well as how to use
experimental design in policy formulation.

7. Bibliography
Hill, A. 1965. The environment and disease: association or causation? Proceedings of
the Royal Society of Medicine. 58(5):295-300.

Tredoux, C. & Durrheim, K. 2002. Numbers, hypotheses & conclusions: a course in


statistics for the social sciences. Cape Town: UCT Press.

Rdrr.io. n.d. Schooling: wages and schooling. [Data set]. Available:


https://rdrr.io/cran/Ecdat/man/Schooling.html.

© 2023 LSE
All Rights Reserved

getsmarter.com | [email protected]
+44 203 457 5774 (UK) | +1 224 249 3522 (US) | +27 21 447 7565 (SA)

Page 15 of 15

You might also like