0% found this document useful (0 votes)
277 views

Section 2 Mathematics As A Tool Gecmat Chmsu - Cas Mathematics Department

This document discusses tools and techniques for statistical analysis. It explains that statistics involves collecting, organizing, and interpreting data. There are two main fields: descriptive statistics which summarizes and presents data, and inferential statistics which draws conclusions from samples. Statistical tools derived from mathematics are useful for processing and managing numerical data to describe phenomena and predict values. Common techniques include frequency distributions, measures of central tendency and dispersion, and probability distributions. The document provides examples of gathering, organizing, and presenting data through tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
277 views

Section 2 Mathematics As A Tool Gecmat Chmsu - Cas Mathematics Department

This document discusses tools and techniques for statistical analysis. It explains that statistics involves collecting, organizing, and interpreting data. There are two main fields: descriptive statistics which summarizes and presents data, and inferential statistics which draws conclusions from samples. Statistical tools derived from mathematics are useful for processing and managing numerical data to describe phenomena and predict values. Common techniques include frequency distributions, measures of central tendency and dispersion, and probability distributions. The document provides examples of gathering, organizing, and presenting data through tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Introduction
Statistics is a branch of applied mathematics that deals with gathering, organizing, presenting, analyzing,
and interpreting the collected data. There are two major fields of applied statistics – descriptive and inferential
statistics. Descriptive statistics involve the collecting, organizing, describing, summarizing and presenting of
gathered data in a meaningful and informative way while inferential statistics refers to the process of drawing
conclusion and making decision on the population based on evidence obtained from a sample. Inferential
statistics include estimation and hypothesis testing.
In performing all these processes involved, the application of statistical tools and techniques is necessary.
Statistical tools derived from mathematics are useful on processing and managing numerical data in order to
describe a phenomenon and predict values.
The essential processes arrange the data to b analyzed and interpreted. These refer to gathering and
organizing data that can be done using the frequency distribution or grouped data and series of values in the case
of few data or values. The use of the measures of central tendency is very much important to help us determine
central value which can be used to describe the general or overall performance of a certain group of values like
the mean, the median, and the mode. On the other hand, the measures of dispersion can also be utilized in order
to know how close or far the data or values from each other like the range, the standard deviation, and the
variance. There are also helpful in describing whether the groups being studied or the data gathered are
heterogenous or homogeneous or they are dispersed, scattered, varied, distant or spread, or they are just clustered
or close to each other. These measures include also the measures of relative position which include the z-scores,
percentiles, quartiles, deciles, and box-and-whiskers plots.
The probability and the normal distributions are also discussed in this module. This is to equip the students
with further knowledge and skills on how to obtain the value of probability and problems about normal
distributions.
The topics on linear regression and correlation like least-squares line and linear correlation shall be
covered in the module so to help students interpret data and prove assumptions based on the problem given.

Lesson 1: Gathering and Organizing Data, Representing Data using Graphs and Charts and Interpreting
Organized Data

In today’s world we process enormous amount of data almost everyday. In schools, laboratories, and
companies, volumes of data are processed. Data management plays a very important role in processing this data.
To help analyze a certain phenomenon, we need to manage data with the help of statistics. The use of statistics
in predicting outcomes and possibly explain what is happening is very evident. When data are managed
efficiently, it results to understanding the nature of such phenomenon. This will further improve the lives in the
modern world.
The bar graph below depicts the confirmed COVID-19 cases, deaths, and recoveries in Southeast Asian
countries.

1
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Provide concise answers (maximum of 5 sentences) to the following questions.


1. What are data?
2. What are categorical data and continuous data?
3. What are the four levels or scales of measurement? Differentiate the four levels or scales and give example for each level or scale.
4. What is a variable?
5. Differentiate independent and dependent variables.
6. Differentiate quantitative and qualitative variables.
7. Differentiate discrete and continuous variables.

For each item, identify the misleading graph. After that, create a small group of four members and share
your answer or thoughts about the graphs. Then synthesizing the answers of your group, choose one representative
to present the answer to the class.

1.

2.

3.

4.

2
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Gathering and Organizing Data

When conducting a statistical study, the researcher must gather data for the particular variable under
study. For example, if a researcher wishes to study the number of people who were bitten by poisonous snakes
in a specific geographic area over the past several years, he or she has to gather the data from various doctors,
hospitals, or health departments.
To describe situations, draw conclusions, or make inferences about events, the researcher must organize
the data in some meaningful way. The data gathered shall be presented, analyzed and interpreted that can be
easily understood by the reader. Data may be presented in textual, tabular, graphical or a combination of these.
Textual presentation uses statements with numerals in order to describe the data for the concrete information and
in expository form. It is to discuss the data and the information and interpretation it carries. For example, the math test
scores of 15 students out of 50 items are 47, 48, 49, 42, 36, 38, 40, 35, 50, 26, 25, 31, 34, 19, 41.
Tabular presentation uses statistical table to directly display the quantities or values collected as data. It is a
systematic arrangement of information into columns and rows. Examples of tabular presentation are simple frequency
distribution (or it can be just called a frequency distribution), cumulative frequency distribution, grouped frequency
distribution, and cumulative grouped frequency distribution. Graphical presentation illustrates data in a form of graphs
aiding readers to understand the text easily. It is the most attractive, effective and convincing way in describing the data.
There are various types of graphs we can prepare like bar graph, circle graph (pie chart), line graph, pictograph,
histogram, frequency polygon, and a scatter diagram.

Tabular Presentation of Data


Frequency distribution
The frequency distribution table (FDT) is a statistical table showing the frequency or number of
observations contained in each of the defined classes or categories.
Parts of a Statistical Table
a. The table heading includes the table number and the title of the table.
b. The body is the main part of the table that contains the information or figures.
c. Stubs or Classes are the classification of categories describing the data and usually found at the left most side of
the table.
d. The caption is the designations or identifications of the information contained in a column, usually found at the
top most of the column.

Sample frequency distribution table


Table Number
Table 1
Distribution of Students in YS High School According to Year Level
Table Title
Year Level No. of Students Percentage
(Frequency) Frequency Column Header
Freshmen 300 0.3182
Sophomore 350 0.2727
Row
Junior 250 0.2273
Classifier
Senior 200 0.1818
N = 1,100
SOURCE: YS High School Registrar Source Note

Rules in Constructing Tables


➢ Tables should have the following labels:
➢ Table number is written on top of the table. The table title briefly explains the contents of the table. The title
follows the table number. The column headers describe the entry in each column. The row classifier classifies the
rows.
➢ Source Note. It is to acknowledge the sources of data.
➢ Frequency. It shows the number of entries per category.
➢ Total frequency. It shows the general picture of the total population.
➢ Percentage frequency. It views the characteristics of the data set.

3
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Two types of frequency distributions that are most often used are the qualitative or categorical frequency
distribution and the quantitative or grouped frequency distribution. Each raw data value is placed into a
quantitative or qualitative category called a class. The frequency of a class then is the number of data values
contained in a specific class.
Types of FDT
a) Qualitative or Categorical FDT. It is a frequency distribution table where the data are grouped according to some
qualitative characteristics; data are grouped into non-numerical categories. Categorical frequency distribution is
used for data that can be placed in specific categories, such as nominal- or ordinal-level data. For example, data
such as political affiliation, religious affiliation, or major field of study would use categorical frequency
distributions.
b) Quantitative FDT. It is a frequency distribution table where the data are grouped according to some numerical or
quantitative characteristics.

Example 1. Twenty-five army inductees were given a blood test to determine their blood type. The data set is

Construct a frequency distribution for the data.


Solution
Since the data are categorical, discrete classes can be used. There are four blood types: A, B, O, and AB. These
types will be used as the classes for the distribution. The procedure for constructing a frequency distribution for
categorical data is given next.
Step 1. Make a table as shown.
Step 2. Tally the data and place the results in column B.
Step 3. Count the tallies and place the results in column C.

Step 1 Step 2 Step 3


𝑓
Step 4. Find the percentage of values in each class by using the formula % = × 100 where 𝑓 = frequency of
𝑛
the class and 𝑛 = total number of values. For example, in the class of type A blood, the percentage is
𝑓 5
% = × 100 = × 100 = 20%
𝑛 25
Percentages are not normally part of a frequency distribution, but they can be added since they are used
in certain types of graphs such as pie graphs. Also, the decimal equivalent of a percent is called a relative
frequency.
Step 5. Find the totals for columns C (frequency) and D (percent). The completed table is shown. It is a good
idea to add the percent column to make sure it sums to 100%. This column won’t always sum to 100% because
of rounding.

4
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

When the range of the data is large, the data must be grouped into classes that are more than one unit in width, in what is
called a grouped or quantitative frequency distribution. It is a frequency distribution table where the data are grouped
according to some numerical or quantitative characteristics. For example, a distribution of the blood glucose levels in
milligrams per deciliter (mg/dL) for 50 randomly selected college students is shown.

Constructing a quantitative or grouped FDT (frequency distribution table)


Steps in Constructing a quantitative or grouped FDT (frequency distribution table)
Step 1. Determine the range (𝑅): 𝑅 = highest value − lowest value
Step 2. Determine the number of classes (𝑘): (The rough method is between 5-20 or using 𝑘 = √𝑁 where 𝑁 is
the total number of observations in the data set.
Note: Sometimes the number of classes (𝑘) is not followed. An extra class will be added to accommodate the
highest observed value in the data set and a class will be deleted if it turns out to be empty.
𝑅
Step 3. Determine the class size (𝑐) by dividing the range (𝑅) by the number of classes 𝑘 or 𝑐 = 𝑘 where 𝑐 is
preferably but not absolutely necessary be an odd number and should have the same number of decimal places
in the raw data; i.e. if the observations in the data set are all whole numbers, then your c should also be a whole
number.
Step 4. Enumerate the classes or categories. The classes must be mutually exclusive. Mutually exclusive classes
have nonoverlapping class limits so that data cannot be placed into two classes. The classes must be continuous.
Even if there are no values in a class, the class must be included in the frequency distribution. There should be
no gaps in a frequency distribution. The only exception occurs when the class with a zero frequency is the first
or last class. A class with a zero frequency at either end can be omitted without affecting the distribution. The
classes must be exhaustive. There should be enough classes to accommodate all the data.
Step 5. Tally the observations and determine the frequency of each class interval.
Step 6. Compute for the values in the other columns of the FDT as deemed necessary.

Other Columns in the FDT


1. True Class Boundaries (TCB)
a. True Lower-Class Boundaries (TLCB) = Lower Boundary – 0.5 unit of a measure
b. True Upper-Class Boundaries (TUCB) = Upper Limit + 0.5 unit of a measure
2. Class Mark (CM). It is the midpoint of the class interval.
1 1
CM = (LL + UL) or CM = (TLCB + TUCB)
2 2
3. Relative Frequency (RF). It is the proportion of observations falling in a class and expressed in percentage.
𝑓
RF = ( × 100)%
∑𝑓
4. Cumulative Frequency (CF). It is the accumulated frequency of the classes. A cumulative frequency distribution is a
distribution that shows the number of data values less than or equal to a specific value (usually an upper boundary).
The values are found by adding the frequencies of the classes less than or equal to the upper class boundary of a
specific class. This gives an ascending cumulative frequency. Cumulative frequencies are used to show how many
data values are accumulated up to and including a specific class.
a. Less than CF (<CF). It is the accumulated frequency from the lowest class interval.
b. Greater Than CF (>CF). It is the accumulated frequency from the highest class interval
5. Relative Cumulative Frequency (RCF)
a. Less than RCF (<RCF)
b. Greater than RCF (>RCF)
5
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Example 2. Suppose a researcher wished to do a study on the ages of the 40 patients confined at a certain
hospital. The researcher first would have to get the data on the ages of the participants. When the data are in
original form, they are called raw data and are listed next. Construct the FDT of the given data set.
Age (in years) of 40 patients confined at a certain hospital.
5 13 20 25 30 33 36 42 45 53
5 15 21 25 31 33 38 43 50 55
7 15 23 26 31 34 40 44 51 57
10 20 24 27 32 35 42 45 52 57
Solution
Step 1. 𝑅 = highest value − lowest value = 57 − 5 = 52
Step 2. 𝑘 = √𝑁 = √40 ≈ 6.32 or 6 classes
𝑅 52
Step 3. 𝑐 = 𝑘 = ≈ 8.67 or 9
6
Step 4 and Step 5.
Table 1
Frequency Distribution of Age (in years) of 40 Patients Confined at a Certain Hospital
Age (in years) Tally f
5 – 13 IIII 5
14 – 22 IIII 5
23 – 31 IIII - IIII 9
32 - 40 IIII - III 8
41 – 49 IIII - I 6
50 – 58 IIII - II 7
TOTAL ∑ 𝑓 = 40
Other Columns in the FDT
Table 2
Frequency Distribution of Age (in years) of 40 Patients Confined at a Certain Hospital
Age (in years) Tally f TCB CM RF CF RCF
TLCB TUCB (%) <CF >CF <RCF >RCF
5 – 13 IIII 5 4.5 13.5 9 12.5 5 40 12.5 100.0
14 – 22 IIII 5 13.5 22.5 18 12.5 10 35 25.0 87.5
23 – 31 IIII - IIII 9 22.5 31.5 27 22.5 19 30 47.5 75.0
32 - 40 IIII - III 8 31.5 40.5 36 20.0 27 21 67.5 52.5
41 – 49 IIII - I 6 40.5 49.5 45 15.0 33 13 82.5 32.5
50 – 58 IIII - II 7 49.5 58.5 54 17.5 40 7 100.0 17.5
TOTAL ∑ 𝑓 = 40 100.0

Graphical Presentation Data


A graph or a chart is a device for showing numerical values or relationships in pictorial form. The
following are the advantages of graphical presentation of data.
➢ The main features and implications of a body of data can be seen at once.
➢ It can attract attention and hold the reader’s interest.
➢ It simplifies concepts that would otherwise have been expressed in so many words.
➢ It can readily clarify data, frequently bring out hidden facts and relationships.

The only allowable calculation on nominal data is to count the frequency of each value of the variable.
We can display the counts in four ways: pie charts, bar charts, scatter plot, time series graph, and pictograph.
1. Pie Chart (Circle graph). Pie chart is a circular graph that is useful in showing how a total quantity is distributed
among a group of categories. The “pieces of the pie” represent the proportion of the total that fall into each category.
It is useful for data sorted into categories for a specific period. Its emphasis is to show the components parts with
respect to the total in terms of the percentage distribution. It uses the pie chart if there are less than 8 categories in the
data set. The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the
sizes of the sections. Percentages or proportions can be used. The variable is nominal or categorical. It is a circle that
is divided into sections or wedges according to the percentage of frequencies in each category of the distribution.

Guidelines on Pie Chart


➢ plot the biggest slice at 12 o’clock
➢ arrange components of the pie chart according to magnitude
6
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

➢ if there is an “Others” category, put it in the last section


➢ use different colors, shadings, or patterns to distinguish one section of the pie to the other sections

Example 3. Super Bowl Snack Foods


This frequency distribution shows the number of pounds of each snack food eaten during the Super
Bowl. Construct a pie graph for the data.

Solution
Step 1. Since there are 360° in a circle, the frequency for each class must be converted to a proportional part of
the circle. This conversion is done by using the formula
𝑓
Degrees = × 360°
𝑛
where 𝑓 = frequency for each class and 𝑛 = sum of the frequencies.
Hence, the following conversions are obtained. The degrees should sum to
360°.

Step 2. Each frequency must also be converted to a percentage. Using the


𝑓
formula % = 𝑛 × 100. Hence, the following percentages are obtained. The
percentages should sum to 100%.

Step 3. Next, using a protractor and a compass, draw the graph, using the appropriate degree measures found in
Step 1, and label each section with the name and percentages, as shown in the figure below.

2. Bar chart (Column graph). Like pie charts, column graphs or bar charts are applicable only to grouped data. They
should be used for discrete, grouped data of ordinal and ordinal scale. Column chart is appropriate for comparing the
magnitudes of variable in the x-axis for the different categories of variable in the y-axis. For time series data, its
emphasis is on the magnitude and not the movement or trend. The usual space between bars is around one-fourth of
the width of the column. When the data are qualitative or categorical, bar graphs can be used to represent the data. A
bar chart can be drawn using either horizontal or vertical bars.
7
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Horizontal bar chart is used for qualitative types of data given a specific time. It is to compare the magnitudes of the
different categories of a qualitative variable. It places the categories of the qualitative variable on the y-axis and the
amount or number is on the horizontal axis. The spaces in between the bars may be one-fifth to one-half the width of
the bar.

Example 4. College Spending for First-Year Students


The table shows the average money spent by first-year college students. Draw a horizontal and vertical bar
graph for the data.

Solution
Step 1. Draw and label the x and y axes. For the horizontal bar graph place the frequency scale on the x axis,
and for the vertical bar graph place the frequency scale on the y axis.
Step 2. Draw the bars corresponding to the frequencies. See the figure below.

The graphs show that first-year college students spend the most on electronic equipment.
Bar charts can also be used to compare data for two or more groups. These types of bar graphs are called
compound bar graphs. Consider the following data for the number (in millions) of never married adults in the
United States.

The figure shows a bar graph that


compares the number of never married
males with the number of never married
females for the years shown. The
comparison is made by placing the bars next
to each other for the specific years. The
heights of the bars can be compared. This
graph shows that there have consistently
been more never married males than never
married females and that the difference in
the two groups has increased slightly over
the last 50 years.

8
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

3. Scatter plot (Scatter graph) is a graph used to represent the measurements or values that are thought to be related. It is
used to examine possible relationships between two numerical variables. The two variables are plot in 𝑥-aixs and 𝑦-
axis.
4. Time series graph represents data that occur over specific period of time under observation. It shows trends, patterns,
forecasts and applicable for one or more time series data for comparison purposes.
5. Pictograph (Pictogram) immediately suggests the nature of the data being shown. It gives an approximation only of
the actual figures and compares the different categories. The symbols selected should be self-explanatory and easy to
understand. Each symbol represents a number.

Graphical Presentation using Statistical data


When the data set contains large number of values, making conclusions from an ordered array or stem-
and-leaf plot is often difficult. We will need graphs or charts in such situations. There are a number of graphs or
charts to visually show numerical data. These include histogram, frequency polygon, and cumulative frequency
polygon (ogive).
1. Histogram is a bar chart that displays the classes on horizontal axis and the frequencies of the classes on the
vertical axis; the vertical lines of the bars are erected at the class boundaries and the height of the bars corresponds
to the class frequency. Histograms are applicable only for quantitative data. A histogram doesn’t show data over
time — it shows all the data at one point in time. The histogram is a graph that displays the data by using
contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the
classes.

Example 5. Construct a histogram to represent the data shown


for the record high temperatures (in ℉) for each of the 50
provinces in Luzon.

Solution
Step 1. Draw and label the x and y axes. The x axis is always the horizontal axis, and the y axis is always the
vertical axis.
Step 2. Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3. Using the frequencies as the heights, draw vertical bars for each class.

As the histogram shows, the class with the greatest number of data values (18) is 109.5–114.5, followed by 13
for 114.5–119.5. The graph also has one peak with the data clustering around it.
2. Frequency polygon is a graph constructed by plotting the frequencies at the class marks and connecting the potted
points by means of straight lines; the polygon is closed by considering an additional class at each end in the ends
of the lines are brought down to the horizontal axis at the midpoint of the additional classes. The frequency
polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the
midpoints of the classes. The frequencies are represented by the heights of the points.

9
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Example 6. Using the frequency distribution below, construct a frequency polygon.

Solution
Step 1. Find the midpoints of each class. Recall that midpoints are found by adding the upper and lower
boundaries and dividing by 2:
99.5+104.5 104.5+109.5
= 102 = 107
2 2
and so on. The midpoints are

Step 2. Draw the x and y axes. Label the x axis with the midpoint of each class, and then use a suitable scale on
the y axis for the frequencies.
Step 3. Using the midpoints for the x values and the frequencies as the y values, plot the points.
Step 4. Connect adjacent points with line segments. Draw a line back to the x axis at the beginning and end of
the graph, at the same distance that the previous and next midpoints would be located, as shown in the figure
below.
Figure 1
Frequency Polygon of the Record High Temperatures

3. Cumulative frequency polygon (Ogive) is a graph that displays the cumulative frequencies for the classes in a
frequency distribution. The vertical axis represents the cumulative frequency for the classes in a frequency
distribution. The vertical axis represents the cumulative frequency of the distribution while the horizontal axis
represents the upper-class boundaries of the frequency distribution.
The less than cumulative frequency polygon (less than ogive) is plotted against upper class boundaries while the
greater than cumulative frequency polygon (greater than ogive) is plotted against lower class boundaries.

Example 7. Construct an ogive for the frequency distribution described below.

10
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Solution
Step 1. Find the cumulative frequency for each class.

Step 2. Draw the x and y axes. Label the x axis with the class boundaries. Use an appropriate scale for the y axis
to represent the cumulative frequencies. (Depending on the numbers in the cumulative frequency columns,
scales such as 0, 1, 2, 3,..., or 5, 10, 15, 20,..., or 1000, 2000, 3000,... can be used. Do not label the y axis with
the numbers in the cumulative frequency column.) In this example, a scale of 0, 5, 10, 15, . . . will be used.
Step 3. Plot the cumulative frequency at each upper class boundary, as shown in the figure below. Upper
boundaries are used since the cumulative frequencies represent the number of data values accumulated up to the
upper boundary of each class.
Step 4. Starting with the first upper class boundary, 104.5, connect adjacent points with line segments, as shown
in the figure below. Then extend the graph to the first lower class boundary, 99.5, on the x axis.

Cumulative frequency graphs are used to visually represent how many values are below a certain upper
class boundary. For example, to find out how many record high temperatures are less than 114.5℉, locate
114.5℉ on the x axis, draw a vertical line up until it intersects the graph, and then draw a horizontal line at that
point to the y axis. The y axis value is 28, as shown in the figure below.

11
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Activity:

1. Percentage of People Who Completed 4 or More Years of College


Listed by province are the percentages of the population who have completed 4 or more years of a college
education. Construct a frequency distribution with 7 classes.

21 26 25 19 29 35 34 26 25 23
27 29 24 29 22 24 28 20 20 27
35 38 25 31 19 25 27 28 22 33
34 25 32 26 26 24 23 28 26 30
23 25 22 25 29 34 34 30 17 25

2. SJS Travel Agency, a nationwide local travel agency, offers special rates on summer period. The owner wants additional
information on the ages of those people taking travel tours. Construct a histogram, frequency polygon, and cumulative
frequency polygon. What conclusions can you reach based on the information presented?
Class limits Class boundaries Midpoints Frequency < 𝑐𝑓
18-26 17.5-26.5 22 3 3
27-35 26.5-35.5 31 5 8
36-44 35.5.-44.5 40 9 17
45-53 44.5-53.5 49 14 31
54-62 53.5-62.5 58 11 42
63-71 62.5-71.5 67 6 48
72-80 71.5-80.5 76 2 50

Solve the following problems and show complete solution (15 points each)

3. Cereal Calories. The number of calories per serving for selected ready-to-eat cereals is listed here. Construct a
frequency distribution, using 7 classes. Draw a histogram, a frequency polygon, and an ogive for the data, using relative
frequencies. Describe the shape of the histogram.
130 190 140 80 100 120 220 220 110 100
210 130 100 90 210 120 200 120 180 120
190 210 120 200 130 180 260 270 100 160
190 240 80 120 90 190 200 210 190 180
115 210 110 225 190 130

4. The table below presents the COVID-19 death cases in the Philippines by region as of September 2021. Sketch the bar
chart and pie chart of the given data and interpret the data.

Region / Location Deaths


Metro Manila 9,142
Central Luzon 4,445
Calabarzon 4,073
Central Visayas 3,388
Western Visayas 1,971
Cagayan Valley 1,426
Davao Region 1,259
Ilocos Region 986
CAR 805
Caraga 692
Northern Mindanao 610
Zamboanga Peninsula 580
SOCCSKSARGEN 544
Bicol 506
Eastern Visayas 474
MIMAROPA 430

12
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Lesson 2: Measures of Central tendency, Dispersion, and Relative Position

Any given data in statistics are useless if we don’t interpret them. The most appropriate measures found
to be useful in describing a distribution of observations are the measures of central tendency, measures of
dispersion, measures of relative position, 𝑧-scores, box and whisker plot, probability and normal curve, linear
regression and correlation.

Measures of Central Tendency

A measure of central tendency is any single value that is used to identify the “center” of the data or the
typical value. It is called measure of central tendency because when the data points are arranged according to
magnitude, it tends to lie centrally within the set. It is the representative value of the data set. It is the value
around which most of the data points are found.

Mean
The mean represents the center of the data. It is the most important measure if the distribution is symmetric
and the most stable measure of location. It is used when the data is at least interval. When n is small, the mean is
very sensitive to extreme values.
It is computed by summing all the observations in the sample and dividing the sum by the number of
observations.

Properties of Mean
a) A set of data has only one mean.
b) Mean can be applied for interval and ratio data.
c) All values in the data set are included in computing the mean.
d) The mean is very useful in comparing two or more data sets.
e) Mean is affected by the extreme small or large values on a data set.
f) Mean is most appropriate in symmetrical data.

For the ungrouped data, the following are the formulas of the mean.
∑𝑥
Population Mean (𝜇): 𝜇 = 𝑁 𝑖 , where 𝑥𝑖 is the i𝑡ℎ score or observation, and 𝑁 is the number of observations in
the population.
∑𝑥
Sample Mean: (𝑋̅): 𝑋̅ = 𝑛 𝑖 , where 𝑥𝑖 is the i𝑡ℎ score or observation, and 𝑛 is the number of observations in the
sample.

Example 1: During a particular summer month, the eight hospitals in a particular province reported the following
number of admissions in their respective ICUs: 8, 11, 5, 14, 8, 11, 16, and 11.
Solution: Considering this month as the statistical population of interest, the mean number of ICU admissions is
∑ 𝑥𝑖 8+11+5+14+8+11+16+11 84
𝜇= = = = 10.5 ICU admissions
𝑁 8 8

Example 2. Determine mean age (in years) of a sample group of children whose ages are 9, 11, 7, 10, 9, 8, 8, 7,
12, 7 and 13.
∑𝑥 9 + 11 + 7 + 10 + 9 + 8 + 8 + 7 + 12 + 7 + 13 101
Solution: 𝑋̅ = 𝑖 = = = 9.18 years
𝑛 11 11

∑ 𝑓𝑀 ∑ 𝑓𝑀
For the grouped data, we have 𝜇 = or 𝑋̅ = , where 𝑓 is the frequency of the class interval and 𝑀 is the
𝑁 𝑛
midpoint of the class interval.

Example 3. Calculate the mean grade of 50 students in statistics below and give its description or interpretation.

Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
13
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

70 – 74 6

Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor

Solution.
First, determine the midpoint (𝑀) of each interval and the total frequency (∑ 𝑓) or 𝑛.
Grade 𝑓 𝑀
90 – 94 7 92
85 – 89 13 87
80 – 84 16 82
75 – 79 8 88
70 – 74 6 72
𝑛 = 50
Second, add a column for 𝑓𝑀, which is the product of a frequency (𝑓) and the midpoint (𝑀) of the class
interval, and find the sum of 𝑓𝑀 column or ∑ 𝑓𝑀.

Grade 𝑓 𝑀 𝑓𝑀
90 – 94 7 92 644
85 – 89 13 87 1131
80 – 84 16 82 1312
75 – 79 8 88 616
70 – 74 6 72 432
𝑛 = 50 ∑ 𝑓𝑀 = 4135

∑ 𝑓𝑀
Using the formula 𝑋̅ = , solve for the mean.
𝑛

∑ 𝑓𝑀 4135
𝑋̅ = 𝑛 = 50 = 82.7 or 83. Hence, the mean grade of 50 students in statistics is Satisfactory.

Weighted mean (𝑿 ̅ 𝐰 or 𝝁𝒘 ) is the sum of the mean of each group multiplied by its respective weight
divided by the sum of the weights. (For mean alone, the weight values in each distribution are equal). Example
of weighted mean is solving the weighted average of a student in a semester to determine whether he or she
belongs to the dean’s list. Each of his or her grade has a corresponding number of units (Example, GECMAT is
3 units, major subject is 4 or 5 units, and so on.)
The formula of the weighted mean is
𝑋1 (𝑤1 ) + 𝑋2 (𝑤2 ) + … + 𝑋𝑛 (𝑤𝑛 )
𝑋̅w =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛
1
Example 4. Francis answered 20 calculus problems. He spent 12 hours for the first 6 problems; 45 minutes for
the next 3; and 3 hours for the last 11 problems. What was the average time (in minutes) he spent for the 20
problems?

Solution
This problem requires the weighted average time because each set of problems has a weight (which is
time).

𝑋1 (𝑤1 ) + 𝑋2 (𝑤2 ) + … + 𝑋𝑛 (𝑤𝑛 ) 6(90) + 3(45) + 11(180) 540 + 135 + 1980 2655
𝑋̅w = = = =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛 90 + 45 + 180 315 315
≈ 8.42 minutes

Median (Population median: 𝜇̃, Sample median: 𝑋̃)


14
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Median is the positional middle of the data array. In the data array, one-half of the values precede the median and
one-half follow it. When the data set is ordered, whether ascending or descending, it is called a data array. Median
is an appropriate measure of central tendency for data that are ordinal or above, but is more valuable in an ordinal
type of data.

Properties of Median
a) The median is unique, there is only one median for a set of data.
b) The median is found by arranging the set of data from lowest or highest (or highest to lowest) and getting the value
of the middle observation.
c) Median is not affected by the extreme small or large values.
d) Median can be applied for ordinal, interval and ratio data.
e) Median is most appropriate in a skewed data.
For ungrouped data, the first step in calculating the median, denoted by (𝑋̃), is to arrange the data in an array. Let
X (𝑖) the 𝑖 𝑡ℎ observation in the array, 𝑖 = 1, 2, … 𝑁.
𝑁+1 𝑁+1 𝑡ℎ
If 𝑁 is odd, the median position equals ( ), and the value of the ( ) observation in the array is taken
2 2
as the median, i.e. 𝑋̃ = X 𝑁+1 .
( )
2
If 𝑁 is even, the mean of the two middle values in the array is the median, i.e.
X(𝑁) + X(𝑁+1)
𝑋= 2
̃ 2
2
Example 5. Find the median of the given data set: 75, 67, 71, 75, and 72
Solution
First, arrange the data set in ascending order: 67, 71, 72, 75, 75
Since 𝑁 = 5, we will use 𝑋̃ = X(𝑁+1) , hence, 𝑋̃ = X(𝑁+1)
2 2
= X5+1
2
= X3
= 72
67, 71, 72, 75, 75.
Therefore, 𝑋̃ = 72.
Example 6. The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1,
4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the median.
Solution
Array: 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3
𝑋̃ = 3.1 seconds
𝑛
−<𝑐𝑓
For grouped data, the formula for the median is 𝑋̃ = 𝑋𝐿𝐵 + ( 2 𝑓 )𝑖
𝑚

where
𝑋𝐿𝐵 = lower boundary of class containing the median
𝑛 = sample size
< 𝑐𝑓 = cumulative frequency of classes preceding class containing the median
𝑓𝑚 = number of observations in class containing the median
𝑖 = width of the interval containing the median

Example 7. Calculate the median grade of 50 students in statistics in Example 3 and give its description or
interpretation.
Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
70 – 74 6
15
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor
Solution
First, we add two columns for class boundaries and less than cumulative frequency (< 𝑐𝑓).
Grade 𝑓 Class boundaries < 𝑐𝑓
90 – 94 7 89.5 – 94.5 50
85 – 89 13 84.5 – 89.5 43
80 – 84 16 79.5 – 84.5 30
75 – 79 8 74.5 – 79.5 14
70 – 74 6 69.5 – 74.5 6

𝑛
Then, determine the median class using (2 )th item in the distribution. Hence,
𝑛 50
( )th = ( )th = 25th
2 2
If the scores are arranged in an ordered array, the 25 score of the distribution falls on the class interval 79.5 –
th

84.5. Hence, 79.5 – 84.5 is the median class.


𝑋𝐿𝐵 = 79.5
𝑛 = 50
< 𝑐𝑓 = 14
𝑓𝑚 = 16
𝑖=5
By substitution,
𝑛 50
−<𝑐𝑓 −14 55
𝑋̃ = 𝑋𝐿𝐵 + ( 2 𝑓 ) 𝑖 = 79.5 + ( 2 ) 5 = 79.5 + 16 = 82.94 or 83.
𝑚 16

Hence, the median grade of 50 students in statistics is Satisfactory.

Mode (Population mode: 𝜇̂ , Sample mode: 𝑋̂)

Mode is the observed value the occurs most frequently. It locates the point where the observation values
occur with the greatest density. It does not always exist, and if it does, it may not be unique. A data set is said to
be unimodal if there is only one mode, bimodal if there are two modes, multimodal if there three or more. There
are some cases when a data set values have the same number frequency. When this occurs, the data set is said to
be no mode.
Properties of Mode
a) The mode is found by locating the most frequently occurring value.
b) The mode is the easiest average to compute.
c) There can be more than one mode or even no mode in any given data set.
d) Mode is not affected by the extreme small or large values.
e) Mode can be applied for nominal, ordinal, interval, and ratio data.

Example 8. The eight hospitals described in Example 1 had the following number of ICU admissions: 8, 11, 5,
14, 8, 11, 16, and 11. Find the mode.
Solution
𝜇̂ = 11.0 ICU admissions or 𝑋̂ = 11.0 ICU admissions
Example 9. The reaction times for a random sample of 9 objects described in Example 6 were recorded as 2.5,
3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the mode.
Solution
𝜇̂ or 𝑋̂ does not exist since all values have the same frequency.

16
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑑1
For grouped data, the formula of the mode is 𝑋̂ = 𝑋𝑀𝑜 + (𝑑 +𝑑 )𝑖
1 2
where
𝑋𝑀𝑜 = lower class boundary of the modal class
𝑑1 = difference between the frequency of the modal class and that of the immediately preceding lower class
𝑑2 = difference between the frequency of the modal class and that of the immediately following the higher class
𝑖 = class width or size

Example 10. Calculate the modal grade of 50 students in statistics in Example 3 and give its description or
interpretation.
Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
70 – 74 6

Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor
Solution
First, determine the modal class of the distribution. The modal class of the distribution has the highest frequency.
Hence, 80-84 is the modal class.
𝑋𝑀𝑜 = 79.5
𝑑1 = 8
𝑑2 = 13
𝑖=5
By substitution,
𝑑 8 40
𝑋̂ = 𝑋𝑀𝑜 + ( 1 ) 𝑖 = 79.5 + ( ) 5 = 79.5 + = 81.40 or 81.
𝑑1 +𝑑2 8+13 21
Hence, the modal grade of 50 students in statistics is Fair.
Example 11: Find the mean, median, and mode of the following ages in years below.
a) 3, 4, 5, 5, 6, 7, 9, 10, 14
b) 7, 8, 9, 9, 10, 10, 11, 12

Solution
a) 3, 4, 5, 5, 6, 7, 9, 10, 14
∑𝑥 3+4+5+5+6+7+9+10+14
Mean: 𝑋̅ = 𝑖 =
𝑛 9
63
=
9
= 7 years
Median: Since 𝑁 is 9 (which is odd), use the formula 𝑋̃ = 𝑋(𝑁+1) .
2

𝑋̃ = 𝑋(𝑁+1) = 𝑋(9+1)
2 2

= 𝑋10
2
= 𝑋5
Hence, 𝑋5 = 6.
Mode: The mode is 5 since it has the highest frequency (it appears twice in the distribution)

b) 7, 8, 9, 9, 10, 10, 11, 12

17
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
∑ 𝑥𝑖 7+8+9+9+10+10+11+12
Mean: 𝑋̅ = =
𝑛 8
76
=
8
= 9.5 years
X 𝑁 +X 𝑁
( ) ( +1)
Median: Since 𝑁 is 8 (which is even), use the formula 𝑋̃ = 2 2
2
X 𝑁 +X 𝑁 X 8 +X 8
( ) ( +1) ( ) ( +1) 𝑋4 +𝑋5
𝑋̃ = 2 2
= 2 2
=
2 2 2
9+10
= 2
= 9.5, therefore, the median is 9.5.
Mode: The modes are 9 and 10 since they have the highest frequency (appeared twice). It is bimodal.

Activity:
Determine the following:
1. Find the value of each of the following expressions using the values of the variables below.
𝑋1 = 2, 𝑋2 = 4, 𝑋3 = 6, 𝑋4 = 8, 𝑋5 = 10.
a. ∑5𝑖=2 𝑋𝑖
b. ∑ 5𝑋𝑖
c. ∑ 𝑋𝑖 2
2
2. Make up your own set of at least five numbers and demonstrate that ∑ 𝑋𝑖 2 ≠ (∑ 𝑋𝑖 ) .
3. Round off the following numbers to two decimal places (assume digits to the right of those shown are zero):
a. 144.0135 _______________
b. 67.245 _______________
c. 99.707 _______________
d. 13.345 _______________
e. 7.3451 _______________
f. 5.9817 _______________
g. 5.9977 _______________

For each item, answer the following questions. After that, create a small group of four members and share your
answer or thoughts about the questions. Then synthesizing the answers of your group, choose one representative
to present the answer to the class.
1. Why a certain size of a pair shoes or a brand of shirt is made more available than the other sizes?
2. Why a certain basketball player gets more playing time than the rest of his teammates?
3. Have you ever experienced to compute the average of your grades for your wanted to compare it with your other
classmates’ grades? How did you compute it?
4. The set of data shows a score of 35 students in their periodical test.

34 35 40 40 48 21 9
21 20 19 34 45 21 20
19 17 18 15 16 20 28
21 20 18 17 10 45 48
19 17 29 45 50 48 25

a) What score is typical to the group of students? Why?


b) What score frequently appears?
c) What score appears to be in the middle? How many students fall below this score?

18
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Measures of Dispersion

Computing a measure of variability is important because without it a measure of central tendency provides
an incomplete description of a distribution. The mean, for example, only indicates the central score and where the
most frequent scores are. Thus, to completely describe a set of data, we need to know not only the central tendency
but also how much the individual scores differ from each other and from the center. We obtain this information
by calculating statistics called measures of variability.

Measures of variability/dispersion indicate the extent to which individual items in a series are scattered
about an average. It is used to determine the extent of the scatter so that steps may be taken to control the existing
variation. It is also used as a measure of reliability of the average value.
Measures of variability describe the extent to which scores in a distribution differ from each other. With
many, large differences among the scores, our statistic will be a larger number, and we say the data are more
variable or show greater variability. Measures of variability communicate three related aspects of the data. First,
the opposite of variability is consistency. Small variability indicates few and/or small differences among the
scores, so the scores must be consistently close to each other (and reflect that similar behaviors are occurring).
Conversely, larger variability indicates that scores (and behaviors) were inconsistent. Second, recall that a score
indicates a location on a variable and that the difference between two scores is the distance that separates them.
From this perspective, by measuring differences, measures of variability indicate how spread out the scores and
the distribution are. Third, a measure of variability tells us how accurately the measure of central tendency
describes the distribution. Our focus will be on the mean, so the greater the variability, the more the scores are
spread out, and the less accurately they are summarized by the one, mean score. Conversely, the smaller the
variability, the closer the scores are to each other and to the mean.

One way to describe variability is to determine how far the lowest score is from the highest score. The
descriptive statistic that indicates the distance between the two most extreme scores in a distribution is called the
range.

Range
Probably the simplest and easiest way to determine measure of dispersion is the range. The range of a set
of measurements is the difference between the largest value and the smallest value. Range (𝑅) = Maximum
value − Minimum value

Example 12. The IQ scores of 5 members of CHMSC Basketball men varsity are 108, 112, 127,
116, and 113. Find the range.
Solution: R = 127– 108 = 19

Variance and Standard Deviation


The variance and standard deviation are two measures of variability that indicate how much the scores are
spread out around the mean.
Mathematically, the distance between a score and the mean is the difference between them. Recall that this
difference is symbolized by 𝑋 − 𝑋̅ , which is the amount that a score deviates from the mean. Thus, a score’s
deviation indicates how far it is spread out from the mean. Of course, some scores will deviate by more than
others, so it makes sense to compute something like the average amount the scores deviate from the mean. Let’s
call this the “average of the deviations.” The larger the average of the deviations, the greater the variability.
The sample variance is the average of the squared deviations of scores around the sample mean.
The symbol for the sample variance is 𝑠 2 . Always include the squared sign because it is part of the symbol.
The formula for the variance is similar to the previous formula for the average deviation except that we add
̅ )2
∑(X−X
the squared sign. The definitional formula for the sample variance is 𝑠 2 = 𝑛−1
where 𝑠 2 is the sample variance, 𝑠 is the sample standard deviation, 𝑋 is the value of any particular observation
or measurement, 𝑋̅ is the sample mean, and 𝑛 is the sample size.
The measure of variability that more directly communicates the “average of the deviations” is the standard
deviation. The symbol for the sample standard deviation is 𝑠 (which is the square root of the symbol for the sample
variance: √𝑠 2 = 𝑠).

19
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

To create the definitional formula here, we simply add the square root sign to the previous defining formula
̅ )2
∑(X−X
for variance. The definitional formula for the sample standard deviation is 𝑠 = √ .
𝑛−1

Example 13: A sample of 5 households showed the following number of household members:
3, 8, 5, 4, and 4. Find the variance and standard deviation.
Solution
First, solve for the sample mean (𝑋̅) and add the columns for (𝑋 − 𝑋̅ ) and (𝑋 − 𝑋̅)2 .

3+8+5+4+4
𝑋̅ =
5
24
= 5
= 4.8

Second, solve for the sample variance and sample standard deviation by substitution,

̅)2
∑(X i − X
2
𝑠 =
𝑛−1
14.8
=
5−1
= 3.7

∑(X i − ̅
X )2
𝑠= √
𝑛−1
= √3.7
= 1.92
Example 14: Find the measures of variability for the grades in Mathematics of the two sample groups of students.
Male: 100, 65, 75, 85, 95 Female: 84, 86, 85, 82, 83
Solution
For Range R,
Male Group, Range R = 100 − 65 = 35 Female Group, Range R = 86 − 82 = 4
For Sample Variance and Sample Standard Deviation
65+75+85+95+100
The mean for male group is ̅
X= 5
= 84.
82+83+84+85+86
The mean for female group is ̅
X= = 84.
5

For male group, the sample variance and sample standard deviation is
∑(Xi − X̅)2
2
s =
n−1
820
= 5−1
= 205
20
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

For female group, the sample variance and sample standard deviation is
∑(Xi − ̅X)2
s2 =
n−1
10
= 5−1
= 2.5

Table 1

Statistics Grades of Students When Grouped According to Sex

Mean Range Sample Variance Standard Deviation

Male 84 35 205 14.32

Female 84 4 2.5 1.58

Findings. Both groups have the same mean but differ on all measures of variability. The male group is more
variable than female group.

Conclusion. The grades of the female group are less variable than that of the male because it has smaller standard
deviation. The female group has a more uniform set of grades in Statistics than the male group.

Measures of Relative Position


When presenting or analyzing data set it is sometimes helpful to group subjects into several equal groups.
For example, to create four equal groups we need the values that split the data such that 25% of the observations
are in each group. The cut off points are called quartiles, and there are three (3) of them (the middle one also being
called the median). The general term for such cut off points is quantiles; other values likely to be encountered are
deciles, which split data into 10 parts, and percentiles, which split the data into 100 parts (also called centiles).
Values such as quartiles can also be expressed as percentiles; for example, the lowest quartile is also the 25 th
percentile and the median is the 50th percentile of the 5th decile.

1. Percentiles

Percentiles are values that divide a set of observations in an array into 100 equal parts. Thus, P1, read as
first percentile, is the value below which 1% of the values fall P 2, read as second percentile, is the value below
which 2% of the values fall,…, P99, read as ninety – ninth percentile, is the value below which 99% of the fall.
Example. The 80th percentile of a distribution is a value such that at least 80 percent of the ordered
observations are less than its value and at least 20 percent of the ordered observations are larger than its value. If
𝑃80 = 75: At least 80% of the ordered observations are less than 75 or at least 20% of the ordered observations
are larger than 75. So any observation that is smaller than 𝑃80 value belongs in the lower 80% of the distribution
while any observation greater than 𝑃80 value belongs in the upper 20% of the distribution.
To compute for the ith percentile, we have
i(n+1) th
Pi = the value of the [ ] observation in the array
100
Note:
➢ If Pi is a whole number, the ith percentile is the average of the Pi observation and the P(i+1) observation.
➢ If Pi has a fractional value, the ith percentile is the P(i+1) observation, or, round up the value of Pi to the next integer.

Example 15. The following were the scores of 10 students in a short quiz. Find the 64 th percentile.
2 8 6 9 7 5 8 10 10 1
Solution: First arrange the data from lowest to highest.
1 2 5 6 7 8 8 9 10 10
Then, using
i(n+1) th
Pi = [ ] observation. We have
100

21
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

i(n + 1) th
Pi = [ ]
100
64(10+1)
P64 = [ 100 ]th = 7.04th or 8th observation (always round up to the nearest whole number)
Since, the 8th observation in an ordered array is 9, therefore, the 64th percentile of the distribution is 9, which is
interpreted as 64% of the scores are below 9.

Approximating the ith Percentile from a Frequency distribution


To solve for the percentile in grouped data, we have
𝑖𝑛
−< 𝑐𝑓𝑃𝑖
Pi = 𝑋𝑃𝑖 + 𝑐( 100 )
𝑓𝑃𝑖
𝑖𝑛
where: The Pith class is the class where the 100 falls.
𝑋𝑃𝑖 = the lower-class boundary of the Pith class
𝑐 = class size of the Pith class
< 𝑐𝑓𝑃𝑖 = less than cumulative frequency of the class preceding the Pith class
𝑓𝑃𝑖 = frequency of the Pith class
Example 16. Find the 35th percentile of the given frequency distribution of 110 scores in achievement test below.

Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110

Solution
𝑖𝑛
First, add one column in the FDT for < 𝑐𝑓and determine the P35th class using 100.
Score Frequency < 𝑐𝑓
50 – 54 10 10
55 – 59 3 13
60 – 64 8 21
65 – 69 13 34
70 – 74 17 51
75 – 79 19 70
80 – 84 22 92
85 – 89 13 105
90 – 94 4 109
95 – 99 1 110
TOTAL 110

𝑖𝑛 𝑖𝑛 35(110)
Using , we have = = 38.5. Since 38.5 falls on the class interval 70 – 74, hence, the P35th
100 100 100
class is 70 – 74. Therefore, we have
𝑖𝑛
= 38.5
100
𝑋𝑃35 = 69.5
𝑐=5
< 𝑐𝑓𝑃35 = 34
𝑓𝑃35 = 17

22
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

By substitution, we have
𝑖𝑛
−<𝑐𝑓𝑃𝑖
100
Pi = 𝑋𝑃𝑖 + 𝑐( )
𝑓𝑃𝑖
38.5−34
P35 = 69.5 + 5 ( ) = 70.82
17

Hence, thirty-five percent of the scores in the achievement test are below 70.82.
2. Deciles
Deciles are values that divide the array into 10 equal parts. Thus, D1, read as first decile, is the value below
which is 10% of the values fall, D2, read as second decile, is the value below which 20% of the values fall,…, D9,
read as ninth decile, is the value below which 90% of the values fall.
To compute for the ith decile, we have
i(n+1) th
Di = the value of the [ ] observation in the array
10
Example 17. From the given set scores in a quiz find the 4thdecile or D4.
3 8 9 11 12 18 19
Solution
Since the data is already arranged from lowest to highest then we may proceed in finding the 4 thdecile.
3 8 9 11 12 18 19
i(n+1) th
Using Di = [ ] , we have
10
4(7+1) th
D4 = [ ] observation = 3.02th or 4th observation (always round up to the nearest whole number)
10
Since, the 4th observation in an ordered array of the given distribution is 11, therefore, the 4th decile of the
distribution is 11, which is interpreted as 40% of the scores are below 11.
Approximating the ith Decile from a Frequency distribution
To solve for the decile in grouped data, we have
𝑖𝑛
−< 𝑐𝑓𝐷𝑖
Di = 𝑋𝐷𝑖 + 𝑐( 10 )
𝑓𝐷𝑖
where
𝑖𝑛
The Dith class is the class where the falls.
10
𝑋𝐷𝑖 = the lower class boundary of the Dith class
𝑐 = class size of the Dith class
< 𝑐𝑓𝐷𝑖 = less than cumulative frequency of the class preceding the D ith class
𝑓𝐷𝑖 = frequency of the Dith class

Example 18. Find the 6th decile of the given frequency distribution of 110 scores in achievement test below.

Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110

𝑖𝑛 𝑖𝑛 6(110)
Using 10, we have 10 = = 66. Since 66 falls on the class interval 75 – 79, hence, the D6th class is 75
10
– 79. Therefore, we have
23
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑖𝑛
= 66
10
𝑋𝐷6 = 74.5
𝑐=5
< 𝑐𝑓𝐷6 = 51
𝑓𝐷6 = 19
By substitution, we have
𝑖𝑛
−<𝑐𝑓𝐷6
10
D6 = 𝑋𝐷6 + 𝑐( )
𝑓𝐷6
66−51
D6 = 74.5 + 5 ( ) = 78.45
19

Hence, sixty percent of the scores in the achievement test are below 78.45.

3. Quartiles

Quartiles are values that divide the array into 4 equal parts. Thus, Q1, read as first quartile, is the value
below which 25% of the values fall Q2, read as second quartile, is the value below which 50% of the values fall
Q3, read as third quartile, is the value below which 75% of the values fall.
Example 19. From the given set scores in a quiz find the 3rd quartile or Q3
3 8 9 11 12 18 19
Solution
Since the data is already arranged from lowest to highest then we may proceed in finding the 3 rd quartile.
3 8 9 11 12 18 19
i(n+1) th
Using Q i = [ ] , we have
4
3(7+1) th
Q3 = [ ] observation = 6th observation.
4
Since, the 6th observation in an ordered array of the given distribution is 18, therefore, the 3rd quartile of
the distribution is 18, which is interpreted as 75% of the scores are below 18.
Approximating the ith Quartile from a Frequency distribution
To solve for the quartile in grouped data, we have
𝑖𝑛
−< 𝑐𝑓𝑄𝑖
Q i = 𝑋𝑄𝑖 + 𝑐( 4 )
𝑓𝑄𝑖
𝑖𝑛
where: The Qith class is the class where the 4
falls.
𝑋𝑄𝑖 = the lower class boundary of the Qith class
𝑐 = class size of the Qith class
< 𝑐𝑓𝑄𝑖 = less than cumulative frequency of the class preceding the Q ith class
𝑓𝑄𝑖 = frequency of the Qith class

Example 20. Find the 1st quartile of the given frequency distribution of 110 scores in achievement test below.

Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110
24
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑖𝑛 𝑖𝑛 1(110)
Using 4 , we have = = 27.5th. Since 27.5 falls on the class interval 65 – 69, hence, the Q1st class is 65 –
4 4
69. Therefore, we have
𝑖𝑛
= 27.5
4
𝑋𝑄1 = 64.5
𝑐=5
< 𝑐𝑓𝑄1 = 21
𝑓𝑄1 = 13
By substitution, we have
𝑖𝑛
−<𝑐𝑓𝑄1
4
Q1 = 𝑋𝑄1 + 𝑐( )
𝑓𝑄1
27.5−21
Q1 = 64.5 + 5 ( ) = 67
13
Hence, 25% of the scores in the achievement test are below 67.

4. 𝒛 −Score
𝑧 −Score is used to know the position of one observation relative to others in a set of data. Let say, we
want to know a score of a student of 42 compared to the scores of the other students in the class based from a quiz
on a total of 50 points. The mean and the standard deviation of the scores can be used to compute a 𝑧 −score,
which will measure the relative standing of a measurement in a data set.
A 𝑧 −score measures the distance between an observation and the mean, measured in units of standard
deviation. The following formulas show how to compute the 𝑧 −score for a data value 𝑥 in a population and in a
sample.
𝑋−𝜇 𝑋−𝑋̅
For population: 𝑧 = For sample: 𝑧 =
𝜎 𝑠

Example 21: The monthly expenditures of a large group of households has a mean of ₱48,700 and a standard
deviation of ₱10,400. What is the 𝑧 −value of monthly expenditures of ₱59,400 and ₱38,300?
Solution
Let 𝜇 = ₱48,700 and 𝜎 = ₱10,400
Using the formula of 𝑧 to determine 𝑧 −values for the two 𝑥 values (₱59,400 and ₱38,300) are computed as
follows:
𝑋−𝜇 ₱59,400−₱48,700
For ₱59,400: 𝑧= 𝜎
= ₱10,400
= 1.00
𝑋−𝜇 ₱38,300−₱48,700
For ₱38,300: 𝑧= 𝜎
= ₱10,400
= −1.00
The 𝑧 of 1.00 indicates that a monthly expenditure of ₱59,400 for households is one standard deviation above the
mean, and a 𝑧 of −1.00 shows that a ₱38,300 monthly expenditure is one standard deviation below the mean.
Note that both household monthly expenditures (₱59,400 and ₱38,300) are the same distance (₱48,700) from the
mean.

Example 22: Raul has taken two tests in his mathematics class. He scored 72 on the first test, for which the mean
of all scores was 65 and the standard deviation was 8. He received a 60 on a second test, for which the mean of
all scores was 45 and the standard deviation was 12. In comparison to the other students, did Raul do better on
the first test or the second test?
Solution: Find the 𝑧 −score for each test.
72−65 60−45
𝑧72 = = 0.875 𝑧60 = = 1.25
8 12
Raul scored 0.875 standard deviation above the mean on the first test and 1.25 standard deviations above the mean
on the second test. These 𝑧 −scores indicate that, in comparison to his classmates, Raul scored better on the
second test than he did on the first test.
Example 23: A consumer group tested a sample of 100 light bulbs. It found that the mean life expectancy of the
bulbs was 842 h, with a standard deviation of 90. One particular bulb from the DuraBright Company had a
𝑧 −score of 1.2. What was the life span of this light bulb?

25
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Solution: Substitute the given values into the 𝑧 −score equation and solve for 𝑥.
𝑋 − 𝑋̅
𝑧=
𝑠
𝑋 − 842
1.2 =
90
108 = 𝑥 − 842
950 = 𝑥
The light bulb had a life span of 950 h.

5. Box-and-Whisker Plot
A box-and-whisker plot (sometimes called a boxplot) is often used to provide a visual summary of a set of data.
It is a graph of a data set obtained by drawing a horizontal line from the minimum data value to first quartile (𝑄1),
drawing a horizontal line to third quartile (𝑄3 ) to the maximum data value, and drawing a box whose vertical line
passes through 𝑄1 and 𝑄3 with a vertical line inside the box passing through the median or second quartile (𝑄2 ).

The boxplot will give the following information:


a) If the median is near the center of the box, the distribution is approximately symmetric.
b) If the median falls to the right of the center of the box, the distribution is negatively skewed.
c) If the median falls to the left of the center of the box, the distribution is positively skewed.
d) If the lines are about the same length, the distribution is approximately symmetric.
e) If the left line is larger than the right line, the distribution is negatively skewed.
f) If the right line is larger than the left line, the distribution is positively skewed.

Example 24: Construct a boxplot for the data set of the ages of 9 middle-management employees of a certain
company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55. What can you say about the distribution of the
data set?
Solution:
Step 1: Determine the 𝑄1, Median, and 𝑄3 of the given data set. Recall that 𝑄1 = 47, Median = 53, and 𝑄3 =
56.5.
Step 2: Locate the lowest value, 𝑄1, the median, 𝑄3 , and the highest value on the scale.
Step 3: Draw a box around 𝑄1 and 𝑄3 , draw a vertical line through the median, and connect the upper and lower
values, as shown in the figure below.

Activity:

Solve the following problems and show complete solution.


1. The fuel efficiency, in miles per gallon, of 15 small utility trucks was measured. The results are recorded in the table
below. Find the mean, median, mode, range, variance, and standard deviation of these data. Fuel efficiency (mpg): 22,
25, 23, 27, 15, 24, 24, 32, 23, 22, 25, 22, 26, 21, 20
2. A professor, grade students on 5 tests, a project, and a final examination. Each test counts as 10% of the course grade.
The project counts as 20% of the course grade. The final examination counts as 30% of the course grade. Samantha has
test scores of 70, 65, 82, 94 and 85. Samantha’s project score is 92. Her final examination score is 80. Use the weighted
mean (average) formula to find Samantha’s average for the course. (Hint: The sum of all weights is 100% or 1)
3. A random sample of 80 tires showed that the mean mileage per tire was 41,700 miles, with a standard deviation of
4,300 miles.
a) Determine the 𝑧-score, to the nearest hundredth, for a tire that provided 46,300 miles of wear.
b) The 𝑧-score for one tire was −2.44. What mileage did this tire provide? Round your result to the nearest hundred
miles.

26
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

4. The table below shows the heights, in inches, of 15 randomly selected National Basketball Association (NBA) players
and 15 randomly selected Division I National Collegiate Athletic Association (NCAA) players.
NBA 84, 76, 79, 75, 81, 81, 76, 85,
78, 79, 78, 78, 84, 75, 76
NCAA 78, 73, 73, 78, 77, 76, 75, 74,
74, 81, 75, 78, 78, 79, 73
Using the same scale, draw a box-and-whisker plot for each of the two data sets, placing the second plot below the first.
Write a valid conclusion based on the data.
5. A survey of 16 energy drinks noted the caffeine concentration of each drink in milligrams per ounce. The results are
given below. Find the mean, median, mode, range, variance, and standard deviation of these data. Concentration of
caffeine (mg/oz): 9.1, 7.8, 7.5, 8.9, 9.0, 8.2, 9.1, 8.7, 9.0, 7.7, 8.8, 8.9, 9.0, 9.1, 8.2, 8.9, 7.0
6. Given the data set below, find the mean, median and mode.
Frequency Distribution of Grades in College Algebra
Grade Number of Students
90 – 100 9
80 - 89 30
70 – 79 35
60 – 69 8
50 – 59 9
40 – 49 2
30 – 39 3
20 – 29 1
10 – 19 2
0–9 1
Total 100
7. A professor, grade students on 4 tests, a term paper, and a final examination. Each test counts as 15% of the course
grade. The term paper counts as 20% of the course grade. The final examination counts as 20% of the course grade.
Alan has test scores of 80, 78, 92, and 84. Alan received an 84 on his term paper. His final examination score was 88.
Use the weighted mean (average) formula to find Alan’s average for the course. (Hint: The sum of all weights is 100%
or 1)
8. A psychologist obtained the IQ scores of 10 students. The IQ scores are as follows:
110 95 85 140 132 100 95 70 85 100
Find P65, D3, D9 and Q3. Interpret the values.
9. A test involving 380 men ages 20 to 24 found that their blood cholesterol levels had a mean of 182 mg/dl and a standard
deviation of 44.2 mg/dl.
a) Determine the 𝑧-score, to the nearest hundredth, for one of the men who had a blood cholesterol level of 214 mg/dl.
b) The 𝑧-score for one man was −1.58. What was his blood cholesterol level? Round to the nearest hundredth.
10. The blood lead concentrations, in micrograms per deciliter (𝜇𝑔/𝑑𝐿), of 20 children from two different neighborhoods
were measured. The results are recorded in the table.

Neighborhood 1 3.97, 3.91, 3.98, 3.70, 4.13, 3.97, 4.01, 3.88, 4.11, 3.70, 3.96,
3.77, 4.30, 4.08, 4.12, 4.93, 3.93, 3.94, 3.85, 3.83
Neighborhood 2 4.31, 4.22, 3.78, 4.10, 4.34, 4.20, 4.35, 4.20, 4.01, 4.04, 4.28,
4.12, 4.59, 4.12, 4.01, 3.85, 3.96, 4.28, 4.39, 4.13

Using the same scale, draw a box-and-whisker plot for each of the two data sets, placing the second plot below the first.
Considering that high blood lead concentrations are harmful to humans, in which of the two neighborhoods would you
prefer to live?
11. Find the P20, D4, D6 and Q3 of the following distribution of the ages of the members of a labor union. Interpret the
values.
AGE (Years) FREQUENCY
15 – 19 18
20 – 24 42
25 – 29 78
30 – 34 115
35 – 39 178
40 – 44 107
45 – 49 88
50 – 54 52
55 – 59 30
60 – 64 11
TOTAL 719
27
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Lesson 3: Probabilities and Normal Distributions


One of the most important statistical distributions of data is known as a normal distribution. This
distribution occurs in a variety of applications. Types of data that may demonstrate a normal distribution include
the lengths of leaves on a tree, the weights of newborns in a hospital, the lengths of time of a student’s trip from
home to school over a period of months, the SAT scores of a large group of students, and the life spans of light
bulbs.
The accompanying table shows the scores of 50 students in Mathematics examination. Using the data, construct a
histogram and answer the questions that follow.
Score interval Frequency
15 – 19 2
20 – 24 6
25 – 29 10
30 – 34 14
35 – 39 10
40 – 44 6
45 – 50 2

1. What is the mean, median and the mode of the given data? What can you say about the values of three central
tendencies?
2. What is the maximum data value as shown on the histogram? (What is the largest value on the data axis?)
3. What is the minimum data value as shown on the histogram? (What is the smallest value on the data axis?)
4. Is the histogram symmetric, skewed to the left, skewed to the right, bell-shaped, uniform or does it have no special shape?
5. How many peaks does the histogram have, and where are they located? (Peaks are bars with shorter bars on each side. First bars that
are taller than second bars or last bars that are taller than the preceding are also called peaks. Two or more adjacent bars of the same
height with neighboring shorter bars - a plateau - would be considered one peak.)

According to Frost (2020), the normal distribution is the most important probability distribution in
statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and
IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve. The
normal distribution is a probability function that describes how the values of a variable are distributed. It is a
symmetric distribution where most of the observations cluster around the central peak and the probabilities for
values further away from the mean taper off equally in both directions. Extreme values in both tails of the
distribution are similarly unlikely (Frost, 2020). It is a distribution of normal random variable with a mean equal
to zero (𝑋̅ = 0) and a standard deviation equal to one (𝑠 = 1). It is represented by a normal curve.
The two factors from which the graph of the normal distribution depends on are the mean and the
standard deviation. The mean of the distribution determines the location of the center of the graph, and the
standard deviation determines the height and width of the graph. The graphs of normal distributions look like a
symmetric, bell-shaped curve are shown below:

When the standard deviation is small, the curve is tall and narrow; and when the standard deviation is big, the
curve is short and wide (see above).
Definition. Central Limit Theorem. If 𝑛 (the sample size) is large, the theoretical sampling distribution of the
mean can be approximated closely with a normal distribution.
Properties of a Normal Distribution
Every normal distribution has the following properties.
a) It is symmetrical about the vertical line drawn through 𝑧 = 0. This means that the shape of the distribution at the
right is a mirror image of the left.
b) The highest point in the curve is 𝑦 = 0.3989.

28
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

c) The curve is asymptotic to the x-axis. This means that both positive and negative ends approach the horizontal
axis but do not touch it.
d) For all practical purposes, the area under the normal curve from 𝑧 = −3 to 𝑧 = +3 equals 1.0 or 100%, hence the
term unit normal curve.
e) The three measures of central tendency (mean, median, and mode) coincide with each other.
f) It is dependent on the values of the mean and standard deviation.

In a normal curve, approximately


• 68% of the area under the curve falls within 1 standard
deviation of the mean
• 95% of the area under the curve falls within 2 standard
deviations of the mean
• 99.7% of the area under the curve falls within 3 standard
deviations of the mean.

Areas under the normal curve may represent several things


1) Probability of an event
2) Percentile rank of the score
3) Percentage distribution of the whole population

Example 1: The area under a bell curve gives the probability that a randomly caught fish’s weight is
somewhere in a given interval. Thus, there is a big chance of catching a fish whose weight is somewhere
between 450 grams to 550 grams. Moreover, there is a small chance of catching a fish that weighs more than
550 grams. The question is how does one get the area under that curve? The z-Table gives the answer to this
question.

29
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

To do that, standardize the normal curve. Then refer to the z-Table to obtain the value. There is a need to
standardize a normal variable. Without this process, finding the area under a particular curve is close to
impossible. Setting up table of values for a normal variable just like the z-table is very difficult; added to this
burden is the number of countless possibilities for the mean and standard deviation of a normal variable. To
simplify the task of getting area, refer to the z-table (area under the normal curve) above. The z-table (area
under the normal curve) has the following properties.
1) The total area under the normal curve is 1 or 100%.
2) Since the normal curve is symmetrical about the mean, then half the normal curve has an area of 0.5.
3) The table on the next page gives only the area to the right of the mean.
4) The given area in the table is the area from 𝑧 = 0 to ±𝑧.
5) Area is always + but 𝑧 can either be positive or negative.
6) Always draw the curve and shade the given region.
7) Simple arithmetic, addition and subtraction are the only operations needed to get the correct area.

The normal random variable of a standard normal distribution is called a standard score or a 𝑧-score. Every
normal random variable 𝑋 can be transformed into a 𝑧-score using the following equation
𝑋−𝜇
𝑧=
𝜎
where 𝑋 is a normal random variable; 𝜇 is the mean of 𝑋; 𝜎 is the standard deviation of 𝑋
Example 2: Determine the area under the normal curve between 𝑧 = 0 and 𝑧 = 1.85.
Solution: Draw the figure and represent the area.

Since z-table gives the area between 0 and any 𝑧-value to the right of 0, we only need to look up the 𝑧-value in
the table. Find 1.8 in the left column and 0.05 in the top row. The value where the column and row meet in the
table is the answer, 0.4678.

30
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Example 3: Determine the area under the normal curve between 𝑧 = 0 and 𝑧 = −1.15.
Solution: Draw the figure and represent the area.

The area between 𝑧 = 0 and 𝑧 = −1.15 or


𝑃(−1.15 < 𝑧 < 0) is 0.3749. Therefore, the area is 0.3749 or 37.49%.

Example 4: Find the area under the normal curve to the right of 𝑧 = 1.15.
Solution: Draw the figure and represent the area.

The required area is the right tail of the normal curve. Since z-table gives the area between 𝑧 = 0 and 𝑧 = 1.15,
first find the area.
𝑃 (0 < 𝑧 < 1.15) = 0.3749
Then subtract 𝑃(0 < 𝑧 < 1.15) = 0.374 from 0.5000, since half of the area under the normal is to right of 𝑧 =
0.
𝑃 (𝑧 > 1.15) = 0.5000 − 𝑃(0 < 𝑧 < 1.15)
= 0.5000 − 0.3749
= 0.1251
Therefore, the area to the right of 𝑧 = 1.15 is 0.1251 or 12.51%.
Example 5: Determine the area under the normal curve
between 𝑧 = 0.75 and 𝑧 = 1.85.

Solution: 𝑃(0 < 𝑧 < 1.85) = 0.4678 and 𝑃(0 < 𝑧 < 0.75) =
0.2734
Hence, 𝑃(0.75 < 𝑧 < 1.85) = 𝑃(0 < 𝑧 < 1.85) − 𝑃(0 < 𝑧 <
0.75) = 0.4678 − 0.2734 = 0.1944
Therefore, the area is 0.1944 or 19.44%.

31
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Example 6: Find the area under the normal curve between 𝑧 = 1.15 and 𝑧 = −1.85.
Solution: 𝑃(−1.85 < 𝑧 < 0) = 0.4678 and 𝑃(0 < 𝑧 < 1.15) = 0.3749

Since two areas are on the opposite sides of 𝑧 =


0, we must find both areas and add them.
𝑃 (−1.85 < 𝑧 < 1.15)
= 𝑃 (−1.85 < 𝑧 < 0)
+ 𝑃(0 < 𝑧 < 1.15)
= 0.4678 + 0.3749 = 0.8427
Hence, the total area is 0.8427 or 84.27%.

Example 7: Find the z-value such that the area under the normal curve is between 0 and 𝑧 −value is 0.3962.
Solution: Find the area in z-table. Then correct 𝑧 −value in the left column as 1.2 and in the top row as 0.06, and
add these two values to get 1.26.

Hence, the 𝑧 −value is 1.26.

Application of Normal Distribution


Example 8: Suppose that the data concerning the first-year salaries of CHMSC graduates is normally
distributed with the population mean 𝜇 = ₱60,000 and the population standard deviation 𝜎 = ₱15,000. Find
the probability of a randomly selected CHMSC graduate earning less than ₱45,000 annually. To answer this
question, we have to find the portion of the area under the normal curve from 45 all the way to the left.

32
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Solution: We have to transform first the given data into 𝑧-scores.


𝑋−𝜇
𝑧=
𝜎
Using the transformation formula, find the value of 𝑧 and then find the number that corresponds to that 𝑧 in the
body of 𝑧-table:
𝑋 − 𝜇 45,000 − 60,000 −15,000
𝑧= = = = −1
𝜎 15,000 15,000
The z-table that corresponds to the area from 𝑧 = −1 to 𝑧 = 0 is 0.3413. Since we are looking for the area or
percentage of students earning less than ₱45,000 or to the left of −1, therefore,
𝑃 (𝑧 < −1) or Area (to the left of −1) = 0.5000 − 0.3413 = 0.1587
Therefore, the percentage of CHMSC students earning less than ₱45000 a year is 15.87%.
Example 9: The average Pag-ibig salary loan for RFS Pharmacy Inc. employees is ₱23,000. If the debt is
normally distributed with a standard deviation of ₱2,500, find the probability that the employee owes less than
₱18,500.
Solution:
First, draw a figure and represent the area.

Second, find the value of 𝑧-value for ₱18,500.


𝑋 − 𝜇 18,500 − 23,000 −4,500
𝑧= = = = −1.80
𝜎 2,500 2,500
Third, find the appropriate area. The area obtained in 𝑧 −table is 0.4641, which corresponds to the area
between 𝑧 = 0 and 𝑧 = −1.80.
𝑃 (−1.80 < 𝑧 < 0) = 0.4641
Fourth, subtract 0.4641 from 0.5000.
𝑃 (𝑧 < −1.80) = 0.5000 − 𝑃(−1.80 < 𝑧 < 0) = 0.5000 − 0.4641 = 0.0359.
Hence, the probability that the employee owes less than ₱18,500 in Pag-ibig salary loan is 0.0359 or 3.59%.
Example 10: The average age of bank managers if 40 years. Assume the variable is normally distributed. If the
standard deviation is 5 years, find the probability that the age of a randomly selected bank manager will be in
the range between 35 and 46 years old.
Solution:
Assume that ages of bank managers are normally
distributed; then cut off points are as shown in the
figure below.
First, draw a figure and represents the area.

Second, find the two 𝑧 −values


𝑋 − 𝜇 35 − 40 −5
𝑧= = = = −1.00
𝜎 5 5
𝑋 − 𝜇 46 − 40 6
𝑧= = = = 1.20
𝜎 5 5
Third, find the appropriate area for 𝑧 = −1.00 and 𝑧 = 1.20, in 𝑧 −table.
𝑃 (−1.00 < 𝑧 < 0) = 0.3414 𝑃(0 < 𝑧 < 1.20) = 0.3849

33
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Fourth, add 𝑃(−1.00 < 𝑧 < 0) and 𝑃(0 < 𝑧 < 1.20).
𝑃 (35 < 𝑋 < 46) = 𝑃 (−1.00 < 𝑧 < 1.20) = 𝑃(−1.00 < 𝑧 < 0) + 𝑃(0 < 𝑧 < 1.20) = 0.3414 + 0.3849
= 0.7262
Hence, the probability that a randomly selected bank manager is between 35 and 46 years old is 0.7262 or
72.62%.
Example 11: The Emotional Quotient (EQ) score on the latest version of Sirug EQ Test is transformed so as to
follow a normal distribution with a mean of 110 and a standard deviation of 15. Find the 96 th percentile of the
distribution of Emotional Quotient?
Solution:
First, sketch the graph to illustrate the scenario.

Second, find the value of 𝑧 corresponding to the 96th percentile.


Using z-table, the z-value that has a corresponding of 0.96 is 1.75. Hence, 𝑃 (0 < 𝑧 < 1.75) = 0.96.
Third, we substitute the values of 𝑧 = 1.75, 𝜎 = 15, 𝜇 = 110 and solve for 𝑋.
𝑋−𝜇
𝑧=
𝜎
Solve for 𝑋 in terms of the other variables
𝑋 = 𝑧𝜎 + 𝜇 = (1.75)(15) + 110 = 136.25
Thus, we claim that 96% of all people have EQ’s below 136.25.

34
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Activity:

1. Find the following probability.


a) 𝑃(𝑧 < −1.47)
b) 𝑃(−1.89 < 𝑧 < −1.32)
c) 𝑃(𝑧 > 1.72)
d) 𝑃(𝑧 < 2.05)
e) 𝑃(−1.04 < 𝑧 < 0.56)
2. Find the following probability.
a) 𝑃(𝑧 < −0.63)
b) 𝑃(−0.76 < 𝑧 < −0.41)
c) 𝑃(𝑧 > 0.732)
d) 𝑃(𝑧 < 1.76)
e) 𝑃(−1.32 < 𝑧 < 1.62)
3. Past records show that the average daily consumption of a person who drinks beer is 500 ml with a standard deviation
of 25 ml.
a) What’s the chance that the average daily consumption of a person who drinks is less than 500 ml?
b) If a random person who drinks is chosen, what is the chance that he consumes less than 575 ml but greater than
500 ml?
4. In a math competition, past records show that the average score of competitors is 78 with a standard deviation of 8.
a) What is the chance that a competitor’s score is less than 70?
b) Suppose an outstanding merit award is offered for any competitor who scored more than 90. What is the chance of
winning the award?
c) What’s the chance of scoring greater than 75but less than 85?
5. A company produces different types of energy drinks. The filling machines are adjusted to pour 500 milliliters (ml) of
energy drinks into each plastic bottle. Nonetheless, the actual amount of energy drink poured into each bottle is not
exactly 500 ml., it varies from bottle to bottle. It has been observed that the amount of energy drink in a bottle is normally
distributed with a mean of 500 ml and a standard deviation of 4.75 ml. What percentage of the energy drink bottles
contains 505 to 513 milliliters?

6. The average daily jail population in the New Bilibid Prison in Muntinlupa City is 36,290. If the distribution is normal
and the standard deviation is 3,750, find the probability that on a randomly selected day, the jail population is greater
than 40,145.
7. The average credit card debt for public school teacher is ₱14,970. If the debt is normally distributed with a standard
deviation of ₱5,650, find the probabilities: (a) that the teacher owes at least ₱6,740, (b) that the teacher owes more than
₱19,270, and (c) that the teacher owes between ₱6,740 and ₱19,270
8. To qualify for a Master’s degree program in Business Administration at San Sebastian College, candidates must score
in the top 20% on a mental ability test. The test has a mean of 180 and a standard deviation of 25. Find the lowest
possible score to qualify. Assume the test scores are normally distributed.

35
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Lesson 4: Linear Regression and Correlation

Linear Regression
When performing research studies, scientists often wish to know whether two variables are related. If the
variables are determined to be related. A scientist may then wish to find an equation that can be used to model
the relationship. For instance, the zoology professor R. McNeill Alexander wanted to determine whether the stride
length of a dinosaur, as shown by its fossilized footprints, could be used to estimate the speed of the dinosaur.
Stride length for an animal is defined as the distance x from a particular point on a footprint to that same point on
the next footprint of the same foot. (See the figure below.) Because no dinosaurs were available, Alexander and
fellow scientist A. S. Jayes carried out experiments with many types of animals, including adult men, dogs,
camels, ostriches, and elephants. The results of these experiments tended to support the idea that the speed y of
an animal is related to the animal’s stride length x. To better understand this relationship, examine the data in the
table below, which are similar to, but less extensive than, the data collected by Alexander and Jayes.

Table 1.a
Adult men

Table 1.b
Dogs

Table 1.c
Camels

A graph of the ordered pairs in Table 1 is shown in Figure 1. In this graph, which is called a scatter
diagram or scatter plot, the x-axis represents the stride lengths
in meters and the y-axis represents the average speeds in meters per second. The scatter diagram seems to
indicate that for each of the three species, a larger stride length generally produces a faster speed. Also note that
for each species, a straight line can be drawn such that all of the points for that species lie on or very close to the
line. Thus, the relationship between speed and stride length appears to be a linear relationship.

Figure 1: Scatter diagram for Table 1 Figure 2: Vertical deviations


36
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

After a relationship between paired data, which are referred to as bivariate data, has been discovered, a
scientist tries to model the relationship with an equation. One method of determining a linear relationship for
bivariate data is called linear regression. To see how linear regression is carried out, let us concentrate on the
bivariate data for the dogs, which is shown by the green points in Figures 1 and 2. There are many lines that can
be drawn such that the data points lie close to the line; however, scientists are generally interested in the line
called the line of best fi t or the least-squares regression line.

The least-squares regression line for a set of bivariate data is the line that minimizes the sum of the squares of
the vertical deviations from each data point to the line.

The least-squares regression line is also called the least-squares line. The approximate equation of the
least-squares line for the bivariate data for the dogs is 𝑦̂ = 3.2𝑥 − 1.1. Figure 2 shows the graph of these data
and the graph of 𝑦̂ = 3.2𝑥 − 1.1. In Figure 2, the vertical deviations from the ordered pairs to the graph of 𝑦̂ =
3.2𝑥 − 1.1 are 0, −0.06, 0.5, −0.52, −0.16, −0.6, 0.34 and 0.2. It is traditional to use the symbol 𝑦̂
(pronounced y-hat) in place of y in the equation of a least-squares line. This also helps us differentiate the line’s
y-values from the y-values of the given ordered pairs. The next formula can be used to determine the equation
of the least-squares line for a given set of ordered pairs.

The Formula for the Least-Squares Line


The equation of the least-squares line for the 𝑛 ordered pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ) is 𝑦̂ = 𝑎𝑥 +
𝑏, where
𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
𝑎= and 𝑏 = 𝑦̅ − 𝑎𝑥̅
𝑛 ∑ 𝑥 2 −(∑ 𝑥)2
In the formula for the least-squares regression line, 𝑥 represents the sum of all the 𝑥 values, 𝑦 represents the
sum of all the 𝑦 values, and 𝑥𝑦 represents the sum of the 𝑛 products 𝑥1 𝑦1 , 𝑥2 𝑦2 , … , 𝑥𝑛 𝑦𝑛 . The notation 𝑥̅
represents the mean of the 𝑥 values, and 𝑦̅ represents the mean of the y values. The following example
illustrates a procedure that can be used to calculate efficiently the sums needed to find the equation of the least-
squares line for a given set of data.

Example 1: Find the equation of the least-squares line for the ordered pairs in Table 1.a on page.
Solution
The ordered pairs are (2.5, 3.4), (3.0, 4.9), (3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7), (4.2, 8.3), (4.5, 8.7)
The number of ordered pairs is 𝑛 = 8. Organize the data in four columns, as shown in Table 2. Then find the
sum of each column.
Table 2

Find the slope 𝑎.


𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦) (8)(195.86) − (28.8)(52.1)
𝑎= = ≈ 2.7303
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 (8)(106.72) − (28.8)2
Find 𝑥̅ and 𝑦̅.
∑𝑥 28.8 ∑𝑦 52.1
𝑥̅ = 𝑛
= 8
= 3.6 𝑦̅ = 𝑛
= 8
= 6.5125
Find the 𝑦 −intercept 𝑏.
𝑏 = 𝑦̅ − 𝑎𝑥̅ = 6.5125 − (2.7303)(3.6) = −3.31658
37
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

If 𝑎 and 𝑏 are each rounded to the nearest tenth, to reflect the accuracy of the original data, then we have as our
equation of the least-squares line:
𝑦̂ = 𝑎𝑥 + 𝑏
𝑦̂ = 2.7𝑥 − 3.3
See figure 3 below.

Figure 3. Least-squares line for speed versus stride length in adult men

Example 2: Use the equation of the least-squares line from Example 1 to predict the average speed of an adult
man for each of the following stride lengths. Round your results to the nearest tenth of a meter per second.
a) 2.8 m
b) 4.8 m
Solution
a) In Example 1, we found the equation of the least-squares line to be 𝑦̂ = 2.7𝑥 − 3.3. Substituting 2.8 for 𝑥 gives
𝑦̂ = 2.7𝑥 − 3.3 = 2.7(2.8) − 3.3 = 4.26
Rounding 4.26 to the nearest tenth produces 4.3. Thus 4.3 m/s is the predicted average speed for an adult man
with a stride length of 2.8 m.

b) In Example 1, we found the equation of the least-squares line to be 𝑦̂ = 2.7𝑥 − 3.3


Substituting 4.8 for 𝑥 gives
𝑦̂ = 2.7𝑥 − 3.3 = 2.7(4.8) − 3.3 = 9.66
Rounding 9.66 to the nearest tenth produces 9.7. Thus 9.7 m/s is the predicted average speed for an adult man
with a stride length of 4.8 m.
The procedure in Example 2a made use of an equation to determine a point between given data points. This
procedure is referred to as interpolation. In Example 2b, an equation was used to determine a point to the right
of the given data points. The process of using an equation to determine a point to the right or left of given data
points is referred to as extrapolation. See Figure 4.

Figure 4. Interpolation and extrapolation


38
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Activity:

1. Which diagram shows two variables which have


a) positive correlation,
b) negative correlation,
c) no correlation?
2. Classify the variables in each item whether categorical, ordinal, interval or ratio level (each item contains two variables).
a)
b)
c)
3. Construct a scatter plot for the data shown for car rental companies in the Philippines for a recent year.
Company Cars (in Revenue (in
hundreds) millions of pesos)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
4. Construct a scatter plot for the data obtained in a study on the number of absences and the final examination
results of seven randomly selected students from a statistics class. The data are shown here.
Student Number of Final
absences examination
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
5. Construct a scatter plot for the data obtained in a study on the number of hours that nine people exercise each
week and the amount of milk (in ounces) each person consumes per week. The data are shown.
Subject Hours Amount
A 3 48
B 0 8
C 2 32
D 5 64
E 8 10
F 5 32
G 10 56
H 2 72
I 1 48

a. Analyze the three scatter plots and determine which type of relationship (positive, negative, no), if any,
exists. Compare the three scatter plots.
b. Give one real-life example of positive relationship, negative relationship, and no relationship between
two variables.

39
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

Linear Correlation Coefficient


To determine the strength of a linear relationship between two variables, statisticians use a statistic
called the linear correlation coefficient, which is denoted by the variable 𝑟 and is defined as follows.
Linear Correlation Coefficient
For the 𝑛 ordered pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ), the linear correlation coefficient 𝑟 is given by
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
𝑟=
√[𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ][𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 ]
If the linear correlation coefficient 𝑟 is positive, the relationship between the variables has a positive correlation.
In this case, if one variable increases, the other variable also tends to increase. If r is negative, the linear
relationship between the variables has a negative correlation. In this case, if one variable increases, the other
variable tends to decrease. Figure 5 shows some scatter diagrams along with the type of linear correlation that
exists between the x and y variables. The closer |𝑟| is to 1, the stronger the linear relationship between the
variables.

Figure 5. Linear correlation


Example 3: Find the linear correlation coefficient for stride length versus speed of an adult man. Use the data in
Table 1.a. Round your result to the nearest hundredth.
Solution: The ordered pairs are
(2.5, 3.4), (3.0, 4.9), (3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7), (4.2, 8.3), (4.5, 8.7). The number of ordered
pairs is 𝑛 = 8. In Table 2, we found:
∑ 𝑥 = 28.8, ∑ 𝑦 = 52.1, ∑ 𝑥 2 = 106.72, ∑ 𝑥𝑦 = 195.86
The only additional value that is needed is
∑ 𝑦 2 = (3.4)2 + (4.9)2 + (5.5)2 + (6.6)2 + (7.0)2 + (7.7)2 + (8.3)2 + (8.7)2 = 362.25
Substituting the above values into the equation for the linear correlation coefficient gives us
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦) 8(195.86) − (28.8)(52.1)
𝑟= = ≈ 0.993715
√[𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ][𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2 ] √[8(106.72) − (28.8)2 ][8(362.25) − (52.1)2
To the nearest hundredth, the linear correlation coefficient is 0.99.

40
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

What is the significance of the fact that the linear correlation coefficient is positive in Example 3? (Answer: It
indicates a positive correlation between a man’s stride length and his speed. That is, as a man’s stride length
increases, his speed also increases.)

The linear correlation coefficient indicates the strength of a linear relationship between two variables; however,
it does not indicate the presence of a cause-and-effect relationship. For instance, the data in Table 3 show the
hours per week that a student spent playing pool and the student’s weekly algebra test scores for those same
weeks.
Table 3. Algebra Test Scores vs. Hours Spent Playing Pool

The linear correlation coefficient for the ordered pairs in the table is 𝑟 ≈ 0.98. Thus there is a strong
positive linear relationship between the student’s algebra test scores and the time the student spent playing pool.
This does not mean that the higher algebra test scores were caused by the increased time spent playing pool.
The fact that the student’s test scores increased with the increase in the time spent playing pool could be due to
many other factors or it could just be a coincidence. In your work with applications that involve the linear
correlation coefficient 𝑟, it is important to remember the following properties of 𝑟.

Properties of the Linear Correlation Coefficient


a) The linear correlation coefficient 𝑟 is always a real number between 1 and −1, inclusive. In the case in which
• all of the ordered pairs lie on a line with positive slope, 𝑟 is 1.
• all of the ordered pairs lie on a line with negative slope, 𝑟 is −1.
b) For any set of ordered pairs, the linear correlation coefficient 𝑟 and the slope of the least-squares line both have
the same sign.
c) Interchanging the variables in the ordered pairs does not change the value of 𝑟. Thus the value of 𝑟 for the ordered
pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) is the same as the value of 𝑟 for the ordered pairs (𝑦1 , 𝑥1 ), (𝑦2 , 𝑥2 ), … , (𝑦𝑛 , 𝑥𝑛 ).
d) The value of 𝑟 does not depend on the units used. You can change the units of a variable from, for example, feet
to inches and the value of 𝑟 will remain the same.

In Activity 3 and 4, find the equation of the least-squares line and the linear correlation coefficient for the given data.
Round the constants, 𝑎, 𝑏, and 𝑟, and to the nearest hundredth.
1. {(2,6), (3, 6), (4,8), (6,11), (8,18)}
2. {(2, −3), (3, −4), (4, −9), (5, −10), (7, −12)}
3. A student has recorded the data in the following table, which shows the distance a spring stretches in inches for a
given weight in pounds.

a) Find the linear correlation coefficient.


b) Find the equation of the least-squares line.
c) Use the equation of the least-squares line from part b to predict the distance a weight of 195 pounds will stretch
the spring.

In Activity 3 and 4, find the equation of the least-squares line and the linear correlation coefficient for the given data.
Round the constants, 𝑎, 𝑏, and 𝑟, and to the nearest hundredth.
1. {(−7, −11.7), (−5, −9.8), (−3, −8.1), (1, −5.9), (2, −5.7)}
2. {(1, 4.1), (2, 6.0), (4, 8.2), (6, 11.5), (8, 16.2)}
3. The following table shows the percent of water and the number of calories in various canned soups to which 100
grams of water are added.

41
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department

a) Find the equation of the least-squares line for the data. Round constants to the nearest hundredth.
b) Use the equation in part 𝑎 to find the expected number of calories in a soup that is 89% water. Round to the
nearest whole number.
c) Determine the linear correlation coefficient between the percent of water and the number of calories in various
canned soups.

42
Revision 02

You might also like