Section 2 Mathematics As A Tool Gecmat Chmsu - Cas Mathematics Department
Section 2 Mathematics As A Tool Gecmat Chmsu - Cas Mathematics Department
Introduction
Statistics is a branch of applied mathematics that deals with gathering, organizing, presenting, analyzing,
and interpreting the collected data. There are two major fields of applied statistics – descriptive and inferential
statistics. Descriptive statistics involve the collecting, organizing, describing, summarizing and presenting of
gathered data in a meaningful and informative way while inferential statistics refers to the process of drawing
conclusion and making decision on the population based on evidence obtained from a sample. Inferential
statistics include estimation and hypothesis testing.
In performing all these processes involved, the application of statistical tools and techniques is necessary.
Statistical tools derived from mathematics are useful on processing and managing numerical data in order to
describe a phenomenon and predict values.
The essential processes arrange the data to b analyzed and interpreted. These refer to gathering and
organizing data that can be done using the frequency distribution or grouped data and series of values in the case
of few data or values. The use of the measures of central tendency is very much important to help us determine
central value which can be used to describe the general or overall performance of a certain group of values like
the mean, the median, and the mode. On the other hand, the measures of dispersion can also be utilized in order
to know how close or far the data or values from each other like the range, the standard deviation, and the
variance. There are also helpful in describing whether the groups being studied or the data gathered are
heterogenous or homogeneous or they are dispersed, scattered, varied, distant or spread, or they are just clustered
or close to each other. These measures include also the measures of relative position which include the z-scores,
percentiles, quartiles, deciles, and box-and-whiskers plots.
The probability and the normal distributions are also discussed in this module. This is to equip the students
with further knowledge and skills on how to obtain the value of probability and problems about normal
distributions.
The topics on linear regression and correlation like least-squares line and linear correlation shall be
covered in the module so to help students interpret data and prove assumptions based on the problem given.
Lesson 1: Gathering and Organizing Data, Representing Data using Graphs and Charts and Interpreting
Organized Data
In today’s world we process enormous amount of data almost everyday. In schools, laboratories, and
companies, volumes of data are processed. Data management plays a very important role in processing this data.
To help analyze a certain phenomenon, we need to manage data with the help of statistics. The use of statistics
in predicting outcomes and possibly explain what is happening is very evident. When data are managed
efficiently, it results to understanding the nature of such phenomenon. This will further improve the lives in the
modern world.
The bar graph below depicts the confirmed COVID-19 cases, deaths, and recoveries in Southeast Asian
countries.
1
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
For each item, identify the misleading graph. After that, create a small group of four members and share
your answer or thoughts about the graphs. Then synthesizing the answers of your group, choose one representative
to present the answer to the class.
1.
2.
3.
4.
2
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
When conducting a statistical study, the researcher must gather data for the particular variable under
study. For example, if a researcher wishes to study the number of people who were bitten by poisonous snakes
in a specific geographic area over the past several years, he or she has to gather the data from various doctors,
hospitals, or health departments.
To describe situations, draw conclusions, or make inferences about events, the researcher must organize
the data in some meaningful way. The data gathered shall be presented, analyzed and interpreted that can be
easily understood by the reader. Data may be presented in textual, tabular, graphical or a combination of these.
Textual presentation uses statements with numerals in order to describe the data for the concrete information and
in expository form. It is to discuss the data and the information and interpretation it carries. For example, the math test
scores of 15 students out of 50 items are 47, 48, 49, 42, 36, 38, 40, 35, 50, 26, 25, 31, 34, 19, 41.
Tabular presentation uses statistical table to directly display the quantities or values collected as data. It is a
systematic arrangement of information into columns and rows. Examples of tabular presentation are simple frequency
distribution (or it can be just called a frequency distribution), cumulative frequency distribution, grouped frequency
distribution, and cumulative grouped frequency distribution. Graphical presentation illustrates data in a form of graphs
aiding readers to understand the text easily. It is the most attractive, effective and convincing way in describing the data.
There are various types of graphs we can prepare like bar graph, circle graph (pie chart), line graph, pictograph,
histogram, frequency polygon, and a scatter diagram.
3
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Two types of frequency distributions that are most often used are the qualitative or categorical frequency
distribution and the quantitative or grouped frequency distribution. Each raw data value is placed into a
quantitative or qualitative category called a class. The frequency of a class then is the number of data values
contained in a specific class.
Types of FDT
a) Qualitative or Categorical FDT. It is a frequency distribution table where the data are grouped according to some
qualitative characteristics; data are grouped into non-numerical categories. Categorical frequency distribution is
used for data that can be placed in specific categories, such as nominal- or ordinal-level data. For example, data
such as political affiliation, religious affiliation, or major field of study would use categorical frequency
distributions.
b) Quantitative FDT. It is a frequency distribution table where the data are grouped according to some numerical or
quantitative characteristics.
Example 1. Twenty-five army inductees were given a blood test to determine their blood type. The data set is
4
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
When the range of the data is large, the data must be grouped into classes that are more than one unit in width, in what is
called a grouped or quantitative frequency distribution. It is a frequency distribution table where the data are grouped
according to some numerical or quantitative characteristics. For example, a distribution of the blood glucose levels in
milligrams per deciliter (mg/dL) for 50 randomly selected college students is shown.
Example 2. Suppose a researcher wished to do a study on the ages of the 40 patients confined at a certain
hospital. The researcher first would have to get the data on the ages of the participants. When the data are in
original form, they are called raw data and are listed next. Construct the FDT of the given data set.
Age (in years) of 40 patients confined at a certain hospital.
5 13 20 25 30 33 36 42 45 53
5 15 21 25 31 33 38 43 50 55
7 15 23 26 31 34 40 44 51 57
10 20 24 27 32 35 42 45 52 57
Solution
Step 1. 𝑅 = highest value − lowest value = 57 − 5 = 52
Step 2. 𝑘 = √𝑁 = √40 ≈ 6.32 or 6 classes
𝑅 52
Step 3. 𝑐 = 𝑘 = ≈ 8.67 or 9
6
Step 4 and Step 5.
Table 1
Frequency Distribution of Age (in years) of 40 Patients Confined at a Certain Hospital
Age (in years) Tally f
5 – 13 IIII 5
14 – 22 IIII 5
23 – 31 IIII - IIII 9
32 - 40 IIII - III 8
41 – 49 IIII - I 6
50 – 58 IIII - II 7
TOTAL ∑ 𝑓 = 40
Other Columns in the FDT
Table 2
Frequency Distribution of Age (in years) of 40 Patients Confined at a Certain Hospital
Age (in years) Tally f TCB CM RF CF RCF
TLCB TUCB (%) <CF >CF <RCF >RCF
5 – 13 IIII 5 4.5 13.5 9 12.5 5 40 12.5 100.0
14 – 22 IIII 5 13.5 22.5 18 12.5 10 35 25.0 87.5
23 – 31 IIII - IIII 9 22.5 31.5 27 22.5 19 30 47.5 75.0
32 - 40 IIII - III 8 31.5 40.5 36 20.0 27 21 67.5 52.5
41 – 49 IIII - I 6 40.5 49.5 45 15.0 33 13 82.5 32.5
50 – 58 IIII - II 7 49.5 58.5 54 17.5 40 7 100.0 17.5
TOTAL ∑ 𝑓 = 40 100.0
The only allowable calculation on nominal data is to count the frequency of each value of the variable.
We can display the counts in four ways: pie charts, bar charts, scatter plot, time series graph, and pictograph.
1. Pie Chart (Circle graph). Pie chart is a circular graph that is useful in showing how a total quantity is distributed
among a group of categories. The “pieces of the pie” represent the proportion of the total that fall into each category.
It is useful for data sorted into categories for a specific period. Its emphasis is to show the components parts with
respect to the total in terms of the percentage distribution. It uses the pie chart if there are less than 8 categories in the
data set. The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the
sizes of the sections. Percentages or proportions can be used. The variable is nominal or categorical. It is a circle that
is divided into sections or wedges according to the percentage of frequencies in each category of the distribution.
Solution
Step 1. Since there are 360° in a circle, the frequency for each class must be converted to a proportional part of
the circle. This conversion is done by using the formula
𝑓
Degrees = × 360°
𝑛
where 𝑓 = frequency for each class and 𝑛 = sum of the frequencies.
Hence, the following conversions are obtained. The degrees should sum to
360°.
Step 3. Next, using a protractor and a compass, draw the graph, using the appropriate degree measures found in
Step 1, and label each section with the name and percentages, as shown in the figure below.
2. Bar chart (Column graph). Like pie charts, column graphs or bar charts are applicable only to grouped data. They
should be used for discrete, grouped data of ordinal and ordinal scale. Column chart is appropriate for comparing the
magnitudes of variable in the x-axis for the different categories of variable in the y-axis. For time series data, its
emphasis is on the magnitude and not the movement or trend. The usual space between bars is around one-fourth of
the width of the column. When the data are qualitative or categorical, bar graphs can be used to represent the data. A
bar chart can be drawn using either horizontal or vertical bars.
7
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Horizontal bar chart is used for qualitative types of data given a specific time. It is to compare the magnitudes of the
different categories of a qualitative variable. It places the categories of the qualitative variable on the y-axis and the
amount or number is on the horizontal axis. The spaces in between the bars may be one-fifth to one-half the width of
the bar.
Solution
Step 1. Draw and label the x and y axes. For the horizontal bar graph place the frequency scale on the x axis,
and for the vertical bar graph place the frequency scale on the y axis.
Step 2. Draw the bars corresponding to the frequencies. See the figure below.
The graphs show that first-year college students spend the most on electronic equipment.
Bar charts can also be used to compare data for two or more groups. These types of bar graphs are called
compound bar graphs. Consider the following data for the number (in millions) of never married adults in the
United States.
8
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
3. Scatter plot (Scatter graph) is a graph used to represent the measurements or values that are thought to be related. It is
used to examine possible relationships between two numerical variables. The two variables are plot in 𝑥-aixs and 𝑦-
axis.
4. Time series graph represents data that occur over specific period of time under observation. It shows trends, patterns,
forecasts and applicable for one or more time series data for comparison purposes.
5. Pictograph (Pictogram) immediately suggests the nature of the data being shown. It gives an approximation only of
the actual figures and compares the different categories. The symbols selected should be self-explanatory and easy to
understand. Each symbol represents a number.
Solution
Step 1. Draw and label the x and y axes. The x axis is always the horizontal axis, and the y axis is always the
vertical axis.
Step 2. Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3. Using the frequencies as the heights, draw vertical bars for each class.
As the histogram shows, the class with the greatest number of data values (18) is 109.5–114.5, followed by 13
for 114.5–119.5. The graph also has one peak with the data clustering around it.
2. Frequency polygon is a graph constructed by plotting the frequencies at the class marks and connecting the potted
points by means of straight lines; the polygon is closed by considering an additional class at each end in the ends
of the lines are brought down to the horizontal axis at the midpoint of the additional classes. The frequency
polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the
midpoints of the classes. The frequencies are represented by the heights of the points.
9
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Solution
Step 1. Find the midpoints of each class. Recall that midpoints are found by adding the upper and lower
boundaries and dividing by 2:
99.5+104.5 104.5+109.5
= 102 = 107
2 2
and so on. The midpoints are
Step 2. Draw the x and y axes. Label the x axis with the midpoint of each class, and then use a suitable scale on
the y axis for the frequencies.
Step 3. Using the midpoints for the x values and the frequencies as the y values, plot the points.
Step 4. Connect adjacent points with line segments. Draw a line back to the x axis at the beginning and end of
the graph, at the same distance that the previous and next midpoints would be located, as shown in the figure
below.
Figure 1
Frequency Polygon of the Record High Temperatures
3. Cumulative frequency polygon (Ogive) is a graph that displays the cumulative frequencies for the classes in a
frequency distribution. The vertical axis represents the cumulative frequency for the classes in a frequency
distribution. The vertical axis represents the cumulative frequency of the distribution while the horizontal axis
represents the upper-class boundaries of the frequency distribution.
The less than cumulative frequency polygon (less than ogive) is plotted against upper class boundaries while the
greater than cumulative frequency polygon (greater than ogive) is plotted against lower class boundaries.
10
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Solution
Step 1. Find the cumulative frequency for each class.
Step 2. Draw the x and y axes. Label the x axis with the class boundaries. Use an appropriate scale for the y axis
to represent the cumulative frequencies. (Depending on the numbers in the cumulative frequency columns,
scales such as 0, 1, 2, 3,..., or 5, 10, 15, 20,..., or 1000, 2000, 3000,... can be used. Do not label the y axis with
the numbers in the cumulative frequency column.) In this example, a scale of 0, 5, 10, 15, . . . will be used.
Step 3. Plot the cumulative frequency at each upper class boundary, as shown in the figure below. Upper
boundaries are used since the cumulative frequencies represent the number of data values accumulated up to the
upper boundary of each class.
Step 4. Starting with the first upper class boundary, 104.5, connect adjacent points with line segments, as shown
in the figure below. Then extend the graph to the first lower class boundary, 99.5, on the x axis.
Cumulative frequency graphs are used to visually represent how many values are below a certain upper
class boundary. For example, to find out how many record high temperatures are less than 114.5℉, locate
114.5℉ on the x axis, draw a vertical line up until it intersects the graph, and then draw a horizontal line at that
point to the y axis. The y axis value is 28, as shown in the figure below.
11
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Activity:
21 26 25 19 29 35 34 26 25 23
27 29 24 29 22 24 28 20 20 27
35 38 25 31 19 25 27 28 22 33
34 25 32 26 26 24 23 28 26 30
23 25 22 25 29 34 34 30 17 25
2. SJS Travel Agency, a nationwide local travel agency, offers special rates on summer period. The owner wants additional
information on the ages of those people taking travel tours. Construct a histogram, frequency polygon, and cumulative
frequency polygon. What conclusions can you reach based on the information presented?
Class limits Class boundaries Midpoints Frequency < 𝑐𝑓
18-26 17.5-26.5 22 3 3
27-35 26.5-35.5 31 5 8
36-44 35.5.-44.5 40 9 17
45-53 44.5-53.5 49 14 31
54-62 53.5-62.5 58 11 42
63-71 62.5-71.5 67 6 48
72-80 71.5-80.5 76 2 50
Solve the following problems and show complete solution (15 points each)
3. Cereal Calories. The number of calories per serving for selected ready-to-eat cereals is listed here. Construct a
frequency distribution, using 7 classes. Draw a histogram, a frequency polygon, and an ogive for the data, using relative
frequencies. Describe the shape of the histogram.
130 190 140 80 100 120 220 220 110 100
210 130 100 90 210 120 200 120 180 120
190 210 120 200 130 180 260 270 100 160
190 240 80 120 90 190 200 210 190 180
115 210 110 225 190 130
4. The table below presents the COVID-19 death cases in the Philippines by region as of September 2021. Sketch the bar
chart and pie chart of the given data and interpret the data.
12
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Any given data in statistics are useless if we don’t interpret them. The most appropriate measures found
to be useful in describing a distribution of observations are the measures of central tendency, measures of
dispersion, measures of relative position, 𝑧-scores, box and whisker plot, probability and normal curve, linear
regression and correlation.
A measure of central tendency is any single value that is used to identify the “center” of the data or the
typical value. It is called measure of central tendency because when the data points are arranged according to
magnitude, it tends to lie centrally within the set. It is the representative value of the data set. It is the value
around which most of the data points are found.
Mean
The mean represents the center of the data. It is the most important measure if the distribution is symmetric
and the most stable measure of location. It is used when the data is at least interval. When n is small, the mean is
very sensitive to extreme values.
It is computed by summing all the observations in the sample and dividing the sum by the number of
observations.
Properties of Mean
a) A set of data has only one mean.
b) Mean can be applied for interval and ratio data.
c) All values in the data set are included in computing the mean.
d) The mean is very useful in comparing two or more data sets.
e) Mean is affected by the extreme small or large values on a data set.
f) Mean is most appropriate in symmetrical data.
For the ungrouped data, the following are the formulas of the mean.
∑𝑥
Population Mean (𝜇): 𝜇 = 𝑁 𝑖 , where 𝑥𝑖 is the i𝑡ℎ score or observation, and 𝑁 is the number of observations in
the population.
∑𝑥
Sample Mean: (𝑋̅): 𝑋̅ = 𝑛 𝑖 , where 𝑥𝑖 is the i𝑡ℎ score or observation, and 𝑛 is the number of observations in the
sample.
Example 1: During a particular summer month, the eight hospitals in a particular province reported the following
number of admissions in their respective ICUs: 8, 11, 5, 14, 8, 11, 16, and 11.
Solution: Considering this month as the statistical population of interest, the mean number of ICU admissions is
∑ 𝑥𝑖 8+11+5+14+8+11+16+11 84
𝜇= = = = 10.5 ICU admissions
𝑁 8 8
Example 2. Determine mean age (in years) of a sample group of children whose ages are 9, 11, 7, 10, 9, 8, 8, 7,
12, 7 and 13.
∑𝑥 9 + 11 + 7 + 10 + 9 + 8 + 8 + 7 + 12 + 7 + 13 101
Solution: 𝑋̅ = 𝑖 = = = 9.18 years
𝑛 11 11
∑ 𝑓𝑀 ∑ 𝑓𝑀
For the grouped data, we have 𝜇 = or 𝑋̅ = , where 𝑓 is the frequency of the class interval and 𝑀 is the
𝑁 𝑛
midpoint of the class interval.
Example 3. Calculate the mean grade of 50 students in statistics below and give its description or interpretation.
Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
13
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
70 – 74 6
Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor
Solution.
First, determine the midpoint (𝑀) of each interval and the total frequency (∑ 𝑓) or 𝑛.
Grade 𝑓 𝑀
90 – 94 7 92
85 – 89 13 87
80 – 84 16 82
75 – 79 8 88
70 – 74 6 72
𝑛 = 50
Second, add a column for 𝑓𝑀, which is the product of a frequency (𝑓) and the midpoint (𝑀) of the class
interval, and find the sum of 𝑓𝑀 column or ∑ 𝑓𝑀.
Grade 𝑓 𝑀 𝑓𝑀
90 – 94 7 92 644
85 – 89 13 87 1131
80 – 84 16 82 1312
75 – 79 8 88 616
70 – 74 6 72 432
𝑛 = 50 ∑ 𝑓𝑀 = 4135
∑ 𝑓𝑀
Using the formula 𝑋̅ = , solve for the mean.
𝑛
∑ 𝑓𝑀 4135
𝑋̅ = 𝑛 = 50 = 82.7 or 83. Hence, the mean grade of 50 students in statistics is Satisfactory.
Weighted mean (𝑿 ̅ 𝐰 or 𝝁𝒘 ) is the sum of the mean of each group multiplied by its respective weight
divided by the sum of the weights. (For mean alone, the weight values in each distribution are equal). Example
of weighted mean is solving the weighted average of a student in a semester to determine whether he or she
belongs to the dean’s list. Each of his or her grade has a corresponding number of units (Example, GECMAT is
3 units, major subject is 4 or 5 units, and so on.)
The formula of the weighted mean is
𝑋1 (𝑤1 ) + 𝑋2 (𝑤2 ) + … + 𝑋𝑛 (𝑤𝑛 )
𝑋̅w =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛
1
Example 4. Francis answered 20 calculus problems. He spent 12 hours for the first 6 problems; 45 minutes for
the next 3; and 3 hours for the last 11 problems. What was the average time (in minutes) he spent for the 20
problems?
Solution
This problem requires the weighted average time because each set of problems has a weight (which is
time).
𝑋1 (𝑤1 ) + 𝑋2 (𝑤2 ) + … + 𝑋𝑛 (𝑤𝑛 ) 6(90) + 3(45) + 11(180) 540 + 135 + 1980 2655
𝑋̅w = = = =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛 90 + 45 + 180 315 315
≈ 8.42 minutes
Median is the positional middle of the data array. In the data array, one-half of the values precede the median and
one-half follow it. When the data set is ordered, whether ascending or descending, it is called a data array. Median
is an appropriate measure of central tendency for data that are ordinal or above, but is more valuable in an ordinal
type of data.
Properties of Median
a) The median is unique, there is only one median for a set of data.
b) The median is found by arranging the set of data from lowest or highest (or highest to lowest) and getting the value
of the middle observation.
c) Median is not affected by the extreme small or large values.
d) Median can be applied for ordinal, interval and ratio data.
e) Median is most appropriate in a skewed data.
For ungrouped data, the first step in calculating the median, denoted by (𝑋̃), is to arrange the data in an array. Let
X (𝑖) the 𝑖 𝑡ℎ observation in the array, 𝑖 = 1, 2, … 𝑁.
𝑁+1 𝑁+1 𝑡ℎ
If 𝑁 is odd, the median position equals ( ), and the value of the ( ) observation in the array is taken
2 2
as the median, i.e. 𝑋̃ = X 𝑁+1 .
( )
2
If 𝑁 is even, the mean of the two middle values in the array is the median, i.e.
X(𝑁) + X(𝑁+1)
𝑋= 2
̃ 2
2
Example 5. Find the median of the given data set: 75, 67, 71, 75, and 72
Solution
First, arrange the data set in ascending order: 67, 71, 72, 75, 75
Since 𝑁 = 5, we will use 𝑋̃ = X(𝑁+1) , hence, 𝑋̃ = X(𝑁+1)
2 2
= X5+1
2
= X3
= 72
67, 71, 72, 75, 75.
Therefore, 𝑋̃ = 72.
Example 6. The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5, 3.6, 3.1,
4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the median.
Solution
Array: 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3
𝑋̃ = 3.1 seconds
𝑛
−<𝑐𝑓
For grouped data, the formula for the median is 𝑋̃ = 𝑋𝐿𝐵 + ( 2 𝑓 )𝑖
𝑚
where
𝑋𝐿𝐵 = lower boundary of class containing the median
𝑛 = sample size
< 𝑐𝑓 = cumulative frequency of classes preceding class containing the median
𝑓𝑚 = number of observations in class containing the median
𝑖 = width of the interval containing the median
Example 7. Calculate the median grade of 50 students in statistics in Example 3 and give its description or
interpretation.
Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
70 – 74 6
15
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor
Solution
First, we add two columns for class boundaries and less than cumulative frequency (< 𝑐𝑓).
Grade 𝑓 Class boundaries < 𝑐𝑓
90 – 94 7 89.5 – 94.5 50
85 – 89 13 84.5 – 89.5 43
80 – 84 16 79.5 – 84.5 30
75 – 79 8 74.5 – 79.5 14
70 – 74 6 69.5 – 74.5 6
𝑛
Then, determine the median class using (2 )th item in the distribution. Hence,
𝑛 50
( )th = ( )th = 25th
2 2
If the scores are arranged in an ordered array, the 25 score of the distribution falls on the class interval 79.5 –
th
Mode is the observed value the occurs most frequently. It locates the point where the observation values
occur with the greatest density. It does not always exist, and if it does, it may not be unique. A data set is said to
be unimodal if there is only one mode, bimodal if there are two modes, multimodal if there three or more. There
are some cases when a data set values have the same number frequency. When this occurs, the data set is said to
be no mode.
Properties of Mode
a) The mode is found by locating the most frequently occurring value.
b) The mode is the easiest average to compute.
c) There can be more than one mode or even no mode in any given data set.
d) Mode is not affected by the extreme small or large values.
e) Mode can be applied for nominal, ordinal, interval, and ratio data.
Example 8. The eight hospitals described in Example 1 had the following number of ICU admissions: 8, 11, 5,
14, 8, 11, 16, and 11. Find the mode.
Solution
𝜇̂ = 11.0 ICU admissions or 𝑋̂ = 11.0 ICU admissions
Example 9. The reaction times for a random sample of 9 objects described in Example 6 were recorded as 2.5,
3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the mode.
Solution
𝜇̂ or 𝑋̂ does not exist since all values have the same frequency.
16
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑑1
For grouped data, the formula of the mode is 𝑋̂ = 𝑋𝑀𝑜 + (𝑑 +𝑑 )𝑖
1 2
where
𝑋𝑀𝑜 = lower class boundary of the modal class
𝑑1 = difference between the frequency of the modal class and that of the immediately preceding lower class
𝑑2 = difference between the frequency of the modal class and that of the immediately following the higher class
𝑖 = class width or size
Example 10. Calculate the modal grade of 50 students in statistics in Example 3 and give its description or
interpretation.
Grade 𝑓
90 – 94 7
85 – 89 13
80 – 84 16
75 – 79 8
70 – 74 6
Scale Description
95 – 100 Outstanding
89 – 94 Very Satisfactory
83 – 88 Satisfactory
77 – 82 Fair
70 – 76 Poor
Solution
First, determine the modal class of the distribution. The modal class of the distribution has the highest frequency.
Hence, 80-84 is the modal class.
𝑋𝑀𝑜 = 79.5
𝑑1 = 8
𝑑2 = 13
𝑖=5
By substitution,
𝑑 8 40
𝑋̂ = 𝑋𝑀𝑜 + ( 1 ) 𝑖 = 79.5 + ( ) 5 = 79.5 + = 81.40 or 81.
𝑑1 +𝑑2 8+13 21
Hence, the modal grade of 50 students in statistics is Fair.
Example 11: Find the mean, median, and mode of the following ages in years below.
a) 3, 4, 5, 5, 6, 7, 9, 10, 14
b) 7, 8, 9, 9, 10, 10, 11, 12
Solution
a) 3, 4, 5, 5, 6, 7, 9, 10, 14
∑𝑥 3+4+5+5+6+7+9+10+14
Mean: 𝑋̅ = 𝑖 =
𝑛 9
63
=
9
= 7 years
Median: Since 𝑁 is 9 (which is odd), use the formula 𝑋̃ = 𝑋(𝑁+1) .
2
𝑋̃ = 𝑋(𝑁+1) = 𝑋(9+1)
2 2
= 𝑋10
2
= 𝑋5
Hence, 𝑋5 = 6.
Mode: The mode is 5 since it has the highest frequency (it appears twice in the distribution)
17
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
∑ 𝑥𝑖 7+8+9+9+10+10+11+12
Mean: 𝑋̅ = =
𝑛 8
76
=
8
= 9.5 years
X 𝑁 +X 𝑁
( ) ( +1)
Median: Since 𝑁 is 8 (which is even), use the formula 𝑋̃ = 2 2
2
X 𝑁 +X 𝑁 X 8 +X 8
( ) ( +1) ( ) ( +1) 𝑋4 +𝑋5
𝑋̃ = 2 2
= 2 2
=
2 2 2
9+10
= 2
= 9.5, therefore, the median is 9.5.
Mode: The modes are 9 and 10 since they have the highest frequency (appeared twice). It is bimodal.
Activity:
Determine the following:
1. Find the value of each of the following expressions using the values of the variables below.
𝑋1 = 2, 𝑋2 = 4, 𝑋3 = 6, 𝑋4 = 8, 𝑋5 = 10.
a. ∑5𝑖=2 𝑋𝑖
b. ∑ 5𝑋𝑖
c. ∑ 𝑋𝑖 2
2
2. Make up your own set of at least five numbers and demonstrate that ∑ 𝑋𝑖 2 ≠ (∑ 𝑋𝑖 ) .
3. Round off the following numbers to two decimal places (assume digits to the right of those shown are zero):
a. 144.0135 _______________
b. 67.245 _______________
c. 99.707 _______________
d. 13.345 _______________
e. 7.3451 _______________
f. 5.9817 _______________
g. 5.9977 _______________
For each item, answer the following questions. After that, create a small group of four members and share your
answer or thoughts about the questions. Then synthesizing the answers of your group, choose one representative
to present the answer to the class.
1. Why a certain size of a pair shoes or a brand of shirt is made more available than the other sizes?
2. Why a certain basketball player gets more playing time than the rest of his teammates?
3. Have you ever experienced to compute the average of your grades for your wanted to compare it with your other
classmates’ grades? How did you compute it?
4. The set of data shows a score of 35 students in their periodical test.
34 35 40 40 48 21 9
21 20 19 34 45 21 20
19 17 18 15 16 20 28
21 20 18 17 10 45 48
19 17 29 45 50 48 25
18
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Measures of Dispersion
Computing a measure of variability is important because without it a measure of central tendency provides
an incomplete description of a distribution. The mean, for example, only indicates the central score and where the
most frequent scores are. Thus, to completely describe a set of data, we need to know not only the central tendency
but also how much the individual scores differ from each other and from the center. We obtain this information
by calculating statistics called measures of variability.
Measures of variability/dispersion indicate the extent to which individual items in a series are scattered
about an average. It is used to determine the extent of the scatter so that steps may be taken to control the existing
variation. It is also used as a measure of reliability of the average value.
Measures of variability describe the extent to which scores in a distribution differ from each other. With
many, large differences among the scores, our statistic will be a larger number, and we say the data are more
variable or show greater variability. Measures of variability communicate three related aspects of the data. First,
the opposite of variability is consistency. Small variability indicates few and/or small differences among the
scores, so the scores must be consistently close to each other (and reflect that similar behaviors are occurring).
Conversely, larger variability indicates that scores (and behaviors) were inconsistent. Second, recall that a score
indicates a location on a variable and that the difference between two scores is the distance that separates them.
From this perspective, by measuring differences, measures of variability indicate how spread out the scores and
the distribution are. Third, a measure of variability tells us how accurately the measure of central tendency
describes the distribution. Our focus will be on the mean, so the greater the variability, the more the scores are
spread out, and the less accurately they are summarized by the one, mean score. Conversely, the smaller the
variability, the closer the scores are to each other and to the mean.
One way to describe variability is to determine how far the lowest score is from the highest score. The
descriptive statistic that indicates the distance between the two most extreme scores in a distribution is called the
range.
Range
Probably the simplest and easiest way to determine measure of dispersion is the range. The range of a set
of measurements is the difference between the largest value and the smallest value. Range (𝑅) = Maximum
value − Minimum value
Example 12. The IQ scores of 5 members of CHMSC Basketball men varsity are 108, 112, 127,
116, and 113. Find the range.
Solution: R = 127– 108 = 19
19
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
To create the definitional formula here, we simply add the square root sign to the previous defining formula
̅ )2
∑(X−X
for variance. The definitional formula for the sample standard deviation is 𝑠 = √ .
𝑛−1
Example 13: A sample of 5 households showed the following number of household members:
3, 8, 5, 4, and 4. Find the variance and standard deviation.
Solution
First, solve for the sample mean (𝑋̅) and add the columns for (𝑋 − 𝑋̅ ) and (𝑋 − 𝑋̅)2 .
3+8+5+4+4
𝑋̅ =
5
24
= 5
= 4.8
Second, solve for the sample variance and sample standard deviation by substitution,
̅)2
∑(X i − X
2
𝑠 =
𝑛−1
14.8
=
5−1
= 3.7
∑(X i − ̅
X )2
𝑠= √
𝑛−1
= √3.7
= 1.92
Example 14: Find the measures of variability for the grades in Mathematics of the two sample groups of students.
Male: 100, 65, 75, 85, 95 Female: 84, 86, 85, 82, 83
Solution
For Range R,
Male Group, Range R = 100 − 65 = 35 Female Group, Range R = 86 − 82 = 4
For Sample Variance and Sample Standard Deviation
65+75+85+95+100
The mean for male group is ̅
X= 5
= 84.
82+83+84+85+86
The mean for female group is ̅
X= = 84.
5
For male group, the sample variance and sample standard deviation is
∑(Xi − X̅)2
2
s =
n−1
820
= 5−1
= 205
20
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
For female group, the sample variance and sample standard deviation is
∑(Xi − ̅X)2
s2 =
n−1
10
= 5−1
= 2.5
Table 1
Findings. Both groups have the same mean but differ on all measures of variability. The male group is more
variable than female group.
Conclusion. The grades of the female group are less variable than that of the male because it has smaller standard
deviation. The female group has a more uniform set of grades in Statistics than the male group.
1. Percentiles
Percentiles are values that divide a set of observations in an array into 100 equal parts. Thus, P1, read as
first percentile, is the value below which 1% of the values fall P 2, read as second percentile, is the value below
which 2% of the values fall,…, P99, read as ninety – ninth percentile, is the value below which 99% of the fall.
Example. The 80th percentile of a distribution is a value such that at least 80 percent of the ordered
observations are less than its value and at least 20 percent of the ordered observations are larger than its value. If
𝑃80 = 75: At least 80% of the ordered observations are less than 75 or at least 20% of the ordered observations
are larger than 75. So any observation that is smaller than 𝑃80 value belongs in the lower 80% of the distribution
while any observation greater than 𝑃80 value belongs in the upper 20% of the distribution.
To compute for the ith percentile, we have
i(n+1) th
Pi = the value of the [ ] observation in the array
100
Note:
➢ If Pi is a whole number, the ith percentile is the average of the Pi observation and the P(i+1) observation.
➢ If Pi has a fractional value, the ith percentile is the P(i+1) observation, or, round up the value of Pi to the next integer.
Example 15. The following were the scores of 10 students in a short quiz. Find the 64 th percentile.
2 8 6 9 7 5 8 10 10 1
Solution: First arrange the data from lowest to highest.
1 2 5 6 7 8 8 9 10 10
Then, using
i(n+1) th
Pi = [ ] observation. We have
100
21
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
i(n + 1) th
Pi = [ ]
100
64(10+1)
P64 = [ 100 ]th = 7.04th or 8th observation (always round up to the nearest whole number)
Since, the 8th observation in an ordered array is 9, therefore, the 64th percentile of the distribution is 9, which is
interpreted as 64% of the scores are below 9.
Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110
Solution
𝑖𝑛
First, add one column in the FDT for < 𝑐𝑓and determine the P35th class using 100.
Score Frequency < 𝑐𝑓
50 – 54 10 10
55 – 59 3 13
60 – 64 8 21
65 – 69 13 34
70 – 74 17 51
75 – 79 19 70
80 – 84 22 92
85 – 89 13 105
90 – 94 4 109
95 – 99 1 110
TOTAL 110
𝑖𝑛 𝑖𝑛 35(110)
Using , we have = = 38.5. Since 38.5 falls on the class interval 70 – 74, hence, the P35th
100 100 100
class is 70 – 74. Therefore, we have
𝑖𝑛
= 38.5
100
𝑋𝑃35 = 69.5
𝑐=5
< 𝑐𝑓𝑃35 = 34
𝑓𝑃35 = 17
22
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
By substitution, we have
𝑖𝑛
−<𝑐𝑓𝑃𝑖
100
Pi = 𝑋𝑃𝑖 + 𝑐( )
𝑓𝑃𝑖
38.5−34
P35 = 69.5 + 5 ( ) = 70.82
17
Hence, thirty-five percent of the scores in the achievement test are below 70.82.
2. Deciles
Deciles are values that divide the array into 10 equal parts. Thus, D1, read as first decile, is the value below
which is 10% of the values fall, D2, read as second decile, is the value below which 20% of the values fall,…, D9,
read as ninth decile, is the value below which 90% of the values fall.
To compute for the ith decile, we have
i(n+1) th
Di = the value of the [ ] observation in the array
10
Example 17. From the given set scores in a quiz find the 4thdecile or D4.
3 8 9 11 12 18 19
Solution
Since the data is already arranged from lowest to highest then we may proceed in finding the 4 thdecile.
3 8 9 11 12 18 19
i(n+1) th
Using Di = [ ] , we have
10
4(7+1) th
D4 = [ ] observation = 3.02th or 4th observation (always round up to the nearest whole number)
10
Since, the 4th observation in an ordered array of the given distribution is 11, therefore, the 4th decile of the
distribution is 11, which is interpreted as 40% of the scores are below 11.
Approximating the ith Decile from a Frequency distribution
To solve for the decile in grouped data, we have
𝑖𝑛
−< 𝑐𝑓𝐷𝑖
Di = 𝑋𝐷𝑖 + 𝑐( 10 )
𝑓𝐷𝑖
where
𝑖𝑛
The Dith class is the class where the falls.
10
𝑋𝐷𝑖 = the lower class boundary of the Dith class
𝑐 = class size of the Dith class
< 𝑐𝑓𝐷𝑖 = less than cumulative frequency of the class preceding the D ith class
𝑓𝐷𝑖 = frequency of the Dith class
Example 18. Find the 6th decile of the given frequency distribution of 110 scores in achievement test below.
Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110
𝑖𝑛 𝑖𝑛 6(110)
Using 10, we have 10 = = 66. Since 66 falls on the class interval 75 – 79, hence, the D6th class is 75
10
– 79. Therefore, we have
23
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑖𝑛
= 66
10
𝑋𝐷6 = 74.5
𝑐=5
< 𝑐𝑓𝐷6 = 51
𝑓𝐷6 = 19
By substitution, we have
𝑖𝑛
−<𝑐𝑓𝐷6
10
D6 = 𝑋𝐷6 + 𝑐( )
𝑓𝐷6
66−51
D6 = 74.5 + 5 ( ) = 78.45
19
Hence, sixty percent of the scores in the achievement test are below 78.45.
3. Quartiles
Quartiles are values that divide the array into 4 equal parts. Thus, Q1, read as first quartile, is the value
below which 25% of the values fall Q2, read as second quartile, is the value below which 50% of the values fall
Q3, read as third quartile, is the value below which 75% of the values fall.
Example 19. From the given set scores in a quiz find the 3rd quartile or Q3
3 8 9 11 12 18 19
Solution
Since the data is already arranged from lowest to highest then we may proceed in finding the 3 rd quartile.
3 8 9 11 12 18 19
i(n+1) th
Using Q i = [ ] , we have
4
3(7+1) th
Q3 = [ ] observation = 6th observation.
4
Since, the 6th observation in an ordered array of the given distribution is 18, therefore, the 3rd quartile of
the distribution is 18, which is interpreted as 75% of the scores are below 18.
Approximating the ith Quartile from a Frequency distribution
To solve for the quartile in grouped data, we have
𝑖𝑛
−< 𝑐𝑓𝑄𝑖
Q i = 𝑋𝑄𝑖 + 𝑐( 4 )
𝑓𝑄𝑖
𝑖𝑛
where: The Qith class is the class where the 4
falls.
𝑋𝑄𝑖 = the lower class boundary of the Qith class
𝑐 = class size of the Qith class
< 𝑐𝑓𝑄𝑖 = less than cumulative frequency of the class preceding the Q ith class
𝑓𝑄𝑖 = frequency of the Qith class
Example 20. Find the 1st quartile of the given frequency distribution of 110 scores in achievement test below.
Score Frequency
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
TOTAL 110
24
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
𝑖𝑛 𝑖𝑛 1(110)
Using 4 , we have = = 27.5th. Since 27.5 falls on the class interval 65 – 69, hence, the Q1st class is 65 –
4 4
69. Therefore, we have
𝑖𝑛
= 27.5
4
𝑋𝑄1 = 64.5
𝑐=5
< 𝑐𝑓𝑄1 = 21
𝑓𝑄1 = 13
By substitution, we have
𝑖𝑛
−<𝑐𝑓𝑄1
4
Q1 = 𝑋𝑄1 + 𝑐( )
𝑓𝑄1
27.5−21
Q1 = 64.5 + 5 ( ) = 67
13
Hence, 25% of the scores in the achievement test are below 67.
4. 𝒛 −Score
𝑧 −Score is used to know the position of one observation relative to others in a set of data. Let say, we
want to know a score of a student of 42 compared to the scores of the other students in the class based from a quiz
on a total of 50 points. The mean and the standard deviation of the scores can be used to compute a 𝑧 −score,
which will measure the relative standing of a measurement in a data set.
A 𝑧 −score measures the distance between an observation and the mean, measured in units of standard
deviation. The following formulas show how to compute the 𝑧 −score for a data value 𝑥 in a population and in a
sample.
𝑋−𝜇 𝑋−𝑋̅
For population: 𝑧 = For sample: 𝑧 =
𝜎 𝑠
Example 21: The monthly expenditures of a large group of households has a mean of ₱48,700 and a standard
deviation of ₱10,400. What is the 𝑧 −value of monthly expenditures of ₱59,400 and ₱38,300?
Solution
Let 𝜇 = ₱48,700 and 𝜎 = ₱10,400
Using the formula of 𝑧 to determine 𝑧 −values for the two 𝑥 values (₱59,400 and ₱38,300) are computed as
follows:
𝑋−𝜇 ₱59,400−₱48,700
For ₱59,400: 𝑧= 𝜎
= ₱10,400
= 1.00
𝑋−𝜇 ₱38,300−₱48,700
For ₱38,300: 𝑧= 𝜎
= ₱10,400
= −1.00
The 𝑧 of 1.00 indicates that a monthly expenditure of ₱59,400 for households is one standard deviation above the
mean, and a 𝑧 of −1.00 shows that a ₱38,300 monthly expenditure is one standard deviation below the mean.
Note that both household monthly expenditures (₱59,400 and ₱38,300) are the same distance (₱48,700) from the
mean.
Example 22: Raul has taken two tests in his mathematics class. He scored 72 on the first test, for which the mean
of all scores was 65 and the standard deviation was 8. He received a 60 on a second test, for which the mean of
all scores was 45 and the standard deviation was 12. In comparison to the other students, did Raul do better on
the first test or the second test?
Solution: Find the 𝑧 −score for each test.
72−65 60−45
𝑧72 = = 0.875 𝑧60 = = 1.25
8 12
Raul scored 0.875 standard deviation above the mean on the first test and 1.25 standard deviations above the mean
on the second test. These 𝑧 −scores indicate that, in comparison to his classmates, Raul scored better on the
second test than he did on the first test.
Example 23: A consumer group tested a sample of 100 light bulbs. It found that the mean life expectancy of the
bulbs was 842 h, with a standard deviation of 90. One particular bulb from the DuraBright Company had a
𝑧 −score of 1.2. What was the life span of this light bulb?
25
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Solution: Substitute the given values into the 𝑧 −score equation and solve for 𝑥.
𝑋 − 𝑋̅
𝑧=
𝑠
𝑋 − 842
1.2 =
90
108 = 𝑥 − 842
950 = 𝑥
The light bulb had a life span of 950 h.
5. Box-and-Whisker Plot
A box-and-whisker plot (sometimes called a boxplot) is often used to provide a visual summary of a set of data.
It is a graph of a data set obtained by drawing a horizontal line from the minimum data value to first quartile (𝑄1),
drawing a horizontal line to third quartile (𝑄3 ) to the maximum data value, and drawing a box whose vertical line
passes through 𝑄1 and 𝑄3 with a vertical line inside the box passing through the median or second quartile (𝑄2 ).
Example 24: Construct a boxplot for the data set of the ages of 9 middle-management employees of a certain
company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55. What can you say about the distribution of the
data set?
Solution:
Step 1: Determine the 𝑄1, Median, and 𝑄3 of the given data set. Recall that 𝑄1 = 47, Median = 53, and 𝑄3 =
56.5.
Step 2: Locate the lowest value, 𝑄1, the median, 𝑄3 , and the highest value on the scale.
Step 3: Draw a box around 𝑄1 and 𝑄3 , draw a vertical line through the median, and connect the upper and lower
values, as shown in the figure below.
Activity:
26
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
4. The table below shows the heights, in inches, of 15 randomly selected National Basketball Association (NBA) players
and 15 randomly selected Division I National Collegiate Athletic Association (NCAA) players.
NBA 84, 76, 79, 75, 81, 81, 76, 85,
78, 79, 78, 78, 84, 75, 76
NCAA 78, 73, 73, 78, 77, 76, 75, 74,
74, 81, 75, 78, 78, 79, 73
Using the same scale, draw a box-and-whisker plot for each of the two data sets, placing the second plot below the first.
Write a valid conclusion based on the data.
5. A survey of 16 energy drinks noted the caffeine concentration of each drink in milligrams per ounce. The results are
given below. Find the mean, median, mode, range, variance, and standard deviation of these data. Concentration of
caffeine (mg/oz): 9.1, 7.8, 7.5, 8.9, 9.0, 8.2, 9.1, 8.7, 9.0, 7.7, 8.8, 8.9, 9.0, 9.1, 8.2, 8.9, 7.0
6. Given the data set below, find the mean, median and mode.
Frequency Distribution of Grades in College Algebra
Grade Number of Students
90 – 100 9
80 - 89 30
70 – 79 35
60 – 69 8
50 – 59 9
40 – 49 2
30 – 39 3
20 – 29 1
10 – 19 2
0–9 1
Total 100
7. A professor, grade students on 4 tests, a term paper, and a final examination. Each test counts as 15% of the course
grade. The term paper counts as 20% of the course grade. The final examination counts as 20% of the course grade.
Alan has test scores of 80, 78, 92, and 84. Alan received an 84 on his term paper. His final examination score was 88.
Use the weighted mean (average) formula to find Alan’s average for the course. (Hint: The sum of all weights is 100%
or 1)
8. A psychologist obtained the IQ scores of 10 students. The IQ scores are as follows:
110 95 85 140 132 100 95 70 85 100
Find P65, D3, D9 and Q3. Interpret the values.
9. A test involving 380 men ages 20 to 24 found that their blood cholesterol levels had a mean of 182 mg/dl and a standard
deviation of 44.2 mg/dl.
a) Determine the 𝑧-score, to the nearest hundredth, for one of the men who had a blood cholesterol level of 214 mg/dl.
b) The 𝑧-score for one man was −1.58. What was his blood cholesterol level? Round to the nearest hundredth.
10. The blood lead concentrations, in micrograms per deciliter (𝜇𝑔/𝑑𝐿), of 20 children from two different neighborhoods
were measured. The results are recorded in the table.
Neighborhood 1 3.97, 3.91, 3.98, 3.70, 4.13, 3.97, 4.01, 3.88, 4.11, 3.70, 3.96,
3.77, 4.30, 4.08, 4.12, 4.93, 3.93, 3.94, 3.85, 3.83
Neighborhood 2 4.31, 4.22, 3.78, 4.10, 4.34, 4.20, 4.35, 4.20, 4.01, 4.04, 4.28,
4.12, 4.59, 4.12, 4.01, 3.85, 3.96, 4.28, 4.39, 4.13
Using the same scale, draw a box-and-whisker plot for each of the two data sets, placing the second plot below the first.
Considering that high blood lead concentrations are harmful to humans, in which of the two neighborhoods would you
prefer to live?
11. Find the P20, D4, D6 and Q3 of the following distribution of the ages of the members of a labor union. Interpret the
values.
AGE (Years) FREQUENCY
15 – 19 18
20 – 24 42
25 – 29 78
30 – 34 115
35 – 39 178
40 – 44 107
45 – 49 88
50 – 54 52
55 – 59 30
60 – 64 11
TOTAL 719
27
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
1. What is the mean, median and the mode of the given data? What can you say about the values of three central
tendencies?
2. What is the maximum data value as shown on the histogram? (What is the largest value on the data axis?)
3. What is the minimum data value as shown on the histogram? (What is the smallest value on the data axis?)
4. Is the histogram symmetric, skewed to the left, skewed to the right, bell-shaped, uniform or does it have no special shape?
5. How many peaks does the histogram have, and where are they located? (Peaks are bars with shorter bars on each side. First bars that
are taller than second bars or last bars that are taller than the preceding are also called peaks. Two or more adjacent bars of the same
height with neighboring shorter bars - a plateau - would be considered one peak.)
According to Frost (2020), the normal distribution is the most important probability distribution in
statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and
IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve. The
normal distribution is a probability function that describes how the values of a variable are distributed. It is a
symmetric distribution where most of the observations cluster around the central peak and the probabilities for
values further away from the mean taper off equally in both directions. Extreme values in both tails of the
distribution are similarly unlikely (Frost, 2020). It is a distribution of normal random variable with a mean equal
to zero (𝑋̅ = 0) and a standard deviation equal to one (𝑠 = 1). It is represented by a normal curve.
The two factors from which the graph of the normal distribution depends on are the mean and the
standard deviation. The mean of the distribution determines the location of the center of the graph, and the
standard deviation determines the height and width of the graph. The graphs of normal distributions look like a
symmetric, bell-shaped curve are shown below:
When the standard deviation is small, the curve is tall and narrow; and when the standard deviation is big, the
curve is short and wide (see above).
Definition. Central Limit Theorem. If 𝑛 (the sample size) is large, the theoretical sampling distribution of the
mean can be approximated closely with a normal distribution.
Properties of a Normal Distribution
Every normal distribution has the following properties.
a) It is symmetrical about the vertical line drawn through 𝑧 = 0. This means that the shape of the distribution at the
right is a mirror image of the left.
b) The highest point in the curve is 𝑦 = 0.3989.
28
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
c) The curve is asymptotic to the x-axis. This means that both positive and negative ends approach the horizontal
axis but do not touch it.
d) For all practical purposes, the area under the normal curve from 𝑧 = −3 to 𝑧 = +3 equals 1.0 or 100%, hence the
term unit normal curve.
e) The three measures of central tendency (mean, median, and mode) coincide with each other.
f) It is dependent on the values of the mean and standard deviation.
Example 1: The area under a bell curve gives the probability that a randomly caught fish’s weight is
somewhere in a given interval. Thus, there is a big chance of catching a fish whose weight is somewhere
between 450 grams to 550 grams. Moreover, there is a small chance of catching a fish that weighs more than
550 grams. The question is how does one get the area under that curve? The z-Table gives the answer to this
question.
29
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
To do that, standardize the normal curve. Then refer to the z-Table to obtain the value. There is a need to
standardize a normal variable. Without this process, finding the area under a particular curve is close to
impossible. Setting up table of values for a normal variable just like the z-table is very difficult; added to this
burden is the number of countless possibilities for the mean and standard deviation of a normal variable. To
simplify the task of getting area, refer to the z-table (area under the normal curve) above. The z-table (area
under the normal curve) has the following properties.
1) The total area under the normal curve is 1 or 100%.
2) Since the normal curve is symmetrical about the mean, then half the normal curve has an area of 0.5.
3) The table on the next page gives only the area to the right of the mean.
4) The given area in the table is the area from 𝑧 = 0 to ±𝑧.
5) Area is always + but 𝑧 can either be positive or negative.
6) Always draw the curve and shade the given region.
7) Simple arithmetic, addition and subtraction are the only operations needed to get the correct area.
The normal random variable of a standard normal distribution is called a standard score or a 𝑧-score. Every
normal random variable 𝑋 can be transformed into a 𝑧-score using the following equation
𝑋−𝜇
𝑧=
𝜎
where 𝑋 is a normal random variable; 𝜇 is the mean of 𝑋; 𝜎 is the standard deviation of 𝑋
Example 2: Determine the area under the normal curve between 𝑧 = 0 and 𝑧 = 1.85.
Solution: Draw the figure and represent the area.
Since z-table gives the area between 0 and any 𝑧-value to the right of 0, we only need to look up the 𝑧-value in
the table. Find 1.8 in the left column and 0.05 in the top row. The value where the column and row meet in the
table is the answer, 0.4678.
30
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Example 3: Determine the area under the normal curve between 𝑧 = 0 and 𝑧 = −1.15.
Solution: Draw the figure and represent the area.
Example 4: Find the area under the normal curve to the right of 𝑧 = 1.15.
Solution: Draw the figure and represent the area.
The required area is the right tail of the normal curve. Since z-table gives the area between 𝑧 = 0 and 𝑧 = 1.15,
first find the area.
𝑃 (0 < 𝑧 < 1.15) = 0.3749
Then subtract 𝑃(0 < 𝑧 < 1.15) = 0.374 from 0.5000, since half of the area under the normal is to right of 𝑧 =
0.
𝑃 (𝑧 > 1.15) = 0.5000 − 𝑃(0 < 𝑧 < 1.15)
= 0.5000 − 0.3749
= 0.1251
Therefore, the area to the right of 𝑧 = 1.15 is 0.1251 or 12.51%.
Example 5: Determine the area under the normal curve
between 𝑧 = 0.75 and 𝑧 = 1.85.
Solution: 𝑃(0 < 𝑧 < 1.85) = 0.4678 and 𝑃(0 < 𝑧 < 0.75) =
0.2734
Hence, 𝑃(0.75 < 𝑧 < 1.85) = 𝑃(0 < 𝑧 < 1.85) − 𝑃(0 < 𝑧 <
0.75) = 0.4678 − 0.2734 = 0.1944
Therefore, the area is 0.1944 or 19.44%.
31
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Example 6: Find the area under the normal curve between 𝑧 = 1.15 and 𝑧 = −1.85.
Solution: 𝑃(−1.85 < 𝑧 < 0) = 0.4678 and 𝑃(0 < 𝑧 < 1.15) = 0.3749
Example 7: Find the z-value such that the area under the normal curve is between 0 and 𝑧 −value is 0.3962.
Solution: Find the area in z-table. Then correct 𝑧 −value in the left column as 1.2 and in the top row as 0.06, and
add these two values to get 1.26.
32
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
33
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Fourth, add 𝑃(−1.00 < 𝑧 < 0) and 𝑃(0 < 𝑧 < 1.20).
𝑃 (35 < 𝑋 < 46) = 𝑃 (−1.00 < 𝑧 < 1.20) = 𝑃(−1.00 < 𝑧 < 0) + 𝑃(0 < 𝑧 < 1.20) = 0.3414 + 0.3849
= 0.7262
Hence, the probability that a randomly selected bank manager is between 35 and 46 years old is 0.7262 or
72.62%.
Example 11: The Emotional Quotient (EQ) score on the latest version of Sirug EQ Test is transformed so as to
follow a normal distribution with a mean of 110 and a standard deviation of 15. Find the 96 th percentile of the
distribution of Emotional Quotient?
Solution:
First, sketch the graph to illustrate the scenario.
34
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Activity:
6. The average daily jail population in the New Bilibid Prison in Muntinlupa City is 36,290. If the distribution is normal
and the standard deviation is 3,750, find the probability that on a randomly selected day, the jail population is greater
than 40,145.
7. The average credit card debt for public school teacher is ₱14,970. If the debt is normally distributed with a standard
deviation of ₱5,650, find the probabilities: (a) that the teacher owes at least ₱6,740, (b) that the teacher owes more than
₱19,270, and (c) that the teacher owes between ₱6,740 and ₱19,270
8. To qualify for a Master’s degree program in Business Administration at San Sebastian College, candidates must score
in the top 20% on a mental ability test. The test has a mean of 180 and a standard deviation of 25. Find the lowest
possible score to qualify. Assume the test scores are normally distributed.
35
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
Linear Regression
When performing research studies, scientists often wish to know whether two variables are related. If the
variables are determined to be related. A scientist may then wish to find an equation that can be used to model
the relationship. For instance, the zoology professor R. McNeill Alexander wanted to determine whether the stride
length of a dinosaur, as shown by its fossilized footprints, could be used to estimate the speed of the dinosaur.
Stride length for an animal is defined as the distance x from a particular point on a footprint to that same point on
the next footprint of the same foot. (See the figure below.) Because no dinosaurs were available, Alexander and
fellow scientist A. S. Jayes carried out experiments with many types of animals, including adult men, dogs,
camels, ostriches, and elephants. The results of these experiments tended to support the idea that the speed y of
an animal is related to the animal’s stride length x. To better understand this relationship, examine the data in the
table below, which are similar to, but less extensive than, the data collected by Alexander and Jayes.
Table 1.a
Adult men
Table 1.b
Dogs
Table 1.c
Camels
A graph of the ordered pairs in Table 1 is shown in Figure 1. In this graph, which is called a scatter
diagram or scatter plot, the x-axis represents the stride lengths
in meters and the y-axis represents the average speeds in meters per second. The scatter diagram seems to
indicate that for each of the three species, a larger stride length generally produces a faster speed. Also note that
for each species, a straight line can be drawn such that all of the points for that species lie on or very close to the
line. Thus, the relationship between speed and stride length appears to be a linear relationship.
After a relationship between paired data, which are referred to as bivariate data, has been discovered, a
scientist tries to model the relationship with an equation. One method of determining a linear relationship for
bivariate data is called linear regression. To see how linear regression is carried out, let us concentrate on the
bivariate data for the dogs, which is shown by the green points in Figures 1 and 2. There are many lines that can
be drawn such that the data points lie close to the line; however, scientists are generally interested in the line
called the line of best fi t or the least-squares regression line.
The least-squares regression line for a set of bivariate data is the line that minimizes the sum of the squares of
the vertical deviations from each data point to the line.
The least-squares regression line is also called the least-squares line. The approximate equation of the
least-squares line for the bivariate data for the dogs is 𝑦̂ = 3.2𝑥 − 1.1. Figure 2 shows the graph of these data
and the graph of 𝑦̂ = 3.2𝑥 − 1.1. In Figure 2, the vertical deviations from the ordered pairs to the graph of 𝑦̂ =
3.2𝑥 − 1.1 are 0, −0.06, 0.5, −0.52, −0.16, −0.6, 0.34 and 0.2. It is traditional to use the symbol 𝑦̂
(pronounced y-hat) in place of y in the equation of a least-squares line. This also helps us differentiate the line’s
y-values from the y-values of the given ordered pairs. The next formula can be used to determine the equation
of the least-squares line for a given set of ordered pairs.
Example 1: Find the equation of the least-squares line for the ordered pairs in Table 1.a on page.
Solution
The ordered pairs are (2.5, 3.4), (3.0, 4.9), (3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7), (4.2, 8.3), (4.5, 8.7)
The number of ordered pairs is 𝑛 = 8. Organize the data in four columns, as shown in Table 2. Then find the
sum of each column.
Table 2
If 𝑎 and 𝑏 are each rounded to the nearest tenth, to reflect the accuracy of the original data, then we have as our
equation of the least-squares line:
𝑦̂ = 𝑎𝑥 + 𝑏
𝑦̂ = 2.7𝑥 − 3.3
See figure 3 below.
Figure 3. Least-squares line for speed versus stride length in adult men
Example 2: Use the equation of the least-squares line from Example 1 to predict the average speed of an adult
man for each of the following stride lengths. Round your results to the nearest tenth of a meter per second.
a) 2.8 m
b) 4.8 m
Solution
a) In Example 1, we found the equation of the least-squares line to be 𝑦̂ = 2.7𝑥 − 3.3. Substituting 2.8 for 𝑥 gives
𝑦̂ = 2.7𝑥 − 3.3 = 2.7(2.8) − 3.3 = 4.26
Rounding 4.26 to the nearest tenth produces 4.3. Thus 4.3 m/s is the predicted average speed for an adult man
with a stride length of 2.8 m.
Activity:
a. Analyze the three scatter plots and determine which type of relationship (positive, negative, no), if any,
exists. Compare the three scatter plots.
b. Give one real-life example of positive relationship, negative relationship, and no relationship between
two variables.
39
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
40
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
What is the significance of the fact that the linear correlation coefficient is positive in Example 3? (Answer: It
indicates a positive correlation between a man’s stride length and his speed. That is, as a man’s stride length
increases, his speed also increases.)
The linear correlation coefficient indicates the strength of a linear relationship between two variables; however,
it does not indicate the presence of a cause-and-effect relationship. For instance, the data in Table 3 show the
hours per week that a student spent playing pool and the student’s weekly algebra test scores for those same
weeks.
Table 3. Algebra Test Scores vs. Hours Spent Playing Pool
The linear correlation coefficient for the ordered pairs in the table is 𝑟 ≈ 0.98. Thus there is a strong
positive linear relationship between the student’s algebra test scores and the time the student spent playing pool.
This does not mean that the higher algebra test scores were caused by the increased time spent playing pool.
The fact that the student’s test scores increased with the increase in the time spent playing pool could be due to
many other factors or it could just be a coincidence. In your work with applications that involve the linear
correlation coefficient 𝑟, it is important to remember the following properties of 𝑟.
In Activity 3 and 4, find the equation of the least-squares line and the linear correlation coefficient for the given data.
Round the constants, 𝑎, 𝑏, and 𝑟, and to the nearest hundredth.
1. {(2,6), (3, 6), (4,8), (6,11), (8,18)}
2. {(2, −3), (3, −4), (4, −9), (5, −10), (7, −12)}
3. A student has recorded the data in the following table, which shows the distance a spring stretches in inches for a
given weight in pounds.
In Activity 3 and 4, find the equation of the least-squares line and the linear correlation coefficient for the given data.
Round the constants, 𝑎, 𝑏, and 𝑟, and to the nearest hundredth.
1. {(−7, −11.7), (−5, −9.8), (−3, −8.1), (1, −5.9), (2, −5.7)}
2. {(1, 4.1), (2, 6.0), (4, 8.2), (6, 11.5), (8, 16.2)}
3. The following table shows the percent of water and the number of calories in various canned soups to which 100
grams of water are added.
41
Revision 02
Section 2 Mathematics as A Tool GECMAT CHMSU – CAS Mathematics Department
a) Find the equation of the least-squares line for the data. Round constants to the nearest hundredth.
b) Use the equation in part 𝑎 to find the expected number of calories in a soup that is 89% water. Round to the
nearest whole number.
c) Determine the linear correlation coefficient between the percent of water and the number of calories in various
canned soups.
42
Revision 02