0% found this document useful (0 votes)

13 views

22UCS303 DS-Unit III-N

Uploaded by

niheleshmu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

22UCS303 DS-Unit III-N

Uploaded by

niheleshmu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Dr. N.G.P.

Institute of Technology - Coimbatore-48

(An Autonomous Institution)

22UCS303 - DATA SCIENCE ESSENTIALS

Unit- III DESCRIBING DATA

Text Books:
1. Python Data Science Handbook-Essential Tools for Working with Data, Jake Vander Plas, O'Reilly Media, 2nd edition, 2022.
2. Data Science from Scratch: First Principles with Python, Joel Grus, O'Reilly, 2nd edition, 2019.
Unit- III DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs– Outliers, Relative
Frequency , Distributions , Cumulative Frequency Distributions, Frequency Distributions For
Qualitative (Nominal) Data, Graphs For Quantitative Data, Histogram, frequency polygon, Stem
And Leaf Display , Typical shapes, A Graph For Qualitative (Nominal) Data , Describing Data with
Averages, Mode, Median, Mean

10/3/2024 2
❖ Course Objectives: The course aims to

COB1:Describe about the various types of data

COB2: Discuss about the data science process

COB3: Describe Data with using various statistical techniques

COB4: Understand the pandas and NumPy arrays

COB5: Understand Plotting functions

10/3/2024 3
Course outcomes: At the end of the course, students will be able to

CO1: Identify phases involved in the life cycle of Data Science.

CO2: Apply the Data Science process on real time scenario.
CO3: Realize the various data analytics techniques for labeled/columnar Data using Python Pandas.
CO4: Explore a flexible range of data visualizations approaches in Python.
CO5: Analyze various Machine learning algorithms for data modeling with Python.

10/3/2024 4
In data science, data can be categorized into various types based on its
nature and characteristics.

Understanding these types of data is essential for data analysis and

modeling. Here are some common types of data in data science:
1. Numerical Data:
•Continuous Data: Data that can take any real value within a
given range. Examples include temperature, weight, and height.
•Discrete Data: Data that can only take specific, distinct values.
Examples include the number of cars in a parking lot or the
count of customer arrivals.

10/3/2024 5
10/3/2024 6
2.Categorical Data:
•Nominal Data: Data that represents categories or labels without any inherent order.
Examples include colors, types of fruits, or customer IDs.
•Ordinal Data: Data with categories that have a meaningful order or ranking. Examples
include education levels (e.g., high school, bachelor's, master's) or customer satisfaction
ratings (e.g., very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).

3.Text Data:
1. Unstructured data that includes text, documents, or free-form comments. Text data
often requires natural language processing (NLP) techniques for analysis.

4.Time Series Data:

1. Data collected over a sequence of time intervals. Examples include stock prices,
temperature measurements over time, or web traffic data.

10/3/2024 7
5. Spatial Data:
Data associated with geographic or spatial information. Examples include
GPS coordinates, maps, or geospatial data used in GIS (Geographic
Information Systems).
6. Binary Data:
Data that can take only two values, often represented as 0 and 1. Examples
include yes/no, true/false, or on/off.
7. Image Data:
Data consisting of pixel values that form images. Common in computer
vision tasks.
8. Audio Data:
Data representing sound waves, used in applications like speech
recognition and audio analysis.
10/3/2024 8
9.Multi-dimensional Data:
Data with multiple dimensions or features. This can include data from
sensors, IoT devices, or high-dimensional datasets used in machine
learning.
10.Relational Data:
Data stored in databases with tables and relationships, often used in data
analysis and SQL-based queries.
11.Graph Data:
Data represented as a graph or network, where nodes and edges represent
entities and their connections. Used in social network analysis,
transportation networks, and more.
12.Anomaly Data:
Data that may contain anomalies or outliers, which are values significantly
different from the majority of the data. Detecting anomalies is crucial in
various
10/3/2024 applications, such as fraud detection. 9
Types of Variables
Types of Variables in Data Science!
In totality, there exist 4 types of variables in the field of Data
Science which are listed below:
1.Numerical
2.Categorical
3.DateTime
4.Mixed

10/3/2024 10
Basic Statistical descriptions of Data

• Statistics exists because of the prevalence of variability in the real world

• WHAT IS STATISTICS?
• Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data

• descriptive statistics, statistics provides us with tools—tables, graphs, averages,

Descriptive ranges, correlations—for organizing and summarizing the inevitable variability in
Statistics collections of actual observations or score
• A graph showing the annual change in global temperature during the last 30 years

• This more advanced area is known as inferential statistics. Tools from inferential
Inferential statistics permit us to use a relatively small collection of actual observations to
Statistics evaluate
• An assertion about the relationship between job satisfaction and overall happiness
11
What type of statistic is this?
(a) Students in my statistics class are, on average, 23 years old.
(b) The population of the world exceeds 7 billion (that is, 7,000,000,000 or 1 million
multiplied by 7000).
(c) Either four or eight years have been the most frequent terms of office actually
served by U.S. presidents.
(d) Sixty-four percent of all college students favor right-to-abortion laws.

Answers
(a) descriptive statistics
(b) inferential statistics
(c) descriptive statistics
(d) inferential statistics
12
Data ??

Qualitative Data
Ranked Data Quantitative Data
Data A set of
A set of A set of
Data A collection observations
observations observations
of actual where any single
where any single where any single
observations or observation is a
observation is a observation is a
scores in a survey word, letter, or
number that number that
or an experiment numerical code
indicates relative represents an
that represents a amount or a count.
standing.
class or category

13
Identify which type of data
(a) ethnic group
(a) qualitative
(b) age
(b) quantitative
(c) family size
(c) quantitative
(d) academic major
(d) qualitative
(e) sexual preference (e) qualitative
(f) IQ score (f) quantitative
(g) net worth (dollars) (g) quantitative
(h) third-place finish (h) ranked
(i) gender (i) qualitative
(j) temperature (j) quantitative

14
• Relative standing of ranked data that reflects differences in degree
based on order

15
Levels of Measurement

16
TYPES OF VARIABLES
• Variable Vs Constant
• A variable is a characteristic or property that can take on different values
• Constant A characteristic or property that can take on only one value.
• Discrete and Continuous Variables
• A discrete variable consists of isolated numbers separated by gaps. Examples include
most counts, such as the number of children in a family (1, 2, 3, etc., not 1 / 2)
• Discrete variables represent counts
• A continuous variable consists of numbers whose values, at least in theory, have no
restrictions.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• Continuous variables represent measurable amounts (e.g. water volume or
weight). For example, there are infinite values between 1 and 2.
17
• Approximate Numbers:
• Numbers that are rounded off, as is always the case with values for
continuous variables
• 140.01438, → 140

18
Indicate whether the following quantitative observations
are discrete or continuous.

(a) No of students in your class

(b) cooking time for pasta
(c) IQ (a)discrete
(d) age (b)Continuous
(c) continuous
(e) population of your hometown (d) Continuous
(e) discrete
(f) speed of a jetliner (f) continuous

19
Independent and Dependent Variables
• Experiment A study in which the investigator decides who receives
the special treatment
• Independent Variable → The independent variable is the cause.
Its value is independent of other variables in your study.

• Dependent Variable→ A variable that is believed to have been

influenced by the independent variable
• The dependent variable is the effect. Its value depends on changes
in the independent variable

20
21
Observational Studies
• sociologist might collect paired measures of poverty level and crime rate
for each individual in some group.
• If a statistical analysis reveals that these two variables are related or
correlated, then, given some person’s poverty level, the sociologist can
better predict that person’s crime rate or vice versa.
• On the other hand, both poverty and crime might be caused by one or
some combination of more basic variables, such as inadequate education,
racial discrimination, unstable family environment, and so on.
• Such studies are often referred to as observational studies.
• An observational study focuses on detecting relationships between
variables not manipulated by the investigator, and it yields less clear-cut
conclusions about cause effect relationships than does an experiment
22
Confounding Variable
• Confounding Variable→ An uncontrolled variable that
compromises the interpretation of a study

• A confounding variable is a third variable that influences both the

independent and dependent variables.

23
Experiment or Observational Study

24
Try This
• For each of the listed studies, indicate whether it is an experiment or an
observational study. If it is an experiment, identify the independent variable
and note any possible confounding variables.

(a) years of education and annual income

(b) prescribed hours of sleep deprivation and subsequent amount of REM (dream) sleep
(c) weight loss among obese males who choose to participate either in a weight-loss
program or a self-esteem enhancement program
(d) estimated study hours and subsequent test score
(e) recidivism among substance abusers assigned randomly to different rehabilitation
programs
(f) subsequent GPAs of college applicants who, as the result of a housing lottery, live either
on campus or off campus
25
Got it right!!!
(a) observational study
(b) experiment (independent variable: prescribed hours of sleep
deprivation)
(c) experiment (independent variable: two programs; possible
confounding variable: self-selection of program)
(d) observational study
(e) experiment (independent variable: different rehabilitation
programs)
(f) experiment (independent variable: on campus or off campus)

26
Describing Data with Tables and Graphs
Tables (frequency distributions)
Frequency distributions for quantitative data
Guidelines
Outliers
Relative frequency distributions
Cumulative frequency distributions
Frequency distributions for qualitative (nominal) data
Interpreting distributions constructed by others
Graphs
Graphs for quantitative data
Typical shapes
A graph for qualitative (nominal) data
Misleading graphs
Doing it yourself
27
Frequency distributions for quantitative data
• Frequency Distribution A collection of observations
produced by sorting observations into classes and
showing their frequency (f) of occurrence in each class
• When observations are sorted into classes of single values, as in
Table 2.1, the result is referred to as a frequency distribution for
ungrouped data

Grouped and
Ungrouped

28
Frequency Distribution for Grouped Data
• A frequency distribution produced whenever observations are
sorted into classes of more than one value
frequency ( f )

GUIDELINES
• The “Guidelines for Frequency Distributions” box lists
seven rules for producing a well-constructed frequency
distribution.
• The first three rules are essential and should not be
violated.
• The last four rules are optional and can be modified or
ignored as circumstances warrant
29
30
31
• The IQ scores for a group of 35 high
school dropouts are as follows:

• (a) Construct a frequency distribution

for grouped data.
• (b) Specify the real limits for the lowest
class interval in this frequency
distribution.
(b) 64.5–69.5
32
Real Limits for frequency distribution
• Gaps cannot be ignored when you are determining the actual width of any class
• interval.
• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled boundary.

• For example, the real limits for 140–149 in Table are 139.5 (140 minus one-
half of the unit of measurement of 1) and 149.5 (149 plus one-half of the unit of
measurement of 1), and the actual width of the class interval would be 10
(from 149.5 139.5 = 10).

• If weights had been reported to the nearest tenth of a pound, the real limits for
140.0–149.9 would be 139.95 (140.0 minus one-half of the unit of measurement of
.1) and 149.95 (149.9 plus one-half of one unit of measurement of .1), and the actual
width of the class interval still would be 10 (from 149.95 139.95 = 10).
33
34
35
OUTLIERS
• A very extreme score
• Check for Accuracy → Is 3.06 recorded as 0.06? If it crosses
accuracy check consider it as legitimate score
• Might Exclude from Summaries (Upper /Lower Bound)
• Might Enhance Understanding (Crime Rate/learning rates differ )
Identify any outliers in each of the following sets of data
collected from nine college students

Outliers are a summer income of

$25,700; an age of 61; and a family
size of 18.
No outliers for GPA.
36
Relative Frequency Distributions
• Relative frequency distributions show the frequency of
each class as a part or fraction of the total frequency for
the entire distribution.
• Below is the weight data of students
• We can observe that the 160 class is about 1/4th of the class
• No of students in 160 weight = 12/53 = 23 –>23%

• 53*1.8867 = 100
• 12 * 1.8867 =22.64 == 23%

37
Constructing Relative Frequency Distributions
• To convert a frequency distribution into a
relative frequency distribution, divide the
frequency for each class by the total frequency
for the entire distribution
• Here is the relative frequency of weight data

• Proposition:
• The conversion to proportions is
straightforward. For instance, to obtain the
proportion of .06 for the class 130–139, divide
the frequency of 3 for that class by the total
frequency of 53.
• 3/53 = 0.056 5.6%

38
Percentages or Proportions?
• Both can be used
• Percentage lacks in decimal precision
• proportion always varies between 0 and 1, whereas a percentage
always varies between 0 percent and 100 percent

39
Skill Test
• GRE scores for a group of graduate school applicants are
distributed as follow
• Convert to a relative frequency distribution. When calculating
proportions, round numbers to two digits to the right of the
decimal point, using the rounding procedure

40
CUMULATIVE FREQUENCY DISTRIBUTIONS
• Cumulative frequency distributions show the total number of
observations in each class and in all lower-ranked classes.
• This type of distribution can be used effectively with sets of scores,
such as test scores for intellectual or academic aptitude
• Where the relative standing within the distribution assumes
primary importance.
• Under these circumstances, cumulative frequencies are usually
converted, in to cumulative percentages.
• Cumulative percentages are often referred to as percentile ranks.

41
Constructing Cumulative Frequency Distributions
• To convert a frequency distribution into a cumulative frequency
distribution, add to the frequency of each class the sum of the
frequencies of all classes ranked below it

53/53= 100
52/53= 98

49/53 =92

42
Cumulative Percentages
• As has been suggested, if relative standing within a distribution is
particularly important, then cumulative frequencies are converted to
cumulative percentages.

• 75 percent of all weights are the same as or lighter than the weights
between 170 and 179 lbs

43
Frequency Distributions for Qualitative data (Nominal)
• Frequency distributions for qualitative data are easy to
construct.
• Simply determine the frequency with which observations
occupy each class, and report these frequencies

• Ordered Qualitative Data:

• It is totally arbitrary to place yes and No in place
• However, the qualitative data have an ordinal level of
measurement because observations can be ordered from least
to most
• Then we expect 1. NO and 2. YES

44
Relative and Cumulative Distributions for Qualitative
Data
• Frequency distributions for
qualitative variables can
always be converted into
relative frequency
distributions

• Example, that a captain has

an approximate percentile
rank of 63 among officers
since 62.5 (or 63) is the
cumulative percent for this
class
45
• Movie ratings reflect ordinal measurement because they can be
ordered from most to least restrictive: NC-17, R, PG-13, PG, and G.
The ratings of some films shown recently in San Francisco are as
follows:

46
Answer this
(a) Construct a frequency distribution.
(b) Convert to relative frequencies, expressed as percentages.
(c) Construct a cumulative frequency distribution.
(d) Find the approximate percentile rank for those films with a PG
rating.

47
Solution
• Percentile rank for films with
a PG rating is 55 (from 11/20
multiplied by 100).

48
GRAPHS
• Data can be described clearly and concisely with the aid of a well-
constructed frequency distribution.
• And data can often be described even more vividly, particularly when
you’re attempting to communicate with a general audience, by
converting frequency distributions into graph

49
Graphs
Bar Graph
Column Chart
Line Graph Qualitative Data Quantitative Data
Dual Axis Chart
Area Chart
Stacked Bar Graph
Mekko Chart Histogram Pie Chart
Pie Chart
Scatter Plot Chart Frequency Polygon Histogram
Bubble Chart
Waterfall Chart Stem and Leaf Bar chart
Funnel Chart Displays Pictogram
Bullet Chart
Heat Map

50
Histograms
• Histogram A bar-type graph for quantitative data. The common
boundaries between adjacent bars emphasize the continuity of the
data, as with continuous variables.

Class Interval

Intersection
of X &Y

Frequency 51
• Equal units along the horizontal axis (the X axis, or abscissa)
reflect the various class intervals of the frequency
distribution.
• Equal units along the vertical axis (the Y axis, or ordinate)
reflect increases in frequency. (The units along the vertical
axis do not have to be the same width as those along the
horizontal axis.)
• The intersection of the two axes defines the origin at which
both numerical scales equal 0.
• Numerical scales always increase from left to right along the
horizontal axis and from bottom to top along the vertical axis
• The body of the histogram consists of a series of bars
whose heights reflect the frequencies for the various
classes.
52
Frequency Polygon (In Histogram)
• The Frequency polygon can be applied to histogram and Line
graph
• Frequency polygons may be constructed directly from frequency
distributions.

• Frequency Polygon → A line graph for quantitative data that

also emphasizes the continuity of continuous variables.

53
Process of making Frequency polygon

B. Place dots at the midpoints of each bar top or, in the absence of bar
A. This panel shows the histogram tops, at midpoints for classes on the horizontal axis, and connect them
for the weight distribution with straight lines.
[To find the midpoint of any class, such as 160–169, simply add the
two tabled boundaries (160 + 169 = 329) and divide this sum by 2
(329/2 = 164.5).] 54
Process of making Frequency polygon

C. Anchor the frequency polygon to the

horizontal axis. First, extend the upper tail to
the midpoint of the first unoccupied class D. Finally, erase all of the histogram bars,
(250–259) on the upper flank of the leaving only the frequency polygon.
histogram. Then extend the lower tail to the Frequency polygons are particularly useful
midpoint of the first unoccupied class (120– when two or more frequency distributions
129) on the lower flank of the histogram. Now or relative frequency distributions are to be
all of the area under the frequency polygon is included in the same graph
55
enclosed completely.
Skill Test
• The following frequency distribution
shows the annual incomes in dollars
for a group of college graduates.

• (a) Construct a histogram.

• (b) Construct a frequency polygon.
• (c) Is this distribution balanced or
lopsided?

56
Stem and Leaf Displays
• Another technique for summarizing quantitative data is a stem and
leaf display

• A device for sorting quantitative data on the basis of leading and

trailing digits

• Stem and leaf displays are ideal for summarizing distributions, such
as that for weight data, without destroying the identities of
individual observations.

57
Constructing a Display (STEM)

• Selection of Stems Stem values are not limited to units of 10. Depending on the data, you might
identify the stem with one or more leading digits that culminates in some variation on a stem 58
Skill Test
• Selection of Stems –
• annual income of $23,784 – Stem 23
• SAT test score of 689 – stem 68
• Cut off score 189 stem 18
• Cut off score 189.88 stem 189

59
60
Typical Shapes- Histogram

Normal Bimodal

Positively Negatively
Skewed Skewed
61
Normal

62
Bimodal
• reflect the coexistence of two different types of observations in the
same distribution
• For Example, the distribution of the ages of residents in a
neighborhood consisting largely of either new parents or their infants
has a bimodal shape

63
Positively Skewed
• A distribution that includes a few extreme observations in the
positive direction (to the right of the majority of observations)
• The distribution of incomes among U.S. families has a positive skew,
were most family incomes under $200,000 and relatively few family
incomes spanning a wide range of values above $200,000.

64
Negatively Skewed

Retirement Age

65
(a)Positively skewed
(b) Normal
Solve this!!! (c) Positively skewed
(d) Bimodal
(e) Negatively skewed
Describe the probable shape—normal, bimodal, positively skewed, or negatively
skewed—for each of the following distributions:
(a) female beauty contestants’ scores on a masculinity test, with a higher
score indicating a greater degree of masculinity
(b) scores on a standardized IQ test for a group of people selected from the
general population
(c) test scores for a group of high school students on a very difficult college-
level math exam
(d) reading achievement scores for a third-grade class consisting of about
equal numbers of regular students and learning-challenged students
(e) scores of students at the Eastman School of Music on a test of music
aptitude (designed for use with the general population)
66
67
68
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
• Bar Graph for the qualitative data
• Facebook Replies
• In bar graph – Vertical Axis (Y axis shows
the frequency distribution)
• Horizontal axis (X axis) shows Classes
• A bar graph also can be used with
quantitative data to show the discrete
data.
• To emphasize the discontinuous nature of
a discrete variable
• Example : number of children in a fam
69
MIS-LEADING GRAPHS
• Graphs can be constructed in an unscrupulous manner to support a
particular point of view
• Indeed, this type of statistical fraud gives credibility to popular
sayings, including “Numbers don’t lie, but statisticians do”
• “There are three kinds of lies—lies, damned lies, and statistics.

70
• For example, to imply that comparatively many students responded
Yes to the Facebook profile question, an unscrupulous person might
resort to the various tricks shown

The width of the Yes bar is more than three

times that of the No bar, thus violating the
custom that bars be equal in width

The lower end of the frequency scale is

omitted, thus violating the custom that the
entire scale be reproduced, beginning with
zero.

The height of the vertical axis is several

times the width of the horizontal axis

Beware of graphs in which, the

vertical axis is many times smaller
than the horizontal axis, and
71
frequency differences are suppressed
Describing Data with Averages
• Describing Data with Averages - Describing Variability - Normal
Distributions and Standard (z) Scores

MODE MEDIAN MEAN

AVERAGES FOR
WHICH AVERAGE
QUALITATIVE AND
?
RANKED DATA
72
Describing Data with Averages
• Average – Central
• measures of central tendency
• Numbers or words that attempt to describe, most generally, the
middle or typical value for a distribution

• measures of central tendency—the mode, median, and mean.

• Each of these has its special uses, but the mean is the most important
average in both descriptive and inferential statistics

73
MEAN

74
MEDIAN
• The median reflects the middle value when observations are ordered
from least to most
• 23 45 34

75
MODE
• The mode reflects the value of the most frequently
occurring score
• Table shows the number of years served by 20 recent
U.S. presidents, beginning with Benjamin Harrison (4
years) and ending with Bill Clinton (8 years).
• Mode = 4
• Ordered Values

76
Check This!!!
• Determine the mode for the following retirement ages: 60, 63, 45,63, 65, 70, 55, 63,
60, 65, 63.

• The owner of a new car conducts six gas mileage tests and obtains the following
results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find the mode
for these data.
mode = 63
mode = 27.4
3.3 median = 63
3.4 median = 27.15 (halfway between
26.9 and 27.4)
3.5
672
mean 61.09
11
77
Finding Median
1 Order scores from least to most.
2 Find the middle position by adding one to the total number of scores and
dividing by 2.
3 If the middle position is a whole number, as in the left-hand panel below, use this
number to count into the set of ordered scores.
4 The value of the median equals the value of the score located at the middle
position.
5 If the middle position is not a whole number, as in the right-hand panel below, use
the two nearest whole numbers to count into the set of ordered scores.
6 The value of the median equals the value midway between those of the two
middlemost scores; to find the midway value, add the two given values and divide
by 2.
78
Example
• Set of five scores: Set of six scores:
• 2, 8, 2, 7, 6 3, 8, 9, 3, 1, 8
Step 1: 2, 2, 6, 7, 8 Step1 :1, 3, 3, 8, 8, 9
Step 2: 5 +1/2=3 Step 2: 6+ 1/2 =3.5 .
Step 3&4 :1, 3, 3, 8, 8, 9
Step 3:2, 2, 6, 7, 8 :
Step5: 1, 2, 3, 4
3rd element 1, 2, 3 Step6: median = 3+ 8/2=5.5
Step 4 :median = 6

79
More Than One Mode -
• Distributions can have more than one mode (or no mode at all)
• Distribution with two obvious peaks Bimodal
• Distributions with more than two peaks are referred to as
multimodal.
• The presence of more than one mode might reflect important
differences among subsets of data.

80
Sample Vs Population
• Population -- A complete set of scores
• Sample -- A subset of scores
• Formula for Sample Mean :

• Sample Mean
• The balance point for a sample, found by dividing the sum for the
values of all scores in the sample by the number of scores in the
sample.
• Sample Size (n) The total number of scores in the sample.

81
Formula for Population Mean

• Population Mean (μ) The balance point for a population, found by

dividing the sum for all scores in the population by the number of
scores in the population.
• Population Size (N) The total number of scores in the population

• population mean usually is unknown but fixed as a constant, while

the sample mean is known but varies from sample to sample.
82
Mean Serves as the balance point for its
frequency distribution
• The sum of all scores, expressed as positive and negative deviations
from the mean, always equals zero.
4 1.6 8 -2.4
4 1.6 8 -2.4
• Mean 5.60
4 1.6 2 3.6
8 -2.4 6 -0.4
4 1.6 5 0.6
8 -2.4 3 2.6
2 3.6 4 1.6
6 -0.4 8 -2.4
4 1.6 4 1.6
12 -6.4 8 -2.4 83
Average ?
• If Distribution Is Not Skewed :
• When a distribution of scores is not too skewed, the
values of the mode, median, and mean are similar,
and any of them can be used to describe the central
tendency of the distribution.
• Example → 3
5
• Mean : 30 6.5 6.5

• Median 10th Position → 30 8

84
Interpreting Differences between Mean and Median

Positively Negatively
skewed Skewed

mean median
exceeds exceeds
the median the mean

7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
19 pages
Blueberry Supply Chain in Peru: Planning, Integration and Execution
No ratings yet
Blueberry Supply Chain in Peru: Planning, Integration and Execution
13 pages
Unit-2-1
No ratings yet
Unit-2-1
48 pages
Intro
No ratings yet
Intro
67 pages
Pa 1 2024
No ratings yet
Pa 1 2024
88 pages
Data and Its Types
No ratings yet
Data and Its Types
32 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
ML Unit-II Notes
No ratings yet
ML Unit-II Notes
86 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Introduction To STATISTICS-new
No ratings yet
Introduction To STATISTICS-new
44 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Lecture 2-Introduction To Satistics
No ratings yet
Lecture 2-Introduction To Satistics
43 pages
Lecture 1 Statistics and Lecture2 (1)
No ratings yet
Lecture 1 Statistics and Lecture2 (1)
44 pages
Data Visualization
No ratings yet
Data Visualization
49 pages
statistics notes part - 1
No ratings yet
statistics notes part - 1
25 pages
Stats_Notes
No ratings yet
Stats_Notes
81 pages
inbound1318015744805122613
No ratings yet
inbound1318015744805122613
10 pages
EDA 1
No ratings yet
EDA 1
137 pages
FDS Unit II Notes
No ratings yet
FDS Unit II Notes
48 pages
Chapter 1 the Nature of Probability and Statistics Updated Spring 2023-2024
No ratings yet
Chapter 1 the Nature of Probability and Statistics Updated Spring 2023-2024
38 pages
Chapter1 Introduction To Statistics
No ratings yet
Chapter1 Introduction To Statistics
27 pages
Chapter 1. Biostatistics
No ratings yet
Chapter 1. Biostatistics
34 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Summarize Topic in Statistical
No ratings yet
Summarize Topic in Statistical
5 pages
UE20CS203-Unit1-Class5-Types of Data - Experiments
No ratings yet
UE20CS203-Unit1-Class5-Types of Data - Experiments
51 pages
INTRODUCTION TO SATISTICS .DOC1
No ratings yet
INTRODUCTION TO SATISTICS .DOC1
7 pages
Chapter 1 - The Nature of Probability and Statistics - Sections 1 and 2
No ratings yet
Chapter 1 - The Nature of Probability and Statistics - Sections 1 and 2
34 pages
TOPIC ONE_INTRODUCTION
No ratings yet
TOPIC ONE_INTRODUCTION
72 pages
CS3353.docx
No ratings yet
CS3353.docx
13 pages
CH 01
No ratings yet
CH 01
45 pages
Ahsan Stats
No ratings yet
Ahsan Stats
9 pages
Stat For ds-1 (IITM BS Degree)
No ratings yet
Stat For ds-1 (IITM BS Degree)
109 pages
STA132 Complete Note
No ratings yet
STA132 Complete Note
110 pages
MMW Stat 24 25
No ratings yet
MMW Stat 24 25
42 pages
CH 01
No ratings yet
CH 01
11 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
(Buiness Statistics) Chapter 1 2
No ratings yet
(Buiness Statistics) Chapter 1 2
33 pages
Introduction Data
No ratings yet
Introduction Data
32 pages
Chapter 01
No ratings yet
Chapter 01
96 pages
Quantitative Methods
No ratings yet
Quantitative Methods
33 pages
2-Fundamental of Statistical Techniques
No ratings yet
2-Fundamental of Statistical Techniques
83 pages
Nature of Statistics
No ratings yet
Nature of Statistics
7 pages
Nature of Statistics
100% (1)
Nature of Statistics
7 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
86 pages
Unit 2
No ratings yet
Unit 2
72 pages
Statistics & Data
No ratings yet
Statistics & Data
11 pages
01 Introduction
No ratings yet
01 Introduction
50 pages
Introduction to Statistics_Note
No ratings yet
Introduction to Statistics_Note
16 pages
مبادئ الاحصاء
No ratings yet
مبادئ الاحصاء
66 pages
AD3491-Unit 2
No ratings yet
AD3491-Unit 2
102 pages
Notes of Week-1 and Week-2
No ratings yet
Notes of Week-1 and Week-2
30 pages
Unit 3
No ratings yet
Unit 3
30 pages
Introduction To Quantitative Research
No ratings yet
Introduction To Quantitative Research
68 pages
MTPDF1 - Introduction To Statistics
No ratings yet
MTPDF1 - Introduction To Statistics
106 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
1 Elements, Variables and Data Categorization
No ratings yet
1 Elements, Variables and Data Categorization
27 pages
1 - 2 Biostatistics
No ratings yet
1 - 2 Biostatistics
24 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
Statistics-Introduction - 9-11-2021
No ratings yet
Statistics-Introduction - 9-11-2021
97 pages
ewrgthgfdcsdfgh
No ratings yet
ewrgthgfdcsdfgh
25 pages
Thinking Statistically
From Everand
Thinking Statistically
Anthony Banfield
5/5 (1)
Lab 1. Soil Sampling GP3 SE 4101
No ratings yet
Lab 1. Soil Sampling GP3 SE 4101
11 pages
Fading American Dream Trens in Absolute Income Mobility Since 1940
No ratings yet
Fading American Dream Trens in Absolute Income Mobility Since 1940
16 pages
1 PDF
100% (1)
1 PDF
6 pages
Educational Research MCQS PDF
95% (21)
Educational Research MCQS PDF
51 pages
Questionairs (Advantage and Disadvnatage)
No ratings yet
Questionairs (Advantage and Disadvnatage)
2 pages
A Project Report ON 'The Impact of Training and Development On Employees Performance in An Organization'' Equitas Small Finance Bank
No ratings yet
A Project Report ON 'The Impact of Training and Development On Employees Performance in An Organization'' Equitas Small Finance Bank
57 pages
LESSON PLAN in Trends1
No ratings yet
LESSON PLAN in Trends1
2 pages
Gujarat Technological University: Page 1 of 3
No ratings yet
Gujarat Technological University: Page 1 of 3
3 pages
The B Level Speaking Test
No ratings yet
The B Level Speaking Test
31 pages
Lean Six Sigma Green Belt Exam Topics: Section 1: Define & Matching ( 15 Questions)
No ratings yet
Lean Six Sigma Green Belt Exam Topics: Section 1: Define & Matching ( 15 Questions)
2 pages
Research in Daily Life 2 (HRE 121) - LAP 5 - Week 9-10
No ratings yet
Research in Daily Life 2 (HRE 121) - LAP 5 - Week 9-10
21 pages
SPLISS An International Comparitive Study Sport Po
No ratings yet
SPLISS An International Comparitive Study Sport Po
27 pages
Information Security Objectives
No ratings yet
Information Security Objectives
20 pages
CT 2021-2022 (34) QTM BT
No ratings yet
CT 2021-2022 (34) QTM BT
3 pages
WWWW2
No ratings yet
WWWW2
1 page
Is Ricardo Still Relevant? An Empirical Re-Examination of Ricardian Trade Theory
No ratings yet
Is Ricardo Still Relevant? An Empirical Re-Examination of Ricardian Trade Theory
22 pages
Mixed Methods: Integrating Qualitative and Quantitative Research
No ratings yet
Mixed Methods: Integrating Qualitative and Quantitative Research
11 pages
Research Methodology
100% (1)
Research Methodology
63 pages
WT Lecture 8
No ratings yet
WT Lecture 8
15 pages
Summative Test
63% (8)
Summative Test
7 pages
Cohen Chap 7 T Test For Independent Sample Means (Screen)
No ratings yet
Cohen Chap 7 T Test For Independent Sample Means (Screen)
20 pages
NCSC BP
No ratings yet
NCSC BP
25 pages
Sources
No ratings yet
Sources
13 pages
PR1 Methodology Group 3
No ratings yet
PR1 Methodology Group 3
3 pages
Che 231 Lesson Plan
No ratings yet
Che 231 Lesson Plan
2 pages
Unit 1 The Nature and Context of Social Research
No ratings yet
Unit 1 The Nature and Context of Social Research
48 pages
Use of Social Media by College Students Relationsh
No ratings yet
Use of Social Media by College Students Relationsh
14 pages
Sullivan Et Al 2018 - Should Multiple Imputation Be The Method of Choice For Handling Missing Data in Randomized Trials
No ratings yet
Sullivan Et Al 2018 - Should Multiple Imputation Be The Method of Choice For Handling Missing Data in Randomized Trials
17 pages
Module_11_GCLP_EXTERNAL_QUALITY_ASSESSMENT
No ratings yet
Module_11_GCLP_EXTERNAL_QUALITY_ASSESSMENT
17 pages

Uploaded by

Uploaded by

Dr. N.G.P.

Institute of Technology - Coimbatore-48

22UCS303 - DATA SCIENCE ESSENTIALS

Unit- III DESCRIBING DATA

COB1:Describe about the various types of data

COB2: Discuss about the data science process

COB3: Describe Data with using various statistical techniques

COB4: Understand the pandas and NumPy arrays

COB5: Understand Plotting functions

CO1: Identify phases involved in the life cycle of Data Science.

Understanding these types of data is essential for data analysis and

4.Time Series Data:

• Statistics exists because of the prevalence of variability in the real world

• descriptive statistics, statistics provides us with tools—tables, graphs, averages,

(a) No of students in your class

• Dependent Variable→ A variable that is believed to have been

• A confounding variable is a third variable that influences both the

(a) years of education and annual income

• (a) Construct a frequency distribution

Outliers are a summer income of

• Ordered Qualitative Data:

• Example, that a captain has

• Frequency Polygon → A line graph for quantitative data that

C. Anchor the frequency polygon to the

• (a) Construct a histogram.

• A device for sorting quantitative data on the basis of leading and

The width of the Yes bar is more than three

The lower end of the frequency scale is

The height of the vertical axis is several

Beware of graphs in which, the

MODE MEDIAN MEAN

• measures of central tendency—the mode, median, and mean.

• Population Mean (μ) The balance point for a population, found by

• population mean usually is unknown but fixed as a constant, while

• Median 10th Position → 30 8

You might also like