0% found this document useful (0 votes)
7 views

EDA 1

The document provides an introduction to data, its types, and classifications, including measurement scales such as nominal, ordinal, interval, and ratio. It emphasizes the importance of data in decision-making, strategy formulation, and process improvement, while also discussing various sources of data and its applications in analytics. The document further explores the properties of data and the operations applicable to different data types.

Uploaded by

sudeep shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

EDA 1

The document provides an introduction to data, its types, and classifications, including measurement scales such as nominal, ordinal, interval, and ratio. It emphasizes the importance of data in decision-making, strategy formulation, and process improvement, while also discussing various sources of data and its applications in analytics. The document further explores the properties of data and the operations applicable to different data types.

Uploaded by

sudeep shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Quote

► The simple things are also the most extraordinary things,


and only the wise can see them.
► Be minute to everything around you. The world is a great
teacher!

► PAULO COELHO Brazillian author


UNIT I
Introduction to Data and its types
Contents

► Definition and importance of data


► Classification of data :
❖ based on measurement – ratio, interval, ordinal and nominal
❖ based on observation – Cross Sectional, times series and panel data
❖ based on availability – primary, secondary, tertiary
❖ based on structural form – structured, semi structured and unstructured
❖ based on inherent nature – quantitative and qualitative
► concepts on sample data and population
► small sample and large sample
► statistic and parameter
► types of statistics and its application in different business scenarios
► frequency distribution of data.
Introduction to data

► Example:
40, 95, …, Jain University, 18BTRCR75, [email protected]
Anything else?

► Data vs. Information


100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0

Is there any information?

4
How large your data is?
► What is the maximum file size you have
dealt so far?
► Movies/files/streaming video that you have
used?

► What is the maximum download speed you


get?
► To retrieve data stored in distant locations?

► How fast your computation is?


► How much time to just transfer from you,
process and get result?
5
Sources of data

Social media and Scientific instruments


networks (Collecting all sorts of
(All of us are generating data)
data)

Sensor technology and


Mobile devices networks
(Tracking all objects all the (Measuring all kinds of data)
time)
6
Sources of data
► “Every day, we create 2.5 quintillion bytes of data
( Source: Social Media Today)
► By 2025, there will be 175 zettabytes of data in the global datasphere.
(Seagate UK)
► By 2025, the amount of data generated each day will reach 463 exabytes globally.
(World Economic Forum, High Scalability)

► The data come from several sources


► sensors used to gather climate information
► posts to social media sites,
► digital pictures and videos
► purchase transaction records
► cell phone GPS signals

etc. …… to name a few!


7
Data

► Data – a collection of facts (numbers, words, measurements,


observations, etc) that has been translated into a form that
computers can process
► Wikipedia: Data is a set of values of qualitative or quantitative
variables. Pieces of data are individual pieces of information.
► Computer Hope: In general, data is any set of characters that has
been gathered and translated for some purpose, usually analysis. It
can be any character, including text and numbers, pictures, sound,
or video. If data is not put into context, it doesn't do anything to a
human or computer.
► Cambridge dictionary: data – noun
information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an
electronic form that can be stored and used by a computer
Importance of Data
► Create personalized services (recommendations)
► Drive strategy (decisions based on assumptions about
customers' behavior)
► Measure impact (success rates associated with service
offerings).
► Improve products and processes (user feedback through
surveys or questionnaires).
► Protect safety & property (machine learning algorithms that
detect fraud).
► Ensure compliance with regulations.
► It helps you make better decisions.
Importance of Data

Training machines with data: Machine Learning


Sensing the physical world in real-time: IoT devices
Digitizing physical matter: 3 D printing
Humans augmented with data: Smart watches, ear
buds, head mounted-devices
Data helps you make better decisions
Data helps you solve problems
Data helps you see performance
Data helps you improve processes
► Describing is the use of data to set out what has
happened – e.g. a website received 1,000 visitors in the
first week of January this year
► Diagnosing is the use of data to explain why something
happened – e.g. 1,000 visitors came to the website on
that week because an email was sent to 10,000 customers
► Predicting is the use of data to define what will
happen – e.g. if we send an email to 20,000 customers
we will get 2,000 visitors to the website
► Prescribing is the use of data to define what will be
done – e.g. we will send a weekly email
Data in Data Analytics

► Entity: A particular thing is called entity or object.


► Attribute. An attribute is a measurable or observable property of an entity.
► Data. A measurement of an attribute is called data.
► Note
► Data defines an entity.
► Computer can manage all type of data (e.g., audio, video, text, etc.).
12
Classification of data
based on measurement

based on observation

based on availability

based on structural form

based on inherent nature


Based on Measurement
⚫ Measurement of some attribute of a set of things is the process of assigning
numbers or other symbols to the things in such a way that relationships of
the numbers or symbols reflect relationships of the attribute being measured.
⚫ A particular way of assigning numbers or symbols to measure something is
called a scale of measurement.
⚫ In general, there are many types of data that can be used to measure the
properties of an entity.

⚫ A good understanding of data scales (also called scales of measurement) is


important.

⚫ Depending the scales of measurement, different technique are followed to


derive unknown knowledge in the form of
⚫ patterns, associations, anomalies or similarities from a volume of data.

14
NOIR
Classification of data based on scales of
Measurement
NOIR classification
⚫ The mostly recommended scales of measurement are
N: Nominal
O: Ordinal
I: Interval
R: Ratio

The NOIR scale is the fundamental building block on which


the extended data types are built.

16
NOIR Classification

Nominal Ordinal Interval Ratio

Alphabetical
Binary Ternary Others
Ordered Discrete

Numerically
Symmetric
Ordered Continuous

Literally
Asymmetric
Ordered

Categorical (Qualitative) Numeric (Quantitative)


Properties of data
⚫ Following FOUR properties (operations) of data are pertinent.

# Property Operation Type

1. Distinctiveness = and ≠
Categorical
(Qualitative)
2. Order <,≤,>,≥

3. Addition + and -
Numerical
(Quantitative)
4. Multiplication * and /
18
NOIR summary
✔ Nominal (with distinctiveness property only)

✔ Ordinal (with distinctive and order property only)

✔ Interval (with additive property + property of Ordinal data)

✔ Ratio (with multiplicative property + property of Interval data)

⚫ Further, nominal and ordinal are collectively referred to as


categorical or qualitative data. Whereas, interval and ratio data are
collectively referred to as quantitative or numeric data.

19
Nominal scale
⚫ Definition
A variable that takes a value among a set of mutually exclusive codes that have no logical order is
known as a nominal variable.

⚫ Examples
Gender Used letters or numbers
{ M, F} or { 1, 0 }

Blood groups Used string


{A , B , AB , O }

Rhesus (Rh) factors Used symbols


{+ , - }

Country code ??
????

20
Nominal scale

⚫ The nominal scale is used to label data categorization


using a consistent naming convention.
⚫ The labels can be numbers, letters, strings, enumerated
constants or other keyboard symbols.
⚫ Nominal data thus makes “category” of a set of data.

⚫ The number of categories should be two (binary) or more


(ternary, etc.), but countably finite.

21
Nominal scale
Note
⚫ A nominal data may be numerical in form, but the numerical values have
no mathematical interpretation.
⚫ For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 = 210 is
meaningless. They are simply labels.
⚫ Two labels may be identical ( = ) or dissimilar ( ≠ ).
⚫ These labels do not have any ordering among themselves.
⚫ For example, we cannot say blood group B is better or worse than group
A.
⚫ Labels (from two different attributes) can be combined to give another
nominal variable.
⚫ For example, blood group with Rh factor ( A+ , A- , AB+, etc.)

22
Binary scale
⚫ Definition
A nominal variable with exactly two mutually exclusive categories that
have no logical order is known as binary variable or “dichotomous“
⚫ Examples
Switch: {ON, OFF}
Attendance: {True, False}
Entry: {Yes, No}
etc.
⚫ dichotomous“
Note
⚫ A Binary variable is a special case of a nominal variable that takes
only two possible values.

23
Symmetric and Asymmetric Binary Scale
⚫ Different binary variables may have unequal importance.
⚫ If two choices of a binary variable have equal importance, then
it is called symmetric binary variable.
⚫ Example: Gender = {male , female}
// usually of equal probability.

⚫ If the two choices of a binary variable have unequal


importance, it is called asymmetric binary variable.
⚫ Example: Food preference = {V , NV}

24
Operations on Nominal variables
⚫ Summary statistics applicable to nominal data are mode,
contingency correlation, etc.
⚫ Arithmetic (+,-,*and/) and logical operations (<,>,≠ etc.) are not
permitted.
⚫ The allowed operations are : accessing (read, check, etc.) and
re-coding (into another non-overlapping symbol set, that is,
one-to-one mapping) etc.
⚫ Nominal data can be visualized using line charts, bar charts or pie
charts etc.
⚫ Two or more nominal variables can be combined to generate other
nominal variable.
⚫ Example: Gender (M,F) × Marital status (S, M, D, W)

25
Nominal Examples
⚫ Gender: Male, Female, Other.
⚫ Hair Color: Brown, Black, Blonde, Red, Other.
⚫ Type of living accommodation: House, Apartment,
Trailer, Other.
⚫ Religious preference: Buddhist, Mormon, Muslim, Jewish,
Christian, Other.

26
Ordinal scale
⚫ Definition
Ordered nominal data are known as ordinal data and the variable that
generates it is called ordinal variable.
⚫ Example:
Shirt size = { S, M, L, XL, XXL}

Note
The values assumed by an ordinal variable can be ordered among
themselves as each pair of values can be compared literally or
using relational operators ( < , ≤ , > , ≥ ).

27
Operation on Ordinal data
⚫ Usually relational operators can be used on ordinal data.
⚫ Summary measures mode and median can be used on ordinal data.
⚫ Ordinal data can be ranked (numerically, alphabetically, etc.) Hence, we
can find any of the percentiles measures of ordinal data.
⚫ Calculations based on order are permitted (such as count, min, max, etc.).
⚫ Spearman’s R can be used as a measure of the strength of association
between two sets of ordinal data.
⚫ Numerical variable can be transformed into ordinal variable and vice-versa,
but with a loss of information.
⚫ For example, Age [1, … 100] = [young, middle-aged, old]

28
Ordinal Examples
⚫ Class ranking: 1st, 9th, 87th…
⚫ Socioeconomic status: poor, middle class, rich.
⚫ The Likert Scale: strongly disagree, disagree, neutral,
agree, strongly agree.
⚫ Level of Agreement: yes, maybe, no.
⚫ Time of Day: dawn, morning, noon, afternoon, evening,
night.
⚫ Political Orientation: left, center, right.

29
Interval scale
⚫ Definition
Interval-scale variables are continuous measurements of a roughly linear scale.
⚫ Example:
weight, height, latitude, longitude, weather, temperature, calendar dates, etc.

Note
⚫ Interval data are with well-defined interval.
⚫ Interval data are measured on a numeric scale (with +ve, 0 (zero), and –ve
values).
⚫ Interval data has a zero point on origin. However, the origin does not
imply a true absence of the measured characteristics.
⚫ For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean
absence of temperature, that is, no heat!

30
Operation on Interval data
⚫ We can add to or from interval data.
⚫ For example: date1 + x-days = date2
⚫ Subtraction can also be performed.
⚫ For example: current date – date of birth = age
⚫ Negation (changing the sign) and multiplication by a constant
are permitted.
⚫ All operations on ordinal data defined are also valid here.
⚫ Linear (e.g. cx + d ) or Affine transformations are permissible.
⚫ Other one-to-one non-linear transformation (e.g., log, exp, sin,
etc.) can also be applied.

31
Operation on Interval data
Note
⚫ Interval data can be transformed to nominal or ordinal
scale, but with loss of information.

⚫ Interval data can be graphed using histogram, frequency


polygon, etc.

32
Interval Examples
⚫ Celsius Temperature.
⚫ Fahrenheit Temperature.
⚫ IQ (intelligence scale).
⚫ SAT scores.
⚫ Time on a clock with hands.

33
Ratio scale
⚫ Definition
Interval data with a clear definition of “zero” are called ratio data.
⚫ Example:
Temperature in Kelvin scale, Intensity of earth-quake on Richter scale, Sound
intensity in Decibel, cost of an article, population of a country, etc.

Note
⚫ All ratio data are interval data but the reverse is not true.
⚫ In ratio scale, both differences between data values and ratios (of
non-zero) data pairs are meaningful.
⚫ Ratio data may be in linear or non-linear scale.
⚫ Both interval and ratio data can be stored in same data type (i.e.,
integer, float, double, etc.)

34
Operation on Ratio data
⚫ All arithmetic operations on interval data are applicable to
ratio data.

⚫ In addition, multiplication, division, etc. are allowed.

⚫ Any linear transformation of the form ( ax + b )/c are


known.

35
Ratio Examples
⚫ Weight.
⚫ Height.
⚫ Sales Figures.
⚫ Ruler measurements.
⚫ Income earned in a week.
⚫ Years of education.
⚫ Number of children.
⚫ Age.*

36
Note
⚫ Nominal - names only
⚫ Ordinal - has an order
⚫ Interval - also has meaningful distances
⚫ Ratio - also has a meaningful 0.

37
Questions
1. Consider an image as an entity.
• What are the attributes you should think to represent an
image?
• Categorize each attribute according to the NOIR data
classification.
2. Give FOUR differences between data of types “interval” and
“ratio-scale” .
3. What are the different properties used to categorize the data
according to NOIR data categorization?

38
4. Given an entity say “STUDENT” with the following attributes.
Identify the NOIR category to which each of them belongs.

⚫ Scholarship amount
⚫ Name
⚫ RollNo
⚫ DoB
⚫ Aaadhar No.
⚫ Gender
⚫ Mobile No.
⚫ Email Id

39
⚫ Atherosclerosis, thickening and hardening of internal artery walls,
is one of the main causes of death for men above 35 and women
above 45. One of its consequences is myocardial infarction. An
artery wall is made of three layers; innermost to outermost, they are
called intima, media and adventitia. Intima‐media thickness is a
marker of atherosclerosis. This question is based ultra‐sonography
measurements made on a sample of 110 subjects. Figure 1 gives
the data description.
⚫ Examine the variable types for each variable in this dataset
(numerical, continuous, categorical, ordinal etc.).

⚫ Source: US National Library of Medicine National Institutes of


Health
Figure 1
⚫ Gender: categorical, nominal (binary)
⚫ Age: numerical, continuous
⚫ Height: numerical, continuous
⚫ Weight: numerical, continuous
⚫ Tobacco: categorical, ordinal
⚫ Packyear: numerical, discrete
⚫ Sport:categorical, nominal (binary)
⚫ Measure: numerical, continuous
⚫ Alcohol: categorical, ordinal

42
gender & sport – categorical, nominal (binary);
Tobacco & alcohol - categorical, ordinal;
Age, height, weight, & measure – numerical, continuous;
packyear - numerical, discrete.
Mention the types of statistical tests performed on nominal,
ordinal, interval and ratio data types.
Nominal Ordinal Interval Ratio
Mode
Median
Mean
Frequency
Distribution

Range
Add and Subtract

Multiply and Divide

Standard Deviation
Nominal Ordinal Interval Ratio

Mode Yes Yes Yes Yes

Median No Yes Yes Yes

Mean No No Yes Yes

Frequency Yes Yes Yes Yes


Distribution

Range No Yes Yes Yes

Add and Subtract No No Yes Yes

Multiply and Divide No No No Yes

Standard Deviation No No Yes Yes


Some of the attributes of an image are:

⚫ Resolution: the number of pixels in the image


⚫ Color depth: the number of bits used to represent the color of each
pixel
⚫ Aspect ratio: the ratio of the width to the height of the image
⚫ File format: the type of file that the image is saved in (e.g. JPEG,
PNG, GIF)
⚫ Compression: the degree to which the image data has been
compressed
⚫ Metadata: information about the image such as the date it was taken,
camera settings, and location
⚫ Size: the file size of the image in bytes or megabytes
⚫ Orientation: the orientation of the image (e.g. landscape or portrait)
⚫ Bit depth: the number of bits used to represent the brightness of each
pixel
⚫ Shape: the shape of the image, such as rectangular or square.

46
NOIR Categorization:
⚫ Resolution: interval
⚫ Color depth: ratio
⚫ Aspect ratio: interval
⚫ File format: nominal
⚫ Compression: nominal
⚫ Metadata: nominal
⚫ Size: ratio
⚫ Orientation: nominal
⚫ Bit depth: ratio
⚫ Shape: nominal.
47
Suppose, two images are given. Give an idea to check if two images are identical or not.

To check if two images are identical or not, you can follow the below steps:
⚫ Check the resolution of both images. If the resolution is different, then the
images are not identical.
⚫ Check the color depth of both images. If the color depth is different, then the
images are not identical.
⚫ Check the aspect ratio of both images. If the aspect ratio is different, then the
images are not identical.
⚫ Check the file format of both images. If the file format is different, then the
images are not identical.
⚫ If the above attributes are the same for both images, then compare the pixel
values of each image. If any pixel values are different between the two
images, then they are not identical.
⚫ To compare the pixel values, you can use image comparison algorithms such
as Mean Squared Error (MSE), Structural Similarity Index (SSIM), or Peak
Signal-to-Noise Ratio (PSNR). These algorithms calculate a similarity score
between the two images based on the differences in their pixel values. If the
similarity score is above a certain threshold, then the images can be
considered identical.

48
Classification of data based on availability – primary, secondary,
tertiary
⚫ Primary data: The data which is available very close to the
origin of a particular topic or an event.
⚫ An eyewitness account of a traffic accident is an example of a
primary source.
⚫ Other examples include:
⚫ archeological artifacts
⚫ photographs
⚫ videos
⚫ historical documents such as diaries
⚫ census results
⚫ Maps
⚫ transcripts of surveillance
⚫ public hearings
⚫ trials, or interviews
⚫ un-tabulated results of surveys or questionnaires
49
Examples of primary
data(contd)
⚫ the original written or recorded notes of laboratory and field
research,
⚫ experiments or observations which have not been published in a
peer reviewed source;
⚫ original philosophical works,
⚫ religious scripture,
⚫ administrative documents,
⚫ patents,
⚫ artistic and fictional works such as poems, scripts, screenplays,
novels, motion pictures, videos, and television programs.

50
Examples of primary
data(contd)
⚫ Surveys conducted by a company to collect customer feedback
⚫ Interviews with experts in a particular field to gather insights
⚫ Focus groups conducted to understand consumer preferences and
opinions
⚫ Observations made by researchers during experiments or studies
⚫ Sales data collected by a company on its own products or services
⚫ Experiment results gathered by scientists in a laboratory
⚫ User testing sessions conducted on a new product or service
⚫ Data collected by a company through user feedback forms
⚫ Customer service call recordings
⚫ Social media analytics data collected by a company on its own brand
or products.

51
Secondary Data
⚫ Secondary data is the data that has been collected in the past by
someone else but made available for others to use.
⚫ Secondary data accounts at least one step removed from an event or
body of primary-source material and may include an interpretation,
analysis, or synthetic claims about the subject.
⚫ Secondary sources may draw on primary sources and other secondary
sources to create a general overview, or to make analytic or synthetic
claims.
⚫ Sources of secondary data:
Books
Published Sources
Journals
Newspapers
Websites
Blogs
Diaries

52
Examples of Secondary data(contd)
⚫ Reports published by government agencies on economic indicators
like GDP, inflation, and unemployment
⚫ Market research reports published by consulting firms
⚫ Academic papers and research studies published by universities and
research institutions
⚫ Industry statistics and trends published by trade associations
⚫ News articles and press releases
⚫ Company financial statements and annual reports
⚫ Whitepapers and case studies published by companies and research
firms
⚫ Social media analytics data collected by third-party providers
⚫ Customer reviews and ratings of products and services
⚫ Patent filings and other intellectual property data.

53
Tertiary Data
⚫ Tertiary data is based on primary and secondary data.
⚫ Tertiary sources are publications such as encyclopedias or
other compendia (A compendium is a concise collection of
information pertaining to a body of knowledge) that sum up
secondary and primary sources.
⚫ For example, Wikipedia itself is a tertiary source.
⚫ Many introductory textbooks may also be considered tertiary to
the extent that they sum up multiple primary and secondary
sources.
⚫ Manuals, Guidebooks, almanacs, handbooks
⚫ indexing and abstracting sources.

54
Examples of tertiary data(contd)
⚫ Business directories and databases containing information on companies and
industries
⚫ Online search engine results pages (SERPs) containing information about a
particular topic
⚫ Reference books and encyclopedias
⚫ Online databases containing information on academic journals and
publications
⚫ Online forums and discussion boards
⚫ Publicly available data on websites such as government websites and social
media sites
⚫ Online user-generated content such as blogs and wikis
⚫ Online news articles and archives
⚫ Online marketplaces such as Amazon or eBay
⚫ Online learning platforms such as Coursera and Udemy.

55
Examples of primary, secondary,
tertiary data

56
Based on structural form
Structured
Unstructured
Semi structured
Structured Data
⚫ The data that has a structure and is well organized either in the
form of tables or in some other way and can be easily operated is
known as structured data.
⚫ Searching and accessing information from such type of data is
very easy.
⚫ Structured data is relatively simple to enter, store, query, and
analyze, but it must be strictly defined in terms of field name and
type.
⚫ Example:
✔ Data stored in the relational database in the form of tables
having multiple rows and columns.
✔ The spreadsheet is an another good example of structured data.

58
Examples of structured data(contd)
⚫ Relational database tables
⚫ Excel spreadsheets
⚫ JSON data with a consistent schema
⚫ CSV files with consistent column headers and data types
⚫ Sensor data from Internet of Things (IoT) devices with
well-defined data structures
⚫ Financial transaction records
⚫ Stock market data with consistent formats
⚫ Medical records with a standard format
⚫ Government census data in a tabular format
⚫ Web server logs with consistent fields and formats
59
Unstructured data
⚫ Unstructured data refers to the data that lacks any specific form or structure.
⚫ This makes it very difficult and time-consuming to process and analyze unstructured
data.
Examples:
⚫ Emails
⚫ Word Processing Files
⚫ PDF files
⚫ Digital Images
⚫ Video
⚫ Audio
⚫ Social Media Posts

60
Examples of unstructured data(cond)
⚫ Social media posts (e.g. tweets, Facebook posts)
⚫ Audio and video recordings
⚫ Images and videos
⚫ Emails and instant messages
⚫ Text documents without consistent formatting or structure
⚫ Voice recordings of customer service interactions
⚫ Web pages with unstructured HTML
⚫ Handwritten notes or letters
⚫ News articles or blogs
⚫ Surveillance camera footage

61
Semi-structured data
⚫ Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
⚫ Due to unorganized information, the semi-structured is difficult
to retrieve, analyze and store as compared to structured data.
⚫ It requires software framework like Apache Hadoop to perform
all this.
⚫ Examples:
⚫ XML
⚫ JSON

62
Examples of semi-structured data(contd)
⚫ XML data with a flexible schema
⚫ HTML files with structured data in tags
⚫ JSON data with some variation in schema
⚫ Emails with a consistent format (e.g. sender, recipient, subject, body)
but variable content
⚫ Sensor data with variable fields depending on the device
⚫ Log files with structured fields but variable contents
⚫ Configuration files with some structure but also free-form text
⚫ Social media posts with hashtags or mentions
⚫ Invoices with a consistent structure but variable content
⚫ E-commerce product listings with structured fields but variable
descriptions.

63
Based on inherent nature
quantitative
qualitative
Quantitative data
⚫ Quantitative data are anything that can be expressed as a
number, or quantified.
⚫ It is data that can either be counted or compared on a
numeric scale.
⚫ Examples of quantitative data are scores on achievement
tests, number of hours of study, or weight of a subject.
⚫ These data may be represented by ordinal, interval or ratio
scales and lend themselves to most statistical manipulation.

65
Qualitative data
⚫ Qualitative data cannot be expressed as a number.
⚫ It describes qualities or characteristics.
⚫ It is collected using questionnaires, interviews, or
observation, and frequently appears in narrative form.
⚫ Data that represent nominal scales such as gender, socieo
economic status, religious preference are usually considered
to be qualitative data.

66
Data unit Numeric = Quantitative Categorical = Qualitative
variable data variable data
A person "How many hours 40 hours per week "Do you Full-time
do you work?" work full-time or
part-time?"
"How much do you 10,00,000 p.a. "What is your Data Analyst
earn?" occupation?"
How many children 2 children "In which India
do you have?" country were your
children born?"
A house "How many square 200 square metres "In which city or Bangalore
metres is the town is the house
house?" located?"
A business "How 200 employees "What is Education
many workers are the industry of the
currently business?"
employed?"
A farm "How many milk 36 cows "What is the Dairy
cows are located on main activity of the
the farm? farm?"

67
Based on observation
⚫ Time-series Data
⚫ Cross-sectional Data
⚫ Panel Data

68
Time-series Data

⚫ Time-series data refers to a set of observations taken over a given


period of time at specific and equally-spaced time intervals.
⚫ That the observations are taken at specific points in time means time
intervals are discrete.
⚫ A good example of time-series data could be the daily or weekly
closing price of a stock recorded over a period spanning 10 weeks.
⚫ Other appropriate examples could be the set of monthly profits (both
positive and negative) earned by Samsung between the 1st of January
2020 and the 1st of December 2020.
⚫ Time-series data can be used to predict future values of a given
financial vehicle.

69
Cross-sectional Data
⚫ Cross-sectional data refers to a setoff observations taken at
a single point in time.
⚫ Samples are constructed by collecting the data of interest
across a range of observational units – people, objects,
firms – at the same time.
⚫ A good example of cross-sectional data can be the stock
returns earned by shareholders of Microsoft, IBM, and
Samsung as for the year ended 31st December 2020:

70
Panel Data
⚫ Panel data, sometimes referred to as longitudinal data, is data that
contains observations about different cross sections across time.
OR
⚫ Panel data is a collection of quantities obtained across multiple
individuals, that are assembled over even intervals in time and ordered
chronologically.
⚫ Examples of groups that may make up panel data series include
countries, firms, individuals, or demographic groups.

⚫ It is possible to pool time series data and cross-sectional data. If we


were to study a particular characteristic or phenomenon across several
entities over a period of time, we would end up with what’s referred to
as panel data.
71
• Sample data and Population
• Small sample and Large sample
• Meaning of Statistic and Parameter
• Types of Statistics
• Application of statistics in different business scenarios
• Frequency Distribution of Data
Sample Data and Population

The two important types of data sets are populations and samples. The
definition of Sample and Population data is as follows:
• A sample consists of one or more observations drawn from the population.
• A population includes all the elements from a set of data.
Depending on the sampling method, a sample can have fewer observations than
the population. More than one sample can be derived from the same population.
Sample Data and Population

Differences between Sample and Population based on nomenclature, notation, and computations can
also be identified. For example:
• A measurable characteristic of a population, such as a mean or standard deviation, is called a
parameter; but a measurable characteristic of a sample is called a statistic.
• The mean of a population is denoted by the symbol μ; but the mean of a sample is denoted by the
symbol x.
• The formula for the standard deviation of a population is different from the formula for the standard
deviation of a sample.
75
Small Sample and Large Sample

Large sample theory: If the sample size n is greater than or equal 30 (n≥30) it is known as a large sample. For
large samples, the sampling distributions of statistic are normal (Z test). A study of the sampling distribution of
statistic for a large sample is known as large sample theory.

Small sample theory: If the sample size n is lesser than 30 (n<30), it is known as a small sample. For small
samples, the sampling distributions are t, F and χ2 distribution. A study of sampling distributions for small
samples is known as small sample theory.
Parameter and Statistics
• A measurable characteristic of a population, such as a mean or standard deviation, is called a parameter;
but a measurable characteristic of a sample is called a statistic.
• Parameter never changes, because everyone (or everything) was surveyed to find the parameter. For
example, if the average age of everyone in a class needs to be calculated, then everyone will be asked and
found the average age to be 25. That’s a parameter because everyone was asked in the class. Now, let us say
if we wanted the average age of everyone in your grade or year is required. If you use that information from
your class to take a guess at the average age, then that information becomes a statistic. That’s because you
cannot be sure your guess is correct (although it will probably be close).
Exploratory Data Analysis

Parameter and Statistics


Statistic (Roman or Parameter (Greek or
Measurement
lowercase) uppercase)
Population Proportion p P
Data Elements x X
Population Mean x̄ μ
Standard deviation s σ
Variance s2 σ2
Number of elements n N
Correlation Coefficient r ρ

https://www.statisticshowto.datasciencecentral.com/what-is-a-parameter-statisticshowto/
Exploratory Data Analysis

Types of Statistics
Exploratory Data Analysis

Frequency Distribution of Data


Frequency distribution in statistics provides the information of the number of occurrences (frequency) of
distinct values distributed within a given time or interval, in a list, table, or graphical representation. Grouped
and ungrouped are two types of Frequency Distribution. Data is a collection of numbers or values and it must
be organized for it to be useful.
Exploratory Data Analysis

Frequency Distribution of Data

Types of Frequency Distribution


• Grouped frequency distribution
• Ungrouped frequency distribution
• Cumulative frequency distribution
• Relative frequency distribution
• Relative cumulative frequency distribution
Exploratory Data Analysis

Sample Data and Population


Sampling: Sampling is the process of selecting certain members or a subset of the population to make statistical
inferences from them and to estimate characteristics of the whole population. Sampling is used by researchers in
market research so that they do not need to research the entire population to collect actionable insights. It is also
a time-convenient and a cost-effective method and hence forms the basis of any research design.

For example, if a drug manufacturer would like to research the adverse side effects of a drug on the population
of the country, it is close to impossible to be able to conduct a research study that involves everyone. In this
case, the researcher decides a sample of people from each demographic and then conducts the research on them
which gives them indicative feedback on the behaviour of the drug on the population.
Descriptive Statistics:
Descriptive statistics uses data that provides a description of the population either
through numerical calculation or graph or table. It provides a graphical summary of
data.
Inferential Statistics

•Inferential Statistics makes inference and prediction about population based on a


sample of data taken from population.
•It generalizes a large dataset and applies probabilities to draw a conclusion.
•It is simply used for explaining meaning of descriptive stats. It is simply used to
analyze, interpret result, and draw conclusion.
•Inferential Statistics is mainly related to and associated with hypothesis testing
whose main target is to reject null hypothesis.
•Hypothesis testing is a type of inferential procedure that takes help of sample data
to evaluate and assess credibility of a hypothesis about a population.
•Inferential statistics are generally used to determine how strong relationship is
within sample. But it is very difficult to obtain a population list and draw a random
sample.
Steps of Inferential Statistics
⚫ Obtain and start with a theory.
⚫ Generate a research hypothesis.
⚫ Operationalize or use variables
⚫ Identify or find out population to which we can apply study
material.
⚫ Generate or form a null hypothesis for these population.
⚫ Collect and gather a sample of children from population and
simply run study.
⚫ Then, perform all tests of statistical to clarify if obtained
characteristics of sample are sufficiently different from what
would be expected under null hypothesis so that we can be able
to find and reject null hypothesis.

84
Exploratory Data Analysis

Types of inferential statistics


•One sample test of difference/One sample hypothesis test
•Confidence Interval
•Contingency Tables and Chi-Square Statistic
•T-test or Annova
•Pearson Correlation
•Bi-variate Regression
•Multi-variate Regression
Exploratory Data Analysis

Frequency Distribution of Data


Frequency distribution in statistics provides the information of the number of occurrences (frequency) of
distinct values distributed within a given time or interval, in a list, table, or graphical representation. Grouped
and ungrouped are two types of Frequency Distribution. Data is a collection of numbers or values and it must
be organized for it to be useful.
Exploratory Data Analysis

Frequency Distribution of Data

Types of Frequency Distribution


• Grouped frequency distribution
• Ungrouped frequency distribution
• Cumulative frequency distribution
• Relative frequency distribution
• Relative cumulative frequency distribution
Exploratory Data Analysis

Sample Data and Population


Sampling: Sampling is the process of selecting certain members or a subset of the population to make statistical
inferences from them and to estimate characteristics of the whole population. Sampling is used by researchers in
market research so that they do not need to research the entire population to collect actionable insights. It is also
a time-convenient and a cost-effective method and hence forms the basis of any research design.

For example, if a drug manufacturer would like to research the adverse side effects of a drug on the population
of the country, it is close to impossible to be able to conduct a research study that involves everyone. In this
case, the researcher decides a sample of people from each demographic and then conducts the research on them
which gives them indicative feedback on the behaviour of the drug on the population.
Exploratory Data Analysis

Sample Data and Population

Types of Sampling:

Probability Sampling: Probability sampling is a sampling method that selects random members of a population
by setting a few selection criteria. These selection parameters allow every member to have equal opportunities to
be a part of various samples.
Non-probability Sampling: Non-probability sampling method is dependent on a researcher’s ability to select
members at random. This sampling method is not a fixed or pre-defined selection process which makes it
difficult for all elements of a population to have equal opportunities to be included in a sample.
Exploratory Data Analysis

https://www.questionpro.com/blog/types-of-sampling-for-social-research/
Exploratory Data Analysis

Sample Data and Population


Probability Sampling: Probability Sampling is a sampling technique in which the sample from a larger
population is chosen using a method based on the theory of probability. This sampling method considers every
member of the population and forms samples based on a fixed process. For example, in a population of 1000
members, each of these members will have 1/1000 chances of being selected to be part of a sample. It gets rid of
bias in the population and gives a fair chance to all members to be included in the sample.
There are four types of probability sampling technique:
• Simple Random Sampling
• Cluster Sampling
• Systematic Sampling
• Stratified Random Sampling
Exploratory Data Analysis

Sample Data and Population


Simple Random Sampling: This is one of the best probability sampling techniques that helps in saving time
and resources. It is a reliable method of obtaining information where every single member of a population is
chosen randomly, merely by chance and everyone has the exact same probability of being chosen to be part of a
sample.

For example, in an organization of 500 employees, if the HR team decides on conducting team building
activities, it is highly likely that they would prefer picking chits out of a bowl. In this case, each of the 500
employees has an equal opportunity of being selected.
Exploratory Data Analysis

Sample Data and Population


Cluster Sampling: Cluster sampling is a method where the researchers divide the entire population into sections
or clusters that represent a population. Clusters are identified and included in a sample based on defining
demographic parameters such as age, location, sex etc. which makes it extremely easy for a survey creator to
derive effective inference from the feedback.

For example, if the government of the United States wishes to evaluate the number of immigrants living in the
Mainland US, they can divide it into clusters based on states such as California, Texas, Florida, Massachusetts,
colourado, Hawaii etc. This way of conducting a survey will be more effective as the results will be organized
into states and provide insightful immigration data.
Exploratory Data Analysis

Sample Data and Population


Systematic Sampling: Using systematic sampling method, members of a sample are chosen at regular intervals
of a population. It requires the selection of a starting point for the sample and sample size that can be repeated at
regular intervals. This type of sampling method has a predefined interval and hence this sampling technique is
the least time-consuming.

For example, a researcher intends to collect a systematic sample of 500 people in a population of 5000. Each
element of the population will be numbered from 1-5000 and every 10th individual will be chosen to be a part of
the sample (Total population/ Sample Size = 5000/500 = 10).
Exploratory Data Analysis

Sample Data and Population


Stratified Random Sampling: Stratified Random sampling is a method where the population can be divided
into smaller groups, that do not overlap but represent the entire population together. While sampling, these
groups can be organized and then draw a sample from each group separately.

For example, a researcher looking to analyze the characteristics of people belonging to different annual income
divisions, will create strata (groups) according to annual family income such as – Less than $20,000, $21,000 –
$30,000, $31,000 to $40,000, $41,000 to $50,000 etc. and people belonging to different income groups can be
observed to draw conclusions of which income strata have which characteristics. Marketers can analyze which
income groups to target and which one to eliminate to create a roadmap that would bear fruitful results.
Exploratory Data Analysis

Sample Data and Population


Uses of the Probability Sampling Method: There are multiple uses of the probability sampling method.
Reduce Sample Bias: Using the probability sampling method, the bias in the sample derived from a population
is negligible to non-existent. The selection of the sample describes the understanding and the inference of the
researcher. Probability sampling leads to higher quality data collection as the population is appropriately
represented by the sample.
Diverse Population: When the population is large and diverse, it is important to have adequate representation so
that the data is not skewed towards one demographic. For example, if Square would like to understand the
people that could their point-of-sale devices, a survey conducted from a sample of people across US from
different industries and socio-economic backgrounds, helps.
Create an Accurate Sample: Probability sampling helps the researchers plan and create an accurate sample. This
helps to obtain well-defined data.
Exploratory Data Analysis

Sample Data and Population


Non-probability Sampling Methods: The non-probability method is a sampling method that involves a
collection of feedback based on a researcher or a statistician’s sample selection capabilities and not on a fixed
selection process. Mostly, the output of a survey conducted with a non-probable sample leads to skewed
results, which may not totally represent the desired target population. In the studies where cost constraint is
present, non-probability sampling will be much more effective than the other type.

There are 4 types of non-probability sampling which explains the purpose of this sampling method:
• Convenience sampling
• Judgmental or Purposive Sampling
• Snowball sampling
Exploratory Data Analysis

Sample Data and Population


Convenience sampling: This method is dependent on the ease of access to subjects such as surveying customers
at a mall or passers-by on a busy street. It is usually termed as convenience sampling, as it is carried out based
on how easy is it for a researcher to contact the subjects. Researchers have nearly no authority over selecting
elements of the sample and it is purely done based on proximity and not representativeness. This non-probability
sampling method is used when there is time and cost limitations in collecting feedback. In situations where there
are resource limitations such as the initial stages of research, convenience sampling is used.

For example, startups and NGOs usually conduct convenience sampling at a mall to distribute leaflets of
upcoming events or promotion of a cause – they do that by standing at the entrance of the mall and giving out
pamphlets randomly.
Exploratory Data Analysis

Sample Data and Population

Judgmental or Purposive Sampling: In judgmental or purposive sampling, the sample is formed by the
discretion of the judge purely considering the purpose of study along with the understanding of target audience.
Also known as deliberate sampling, the participants are selected solely based on research requirements and
elements who do not suffice the purpose are kept out of the sample.

For instance, when researchers want to understand the thought process of people who are interested in studying
for their master’s degree. The selection criteria will be: “Are you interested in studying for Masters in …?” and
those who respond with a “No” will be excluded from the sample.
Exploratory Data Analysis

Sample Data and Population


Snowball sampling: Snowball sampling is a sampling method that is used in studies which need to be carried
out to understand subjects which are difficult to trace.

For example, it will be extremely challenging to survey shelterless people or illegal immigrants. In such cases,
using the snowball theory, researchers can track a few of that category to interview and results will be derived on
that basis. This sampling method is implemented in situations where the topic is highly sensitive and not openly
discussed such as conducting surveys to gather information about HIV Aids. Not many victims will readily
respond to the questions, but researchers can contact people they might know or volunteers associated with the
cause to get in touch with the victims and collect information.
Exploratory Data Analysis

Sample Data and


Population
Quota sampling: In Quota sampling, selection of members in this sampling technique happens based on a
pre-set standard. In this case, as a sample is formed on the basis of specific attributes, the created sample will
have the same attributes that are found in the total population. It is an extremely quick method of collecting
samples.
Exploratory Data Analysis

Sample Data and Population


There are multiple uses of the non-probability sampling method. Such as:
Create a hypothesis: The non-probability sampling method is used to create a hypothesis when limited to no
prior information is available. This method helps with the immediate return of data and helps to build a base for
any further research.
Exploratory research: This sampling technique is widely used when researchers aim at conducting qualitative
research, pilot studies or exploratory research.
Budget and time constraints: The non-probability method when there are budget and time constraints and
some preliminary data has to be collected. Since the survey design is not rigid, it is easier to pick respondents at
random and have them take the survey or questionnaire.
Exploratory Data Analysis

Small Sample and Large Sample


Test of Significance: The theory of test of significance consists of various test statistic. The theory had been
developed under two broad heading. They are:

Test of significance for large sample: Large sample test or Asymptotic test or Z test (n≥30)
Test of significance for small samples(n<30): Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
• Large sample test
• Large sample test are
• Sampling from attributes
• Sampling from variables
Exploratory Data Analysis

Parameter and Statistics


• A measurable characteristic of a population, such as a mean or standard deviation, is called a parameter;
but a measurable characteristic of a sample is called a statistic.
• Parameter never changes, because everyone (or everything) was surveyed to find the parameter. For
example, if the average age of everyone in a class needs to be calculated, then everyone will be asked and
found the average age to be 25. That’s a parameter because everyone was asked in the class. Now, let us say
if we wanted the average age of everyone in your grade or year is required. If you use that information from
your class to take a guess at the average age, then that information becomes a statistic. That’s because you
cannot be sure your guess is correct (although it will probably be close).
Exploratory Data Analysis

Parameter and Statistics

• Statistics vary. You know the average age of your classmates is 25. You might guess that the average age of
everyone in your class is 24, 25, or 26. You might guess the average age for other colleges in your area is
the same. And you might even guess that’s the average age for college students in the US. These may not be
bad guesses, but they are statistics because you did not ask everyone.
Exploratory Data Analysis

Parameter and Statistics


Notation of Parameters and Statistics: Parameters are usually Greek letters (e.g. σ) or capital letters (e.g. P).
Statistics are usually Roman letters (e.g. s).
In most cases, a lowercase letter (e.g. p), it’s a statistic.
Exploratory Data Analysis

Parameter and Statistics


Statistic (Roman or Parameter (Greek or
Measurement
lowercase) uppercase)
Population Proportion p P
Data Elements x X
Population Mean x̄ μ
Standard deviation s σ
Variance s2 σ2
Number of elements n N
Correlation Coefficient r ρ

https://www.statisticshowto.datasciencecentral.com/what-is-a-parameter-statisticshowto/
Exploratory Data Analysis

Parameter and Statistics


BASIS FOR COMPARISON STATISTIC PARAMETER

Meaning Statistic is a measure which describes a Parameter refers to a measure which


fraction of population. describes population.

Numerical value Variable and Known Fixed and Unknown


Statistical Notation x̄ = Sample Mean μ = Population Mean

s = Sample Standard Deviation σ = Population Standard Deviation

p̂ = Sample Proportion P = Population Proportion

x = Data Elements X = Data Elements


n = Size of sample N = Size of Population

r = Correlation coefficient ρ = Correlation coefficient

https://keydifferences.com/difference-between-statistic-and-parameter.html
Exploratory Data Analysis

Types of Statistics

A statistic is a piece of data from a portion of a population. It is the opposite of a parameter. A parameter is
data from a census. A census surveys everyone.

For example – If you have a bit of information, it’s a statistic. If you look at part of a data set, it’s a statistic. If
you know something about 10% of people, that’s a statistic too. Parameters are all the information. And all the
information is rarely known.
Exploratory Data Analysis

Types of Statistics

Statistics is a way to understand the data that is collected. For example, every time a package is sent
through the mail, that package is tracked in a huge database. The UPS database is 17 terabytes - about as
big as if you catalogued every book in the Library of Congress.

All data is meaningless without a way to interpret it, which is where statistics comes in. Statistics is about
data and variables. It is also about analysing that data and producing some meaningful information about
that data.
Exploratory Data Analysis

Types of Statistics

Types of Statistics: A statistic can be more than one type. For example, the sample standard deviation can be
used as a descriptive statistic to describe the standard deviation of a sample. It can be used as an estimator: To
estimate the population standard deviation. And it can be used to test a theory (a hypothesis).
• Descriptive Statistics
• Inferential statistics
Exploratory Data Analysis

Application of Statistics
State: For the effective functioning of the State, statistics is indispensable. Different department and
authorities require various facts and figures on different matters. They use this data to frame policies and
guidelines to perform smoothly.

Traditionally, people used statistics to collect data of manpower, crime, wealth, income, etc. for the formation
of suitable military and fiscal policies.
Exploratory Data Analysis

Application of Statistics
Economics: Economics is about allocating limited resources among unlimited ends in the most optimal
manner. Statistics offers information to answer some basic questions in economics –

• What to produce?
• How to produce?
• For whom to produce?

Statistical information helps to understand the economic problems and formulation of economic policies.
Traditionally, the application of statistics was limited since the economic theories were based on deductive
logic. Also, most statistical techniques were not developed enough for application in all disciplines.
Exploratory Data Analysis

Frequency Distribution of Data

Frequency: The frequency of any value is the number of times that value appears in a data set. So from the
above examples of colours, we can say two children like the colour blue, so its frequency is two. So to make
meaning of the raw data, we must organise. And finding out the frequency of the data values is how this
organization is done.
Exploratory Data Analysis

Frequency Distribution of Data


Frequency Distribution
Many times it is not easy or feasible to find the frequency of data from a very large dataset. So to make sense
of the data we make a frequency table and graphs. Let us take the example of the height of ten students in
cms.

Frequency Distribution Table


139, 145, 150, 145, 136, 150, 152, 144, 138, 138
Exploratory Data Analysis

Frequency Distribution of Data


This frequency table will help us make better sense of the data given. Also when the data set is too big
(say if we were dealing with 100 students) we use tally marks for counting. It makes the task more
organized and easy. Below is an example of how we use tally marks.
Exploratory Data Analysis

Frequency Distribution of Data

Types of Frequency Distribution


• Grouped frequency distribution
• Ungrouped frequency distribution
• Cumulative frequency distribution
• Relative frequency distribution
• Relative cumulative frequency distribution
Exploratory Data Analysis

Frequency Distribution of Data


Grouped Data
At certain times to ensure that we are making correct and Class
Frequency
relevant observations from the data set, we may need to group Interval

the data into class intervals. This ensures that the frequency
distribution best represents the data. Let us make a grouped 130-140 4

frequency data table of the same example about the height of


students. 140-150 3

From the table, you can see that the value of 150 is put in the 150-160 3
class interval of 150-160 and not 140-150. This is the
convention must be followed.
Exploratory Data Analysis

Frequency Distribution of Data

Un-Grouped Data: Given in the table


are marks obtained by 20 students in
Maths out of 25.

21, 23, 19, 17, 12, 15, 15, 17, 17, 19,
23, 23, 21, 23, 25, 25, 21, 19, 19, 19

https://www.math-only-math.com/frequency-distribution-of-ungrouped-and-grouped-data.html
Exploratory Data Analysis

Frequency Distribution of Data


Cumulative Frequency Distribution: A cumulative frequency
distribution is the sum of the class and all classes below it in a
frequency distribution. All that means is adding value with all
the values that came before it. Here’s a simple example: You get
paid $250 for a week of work. The second week you get paid
$300 and the third week, $350. Your cumulative amount for
week 2 is $550 ($300 for week 2 and $250 for week 1). Your
cumulative amount for week 3 is $900 ($350 for week 3, $300
for week 2 and $250 for week 1). Cumulative frequency
distributions can be summarized in a table.
Exploratory Data Analysis

Frequency Distribution of Data


Cumulative Frequency Distribution: There are a 2. You’re interested in studying a population to find out a
couple of reasons for the use of Cumulative “more” or “less” question. For example, you’re thinking
Frequency Distribution. of opening a bargain grocery store and you want to know
1. You want to check that your math is correct. By how many people in a particular geographic area spend
adding up all the numbers and comparing it to up to $6000 per person per year in groceries.
your sample size, you know you’ve included all
your data. For example, if your sample size was
44 in this case, you’d know by the cumulative
frequency distribution that you’re missing one
piece of data.
https://www.statisticshowto.datasciencecentral.com/cumulative-frequency-distribution/
Exploratory Data Analysis

Frequency Distribution of Data


Cumulative Frequency Distribution:

The right column will tell you that 614 people spend up to 6000 per year. It includes everyone
who spends up to $6000.
Exploratory Data Analysis

Frequency Distribution of Data

Relative Frequency Distributions: A relative frequency is the fraction or proportion of times a value occurs
in a data set. A relative frequency is the fraction or proportion of times a value occurs. To find the relative
frequencies, divide each frequency by the total number of data points in the sample. Relative frequencies can
be written as fractions, percent, or decimals.

https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis

Frequency Distribution of Data


Relative Frequency Distributions: How to Construct a Relative Frequency Distribution
Constructing a relative frequency distribution is not much different than from constructing a regular frequency
distribution. The beginning process is the same, and the same guidelines must be used when creating classes
for the data. Recall the following:
• Each data value should fit into one class only (classes are mutually exclusive)
• The classes should be of equal size
• Classes should not be open-ended
• Try to use between 5 and 20 classes

https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis

Frequency Distribution of Data


Relative Frequency Distributions
Create the frequency distribution table. However, this time, you will need to add a third column. The first
column should be labelled Class or Category. The second column should be labeled Frequency. The third
column should be labeled Relative Frequency. Fill in your class limits in column one. Then, count the number
of data points that fall in each class and write that number in column two.

Next, start to fill in the third column. The entries will be calculated by dividing the frequency of that class by
the total number of data points. For example, suppose we have a frequency of 5 in one class, and there are a
total of 50 data points.

https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis

Frequency Distribution of Data


Relative Frequency Distributions
The relative frequency for that class would be calculated by the following:
5
50
=
0.10

You can choose to write the relative frequency as a decimal (0.10), as a fraction ( 1/10), or as a percent (10%).
Since we are dealing with proportions, the relative frequency column should add up to 1 (or 100%). It may be
slightly off due to rounding. Relative frequency distributions are often displayed in histograms and in
frequency polygons. The only difference between a relative frequency distribution graph and a frequency
distribution graph is that the vertical axis uses proportional or relative frequency rather than simple frequency.
Exploratory Data Analysis

Frequency Distribution of Data

Relative Frequency Histogram


This graph shows a relative frequency histogram. Notice
the vertical axis is labeled with percentages rather than
simple frequencies.

https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis

Frequency Distribution of Data

Relative Cumulative Frequency: The relative cumulative frequency is the quotient between the cumulative
frequency of a particular value and the total number of data. It can be expressed as a percentage.

Example
A city has recorded the following daily maximum temperatures during the month:
32, 31, 28, 29, 33, 32, 31, 30, 31, 31, 27, 28, 29, 30, 32, 31, 31, 30, 30, 29, 29, 30, 30, 31, 30, 31, 34, 33, 33,
29, 29.
Exploratory Data Analysis

Frequency Distribution of Data

Relative Cumulative Frequency xi fi Fi Ni


27 1 1 0.032
In the first column of the table are the variables 28 2 3 0.097
ordered from lowest to highest, in the second 29 6 9 0.290
30 7 16 0.0516
column is the absolute frequency, in the third is the
31 8 24 0.774
score of the cumulative frequency and in the fourth 32 3 27 0.871
is the relative frequency. 33 3 30 0.968
34 1 31 1
31
Exploratory Data Analysis

Summary
For this sub module the concepts of sample data and population can be understood, the
difference of statistic and parameter is well explained and what is the use of small sample
and large sample statistic and parameter, types of statistics and its application in different
business scenarios, frequency distribution of data.
Exploratory Data Analysis

Self Assessment Question

1. Which one of the following statements are true?


i. The mean of a population is denoted by x.
ii. Sample size is never bigger than population size.
iii. The population mean is a statistic.
a) Only i
b) Only ii
c) Only iii
d) All of the above
e) None of the above
Answer: e
Exploratory Data Analysis

Self Assessment Question


2. The Classification method in which the upper limit of interval is same as the lower limit class
interval is called _______________.

a) Exclusive method
b) Inclusive method
c) Mid-point method
d) Ratio method

Answer: a
Exploratory Data Analysis

Self Assessment Question


3. Summary and presentation of data in tabular form with several non-overlapping classes is
referred as _________________.

a) Nominal distribution
b) Ordinal distribution
c) Chronological distribution
d) Frequency distribution

Answer: d
Exploratory Data Analysis

Self Assessment Question


4. Largest value is 60 and smallest value is 40 and number of classes desired is 5 then class
interval is:

a) 20
b) 4
c) 25
d) 15

Answer: b
Exploratory Data Analysis

Document Links
Topic URL Notes
The link explains about Sample and
Sample and Population Data https://stattrek.com/sampling/populations-and-samples.aspx
Population data
https://www.questionpro.com/blog/types-of-sampling-for-social-resea The link explains about types of
Types of Sampling Methods
rch/ sampling methods
The link explains about Large and
Large and Small Samples http://ecoursesonline.iasri.res.in/mod/page/view.php?id=15455
Small Samples
https://www.statisticshowto.datasciencecentral.com/what-is-a-parame The link explains about Parameter
Parameter and Statistics
ter-statisticshowto/ and Statistics
Types of Statistics https://www.statisticshowto.datasciencecentral.com/statistic/ The link explains about Statistics
https://www.toppr.com/guides/business-economics-cs/descriptive-stat The link explains about Application
Application of Statistics
istics/application-of-statistics/ of statistics in business
The link explains about Frequency
https://www.toppr.com/guides/maths/data-handling/data-and-its-frequ
Frequency of Data Distribution of Data Distribution
ency-distribution/
Exploratory Data Analysis

Video Links

Topic URL Notes

https://ocw.mit.edu/courses/electrical-engineering-and-com
puter-science/6-0002-introduction-to-computational-thinkin
g-and-data-science-fall-2016/lecture-videos/lecture-14-clas
sification-and-statistical-sins/
Data Classification The link explains about Data Classification
https://ocw.mit.edu/courses/electrical-engineering-and-com
puter-science/6-0002-introduction-to-computational-thinkin
g-and-data-science-fall-2016/lecture-videos/lecture-13-clas
sification/

The link explains about Sample and


https://www.youtube.com/watch?v=kBYt67NDToI
population explanation

Statistics

https://www.youtube.com/watch?v=VPZD_aij8H0 The link explains about Basics of Statistics


Exploratory Data Analysis

E- Book Links
EBook name Chapter Page No. URL

Introduction to Data Analysis https://files.eric.ed.gov/fullt


1 and 2 1 to 9
Handbook ext/ED536788.pdf

https://www.itl.nist.gov/div
Exploratory Data Analysis Whole Book Whole Book
898/handbook/eda/eda.htm

https://www.statsref.com/St
Statistical Analysis Handbook Whole Book Whole Book
atsRefSample.pdf

You might also like