EDA 1
EDA 1
► Example:
40, 95, …, Jain University, 18BTRCR75, [email protected]
Anything else?
4
How large your data is?
► What is the maximum file size you have
dealt so far?
► Movies/files/streaming video that you have
used?
based on observation
based on availability
14
NOIR
Classification of data based on scales of
Measurement
NOIR classification
⚫ The mostly recommended scales of measurement are
N: Nominal
O: Ordinal
I: Interval
R: Ratio
16
NOIR Classification
Alphabetical
Binary Ternary Others
Ordered Discrete
Numerically
Symmetric
Ordered Continuous
Literally
Asymmetric
Ordered
1. Distinctiveness = and ≠
Categorical
(Qualitative)
2. Order <,≤,>,≥
3. Addition + and -
Numerical
(Quantitative)
4. Multiplication * and /
18
NOIR summary
✔ Nominal (with distinctiveness property only)
19
Nominal scale
⚫ Definition
A variable that takes a value among a set of mutually exclusive codes that have no logical order is
known as a nominal variable.
⚫ Examples
Gender Used letters or numbers
{ M, F} or { 1, 0 }
Country code ??
????
20
Nominal scale
21
Nominal scale
Note
⚫ A nominal data may be numerical in form, but the numerical values have
no mathematical interpretation.
⚫ For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 = 210 is
meaningless. They are simply labels.
⚫ Two labels may be identical ( = ) or dissimilar ( ≠ ).
⚫ These labels do not have any ordering among themselves.
⚫ For example, we cannot say blood group B is better or worse than group
A.
⚫ Labels (from two different attributes) can be combined to give another
nominal variable.
⚫ For example, blood group with Rh factor ( A+ , A- , AB+, etc.)
22
Binary scale
⚫ Definition
A nominal variable with exactly two mutually exclusive categories that
have no logical order is known as binary variable or “dichotomous“
⚫ Examples
Switch: {ON, OFF}
Attendance: {True, False}
Entry: {Yes, No}
etc.
⚫ dichotomous“
Note
⚫ A Binary variable is a special case of a nominal variable that takes
only two possible values.
23
Symmetric and Asymmetric Binary Scale
⚫ Different binary variables may have unequal importance.
⚫ If two choices of a binary variable have equal importance, then
it is called symmetric binary variable.
⚫ Example: Gender = {male , female}
// usually of equal probability.
24
Operations on Nominal variables
⚫ Summary statistics applicable to nominal data are mode,
contingency correlation, etc.
⚫ Arithmetic (+,-,*and/) and logical operations (<,>,≠ etc.) are not
permitted.
⚫ The allowed operations are : accessing (read, check, etc.) and
re-coding (into another non-overlapping symbol set, that is,
one-to-one mapping) etc.
⚫ Nominal data can be visualized using line charts, bar charts or pie
charts etc.
⚫ Two or more nominal variables can be combined to generate other
nominal variable.
⚫ Example: Gender (M,F) × Marital status (S, M, D, W)
25
Nominal Examples
⚫ Gender: Male, Female, Other.
⚫ Hair Color: Brown, Black, Blonde, Red, Other.
⚫ Type of living accommodation: House, Apartment,
Trailer, Other.
⚫ Religious preference: Buddhist, Mormon, Muslim, Jewish,
Christian, Other.
26
Ordinal scale
⚫ Definition
Ordered nominal data are known as ordinal data and the variable that
generates it is called ordinal variable.
⚫ Example:
Shirt size = { S, M, L, XL, XXL}
Note
The values assumed by an ordinal variable can be ordered among
themselves as each pair of values can be compared literally or
using relational operators ( < , ≤ , > , ≥ ).
27
Operation on Ordinal data
⚫ Usually relational operators can be used on ordinal data.
⚫ Summary measures mode and median can be used on ordinal data.
⚫ Ordinal data can be ranked (numerically, alphabetically, etc.) Hence, we
can find any of the percentiles measures of ordinal data.
⚫ Calculations based on order are permitted (such as count, min, max, etc.).
⚫ Spearman’s R can be used as a measure of the strength of association
between two sets of ordinal data.
⚫ Numerical variable can be transformed into ordinal variable and vice-versa,
but with a loss of information.
⚫ For example, Age [1, … 100] = [young, middle-aged, old]
28
Ordinal Examples
⚫ Class ranking: 1st, 9th, 87th…
⚫ Socioeconomic status: poor, middle class, rich.
⚫ The Likert Scale: strongly disagree, disagree, neutral,
agree, strongly agree.
⚫ Level of Agreement: yes, maybe, no.
⚫ Time of Day: dawn, morning, noon, afternoon, evening,
night.
⚫ Political Orientation: left, center, right.
29
Interval scale
⚫ Definition
Interval-scale variables are continuous measurements of a roughly linear scale.
⚫ Example:
weight, height, latitude, longitude, weather, temperature, calendar dates, etc.
Note
⚫ Interval data are with well-defined interval.
⚫ Interval data are measured on a numeric scale (with +ve, 0 (zero), and –ve
values).
⚫ Interval data has a zero point on origin. However, the origin does not
imply a true absence of the measured characteristics.
⚫ For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean
absence of temperature, that is, no heat!
30
Operation on Interval data
⚫ We can add to or from interval data.
⚫ For example: date1 + x-days = date2
⚫ Subtraction can also be performed.
⚫ For example: current date – date of birth = age
⚫ Negation (changing the sign) and multiplication by a constant
are permitted.
⚫ All operations on ordinal data defined are also valid here.
⚫ Linear (e.g. cx + d ) or Affine transformations are permissible.
⚫ Other one-to-one non-linear transformation (e.g., log, exp, sin,
etc.) can also be applied.
31
Operation on Interval data
Note
⚫ Interval data can be transformed to nominal or ordinal
scale, but with loss of information.
32
Interval Examples
⚫ Celsius Temperature.
⚫ Fahrenheit Temperature.
⚫ IQ (intelligence scale).
⚫ SAT scores.
⚫ Time on a clock with hands.
33
Ratio scale
⚫ Definition
Interval data with a clear definition of “zero” are called ratio data.
⚫ Example:
Temperature in Kelvin scale, Intensity of earth-quake on Richter scale, Sound
intensity in Decibel, cost of an article, population of a country, etc.
Note
⚫ All ratio data are interval data but the reverse is not true.
⚫ In ratio scale, both differences between data values and ratios (of
non-zero) data pairs are meaningful.
⚫ Ratio data may be in linear or non-linear scale.
⚫ Both interval and ratio data can be stored in same data type (i.e.,
integer, float, double, etc.)
34
Operation on Ratio data
⚫ All arithmetic operations on interval data are applicable to
ratio data.
35
Ratio Examples
⚫ Weight.
⚫ Height.
⚫ Sales Figures.
⚫ Ruler measurements.
⚫ Income earned in a week.
⚫ Years of education.
⚫ Number of children.
⚫ Age.*
36
Note
⚫ Nominal - names only
⚫ Ordinal - has an order
⚫ Interval - also has meaningful distances
⚫ Ratio - also has a meaningful 0.
37
Questions
1. Consider an image as an entity.
• What are the attributes you should think to represent an
image?
• Categorize each attribute according to the NOIR data
classification.
2. Give FOUR differences between data of types “interval” and
“ratio-scale” .
3. What are the different properties used to categorize the data
according to NOIR data categorization?
38
4. Given an entity say “STUDENT” with the following attributes.
Identify the NOIR category to which each of them belongs.
⚫ Scholarship amount
⚫ Name
⚫ RollNo
⚫ DoB
⚫ Aaadhar No.
⚫ Gender
⚫ Mobile No.
⚫ Email Id
39
⚫ Atherosclerosis, thickening and hardening of internal artery walls,
is one of the main causes of death for men above 35 and women
above 45. One of its consequences is myocardial infarction. An
artery wall is made of three layers; innermost to outermost, they are
called intima, media and adventitia. Intima‐media thickness is a
marker of atherosclerosis. This question is based ultra‐sonography
measurements made on a sample of 110 subjects. Figure 1 gives
the data description.
⚫ Examine the variable types for each variable in this dataset
(numerical, continuous, categorical, ordinal etc.).
42
gender & sport – categorical, nominal (binary);
Tobacco & alcohol - categorical, ordinal;
Age, height, weight, & measure – numerical, continuous;
packyear - numerical, discrete.
Mention the types of statistical tests performed on nominal,
ordinal, interval and ratio data types.
Nominal Ordinal Interval Ratio
Mode
Median
Mean
Frequency
Distribution
Range
Add and Subtract
Standard Deviation
Nominal Ordinal Interval Ratio
46
NOIR Categorization:
⚫ Resolution: interval
⚫ Color depth: ratio
⚫ Aspect ratio: interval
⚫ File format: nominal
⚫ Compression: nominal
⚫ Metadata: nominal
⚫ Size: ratio
⚫ Orientation: nominal
⚫ Bit depth: ratio
⚫ Shape: nominal.
47
Suppose, two images are given. Give an idea to check if two images are identical or not.
To check if two images are identical or not, you can follow the below steps:
⚫ Check the resolution of both images. If the resolution is different, then the
images are not identical.
⚫ Check the color depth of both images. If the color depth is different, then the
images are not identical.
⚫ Check the aspect ratio of both images. If the aspect ratio is different, then the
images are not identical.
⚫ Check the file format of both images. If the file format is different, then the
images are not identical.
⚫ If the above attributes are the same for both images, then compare the pixel
values of each image. If any pixel values are different between the two
images, then they are not identical.
⚫ To compare the pixel values, you can use image comparison algorithms such
as Mean Squared Error (MSE), Structural Similarity Index (SSIM), or Peak
Signal-to-Noise Ratio (PSNR). These algorithms calculate a similarity score
between the two images based on the differences in their pixel values. If the
similarity score is above a certain threshold, then the images can be
considered identical.
48
Classification of data based on availability – primary, secondary,
tertiary
⚫ Primary data: The data which is available very close to the
origin of a particular topic or an event.
⚫ An eyewitness account of a traffic accident is an example of a
primary source.
⚫ Other examples include:
⚫ archeological artifacts
⚫ photographs
⚫ videos
⚫ historical documents such as diaries
⚫ census results
⚫ Maps
⚫ transcripts of surveillance
⚫ public hearings
⚫ trials, or interviews
⚫ un-tabulated results of surveys or questionnaires
49
Examples of primary
data(contd)
⚫ the original written or recorded notes of laboratory and field
research,
⚫ experiments or observations which have not been published in a
peer reviewed source;
⚫ original philosophical works,
⚫ religious scripture,
⚫ administrative documents,
⚫ patents,
⚫ artistic and fictional works such as poems, scripts, screenplays,
novels, motion pictures, videos, and television programs.
50
Examples of primary
data(contd)
⚫ Surveys conducted by a company to collect customer feedback
⚫ Interviews with experts in a particular field to gather insights
⚫ Focus groups conducted to understand consumer preferences and
opinions
⚫ Observations made by researchers during experiments or studies
⚫ Sales data collected by a company on its own products or services
⚫ Experiment results gathered by scientists in a laboratory
⚫ User testing sessions conducted on a new product or service
⚫ Data collected by a company through user feedback forms
⚫ Customer service call recordings
⚫ Social media analytics data collected by a company on its own brand
or products.
51
Secondary Data
⚫ Secondary data is the data that has been collected in the past by
someone else but made available for others to use.
⚫ Secondary data accounts at least one step removed from an event or
body of primary-source material and may include an interpretation,
analysis, or synthetic claims about the subject.
⚫ Secondary sources may draw on primary sources and other secondary
sources to create a general overview, or to make analytic or synthetic
claims.
⚫ Sources of secondary data:
Books
Published Sources
Journals
Newspapers
Websites
Blogs
Diaries
52
Examples of Secondary data(contd)
⚫ Reports published by government agencies on economic indicators
like GDP, inflation, and unemployment
⚫ Market research reports published by consulting firms
⚫ Academic papers and research studies published by universities and
research institutions
⚫ Industry statistics and trends published by trade associations
⚫ News articles and press releases
⚫ Company financial statements and annual reports
⚫ Whitepapers and case studies published by companies and research
firms
⚫ Social media analytics data collected by third-party providers
⚫ Customer reviews and ratings of products and services
⚫ Patent filings and other intellectual property data.
53
Tertiary Data
⚫ Tertiary data is based on primary and secondary data.
⚫ Tertiary sources are publications such as encyclopedias or
other compendia (A compendium is a concise collection of
information pertaining to a body of knowledge) that sum up
secondary and primary sources.
⚫ For example, Wikipedia itself is a tertiary source.
⚫ Many introductory textbooks may also be considered tertiary to
the extent that they sum up multiple primary and secondary
sources.
⚫ Manuals, Guidebooks, almanacs, handbooks
⚫ indexing and abstracting sources.
54
Examples of tertiary data(contd)
⚫ Business directories and databases containing information on companies and
industries
⚫ Online search engine results pages (SERPs) containing information about a
particular topic
⚫ Reference books and encyclopedias
⚫ Online databases containing information on academic journals and
publications
⚫ Online forums and discussion boards
⚫ Publicly available data on websites such as government websites and social
media sites
⚫ Online user-generated content such as blogs and wikis
⚫ Online news articles and archives
⚫ Online marketplaces such as Amazon or eBay
⚫ Online learning platforms such as Coursera and Udemy.
55
Examples of primary, secondary,
tertiary data
56
Based on structural form
Structured
Unstructured
Semi structured
Structured Data
⚫ The data that has a structure and is well organized either in the
form of tables or in some other way and can be easily operated is
known as structured data.
⚫ Searching and accessing information from such type of data is
very easy.
⚫ Structured data is relatively simple to enter, store, query, and
analyze, but it must be strictly defined in terms of field name and
type.
⚫ Example:
✔ Data stored in the relational database in the form of tables
having multiple rows and columns.
✔ The spreadsheet is an another good example of structured data.
58
Examples of structured data(contd)
⚫ Relational database tables
⚫ Excel spreadsheets
⚫ JSON data with a consistent schema
⚫ CSV files with consistent column headers and data types
⚫ Sensor data from Internet of Things (IoT) devices with
well-defined data structures
⚫ Financial transaction records
⚫ Stock market data with consistent formats
⚫ Medical records with a standard format
⚫ Government census data in a tabular format
⚫ Web server logs with consistent fields and formats
59
Unstructured data
⚫ Unstructured data refers to the data that lacks any specific form or structure.
⚫ This makes it very difficult and time-consuming to process and analyze unstructured
data.
Examples:
⚫ Emails
⚫ Word Processing Files
⚫ PDF files
⚫ Digital Images
⚫ Video
⚫ Audio
⚫ Social Media Posts
60
Examples of unstructured data(cond)
⚫ Social media posts (e.g. tweets, Facebook posts)
⚫ Audio and video recordings
⚫ Images and videos
⚫ Emails and instant messages
⚫ Text documents without consistent formatting or structure
⚫ Voice recordings of customer service interactions
⚫ Web pages with unstructured HTML
⚫ Handwritten notes or letters
⚫ News articles or blogs
⚫ Surveillance camera footage
61
Semi-structured data
⚫ Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
⚫ Due to unorganized information, the semi-structured is difficult
to retrieve, analyze and store as compared to structured data.
⚫ It requires software framework like Apache Hadoop to perform
all this.
⚫ Examples:
⚫ XML
⚫ JSON
62
Examples of semi-structured data(contd)
⚫ XML data with a flexible schema
⚫ HTML files with structured data in tags
⚫ JSON data with some variation in schema
⚫ Emails with a consistent format (e.g. sender, recipient, subject, body)
but variable content
⚫ Sensor data with variable fields depending on the device
⚫ Log files with structured fields but variable contents
⚫ Configuration files with some structure but also free-form text
⚫ Social media posts with hashtags or mentions
⚫ Invoices with a consistent structure but variable content
⚫ E-commerce product listings with structured fields but variable
descriptions.
63
Based on inherent nature
quantitative
qualitative
Quantitative data
⚫ Quantitative data are anything that can be expressed as a
number, or quantified.
⚫ It is data that can either be counted or compared on a
numeric scale.
⚫ Examples of quantitative data are scores on achievement
tests, number of hours of study, or weight of a subject.
⚫ These data may be represented by ordinal, interval or ratio
scales and lend themselves to most statistical manipulation.
65
Qualitative data
⚫ Qualitative data cannot be expressed as a number.
⚫ It describes qualities or characteristics.
⚫ It is collected using questionnaires, interviews, or
observation, and frequently appears in narrative form.
⚫ Data that represent nominal scales such as gender, socieo
economic status, religious preference are usually considered
to be qualitative data.
66
Data unit Numeric = Quantitative Categorical = Qualitative
variable data variable data
A person "How many hours 40 hours per week "Do you Full-time
do you work?" work full-time or
part-time?"
"How much do you 10,00,000 p.a. "What is your Data Analyst
earn?" occupation?"
How many children 2 children "In which India
do you have?" country were your
children born?"
A house "How many square 200 square metres "In which city or Bangalore
metres is the town is the house
house?" located?"
A business "How 200 employees "What is Education
many workers are the industry of the
currently business?"
employed?"
A farm "How many milk 36 cows "What is the Dairy
cows are located on main activity of the
the farm? farm?"
67
Based on observation
⚫ Time-series Data
⚫ Cross-sectional Data
⚫ Panel Data
68
Time-series Data
69
Cross-sectional Data
⚫ Cross-sectional data refers to a setoff observations taken at
a single point in time.
⚫ Samples are constructed by collecting the data of interest
across a range of observational units – people, objects,
firms – at the same time.
⚫ A good example of cross-sectional data can be the stock
returns earned by shareholders of Microsoft, IBM, and
Samsung as for the year ended 31st December 2020:
70
Panel Data
⚫ Panel data, sometimes referred to as longitudinal data, is data that
contains observations about different cross sections across time.
OR
⚫ Panel data is a collection of quantities obtained across multiple
individuals, that are assembled over even intervals in time and ordered
chronologically.
⚫ Examples of groups that may make up panel data series include
countries, firms, individuals, or demographic groups.
The two important types of data sets are populations and samples. The
definition of Sample and Population data is as follows:
• A sample consists of one or more observations drawn from the population.
• A population includes all the elements from a set of data.
Depending on the sampling method, a sample can have fewer observations than
the population. More than one sample can be derived from the same population.
Sample Data and Population
Differences between Sample and Population based on nomenclature, notation, and computations can
also be identified. For example:
• A measurable characteristic of a population, such as a mean or standard deviation, is called a
parameter; but a measurable characteristic of a sample is called a statistic.
• The mean of a population is denoted by the symbol μ; but the mean of a sample is denoted by the
symbol x.
• The formula for the standard deviation of a population is different from the formula for the standard
deviation of a sample.
75
Small Sample and Large Sample
Large sample theory: If the sample size n is greater than or equal 30 (n≥30) it is known as a large sample. For
large samples, the sampling distributions of statistic are normal (Z test). A study of the sampling distribution of
statistic for a large sample is known as large sample theory.
Small sample theory: If the sample size n is lesser than 30 (n<30), it is known as a small sample. For small
samples, the sampling distributions are t, F and χ2 distribution. A study of sampling distributions for small
samples is known as small sample theory.
Parameter and Statistics
• A measurable characteristic of a population, such as a mean or standard deviation, is called a parameter;
but a measurable characteristic of a sample is called a statistic.
• Parameter never changes, because everyone (or everything) was surveyed to find the parameter. For
example, if the average age of everyone in a class needs to be calculated, then everyone will be asked and
found the average age to be 25. That’s a parameter because everyone was asked in the class. Now, let us say
if we wanted the average age of everyone in your grade or year is required. If you use that information from
your class to take a guess at the average age, then that information becomes a statistic. That’s because you
cannot be sure your guess is correct (although it will probably be close).
Exploratory Data Analysis
https://www.statisticshowto.datasciencecentral.com/what-is-a-parameter-statisticshowto/
Exploratory Data Analysis
Types of Statistics
Exploratory Data Analysis
For example, if a drug manufacturer would like to research the adverse side effects of a drug on the population
of the country, it is close to impossible to be able to conduct a research study that involves everyone. In this
case, the researcher decides a sample of people from each demographic and then conducts the research on them
which gives them indicative feedback on the behaviour of the drug on the population.
Descriptive Statistics:
Descriptive statistics uses data that provides a description of the population either
through numerical calculation or graph or table. It provides a graphical summary of
data.
Inferential Statistics
84
Exploratory Data Analysis
For example, if a drug manufacturer would like to research the adverse side effects of a drug on the population
of the country, it is close to impossible to be able to conduct a research study that involves everyone. In this
case, the researcher decides a sample of people from each demographic and then conducts the research on them
which gives them indicative feedback on the behaviour of the drug on the population.
Exploratory Data Analysis
Types of Sampling:
Probability Sampling: Probability sampling is a sampling method that selects random members of a population
by setting a few selection criteria. These selection parameters allow every member to have equal opportunities to
be a part of various samples.
Non-probability Sampling: Non-probability sampling method is dependent on a researcher’s ability to select
members at random. This sampling method is not a fixed or pre-defined selection process which makes it
difficult for all elements of a population to have equal opportunities to be included in a sample.
Exploratory Data Analysis
https://www.questionpro.com/blog/types-of-sampling-for-social-research/
Exploratory Data Analysis
For example, in an organization of 500 employees, if the HR team decides on conducting team building
activities, it is highly likely that they would prefer picking chits out of a bowl. In this case, each of the 500
employees has an equal opportunity of being selected.
Exploratory Data Analysis
For example, if the government of the United States wishes to evaluate the number of immigrants living in the
Mainland US, they can divide it into clusters based on states such as California, Texas, Florida, Massachusetts,
colourado, Hawaii etc. This way of conducting a survey will be more effective as the results will be organized
into states and provide insightful immigration data.
Exploratory Data Analysis
For example, a researcher intends to collect a systematic sample of 500 people in a population of 5000. Each
element of the population will be numbered from 1-5000 and every 10th individual will be chosen to be a part of
the sample (Total population/ Sample Size = 5000/500 = 10).
Exploratory Data Analysis
For example, a researcher looking to analyze the characteristics of people belonging to different annual income
divisions, will create strata (groups) according to annual family income such as – Less than $20,000, $21,000 –
$30,000, $31,000 to $40,000, $41,000 to $50,000 etc. and people belonging to different income groups can be
observed to draw conclusions of which income strata have which characteristics. Marketers can analyze which
income groups to target and which one to eliminate to create a roadmap that would bear fruitful results.
Exploratory Data Analysis
There are 4 types of non-probability sampling which explains the purpose of this sampling method:
• Convenience sampling
• Judgmental or Purposive Sampling
• Snowball sampling
Exploratory Data Analysis
For example, startups and NGOs usually conduct convenience sampling at a mall to distribute leaflets of
upcoming events or promotion of a cause – they do that by standing at the entrance of the mall and giving out
pamphlets randomly.
Exploratory Data Analysis
Judgmental or Purposive Sampling: In judgmental or purposive sampling, the sample is formed by the
discretion of the judge purely considering the purpose of study along with the understanding of target audience.
Also known as deliberate sampling, the participants are selected solely based on research requirements and
elements who do not suffice the purpose are kept out of the sample.
For instance, when researchers want to understand the thought process of people who are interested in studying
for their master’s degree. The selection criteria will be: “Are you interested in studying for Masters in …?” and
those who respond with a “No” will be excluded from the sample.
Exploratory Data Analysis
For example, it will be extremely challenging to survey shelterless people or illegal immigrants. In such cases,
using the snowball theory, researchers can track a few of that category to interview and results will be derived on
that basis. This sampling method is implemented in situations where the topic is highly sensitive and not openly
discussed such as conducting surveys to gather information about HIV Aids. Not many victims will readily
respond to the questions, but researchers can contact people they might know or volunteers associated with the
cause to get in touch with the victims and collect information.
Exploratory Data Analysis
Test of significance for large sample: Large sample test or Asymptotic test or Z test (n≥30)
Test of significance for small samples(n<30): Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
• Large sample test
• Large sample test are
• Sampling from attributes
• Sampling from variables
Exploratory Data Analysis
• Statistics vary. You know the average age of your classmates is 25. You might guess that the average age of
everyone in your class is 24, 25, or 26. You might guess the average age for other colleges in your area is
the same. And you might even guess that’s the average age for college students in the US. These may not be
bad guesses, but they are statistics because you did not ask everyone.
Exploratory Data Analysis
https://www.statisticshowto.datasciencecentral.com/what-is-a-parameter-statisticshowto/
Exploratory Data Analysis
https://keydifferences.com/difference-between-statistic-and-parameter.html
Exploratory Data Analysis
Types of Statistics
A statistic is a piece of data from a portion of a population. It is the opposite of a parameter. A parameter is
data from a census. A census surveys everyone.
For example – If you have a bit of information, it’s a statistic. If you look at part of a data set, it’s a statistic. If
you know something about 10% of people, that’s a statistic too. Parameters are all the information. And all the
information is rarely known.
Exploratory Data Analysis
Types of Statistics
Statistics is a way to understand the data that is collected. For example, every time a package is sent
through the mail, that package is tracked in a huge database. The UPS database is 17 terabytes - about as
big as if you catalogued every book in the Library of Congress.
All data is meaningless without a way to interpret it, which is where statistics comes in. Statistics is about
data and variables. It is also about analysing that data and producing some meaningful information about
that data.
Exploratory Data Analysis
Types of Statistics
Types of Statistics: A statistic can be more than one type. For example, the sample standard deviation can be
used as a descriptive statistic to describe the standard deviation of a sample. It can be used as an estimator: To
estimate the population standard deviation. And it can be used to test a theory (a hypothesis).
• Descriptive Statistics
• Inferential statistics
Exploratory Data Analysis
Application of Statistics
State: For the effective functioning of the State, statistics is indispensable. Different department and
authorities require various facts and figures on different matters. They use this data to frame policies and
guidelines to perform smoothly.
Traditionally, people used statistics to collect data of manpower, crime, wealth, income, etc. for the formation
of suitable military and fiscal policies.
Exploratory Data Analysis
Application of Statistics
Economics: Economics is about allocating limited resources among unlimited ends in the most optimal
manner. Statistics offers information to answer some basic questions in economics –
• What to produce?
• How to produce?
• For whom to produce?
Statistical information helps to understand the economic problems and formulation of economic policies.
Traditionally, the application of statistics was limited since the economic theories were based on deductive
logic. Also, most statistical techniques were not developed enough for application in all disciplines.
Exploratory Data Analysis
Frequency: The frequency of any value is the number of times that value appears in a data set. So from the
above examples of colours, we can say two children like the colour blue, so its frequency is two. So to make
meaning of the raw data, we must organise. And finding out the frequency of the data values is how this
organization is done.
Exploratory Data Analysis
the data into class intervals. This ensures that the frequency
distribution best represents the data. Let us make a grouped 130-140 4
From the table, you can see that the value of 150 is put in the 150-160 3
class interval of 150-160 and not 140-150. This is the
convention must be followed.
Exploratory Data Analysis
21, 23, 19, 17, 12, 15, 15, 17, 17, 19,
23, 23, 21, 23, 25, 25, 21, 19, 19, 19
https://www.math-only-math.com/frequency-distribution-of-ungrouped-and-grouped-data.html
Exploratory Data Analysis
The right column will tell you that 614 people spend up to 6000 per year. It includes everyone
who spends up to $6000.
Exploratory Data Analysis
Relative Frequency Distributions: A relative frequency is the fraction or proportion of times a value occurs
in a data set. A relative frequency is the fraction or proportion of times a value occurs. To find the relative
frequencies, divide each frequency by the total number of data points in the sample. Relative frequencies can
be written as fractions, percent, or decimals.
https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis
https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis
Next, start to fill in the third column. The entries will be calculated by dividing the frequency of that class by
the total number of data points. For example, suppose we have a frequency of 5 in one class, and there are a
total of 50 data points.
https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis
You can choose to write the relative frequency as a decimal (0.10), as a fraction ( 1/10), or as a percent (10%).
Since we are dealing with proportions, the relative frequency column should add up to 1 (or 100%). It may be
slightly off due to rounding. Relative frequency distributions are often displayed in histograms and in
frequency polygons. The only difference between a relative frequency distribution graph and a frequency
distribution graph is that the vertical axis uses proportional or relative frequency rather than simple frequency.
Exploratory Data Analysis
https://courses.lumenlearning.com/boundless-statistics/chapter/frequency-distributions-for-quantitative-data/
Exploratory Data Analysis
Relative Cumulative Frequency: The relative cumulative frequency is the quotient between the cumulative
frequency of a particular value and the total number of data. It can be expressed as a percentage.
Example
A city has recorded the following daily maximum temperatures during the month:
32, 31, 28, 29, 33, 32, 31, 30, 31, 31, 27, 28, 29, 30, 32, 31, 31, 30, 30, 29, 29, 30, 30, 31, 30, 31, 34, 33, 33,
29, 29.
Exploratory Data Analysis
Summary
For this sub module the concepts of sample data and population can be understood, the
difference of statistic and parameter is well explained and what is the use of small sample
and large sample statistic and parameter, types of statistics and its application in different
business scenarios, frequency distribution of data.
Exploratory Data Analysis
a) Exclusive method
b) Inclusive method
c) Mid-point method
d) Ratio method
Answer: a
Exploratory Data Analysis
a) Nominal distribution
b) Ordinal distribution
c) Chronological distribution
d) Frequency distribution
Answer: d
Exploratory Data Analysis
a) 20
b) 4
c) 25
d) 15
Answer: b
Exploratory Data Analysis
Document Links
Topic URL Notes
The link explains about Sample and
Sample and Population Data https://stattrek.com/sampling/populations-and-samples.aspx
Population data
https://www.questionpro.com/blog/types-of-sampling-for-social-resea The link explains about types of
Types of Sampling Methods
rch/ sampling methods
The link explains about Large and
Large and Small Samples http://ecoursesonline.iasri.res.in/mod/page/view.php?id=15455
Small Samples
https://www.statisticshowto.datasciencecentral.com/what-is-a-parame The link explains about Parameter
Parameter and Statistics
ter-statisticshowto/ and Statistics
Types of Statistics https://www.statisticshowto.datasciencecentral.com/statistic/ The link explains about Statistics
https://www.toppr.com/guides/business-economics-cs/descriptive-stat The link explains about Application
Application of Statistics
istics/application-of-statistics/ of statistics in business
The link explains about Frequency
https://www.toppr.com/guides/maths/data-handling/data-and-its-frequ
Frequency of Data Distribution of Data Distribution
ency-distribution/
Exploratory Data Analysis
Video Links
https://ocw.mit.edu/courses/electrical-engineering-and-com
puter-science/6-0002-introduction-to-computational-thinkin
g-and-data-science-fall-2016/lecture-videos/lecture-14-clas
sification-and-statistical-sins/
Data Classification The link explains about Data Classification
https://ocw.mit.edu/courses/electrical-engineering-and-com
puter-science/6-0002-introduction-to-computational-thinkin
g-and-data-science-fall-2016/lecture-videos/lecture-13-clas
sification/
Statistics
E- Book Links
EBook name Chapter Page No. URL
https://www.itl.nist.gov/div
Exploratory Data Analysis Whole Book Whole Book
898/handbook/eda/eda.htm
https://www.statsref.com/St
Statistical Analysis Handbook Whole Book Whole Book
atsRefSample.pdf