0% found this document useful (0 votes)

11 views

foundation of Data science imp notes

foundation of daata science

Uploaded by

Suyash Thorat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

foundation of Data science imp notes

foundation of daata science

Uploaded by

Suyash Thorat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Primary data is original data that is collected by the researcher for a specific research project or

purpose. It is data that is generated by the researcher themselves, and is collected directly from
a first-hand source.

Data quality refers to the condition or characteristics of data that make it suitable for its
intended use. High-quality data ensures that analysis, decision-making, and predictions are
accurate and reliable.

Outlier is a data point that significantly differs from other observations in a dataset. It lies
outside the expected range or pattern and can occur due to variability in the data, measurement
errors, or unusual events.

Interquartile range (IQR) is a measure of statistical dispersion, representing the range within
which the central 50% of a dataset lies. It is calculated as the difference between the third
quartile (Q3) and the first quartile (Q1):
IQR=Q3−Q1

Missing values refer to the absence of data for certain observations or attributes in a dataset.
These gaps occur when no value is recorded for a variable in a specific instance. Missing
values can arise due to various reasons, such as errors during data collection, equipment
malfunction, or respondents skipping questions in surveys.

Zip files are compressed files that bundle multiple files or folders into a single archive, often
reducing their size. They are widely used for their convenience and functionality.

XML (eXtensible Markup Language) is a markup language used to store, organize, and share
data in a structured and human-readable format. It is widely used for data exchange between
systems due to its simplicity and flexibility.

Data discretization is the process of converting continuous data into discrete intervals or
categories. This technique is commonly used in data preprocessing, particularly in machine
learning and data mining, to simplify data analysis and improve model performance.

Tag cloud (also known as a word cloud) is a visual representation of text data, typically used to
depict keywords, tags, or terms. The size, color, or weight of each word in the cloud indicates its
frequency, importance, or relevance in the dataset.

Visual encoding is the process of converting information into a visual format, such as graphs,
charts, maps, or diagrams, to make the data more understandable and accessible. It plays a
crucial role in data visualization by transforming raw data into visual representations that can be
interpreted quickly and accurately.

Applications of data science

Image Recognition and Speech Recognition,
Gaming world, internet search, transport, healthcare, risk detection

Unstructured data refers to data that does not have a predefined structure or format, making it
difficult to organize and analyze using traditional methods or relational databases.
CSV (Comma-Separated Values) file is a simple text file used to store tabular data, such as a
spreadsheet or database, where each line represents a row, and each value within a row is
separated by a comma or another delimiter (such as a semicolon or tab).

Applications of Data Science

1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly use Search engines like Google, Yahoo,
DuckDuckGo and Bing, etc. So Data Science is used to get Searches faster.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With
the help of Driverless Cars, it is easy to reduce the number of Accidents.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for Detecting
Tumor, Drug discoveries,Medical,Image Analysis,Virtual,Medical Bots
6. Image Recognition
Currently, Data Science is also used in Image Recognition.

Null Hypothesis (H₀): The null hypothesis represents a statement of no effect or no difference.
It assumes that any observed effect or difference in the sample data is due to random chance or
sampling error rather than a true effect in the population. Purpose: The null hypothesis serves
as the baseline or default assumption that is tested against the sample data.

Alternative Hypothesis (H₁ or Ha): The alternative hypothesis represents a statement of an

effect or a difference. It suggests that there is a true effect or relationship in the population, and
any observed data deviations are not due to chance. Purpose: The alternative hypothesis is
what the researcher aims to support or prove. It is typically the opposite of the null hypothesis.

The 3 V's of Data Science:

Volume: Refers to the sheer amount of data being generated and stored. In the context of data
science, volume measures how much data is available for analysis. The increasing volume of
data from various sources (such as social media, sensors, transaction logs, and more) presents
challenges in terms of storage, processing, and analysis.

Velocity: Refers to the speed at which data is generated, processed, and analyzed. This
dimension is particularly important for real-time or near-real-time applications. Data is being
created continuously and often needs to be processed quickly to provide timely insights. This
includes the speed of incoming data, the speed of data processing, and the speed at which
decisions must be made.

Variety: Refers to the different types and formats of data available. Data comes in many forms,
including structured, semi-structured, and unstructured formats. The variety of data adds
complexity to its processing, as different types of data may require different methods of storage,
cleaning, and analysis. It includes data from multiple sources and formats, such as text, images,
videos, logs, and more.

Noisy data refers to irrelevant or random variations in the data that do not represent the true
underlying patterns or relationships. It is data that contains errors, inconsistencies, or random
fluctuations, which can distort the results of data analysis, making it harder to extract meaningful
insights.
Causes of Noisy Data
1. Measurement errors occur when data is collected using tools, devices, or systems that
are not accurate or precise. These errors can introduce noise into the dataset, making it
difficult to obtain reliable results.
2. Human errors occur when individuals input or process data incorrectly. This can happen
due to mistakes during data entry, misunderstanding of instructions, or negligence.

Data visualization refers to the graphical representation of data and information using visual
elements like charts, graphs, maps, and plots. The goal of data visualization is to make complex
data more accessible, understandable, and usable by presenting it in a visual format, which
allows for easier identification of patterns, trends, correlations, and outliers.
Examples -
Matplotlib (Python): Matplotlib is a widely-used Python library for creating static, interactive,
and animated visualizations. It provides a variety of chart types such as line plots, histograms,
bar charts, and scatter plots.
Seaborn (Python): Seaborn is a Python library based on Matplotlib that provides a high-level
interface for drawing attractive and informative statistical graphics. It simplifies the process of
creating complex visualizations like heatmaps, violin plots, and pair plots.

Data Cube Aggregation in Data Reduction

1. Multi-dimensional Representation: Data is organized across different dimensions

(such as time, location, product type, etc.), and a measure is associated with each
combination of these dimensions.
2. Aggregation: Instead of keeping detailed records of every individual data point,
aggregation combines data from smaller units into larger groups. For example, instead of
keeping sales data for each individual transaction, we may aggregate the data by week,
region, or product category. This reduces the overall volume of data.
3. Data Reduction: Aggregation leads to data reduction by summarizing large datasets
into compact summaries. This reduces the number of data points, making it easier to
analyze and store.

Data Visualization Tools

1. Tableau - is one of the most widely used data visualization tools, known for its powerful,
user-friendly interface. It allows users to create a wide variety of interactive
visualizations, from basic charts to complex dashboards. Features: Drag-and-drop
interface that allows users to create visualizations without needing advanced coding
skills. Powerful features for creating dashboards, reports, and real-time data
visualizations. Allows sharing of visualizations on the web or with stakeholders.
2. Power BI - is a business analytics tool by Microsoft that allows users to visualize and
share insights from their data. It's known for its deep integration with Microsoft products
and cloud services. Features: Interactive dashboards, visualizations, and reports.
Integrates seamlessly with Microsoft Excel, SQL Server, and other data sources.
Provides cloud-based services for sharing and collaborating on visualizations. Offers
both desktop and cloud-based applications for creating and accessing visual reports.
3. Google Data Studio - is a free tool for creating custom reports and dashboards. It
integrates well with other Google services like Google Analytics, Google Ads, and
Google Sheets. Features: Completely free to use and cloud-based, making it easy to
collaborate. Allows integration with various Google services and third-party connectors.
Provides various customizable visualizations like charts, tables, and geospatial maps.
Real-time collaboration and sharing of reports with a link or embed code.
4. D3.js - (Data-Driven Documents) is a powerful JavaScript library used to create
interactive and dynamic data visualizations in web browsers. Unlike other visualization
tools, D3.js provides fine-grained control over the design and functionality of
visualizations. Features: Allows creation of highly customizable, interactive charts, maps,
and graphics. Works directly with HTML, CSS, and SVG to render complex
visualizations on web pages. Supports animation, transitions, and real-time updates to
data visualizations.

Data attributes refer to the characteristics or properties that describe data. These attributes are
the variables or features that are measured, observed, or collected for a particular entity in a
dataset. In the context of data science and machine learning, attributes define the various
dimensions along which data can be analyzed and are essential for understanding and
processing data.
Types -
1. Nominal attributes are categorical variables that represent categories or groups. The
values of these attributes are names or labels and have no inherent order or ranking.
Example - Gender: {Male, Female, Other} (These categories cannot be ordered, and
they are just labels).
2. Ordinal attributes represent categorical variables that have a meaningful order or
ranking, but the intervals between the categories are not necessarily equal. Example -
Education Level: {High School, Bachelor’s Degree, Master’s Degree, PhD} (These
categories are ordered from lowest to highest).
3. Interval attributes represent continuous variables where the differences between
values are meaningful and consistent, but there is no true zero point. This means that
the scale does not have a true "absence" of the attribute. Example - Temperature (in
Celsius or Fahrenheit): The difference between 10°C and 20°C is the same as between
20°C and 30°C, but 0°C does not represent "no temperature".
4. Ratio attributes are continuous variables that have all the properties of interval
attributes, but they also have a true zero point. This means that 0 represents the
absence of the attribute, and ratios between values are meaningful. Example - Height: A
height of 0 meters means "no height", and a height of 180 cm is exactly twice as tall as
one of 90 cm.

How do you visualize geospatial data? Explain in detail.

Visualizing geospatial data involves the process of displaying geographic information and
related data on maps to help understand spatial relationships, patterns, and trends. Geospatial
data refers to information about locations and physical features, often containing geographical
coordinates (latitude and longitude), addresses, or boundaries that correspond to real-world
locations. There are various methods and tools for visualizing geospatial data, and the choice of
method often depends on the type of data, the analysis goals, and the audience.
Methods for Visualizing Geospatial Data:
1. Static maps are images that represent geographical information without interaction.
They are commonly used in reports, presentations, and printed documents.
Types - Choropleth maps, Dot maps, Proportional symbol maps
2. Interactive maps allow users to explore geospatial data dynamically, offering features
like zooming, panning, and clicking for more information. These are commonly used in
web applications, dashboards, and GIS (Geographic Information System) tools.
Types - Web maps, Heatmaps, Time series maps

3. Geospatial Visualization Tools:

● Google Maps / Google Earth:
a. Google Maps API: Used for embedding maps into web applications, showing
geospatial data like markers, polygons, and paths. It also supports custom
overlays and popups for detailed information.
b. Google Earth: A 3D visualization tool that allows you to explore satellite imagery,
3D terrain, and geospatial data on a global scale.
● Leaflet: Leaflet is an open-source JavaScript library for creating interactive maps. It is
lightweight, easy to use, and supports a wide range of plugins for advanced mapping
features.
● ArcGIS is a powerful Geographic Information System (GIS) software suite used for
creating, analyzing, and visualizing geospatial data. It is commonly used by
professionals for detailed spatial analysis and mapping.
● QGIS is an open-source GIS tool that enables users to visualize, manage, and analyze
geospatial data. It supports a wide range of data formats and provides both basic and
advanced geospatial analysis features.

What do you mean by Data transformation? Explain strategies of data transformation.

Data transformation is the process of converting data from its original format or structure into a
format that is more suitable for analysis, storage, or other operations. It is an essential part of
data preprocessing in data science, machine learning, and data integration processes. Data
transformation helps to make the data consistent, clean, and usable for various analytical tasks,
such as reporting, decision-making, and machine learning modeling.

1. Normalization is the process of adjusting the scale of data to ensure that each feature
contributes equally to the analysis or machine learning model. It is particularly useful
when working with algorithms that are sensitive to the scale of input data (e.g., k-nearest
neighbors, neural networks).
2. Discretization is the process of converting continuous data into discrete intervals or
bins. This is useful for simplifying the data and for making it compatible with certain types
of machine learning algorithms (e.g., decision trees).
3. Aggregation involves summarizing or combining multiple values into a single value,
typically at a higher level of granularity. This is often used to reduce the dimensionality of
data and to improve computational efficiency.
4. Data encoding involves converting categorical variables into numerical representations.
This is necessary for machine learning algorithms that require numeric inputs.
5. Feature extraction involves creating new features from existing data to better capture
the underlying patterns in the data, improving the performance of machine learning
models.
6. Smoothing is used to remove noise or outliers in the data, making it easier to identify
trends and patterns.

What are the different methods for measuring the data dispersion?

1. Range - The range is the simplest measure of dispersion and is the difference between
the maximum and minimum values in a dataset.

Range=Max value−Min value

2. Variance - measures the average squared deviation of each data point from the
mean. It provides an idea of how much the values differ from the average.
3. Standard Deviation - The standard deviation is the square root of the variance
and provides a measure of the average distance of data points from the mean, in
the same units as the data.
4. The interquartile range measures the spread of the middle 50% of the data. It is
the difference between the first quartile (Q1) and the third quartile (Q3).
IQR=Q3−Q1

FINAL RECHEKED 91 PAGE MCQS-merged
No ratings yet
FINAL RECHEKED 91 PAGE MCQS-merged
298 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
FDS MOST IMP QUESTION
No ratings yet
FDS MOST IMP QUESTION
12 pages
Fda 1
No ratings yet
Fda 1
5 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Data Science
No ratings yet
Data Science
12 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
FDS - 4 SOLVED
No ratings yet
FDS - 4 SOLVED
21 pages
Data Science
No ratings yet
Data Science
59 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Unit 1
No ratings yet
Unit 1
61 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Class X AI Project Cycle Notes
No ratings yet
Class X AI Project Cycle Notes
19 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
dsbda_ut6
No ratings yet
dsbda_ut6
11 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
FDS
No ratings yet
FDS
7 pages
Revised NOTES on AI PROJECT CYCLE Class 9 and 10 as on 29-10-2024 1
No ratings yet
Revised NOTES on AI PROJECT CYCLE Class 9 and 10 as on 29-10-2024 1
21 pages
Introduction To Data Science and Analytics: Summer School 2015
No ratings yet
Introduction To Data Science and Analytics: Summer School 2015
31 pages
CS250
No ratings yet
CS250
55 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
datas_unit1
No ratings yet
datas_unit1
20 pages
BA Th Exam
No ratings yet
BA Th Exam
38 pages
Antim Prahar 2024 Data Analytics For Business Decisions
50% (2)
Antim Prahar 2024 Data Analytics For Business Decisions
38 pages
fds print
No ratings yet
fds print
7 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
3.Question bank
No ratings yet
3.Question bank
7 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Da End Sem
No ratings yet
Da End Sem
5 pages
Ids Unit I
No ratings yet
Ids Unit I
46 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Essential Data Science Notes - A Concise PDF Guide
No ratings yet
Essential Data Science Notes - A Concise PDF Guide
20 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
PDF Data Science
No ratings yet
PDF Data Science
7 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
DSA Module 1 Notes
No ratings yet
DSA Module 1 Notes
24 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Web Technology Imp
No ratings yet
Web Technology Imp
20 pages
Block - Chain - Theory Notes
No ratings yet
Block - Chain - Theory Notes
28 pages
Os Notes
100% (1)
Os Notes
5 pages
Assignment 1 Electronics 1
No ratings yet
Assignment 1 Electronics 1
7 pages
Az 900
No ratings yet
Az 900
38 pages
Primary Skills: Java, Python, Django Framework, Django Rest Framework, Angular2, Mysql, Teradata
No ratings yet
Primary Skills: Java, Python, Django Framework, Django Rest Framework, Angular2, Mysql, Teradata
2 pages
IRS Important Questions
No ratings yet
IRS Important Questions
3 pages
top-50-database-interview-questions
No ratings yet
top-50-database-interview-questions
10 pages
(DBMS) 8-9
No ratings yet
(DBMS) 8-9
6 pages
The Predictability of Maxillary Curve of Spee Leveling With The Invisalign Appliance
No ratings yet
The Predictability of Maxillary Curve of Spee Leveling With The Invisalign Appliance
6 pages
TECDIS Manual EN Rev 1 - 9
No ratings yet
TECDIS Manual EN Rev 1 - 9
98 pages
Super Market Last Update
No ratings yet
Super Market Last Update
36 pages
Mba 3 Sem Data Mining For Business Decisions 76987 Jan 2021
No ratings yet
Mba 3 Sem Data Mining For Business Decisions 76987 Jan 2021
2 pages
ANTENATAL CARE MANAGEMENT SYSTEM REPORT (Suleiman Abdul) PDF
No ratings yet
ANTENATAL CARE MANAGEMENT SYSTEM REPORT (Suleiman Abdul) PDF
54 pages
Logical Database Design and The Relational Model
No ratings yet
Logical Database Design and The Relational Model
4 pages
Chap1 - Relational Model
No ratings yet
Chap1 - Relational Model
49 pages
Cs Project Mysql
No ratings yet
Cs Project Mysql
22 pages
Connection Pooling With Connector/J
No ratings yet
Connection Pooling With Connector/J
3 pages
system_admin
No ratings yet
system_admin
588 pages
p2d2-pgvector-workshop
No ratings yet
p2d2-pgvector-workshop
101 pages
Pavan Kumar.K
No ratings yet
Pavan Kumar.K
4 pages
Migration Methodologies: Migration Methodology Description Target DB Downtime Complexity Best Suited For
No ratings yet
Migration Methodologies: Migration Methodology Description Target DB Downtime Complexity Best Suited For
2 pages
Online Voting System
No ratings yet
Online Voting System
91 pages
Statistics - Linear Regression - Correlation Worksheet PDF
No ratings yet
Statistics - Linear Regression - Correlation Worksheet PDF
2 pages
Lab 3 - Query Examples 3-31: Wonderware System Platform Course - Part 2
No ratings yet
Lab 3 - Query Examples 3-31: Wonderware System Platform Course - Part 2
10 pages
Smart Manufacturing & Manufacturing As A Service: Lscm-Information Technologies in Logistics
No ratings yet
Smart Manufacturing & Manufacturing As A Service: Lscm-Information Technologies in Logistics
15 pages
Copy A SQL Server Database With Just The Objects and No Data
No ratings yet
Copy A SQL Server Database With Just The Objects and No Data
10 pages
ST11 Installation and Administration
100% (1)
ST11 Installation and Administration
33 pages
Global Manager Services Installation Guide
No ratings yet
Global Manager Services Installation Guide
48 pages
Rman Cold - Consistent - Offline Backup
No ratings yet
Rman Cold - Consistent - Offline Backup
4 pages
Tut_01 CMT221
No ratings yet
Tut_01 CMT221
4 pages
Fci Assistant Grade III 2015 Paper 1 West Zone 3eff2785
No ratings yet
Fci Assistant Grade III 2015 Paper 1 West Zone 3eff2785
19 pages
Database Management System DMS DSC 3 1 4 2 10 5 3 30 70 100 40 50 20 25# 10 25 10 200
No ratings yet
Database Management System DMS DSC 3 1 4 2 10 5 3 30 70 100 40 50 20 25# 10 25 10 200
6 pages

Uploaded by

Uploaded by

Primary data is original data that is collected by the researcher for a specific research project or

Applications of data science

Applications of Data Science

Alternative Hypothesis (H₁ or Ha): The alternative hypothesis represents a statement of an

The 3 V's of Data Science:

Data Cube Aggregation in Data Reduction

1. Multi-dimensional Representation: Data is organized across different dimensions

Data Visualization Tools

How do you visualize geospatial data? Explain in detail.

3. Geospatial Visualization Tools:

What do you mean by Data transformation? Explain strategies of data transformation.

Range=Max value−Min value

You might also like