0% found this document useful (0 votes)
11 views

foundation of Data science imp notes

foundation of daata science

Uploaded by

Suyash Thorat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

foundation of Data science imp notes

foundation of daata science

Uploaded by

Suyash Thorat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Primary data is original data that is collected by the researcher for a specific research project or

purpose. It is data that is generated by the researcher themselves, and is collected directly from
a first-hand source.

Data quality refers to the condition or characteristics of data that make it suitable for its
intended use. High-quality data ensures that analysis, decision-making, and predictions are
accurate and reliable.

Outlier is a data point that significantly differs from other observations in a dataset. It lies
outside the expected range or pattern and can occur due to variability in the data, measurement
errors, or unusual events.

Interquartile range (IQR) is a measure of statistical dispersion, representing the range within
which the central 50% of a dataset lies. It is calculated as the difference between the third
quartile (Q3) and the first quartile (Q1):
IQR=Q3−Q1

Missing values refer to the absence of data for certain observations or attributes in a dataset.
These gaps occur when no value is recorded for a variable in a specific instance. Missing
values can arise due to various reasons, such as errors during data collection, equipment
malfunction, or respondents skipping questions in surveys.

Zip files are compressed files that bundle multiple files or folders into a single archive, often
reducing their size. They are widely used for their convenience and functionality.

XML (eXtensible Markup Language) is a markup language used to store, organize, and share
data in a structured and human-readable format. It is widely used for data exchange between
systems due to its simplicity and flexibility.

Data discretization is the process of converting continuous data into discrete intervals or
categories. This technique is commonly used in data preprocessing, particularly in machine
learning and data mining, to simplify data analysis and improve model performance.

Tag cloud (also known as a word cloud) is a visual representation of text data, typically used to
depict keywords, tags, or terms. The size, color, or weight of each word in the cloud indicates its
frequency, importance, or relevance in the dataset.

Visual encoding is the process of converting information into a visual format, such as graphs,
charts, maps, or diagrams, to make the data more understandable and accessible. It plays a
crucial role in data visualization by transforming raw data into visual representations that can be
interpreted quickly and accurately.

Applications of data science


Image Recognition and Speech Recognition,
Gaming world, internet search, transport, healthcare, risk detection

Unstructured data refers to data that does not have a predefined structure or format, making it
difficult to organize and analyze using traditional methods or relational databases.
CSV (Comma-Separated Values) file is a simple text file used to store tabular data, such as a
spreadsheet or database, where each line represents a row, and each value within a row is
separated by a comma or another delimiter (such as a semicolon or tab).

Applications of Data Science


1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly use Search engines like Google, Yahoo,
DuckDuckGo and Bing, etc. So Data Science is used to get Searches faster.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With
the help of Driverless Cars, it is easy to reduce the number of Accidents.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue
of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in
order to carry out strategic decisions for the company.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for Detecting
Tumor, Drug discoveries,Medical,Image Analysis,Virtual,Medical Bots
6. Image Recognition
Currently, Data Science is also used in Image Recognition.

Null Hypothesis (H₀): The null hypothesis represents a statement of no effect or no difference.
It assumes that any observed effect or difference in the sample data is due to random chance or
sampling error rather than a true effect in the population. Purpose: The null hypothesis serves
as the baseline or default assumption that is tested against the sample data.

Alternative Hypothesis (H₁ or Ha): The alternative hypothesis represents a statement of an


effect or a difference. It suggests that there is a true effect or relationship in the population, and
any observed data deviations are not due to chance. Purpose: The alternative hypothesis is
what the researcher aims to support or prove. It is typically the opposite of the null hypothesis.

The 3 V's of Data Science:


Volume: Refers to the sheer amount of data being generated and stored. In the context of data
science, volume measures how much data is available for analysis. The increasing volume of
data from various sources (such as social media, sensors, transaction logs, and more) presents
challenges in terms of storage, processing, and analysis.

Velocity: Refers to the speed at which data is generated, processed, and analyzed. This
dimension is particularly important for real-time or near-real-time applications. Data is being
created continuously and often needs to be processed quickly to provide timely insights. This
includes the speed of incoming data, the speed of data processing, and the speed at which
decisions must be made.

Variety: Refers to the different types and formats of data available. Data comes in many forms,
including structured, semi-structured, and unstructured formats. The variety of data adds
complexity to its processing, as different types of data may require different methods of storage,
cleaning, and analysis. It includes data from multiple sources and formats, such as text, images,
videos, logs, and more.

Noisy data refers to irrelevant or random variations in the data that do not represent the true
underlying patterns or relationships. It is data that contains errors, inconsistencies, or random
fluctuations, which can distort the results of data analysis, making it harder to extract meaningful
insights.
Causes of Noisy Data
1. Measurement errors occur when data is collected using tools, devices, or systems that
are not accurate or precise. These errors can introduce noise into the dataset, making it
difficult to obtain reliable results.
2. Human errors occur when individuals input or process data incorrectly. This can happen
due to mistakes during data entry, misunderstanding of instructions, or negligence.

Data visualization refers to the graphical representation of data and information using visual
elements like charts, graphs, maps, and plots. The goal of data visualization is to make complex
data more accessible, understandable, and usable by presenting it in a visual format, which
allows for easier identification of patterns, trends, correlations, and outliers.
Examples -
Matplotlib (Python): Matplotlib is a widely-used Python library for creating static, interactive,
and animated visualizations. It provides a variety of chart types such as line plots, histograms,
bar charts, and scatter plots.
Seaborn (Python): Seaborn is a Python library based on Matplotlib that provides a high-level
interface for drawing attractive and informative statistical graphics. It simplifies the process of
creating complex visualizations like heatmaps, violin plots, and pair plots.

Data Cube Aggregation in Data Reduction

1. Multi-dimensional Representation: Data is organized across different dimensions


(such as time, location, product type, etc.), and a measure is associated with each
combination of these dimensions.
2. Aggregation: Instead of keeping detailed records of every individual data point,
aggregation combines data from smaller units into larger groups. For example, instead of
keeping sales data for each individual transaction, we may aggregate the data by week,
region, or product category. This reduces the overall volume of data.
3. Data Reduction: Aggregation leads to data reduction by summarizing large datasets
into compact summaries. This reduces the number of data points, making it easier to
analyze and store.

Data Visualization Tools

1. Tableau - is one of the most widely used data visualization tools, known for its powerful,
user-friendly interface. It allows users to create a wide variety of interactive
visualizations, from basic charts to complex dashboards. Features: Drag-and-drop
interface that allows users to create visualizations without needing advanced coding
skills. Powerful features for creating dashboards, reports, and real-time data
visualizations. Allows sharing of visualizations on the web or with stakeholders.
2. Power BI - is a business analytics tool by Microsoft that allows users to visualize and
share insights from their data. It's known for its deep integration with Microsoft products
and cloud services. Features: Interactive dashboards, visualizations, and reports.
Integrates seamlessly with Microsoft Excel, SQL Server, and other data sources.
Provides cloud-based services for sharing and collaborating on visualizations. Offers
both desktop and cloud-based applications for creating and accessing visual reports.
3. Google Data Studio - is a free tool for creating custom reports and dashboards. It
integrates well with other Google services like Google Analytics, Google Ads, and
Google Sheets. Features: Completely free to use and cloud-based, making it easy to
collaborate. Allows integration with various Google services and third-party connectors.
Provides various customizable visualizations like charts, tables, and geospatial maps.
Real-time collaboration and sharing of reports with a link or embed code.
4. D3.js - (Data-Driven Documents) is a powerful JavaScript library used to create
interactive and dynamic data visualizations in web browsers. Unlike other visualization
tools, D3.js provides fine-grained control over the design and functionality of
visualizations. Features: Allows creation of highly customizable, interactive charts, maps,
and graphics. Works directly with HTML, CSS, and SVG to render complex
visualizations on web pages. Supports animation, transitions, and real-time updates to
data visualizations.

Data attributes refer to the characteristics or properties that describe data. These attributes are
the variables or features that are measured, observed, or collected for a particular entity in a
dataset. In the context of data science and machine learning, attributes define the various
dimensions along which data can be analyzed and are essential for understanding and
processing data.
Types -
1. Nominal attributes are categorical variables that represent categories or groups. The
values of these attributes are names or labels and have no inherent order or ranking.
Example - Gender: {Male, Female, Other} (These categories cannot be ordered, and
they are just labels).
2. Ordinal attributes represent categorical variables that have a meaningful order or
ranking, but the intervals between the categories are not necessarily equal. Example -
Education Level: {High School, Bachelor’s Degree, Master’s Degree, PhD} (These
categories are ordered from lowest to highest).
3. Interval attributes represent continuous variables where the differences between
values are meaningful and consistent, but there is no true zero point. This means that
the scale does not have a true "absence" of the attribute. Example - Temperature (in
Celsius or Fahrenheit): The difference between 10°C and 20°C is the same as between
20°C and 30°C, but 0°C does not represent "no temperature".
4. Ratio attributes are continuous variables that have all the properties of interval
attributes, but they also have a true zero point. This means that 0 represents the
absence of the attribute, and ratios between values are meaningful. Example - Height: A
height of 0 meters means "no height", and a height of 180 cm is exactly twice as tall as
one of 90 cm.

How do you visualize geospatial data? Explain in detail.


Visualizing geospatial data involves the process of displaying geographic information and
related data on maps to help understand spatial relationships, patterns, and trends. Geospatial
data refers to information about locations and physical features, often containing geographical
coordinates (latitude and longitude), addresses, or boundaries that correspond to real-world
locations. There are various methods and tools for visualizing geospatial data, and the choice of
method often depends on the type of data, the analysis goals, and the audience.
Methods for Visualizing Geospatial Data:
1. Static maps are images that represent geographical information without interaction.
They are commonly used in reports, presentations, and printed documents.
Types - Choropleth maps, Dot maps, Proportional symbol maps
2. Interactive maps allow users to explore geospatial data dynamically, offering features
like zooming, panning, and clicking for more information. These are commonly used in
web applications, dashboards, and GIS (Geographic Information System) tools.
Types - Web maps, Heatmaps, Time series maps

3. Geospatial Visualization Tools:


● Google Maps / Google Earth:
a. Google Maps API: Used for embedding maps into web applications, showing
geospatial data like markers, polygons, and paths. It also supports custom
overlays and popups for detailed information.
b. Google Earth: A 3D visualization tool that allows you to explore satellite imagery,
3D terrain, and geospatial data on a global scale.
● Leaflet: Leaflet is an open-source JavaScript library for creating interactive maps. It is
lightweight, easy to use, and supports a wide range of plugins for advanced mapping
features.
● ArcGIS is a powerful Geographic Information System (GIS) software suite used for
creating, analyzing, and visualizing geospatial data. It is commonly used by
professionals for detailed spatial analysis and mapping.
● QGIS is an open-source GIS tool that enables users to visualize, manage, and analyze
geospatial data. It supports a wide range of data formats and provides both basic and
advanced geospatial analysis features.

What do you mean by Data transformation? Explain strategies of data transformation.

Data transformation is the process of converting data from its original format or structure into a
format that is more suitable for analysis, storage, or other operations. It is an essential part of
data preprocessing in data science, machine learning, and data integration processes. Data
transformation helps to make the data consistent, clean, and usable for various analytical tasks,
such as reporting, decision-making, and machine learning modeling.

1. Normalization is the process of adjusting the scale of data to ensure that each feature
contributes equally to the analysis or machine learning model. It is particularly useful
when working with algorithms that are sensitive to the scale of input data (e.g., k-nearest
neighbors, neural networks).
2. Discretization is the process of converting continuous data into discrete intervals or
bins. This is useful for simplifying the data and for making it compatible with certain types
of machine learning algorithms (e.g., decision trees).
3. Aggregation involves summarizing or combining multiple values into a single value,
typically at a higher level of granularity. This is often used to reduce the dimensionality of
data and to improve computational efficiency.
4. Data encoding involves converting categorical variables into numerical representations.
This is necessary for machine learning algorithms that require numeric inputs.
5. Feature extraction involves creating new features from existing data to better capture
the underlying patterns in the data, improving the performance of machine learning
models.
6. Smoothing is used to remove noise or outliers in the data, making it easier to identify
trends and patterns.

What are the different methods for measuring the data dispersion?

1. Range - The range is the simplest measure of dispersion and is the difference between
the maximum and minimum values in a dataset.

Range=Max value−Min value

2. Variance - measures the average squared deviation of each data point from the
mean. It provides an idea of how much the values differ from the average.
3. Standard Deviation - The standard deviation is the square root of the variance
and provides a measure of the average distance of data points from the mean, in
the same units as the data.
4. The interquartile range measures the spread of the middle 50% of the data. It is
the difference between the first quartile (Q1) and the third quartile (Q3).
IQR=Q3−Q1

You might also like