Introduction to Data Science __ 23CSH-283
Introduction to Data Science __ 23CSH-283
(23CSH-283)
Compiled by : Subhayu
Contents :
(Click on the Unit below, to skip to that particular unit)
Unit 1……………………………………………………………………………………………………………………………………..
Unit 2…………………………………………………………………………………………………………………………………….
Unit 3……………………………………………………………………………………………………………………………………..
MST 1 and 2 solutions………………………………………………………………………………………………………
Sample Questions…………………………………………………..…………………………………………………………..
UNIT-1 : Data Science - An Overview
Contact Hours: 10
Chapter 1 : Introduction
Definition and Description
1. Data: Information in raw form that can be structured (tables, databases)
or unstructured (text, images, videos).
○ Big Data: Extremely large datasets that traditional data-processing
tools cannot handle efficiently.
○ Metadata: Data about data, providing information like structure,
format, and origin.
2. Data Science Pipeline:
○ Data Acquisition: Gathering raw data.
○ Data Cleaning: Removing inaccuracies or inconsistencies.
○ Exploratory Data Analysis (EDA): Gaining preliminary insights into
the dataset.
○ Model Building: Using statistical or machine learning models.
○ Model Evaluation: Testing the accuracy and performance of models.
○ Deployment: Applying the model to real-world data.
3. Machine Learning (ML): A subset of AI focused on building models that
enable computers to learn from and make decisions based on data.
4. Artificial Intelligence (AI): Broader than ML, it involves machines mimicking
human intelligence to perform tasks.
5. Feature Engineering: The process of selecting, transforming, or creating
variables (features) to improve model performance.
6. Overfitting and Underfitting:
○ Overfitting: Model performs well on training data but poorly on new
data.
○ Underfitting: Model is too simple and performs poorly on both
training and test data.
1. Data Wrangling: The process of cleaning and structuring raw data into a
desired format.
2. Exploratory Data Analysis (EDA):
○ Understanding data characteristics using visualization (e.g.,
histograms, scatter plots).
○ Summarizing data using descriptive statistics like mean, median,
mode, and variance.
3. Statistical Modeling:
○ Regression Analysis: Understanding the relationship between
variables.
○ Hypothesis Testing: Checking if assumptions about data are valid.
4. Machine Learning:
○ Supervised Learning: Predicting outcomes using labeled data (e.g.,
Linear Regression, Decision Trees).
○ Unsupervised Learning: Finding patterns in unlabeled data (e.g.,
Clustering, PCA).
○ Reinforcement Learning: Learning through trial and error to
maximize rewards.
5. Data Visualization: Creating visual representations of data using tools like
Matplotlib, Seaborn, Tableau, and Power BI.
6. Big Data Analytics: Using frameworks like Hadoop, Spark to process and
analyze massive datasets.
7. Natural Language Processing (NLP): Techniques for analyzing and
processing text data (e.g., sentiment analysis, text summarization).
1. Data Collection: Gathering data from various sources such as databases,
APIs, sensors, or web scraping.
○ Tools: SQL, NoSQL, APIs.
2. Data Processing: Cleaning and transforming raw data into usable formats.
○ Techniques: Handling missing values, normalization, encoding
categorical data.
3. Data Analysis: Analyzing data to identify trends, correlations, and
patterns.
○ Methods: Statistical analysis, exploratory data analysis (EDA).
4. Data Visualization: Representing data visually for better understanding.
○ Tools: Tableau, Power BI, Matplotlib, Seaborn.
5. Modeling and Algorithms: Using machine learning or statistical models for
predictions and solutions.
○ Examples: Regression, Classification, Clustering.
6. Deployment and Communication: Deploying models in production and
communicating results to stakeholders.
○ Tools: Flask, Streamlit, Dash, Excel for reports.
The Data Science hierarchy describes the step-by-step process involved in data
science workflows:
🔹 Example:
● Predicting stock prices → Uses probability & statistics.
● Image recognition → Uses linear algebra for processing pixel data.
🔹 Example:
● Weather forecasting → Predicts rain probability based on past data.
Statistics:
🔹 Example:
● Medical studies → Analyzing patient recovery data using statistical tests.
🔹 Example:
● A company wants to analyze employee salaries. They compute average
salary, salary distribution, and standard deviation to understand
disparities.
🔹 Example:
● Predicting house prices based on past sales data.
🔹 Example:
● Amazon’s recommendation system suggests products based on user
preferences and past purchases.
EDA Steps:
🔹 Example:
● Descriptive Statistics: "The average height of students in a class is 5.6 ft."
● Exploratory Data Analysis: "Let's check if height and weight are
correlated using a scatter plot."
✅ Conclusion:
1. Descriptive Models: Summarize past data patterns (e.g., mean, variance,
histograms).
2. Predictive Models: Forecast future trends using past data (e.g., regression
models).
3. Prescriptive Models: Suggest actions based on predictive insights (e.g.,
decision trees).
Statistical models play a crucial role in machine learning, data analysis, and
hypothesis testing by allowing us to quantify relationships between variables.
2. Descriptive Statistics
Descriptive statistics help summarize and organize data for easy interpretation.
● Mean (Average):
Mean=∑Xi / n
It is sensitive to outliers.
● Skewness:
○ Positive Skew: Tail on the right, data is concentrated on the left.
○ Negative Skew: Tail on the left, data is concentrated on the right.
● Kurtosis: Measures the "tailedness" of a distribution (high kurtosis = heavy
tails).
3. Notion of Probability
4. Probability Distributions
5.2 Variance
2 2
Var(X)=E[𝑥 ]−(𝐸[𝑥])
5.3 Covariance
Cov(X,Y)=E[(X−μX)(Y−μY)]
6. Covariance Matrix
where:
● X is a vector of variables.
● μ is the mean vector.
● Σ is the covariance matrix.
Applications:
1. Supervised Learning
● Examples:
○ Handwriting recognition
2. Unsupervised Learning
● Examples:
○ Customer segmentation
○ Market basket analysis
3. Reinforcement Learning
● Examples:
○ Robotics
○ Self-driving cars
Key components:
● Examples:
○ Image recognition
○ Speech recognition
○ Language translation
○ ChatGPT
Structure:
Conclusion
Machine Learning is the engine driving intelligent systems today. Understanding its
various types and their applications helps in building efficient solutions tailored to
different kinds of data and problems. As data grows, so does the scope and power
of ML.
Machine Learning algorithms are the core tools used to analyze data, learn from
patterns, and make decisions or predictions without being explicitly programmed.
Naive Bayes Classificati Uses probability and Bayes’ Theorem for text
on classification and spam filtering.
Neural Networks Both Mimics the human brain to learn from large
datasets.
2. Importance of Machine Learning in Today’s Business
6. Healthcare:
7. Finance:
3. Classification vs Prediction
Both classification and prediction are part of Supervised Learning, but they serve
different purposes.
Examples Spam or Not Spam, Yes or No, House price, temperature, sales
Class A/B/C forecast
Example of Classification:
Given customer data, predict whether they will buy a product (Yes/No).
Example of Prediction:
Given features like square footage, number of bedrooms, and location, predict the
price of a house.
5. Conclusion
Solutions/Answers :
Answer:
Answer:
An overdetermined system has more equations than unknowns. This often arises
in real-world data where measurements or constraints are more than the number
of variables.
Characteristics:
x + y = 2
2x + 3y = 5
4x + 5y = 6
Answer:
Answer:
Impact on Decision-Making:
Answer:
1. Structured Data
●
● Tools: SQL, Excel
2. Semi-Structured Data
● Definition: Data that doesn't reside in a traditional database but still has
some organizational properties (tags or markers).
Example:
{
"Name": "Alice",
"Age": 30,
"Salary": 50000
}
●
● Format: XML, JSON
3. Unstructured Data
Answer:
An overdetermined system has more equations than unknowns (m > n). Such
systems often have no exact solution, especially when inconsistent.
Ax = b
Where:
Since exact solutions often don’t exist, we approximate x such that the error
between Ax and b is minimized. This leads to least squares approximation.
Pseudo-Inverse Approach:
Why it is important:
Example:
Solutions/Answers :
Answer:
Example: In linear regression, gradient descent finds the optimal slope and
intercept that minimize the error between predicted and actual values.
Significance:
● Speeds up decision-making
Example: A line graph showing the rise in global temperature over the years
quickly communicates climate trends.
Answer:
Answer:
The Learning Rate (η) controls how big a step the gradient descent algorithm
takes toward the minimum during each iteration.
Roles:
● A large learning rate converges faster but risks overshooting the minimum
or diverging.
Balance is key: An ideal learning rate ensures the algorithm converges efficiently
without oscillation or divergence.
Illustration:
Answer:
Let’s say a company claims their product has an average weight of 500g, but a
competitor suspects it’s less.
We take a sample of 10 products and find the average weight = 490g, with
standard deviation = 15g.
● Alternative Hypothesis (H₁): μ < 500 (the mean is less than 500g)
● α = 0.05 (5%)
Use t-test (since population standard deviation is unknown and n < 30):
Step 6: Conclusion
There is enough evidence at 5% level to conclude that the average weight is less
than 500g.
Answer:
2. Measures of Dispersion
3. Measures of Shape
Example: Positive skew indicates long tail on the right (income distributions often
show this).
Summary Table:
9. State any two differences between deep learning and traditional ML.
5. Discuss three key types of machine learning algorithms and their areas of
application.
2. Assume a dataset with customer transactions for a bank. Design a machine
learning approach to classify customers as ‘high risk’ or ‘low risk’.
5. Design a use-case that combines supervised and unsupervised learning for a
real-world business scenario.