0% found this document useful (0 votes)
7 views

Introduction to Data Science __ 23CSH-283

The document provides an introduction to Data Science, covering its definition, key components, and techniques such as data collection, processing, analysis, and visualization. It discusses the importance of mathematics and statistics in Data Science, including probability, descriptive and predictive statistics, and exploratory data analysis. Additionally, it highlights the challenges and opportunities within the field, emphasizing the role of data science in improving decision-making and business strategies.

Uploaded by

anilskoooo137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Introduction to Data Science __ 23CSH-283

The document provides an introduction to Data Science, covering its definition, key components, and techniques such as data collection, processing, analysis, and visualization. It discusses the importance of mathematics and statistics in Data Science, including probability, descriptive and predictive statistics, and exploratory data analysis. Additionally, it highlights the challenges and opportunities within the field, emphasizing the role of data science in improving decision-making and business strategies.

Uploaded by

anilskoooo137
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Data Science

(23CSH-283)

ALL UNITS - NOTES & QUESTIONS

​ Compiled by : Subhayu

Contents :
(Click on the Unit below, to skip to that particular unit)
Unit 1……………………………………………………………………………………………………………………………………..
Unit 2…………………………………………………………………………………………………………………………………….
Unit 3……………………………………………………………………………………………………………………………………..
MST 1 and 2 solutions………………………………………………………………………………………………………
Sample Questions…………………………………………………..…………………………………………………………..
UNIT-1 : Data Science - An Overview

Contact Hours: 10

Chapter 1 : Introduction
Definition and Description

Data Science is an interdisciplinary field that uses scientific methods, processes,


algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines elements of mathematics, statistics, computer
science, and domain expertise to analyze and solve real-world problems.

●​ Key Components of Data Science:


○​ Data Collection: Gathering data from various sources (databases,
web scraping, sensors, etc.).
○​ Data Processing: Cleaning, transforming, and organizing data to
make it usable.
○​ Data Analysis: Using statistical methods and algorithms to
understand data patterns.
○​ Modeling: Building predictive or descriptive models using Machine
Learning (ML) and Artificial Intelligence (AI).
○​ Visualization: Presenting data insights in a human-readable format
using charts, graphs, and dashboards.

Important Terminologies in Data Science

1.​ Data: Information in raw form that can be structured (tables, databases)
or unstructured (text, images, videos).
○​ Big Data: Extremely large datasets that traditional data-processing
tools cannot handle efficiently.
○​ Metadata: Data about data, providing information like structure,
format, and origin.
2.​ Data Science Pipeline:
○​ Data Acquisition: Gathering raw data.
○​ Data Cleaning: Removing inaccuracies or inconsistencies.
○​ Exploratory Data Analysis (EDA): Gaining preliminary insights into
the dataset.
○​ Model Building: Using statistical or machine learning models.
○​ Model Evaluation: Testing the accuracy and performance of models.
○​ Deployment: Applying the model to real-world data.
3.​ Machine Learning (ML): A subset of AI focused on building models that
enable computers to learn from and make decisions based on data.
4.​ Artificial Intelligence (AI): Broader than ML, it involves machines mimicking
human intelligence to perform tasks.
5.​ Feature Engineering: The process of selecting, transforming, or creating
variables (features) to improve model performance.
6.​ Overfitting and Underfitting:
○​ Overfitting: Model performs well on training data but poorly on new
data.
○​ Underfitting: Model is too simple and performs poorly on both
training and test data.

Overview of Data Science Techniques

1.​ Data Wrangling: The process of cleaning and structuring raw data into a
desired format.
2.​ Exploratory Data Analysis (EDA):
○​ Understanding data characteristics using visualization (e.g.,
histograms, scatter plots).
○​ Summarizing data using descriptive statistics like mean, median,
mode, and variance.
3.​ Statistical Modeling:
○​ Regression Analysis: Understanding the relationship between
variables.
○​ Hypothesis Testing: Checking if assumptions about data are valid.
4.​ Machine Learning:
○​ Supervised Learning: Predicting outcomes using labeled data (e.g.,
Linear Regression, Decision Trees).
○​ Unsupervised Learning: Finding patterns in unlabeled data (e.g.,
Clustering, PCA).
○​ Reinforcement Learning: Learning through trial and error to
maximize rewards.
5.​ Data Visualization: Creating visual representations of data using tools like
Matplotlib, Seaborn, Tableau, and Power BI.
6.​ Big Data Analytics: Using frameworks like Hadoop, Spark to process and
analyze massive datasets.
7.​ Natural Language Processing (NLP): Techniques for analyzing and
processing text data (e.g., sentiment analysis, text summarization).

Challenges in Data Science

1.​ Data Quality Issues:


○​ Missing, inconsistent, or inaccurate data can hinder analysis.
2.​ High Dimensionality:
○​ Handling datasets with many features or variables is computationally
challenging.
3.​ Data Privacy and Security:
○​ Ensuring the ethical and secure use of sensitive data.
4.​ Scalability:
○​ Managing and processing massive datasets efficiently.
5.​ Model Interpretability:
○​ Making complex machine learning models understandable to
stakeholders.
6.​ Domain Knowledge:
○​ Lack of subject expertise can lead to incorrect assumptions or
interpretations.
7.​ Evolving Tools and Techniques:
○​ Rapidly changing technologies make it challenging to keep up.
Opportunities in Data Science

1.​ Business Insights:


○​ Providing actionable insights to improve business strategies.
2.​ Personalization:
○​ Enhancing customer experiences through recommendation systems
(e.g., Netflix, Amazon).
3.​ Healthcare:
○​ Using predictive analytics for early diagnosis and personalized
treatments.
4.​ Automation:
○​ Automating repetitive tasks with AI-driven systems.
5.​ Fraud Detection:
○​ Identifying anomalies in financial transactions to prevent fraud.
6.​ Environmental Monitoring:
○​ Using data to track climate change, predict natural disasters, etc.

Chapter 2 : Data Science and Business Analytics

1. Difference between Data Science and Business


Analytics
2. Importance of Data Science

●​ Improved Decision-Making: Provides actionable insights for better business


strategies.
●​ Automation: Enables the development of AI systems to automate routine
tasks.
●​ Trend Identification: Helps detect patterns and trends for proactive
business strategies.
●​ Customer Understanding: Personalizes user experiences by analyzing
consumer behavior.
●​ Optimization: Optimizes processes, resources, and operations to increase
efficiency.
●​ Competitive Advantage: Offers deeper insights than traditional analytics,
helping businesses stay ahead.

3. Primary Components of Data Science

1.​ Data Collection: Gathering data from various sources such as databases,
APIs, sensors, or web scraping.
○​ Tools: SQL, NoSQL, APIs.
2.​ Data Processing: Cleaning and transforming raw data into usable formats.
○​ Techniques: Handling missing values, normalization, encoding
categorical data.
3.​ Data Analysis: Analyzing data to identify trends, correlations, and
patterns.
○​ Methods: Statistical analysis, exploratory data analysis (EDA).
4.​ Data Visualization: Representing data visually for better understanding.
○​ Tools: Tableau, Power BI, Matplotlib, Seaborn.
5.​ Modeling and Algorithms: Using machine learning or statistical models for
predictions and solutions.
○​ Examples: Regression, Classification, Clustering.
6.​ Deployment and Communication: Deploying models in production and
communicating results to stakeholders.
○​ Tools: Flask, Streamlit, Dash, Excel for reports.

4. Users of Data Science

●​ Business Analysts: Use insights for strategic planning and decision-making.


●​ Data Scientists: Build predictive models and develop machine learning
algorithms.
●​ Marketing Professionals: Analyze customer behavior and create targeted
campaigns.
●​ Healthcare Professionals: Predict diseases and improve patient care.
●​ Engineers: Use data science for predictive maintenance and system
optimization.
●​ Government: Leverage data for policy-making and citizen services.

5. Data Science Hierarchy

The Data Science hierarchy describes the step-by-step process involved in data
science workflows:

1.​ Data Collection


○​ Collecting structured and unstructured data from multiple sources.
2.​ Data Cleaning and Preprocessing
○​ Removing errors, handling missing values, and transforming data.
3.​ Data Exploration (EDA)
○​ Understanding data distributions, patterns, and anomalies.
4.​ Feature Engineering
○​ Creating new features and selecting the most relevant ones.
5.​ Model Building
○​ Training predictive models using machine learning techniques.
6.​ Model Evaluation
○​ Testing model performance using metrics like accuracy, precision,
recall, etc.
7.​ Deployment and Monitoring
○​ Deploying the model in production and monitoring its performance
over time.

Chapter 3 : Linear Algebra in Data Science


Sample Questions :
UNIT-2: Mathematics & ​ ​ ​
Statistics in Data Science
Contact Hours: 10

Chapter 4 : Mathematics in Data


Science
1. Role of Mathematics in Data Science

Mathematics is the backbone of Data Science, enabling modeling, analysis, and


decision-making. The key mathematical areas used in Data Science include:

●​ Linear Algebra → Used for handling datasets (e.g., matrices in machine


learning).
●​ Probability & Statistics → Helps in predictions, measuring uncertainty, and
hypothesis testing.
●​ Calculus → Used in optimization (e.g., gradient descent for training ML
models).
●​ Discrete Mathematics → Important for algorithms and data structures in
Data Science.

🔹 Example:
●​ Predicting stock prices → Uses probability & statistics.
●​ Image recognition → Uses linear algebra for processing pixel data.

2. Importance of Probability & Statistics in Data


Science
Probability:

●​ Measures the likelihood of an event happening.


●​ Used in Bayesian inference, Markov Chains, and Machine Learning models.
●​ Helps in making predictions based on past data.

🔹 Example:
●​ Weather forecasting → Predicts rain probability based on past data.

Statistics:

●​ Helps in analyzing, summarizing, and visualizing data.


●​ Provides insights into data trends, variability, and patterns.
●​ Used in feature selection, anomaly detection, and model evaluation.

🔹 Example:
●​ Medical studies → Analyzing patient recovery data using statistical tests.

3. Important Types of Statistical Measures in Data


Science

(i) Descriptive Statistics

●​ Summarizes data to provide insights.


●​ Includes: Mean, Median, Mode, Standard Deviation, Variance.
●​ Used in: Exploratory Data Analysis (EDA).

🔹 Example:
●​ A company wants to analyze employee salaries. They compute average
salary, salary distribution, and standard deviation to understand
disparities.

(ii) Predictive Statistics


●​ Helps in making future predictions based on patterns in data.
●​ Includes: Regression, Time Series Analysis.
●​ Used in: Forecasting trends and future outcomes.

🔹 Example:
●​ Predicting house prices based on past sales data.

(iii) Prescriptive Statistics

●​ Provides actionable recommendations based on data analysis.


●​ Uses optimization techniques and decision models to suggest the best
actions.

🔹 Example:
●​ Amazon’s recommendation system suggests products based on user
preferences and past purchases.

4. Exploratory Data Analysis (EDA) and


Visualization Techniques

EDA is the process of analyzing datasets to summarize key characteristics using


visuals and statistics.

EDA Steps:

1.​ Understand the dataset (Columns, Data Types).


2.​ Check for missing values.
3.​ Detect outliers using boxplots.
4.​ Identify patterns & correlations using scatter plots, histograms, and
heatmaps.
5.​ Summarize statistics using mean, variance, standard deviation.

🔹 Common Visualization Techniques:


●​ Histograms → Show frequency distribution.
●​ Boxplots → Detect outliers.
●​ Scatter Plots → Identify relationships between two variables.
●​ Heatmaps → Show correlations between multiple variables.

5. Difference Between Exploratory and Descriptive


Statistics
Feature Exploratory Statistics (EDA) Descriptive Statistics

Purpose Finds patterns & relationships in Summarizes data in a


data meaningful way

Tools Graphs, visualizations, hypothesis Measures like mean, median,


Used testing standard deviation
Example Checking missing values, detecting Computing average sales revenue
trends in data of a company

🔹 Example:
●​ Descriptive Statistics: "The average height of students in a class is 5.6 ft."
●​ Exploratory Data Analysis: "Let's check if height and weight are
correlated using a scatter plot."

✅ Conclusion:

●​ Mathematics is essential for data-driven decision-making.


●​ Probability & Statistics help in understanding data, predicting trends, and
making decisions.
●​ EDA & Visualization are crucial for analyzing datasets and finding hidden
patterns.
Chapter 5 : Statistics in Data Science

1. Statistical Modeling in Data Science

Statistical modeling is the process of applying statistical techniques to understand


data patterns, relationships, and trends. It helps in making predictions, estimating
probabilities, and optimizing decision-making processes.

Types of Statistical Models

1.​ Descriptive Models: Summarize past data patterns (e.g., mean, variance,
histograms).
2.​ Predictive Models: Forecast future trends using past data (e.g., regression
models).
3.​ Prescriptive Models: Suggest actions based on predictive insights (e.g.,
decision trees).

Statistical models play a crucial role in machine learning, data analysis, and
hypothesis testing by allowing us to quantify relationships between variables.

2. Descriptive Statistics

Descriptive statistics help summarize and organize data for easy interpretation.

2.1 Measures of Central Tendency

These describe the "center" of a dataset:

●​ Mean (Average):​
Mean=∑Xi / n​
It is sensitive to outliers.​

●​ Median: The middle value when data is sorted. It is robust to outliers.​


●​ Mode: The most frequently occurring value in the dataset. Useful for
categorical data.​

2.2 Measures of Dispersion (Spread of Data)

●​ Range: Difference between the maximum and minimum values.


2 2
●​ Variance: Measures the spread of data around the mean.σ =∑(𝑋𝑖 − 𝑥) /n
●​ Standard Deviation: Square root of variance, providing a more
interpretable measure of data spread. σ= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
●​ Interquartile Range (IQR): Difference between the 75th and 25th
percentiles, useful for detecting outliers.

2.3 Shape of Distribution

●​ Skewness:
○​ Positive Skew: Tail on the right, data is concentrated on the left.
○​ Negative Skew: Tail on the left, data is concentrated on the right.
●​ Kurtosis: Measures the "tailedness" of a distribution (high kurtosis = heavy
tails).

3. Notion of Probability

Probability quantifies the likelihood of an event occurring. It is essential in


statistics for modeling uncertainty.

3.1 Basic Probability Rules

●​ Probability of an event P(A): 0≤P(A)≤10


●​ Addition Rule: P(A∪B)=P(A)+P(B)−P(A∩B)
●​ Multiplication Rule (for independent events): P(A∩B)=P(A)×P(B)

3.2 Conditional Probability


P(A∣B)=P(A∩B)/P(B)
It describes the probability of event A occurring given that event B has already
occurred.

3.3 Bayes’ Theorem


P(A∣B)=P(B∣A) / P(B)​

Used in classification algorithms like Naïve Bayes in machine learning.

4. Probability Distributions

A probability distribution defines how values in a dataset are spread out.

4.1 Discrete Probability Distributions

1.​ Bernoulli Distribution: Models a single binary outcome (success/failure).


2.​ Binomial Distribution: Number of successes in multiple trials.
3.​ Poisson Distribution: Probability of a fixed number of events occurring in a
given time frame.

4.2 Continuous Probability Distributions

1.​ Normal (Gaussian) Distribution: Bell-shaped curve used in statistics and


machine learning.
○​ Properties: Symmetric, mean = median = mode, 68%-95%-99.7%
rule.
2.​ Exponential Distribution: Models waiting times between independent
events.

5. Mean, Variance, and Covariance

5.1 Mean (Expected Value)

The mean represents the average value of a dataset:


E[X]=∑XiP(Xi)

5.2 Variance

Variance measures the spread of data points from the mean:

2 2
Var(X)=E[𝑥 ]−(𝐸[𝑥])

Higher variance means greater spread.

5.3 Covariance

Covariance measures the relationship between two variables:

Cov(X,Y)=E[(X−μX)(Y−μY)]

●​ Positive Covariance: Variables increase together.


●​ Negative Covariance: One increases while the other decreases.

6. Covariance Matrix

A covariance matrix summarizes the relationships between multiple variables:

Used in Principal Component Analysis (PCA) for dimensionality reduction.

7. Understanding Univariate and Multivariate


Normal Distributions

7.1 Univariate Normal Distribution


A normal distribution with one variable, defined by:

​ ​ ​ ​

7.2 Multivariate Normal Distribution

An extension of the normal distribution for multiple variables:

where:

●​ X is a vector of variables.
●​ μ is the mean vector.
●​ Σ is the covariance matrix.

Applications:

●​ Used in machine learning, PCA, and Gaussian Mixture Models (GMMs).


UNIT-3: Machine Learning ​ ​ ​ ​
in Data Science
Contact Hours: 10

Chapter 6 : Machine Learning

Unit-3 (Machine Learning in Data Science)

Chapter 6: Machine Learning

What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables


systems to automatically learn and improve from experience without being
explicitly programmed.​
The primary goal is to develop algorithms that can identify patterns in data and
make decisions or predictions based on it.

Role of Machine Learning in Data Science

Machine Learning plays a crucial role in extracting patterns, making predictions,


and automating decisions within the broader field of Data Science.

●​ Prediction: Predicting customer behavior, product sales, stock prices.​

●​ Classification: Email spam detection, sentiment analysis.​

●​ Clustering: Customer segmentation, anomaly detection.​

●​ Recommendation: Netflix/movie recommendations, product suggestions.​

●​ Automation: Fraud detection, self-driving cars, chatbots.​


Types of Machine Learning Techniques

Machine Learning is broadly divided into four types:

1. Supervised Learning

●​ Definition: The algorithm is trained on labeled data, i.e., input-output pairs


are known.​

●​ Objective: Learn a mapping from inputs to outputs to predict unseen data.​

●​ Examples:​

○​ Email spam detection (Spam / Not Spam)​

○​ Loan approval prediction​

○​ Handwriting recognition​

Types of Supervised Learning:

●​ Classification: Output is a category (e.g., 'yes' or 'no').​

●​ Regression: Output is a continuous value (e.g., price of a house).​

2. Unsupervised Learning

●​ Definition: The algorithm is trained on unlabeled data, and it tries to


discover hidden patterns or structures.​

●​ Objective: Find groupings or relationships in the data.​

●​ Examples:​

○​ Customer segmentation​
○​ Market basket analysis​

○​ Anomaly detection (fraud)​

Types of Unsupervised Learning:

●​ Clustering: Grouping similar data points (e.g., K-means)​

●​ Dimensionality Reduction: Reducing data features (e.g., PCA)​

3. Reinforcement Learning

●​ Definition: The algorithm learns by interacting with an environment. It


receives rewards or penalties based on its actions and learns to maximize
cumulative reward.​

●​ Objective: Learn a policy to take optimal actions.​

●​ Examples:​

○​ Game playing (e.g., AlphaGo)​

○​ Robotics​

○​ Self-driving cars​

Key components:

●​ Agent: Learner or decision-maker​

●​ Environment: Where the agent operates​

●​ Action: What the agent can do​

●​ Reward: Feedback from the environment​


4. Deep Learning

●​ Definition: A subset of machine learning that uses neural networks with


multiple layers to model complex patterns.​

●​ Objective: Automatically extract features and solve problems where


traditional algorithms fail.​

●​ Examples:​

○​ Image recognition​

○​ Speech recognition​

○​ Language translation​

○​ ChatGPT​

Structure:

●​ Built with Artificial Neural Networks (ANNs).​

●​ Works best with large amounts of data and computational power.​

Comparison of Machine Learning Techniques

Aspect Supervised Unsupervised Reinforcement Deep Learning


Learning Learning Learning

Data Labeled Unlabeled Interactive Large


Requirement feedback labeled/unlabeled

Goal Predict Find structure/ Learn through High-level


output patterns trial/error feature learning
Example Linear K-means, PCA Q-learning, CNN, RNN,
Algorithms regression, SARSA LSTM
SVM

Application Fraud Customer Game playing, Vision, NLP,


Areas detection, grouping robotics audio
diagnosis

Feedback Direct None Reward-based Through error


Mechanism (known backpropagation
outputs)

Scope of Machine Learning

Machine Learning has vast applications across almost every industry:

●​ Healthcare: Disease prediction, drug discovery​

●​ Finance: Stock price prediction, credit scoring​

●​ Retail: Customer behavior prediction, demand forecasting​

●​ Manufacturing: Predictive maintenance, quality control​

●​ Agriculture: Crop yield prediction, soil health monitoring​

●​ Entertainment: Content recommendation​

●​ Transportation: Route optimization, autonomous vehicles​

Conclusion
Machine Learning is the engine driving intelligent systems today. Understanding its
various types and their applications helps in building efficient solutions tailored to
different kinds of data and problems. As data grows, so does the scope and power
of ML.

Chapter 7 : Classification & Prediction

1. Introduction to Machine Learning Algorithms

Machine Learning algorithms are the core tools used to analyze data, learn from
patterns, and make decisions or predictions without being explicitly programmed.

Common Machine Learning Algorithms Used in Classification and Prediction

Algorithm Type Description

Linear Regression Prediction Models linear relationships between inputs


and continuous outputs.

Logistic Regression Classificati Estimates the probability that a data point


on belongs to a certain class.

Decision Trees Both Splits the dataset based on features to predict


class or value.

K-Nearest Classificati Classifies based on the most common class


Neighbors (KNN) on among neighbors.

Support Vector Classificati Finds a hyperplane that best separates


Machine on classes.

Naive Bayes Classificati Uses probability and Bayes’ Theorem for text
on classification and spam filtering.

Random Forest Both An ensemble of decision trees to increase


accuracy and prevent overfitting.

Neural Networks Both Mimics the human brain to learn from large
datasets.
2. Importance of Machine Learning in Today’s Business

Machine Learning provides data-driven decision-making capabilities that enhance


efficiency, customer experience, and profits.

Key Business Benefits:

1.​ Customer Insights:​

○​ Predict purchasing behavior.​

○​ Personalize marketing campaigns.​

2.​ Fraud Detection:​

○​ Identify unusual patterns in transactions.​

3.​ Product Recommendations:​

○​ Netflix and Amazon recommend content/products using classification


algorithms.​

4.​ Forecasting and Demand Prediction:​

○​ Predict product demand, inventory requirements.​

5.​ Process Automation:​

○​ Chatbots, automated support, document classification.​

6.​ Healthcare:​

○​ Disease prediction and diagnosis models.​

7.​ Finance:​

○​ Credit scoring, stock trend prediction.​

3. Classification vs Prediction
Both classification and prediction are part of Supervised Learning, but they serve
different purposes.

Feature Classification Prediction (Regression)

Output Categorical (discrete labels) Numerical (continuous values)


Type

Goal Assign input to a category Estimate a numeric value

Examples Spam or Not Spam, Yes or No, House price, temperature, sales
Class A/B/C forecast

Algorithms Decision Trees, SVM, Naive Bayes, Linear Regression, Decision


KNN Trees, ANN

Example of Classification:

Given customer data, predict whether they will buy a product (Yes/No).

Example of Prediction:

Given features like square footage, number of bedrooms, and location, predict the
price of a house.

4. Use Cases to Understand the Difference

Scenario Task Type

Predict if a loan applicant will default Classification

Forecast next quarter’s revenue Prediction

Detect fraudulent transaction Classification

Estimate fuel consumption Prediction

Classify handwritten digits Classification

5. Conclusion

●​ Classification is about identifying what category a data point belongs to.​


●​ Prediction is about estimating a value based on input data.​

●​ Both techniques are critical in solving real-world business problems and


contribute significantly to data-driven strategies across industries.​
Solutions of Mid Semester Tests

Mid Semester Test 1

Section A (2x5= 10)


1.​ Identify the advantages of Data Science.
2.​ Describe the characteristics of over determined equation systems.
3.​ Explain the difference between Data Science and Business Analytics.
4.​ Differentiate between structured and unstructured data.
5.​ Describe a scenario where the responsibilities of a data scientist directly
impact decision-making.

Section B (5x2= 10)


6.​ Explain three types of data with example. Differentiate among them.
7.​ Demonstrate how does the pseudo-inverse help in solving an over
determined system of linear equations, and why is it important.

Solutions/Answers :

1. Identify the advantages of Data Science.

Answer:

●​ Better Decision Making: Enables data-driven decisions by uncovering


hidden patterns and trends.​

●​ Business Intelligence: Helps businesses improve strategies, marketing,


operations, and financial planning.​

●​ Automation: Machine learning and AI reduce human effort by automating


tasks.​

●​ Customer Insights: Improves customer understanding through behavior


analysis, leading to better personalization.​

●​ Innovation: Enables development of new products and services by analyzing


large datasets.​
●​ Competitive Advantage: Gives organizations a strategic edge over
competitors by utilizing data efficiently.​

2. Describe the characteristics of overdetermined equation


systems.

Answer:

An overdetermined system has more equations than unknowns. This often arises
in real-world data where measurements or constraints are more than the number
of variables.

Characteristics:

●​ Typically no exact solution exists (inconsistent system).​

●​ Often used in regression analysis to find the best approximate solution


using least squares.​

●​ Represented as: Ax = b where A is an m × n matrix with m > n.​

Example: If we have 3 equations and 2 variables:

x + y = 2
2x + 3y = 5
4x + 5y = 6

→ Overdetermined system (3 equations, 2 unknowns)


3. Explain the difference between Data Science and Business
Analytics.

Answer:

Feature Data Science Business Analytics

Focus Technical, scientific approach Business-driven insights and


to data strategies

Technique Machine Learning, AI, Big Descriptive & Predictive Analytics


s Data

Scope Broader; includes data Narrower; focused on


engineering, ML decision-making

Tools Python, R, TensorFlow Excel, Power BI, Tableau, SQL

Objective Discover patterns, build models Understand business problems and


solve them

4. Differentiate between structured and unstructured data.

Answer:

Feature Structured Data Unstructured Data

Format Organized in rows and No predefined format


columns (tables)

Storage Relational Databases (SQL) Data lakes, NoSQL, cloud


storage

Examples Excel sheets, SQL tables Images, videos, audio, emails,


PDFs

Ease of Easy to analyze using Requires advanced processing


Analysis traditional tools (NLP, ML)

5. Describe a scenario where the responsibilities of a data scientist


directly impact decision-making.
Answer:

Scenario: A retail company wants to optimize its inventory and avoid


overstocking.

Data Scientist’s Role:

●​ Analyze historical sales, seasonal demand, and customer preferences.​

●​ Use predictive models to forecast future demand.​

●​ Recommend inventory levels and reorder times.​

Impact on Decision-Making:

●​ Helps management decide how much stock to keep.​

●​ Reduces storage cost and wastage.​

●​ Improves customer satisfaction by ensuring availability.​

This shows how a data scientist contributes quantitatively to a critical business


function, impacting profits and operations directly.

6. Explain three types of data with example. Differentiate among


them.

Answer:

The three main types of data are:

1. Structured Data

●​ Definition: Data that is organized in a predefined format (rows and


columns), usually stored in relational databases.​
Example:​

| Name | Age | Salary |
|--------|-----|--------|
| Alice | 30 | 50000 |
| Bob | 28 | 60000 |

●​
●​ Tools: SQL, Excel​

2. Semi-Structured Data

●​ Definition: Data that doesn't reside in a traditional database but still has
some organizational properties (tags or markers).​

Example:​

{
"Name": "Alice",
"Age": 30,
"Salary": 50000
}

●​
●​ Format: XML, JSON​

3. Unstructured Data

●​ Definition: Data without a fixed structure. It’s typically raw, unorganized,


and requires preprocessing before analysis.​

●​ Example: Images, videos, emails, PDFs, social media posts​

Feature Structured Semi-Structured Unstructured

Organization Highly organized Partially organized Not organized


Storage Relational NoSQL, JSON/XML Data lakes, file
Databases files systems

Ease of Easy Moderate Complex


Analysis

Examples SQL Tables JSON/XML Images, Audio,


Video

7. Demonstrate how the pseudo-inverse helps in solving an


overdetermined system of linear equations, and why it is
important.

Answer:

An overdetermined system has more equations than unknowns (m > n). Such
systems often have no exact solution, especially when inconsistent.

Let the system be:

Ax = b

Where:

●​ A is an m×n matrix (with m > n)​

●​ x is an n×1 vector of variables​

●​ b is an m×1 vector of constants​

Since exact solutions often don’t exist, we approximate x such that the error
between Ax and b is minimized. This leads to least squares approximation.

Pseudo-Inverse Approach:
Why it is important:

●​ Stability: Works even when exact solutions are not possible.​

●​ Efficiency: Widely used in machine learning for fitting linear regression


models.​

●​ Generalization: Helps deal with inconsistent systems arising from


real-world data.​

Example:

Solve using pseudo-inverse:


Mid Semester Test 2

Section A (2x5= 10)


1.​ State the purpose of the gradient descent algorithm in optimization.
2.​ Define data visualization and discuss its significance in data analysis.
3.​ Differentiate between predictive and prescriptive statistics with examples.
4.​ Determine the covariance between two datasets : X={2,4,6} and Y={3,6,9}.
5.​ Identify the role of Learning Rate in convergence of Gradient Descent
algorithm.

Section B (5x2= 10)


6.​ Apply hypothesis testing techniques to solve data analysis problems and
demonstrate them step-by-step process of conducting a hypothesis test.
7.​ Compare the three types of statistical measures and analyze their
applications in data analysis.

Solutions/Answers :

1. State the purpose of the gradient descent algorithm in


optimization.

Answer:

The purpose of the Gradient Descent algorithm is to find the minimum of a


function, commonly used in optimization problems in machine learning and deep
learning.

●​ In ML, it minimizes loss functions to improve model accuracy.​

●​ It does this by iteratively adjusting parameters (weights) in the direction


of the steepest decrease of the function (negative gradient).​

Example: In linear regression, gradient descent finds the optimal slope and
intercept that minimize the error between predicted and actual values.

2. Define data visualization and discuss its significance in data


analysis.
Answer:

Data Visualization is the graphical representation of information and data using


visual elements like charts, graphs, maps, and plots.

Significance:

●​ Simplifies complex data and patterns​

●​ Enhances understanding and communication of insights​

●​ Speeds up decision-making​

●​ Helps identify outliers, trends, and correlations​

Example: A line graph showing the rise in global temperature over the years
quickly communicates climate trends.

3. Differentiate between predictive and prescriptive statistics with


examples.

Aspect Predictive Statistics Prescriptive Statistics

Purpose Forecast future outcomes Suggest actions to achieve a desired


outcome

Based On Historical data and Predictive models + optimization and


statistical models simulation methods

Question It "What is likely to "What should we do about it?"


Answers happen?"

Example Predicting next month’s Recommending how many items to


sales based on trends produce to maximize profit
4. Determine the covariance between two datasets: X = {2, 4, 6}, Y
= {3, 6, 9}.

Answer:

5. Identify the role of Learning Rate in convergence of Gradient


Descent algorithm.

Answer:

The Learning Rate (η) controls how big a step the gradient descent algorithm
takes toward the minimum during each iteration.

Roles:

●​ A small learning rate ensures smooth convergence but takes longer.​

●​ A large learning rate converges faster but risks overshooting the minimum
or diverging.​

Balance is key: An ideal learning rate ensures the algorithm converges efficiently
without oscillation or divergence.

Illustration:

●​ If η is too small: The process is slow.​


●​ If η is too large: The updates might jump over the minimum repeatedly or
never converge.

6. Apply hypothesis testing techniques to solve data analysis


problems and demonstrate the step-by-step process of conducting
a hypothesis test.

Answer:

Let’s say a company claims their product has an average weight of 500g, but a
competitor suspects it’s less.

We take a sample of 10 products and find the average weight = 490g, with
standard deviation = 15g.

We want to test at 5% significance level if the mean is less than 500g.

Step-by-step Hypothesis Testing:

Step 1: State the Hypotheses

●​ Null Hypothesis (H₀): μ = 500 (the mean is 500g)​

●​ Alternative Hypothesis (H₁): μ < 500 (the mean is less than 500g)​

Step 2: Select the Significance Level (α)

●​ α = 0.05 (5%)

Step 3: Calculate the Test Statistic

Use t-test (since population standard deviation is unknown and n < 30):

Step 4: Determine the Critical Value

Degrees of freedom (df) = 10 - 1 = 9

From the t-distribution table, critical value at 5% (one-tailed) for df = 9 ≈ -1.833


Step 5: Make a Decision

●​ Since -2.11 < -1.833, we reject the null hypothesis.​

Step 6: Conclusion

There is enough evidence at 5% level to conclude that the average weight is less
than 500g.

7. Compare the three types of statistical measures and analyze


their applications in data analysis.

Answer:

The three types of statistical measures are:

1. Measures of Central Tendency

●​ Includes Mean, Median, and Mode.​

●​ Purpose: Describe the center or average of the data.​

●​ Application: Used in summarizing salary data, average scores, etc.​

Example: Average marks of students in a class.

2. Measures of Dispersion

●​ Includes Range, Variance, Standard Deviation, Interquartile Range (IQR).​

●​ Purpose: Measure spread or variability in data.​

●​ Application: Used to analyze risk, consistency in quality control, and


investment volatility.​
Example: A low standard deviation in product weight shows consistent
manufacturing.

3. Measures of Shape

●​ Includes Skewness and Kurtosis.​

●​ Purpose: Describe the symmetry and peakedness of data distribution.​

●​ Application: Crucial in understanding distribution properties before


applying statistical models (especially in finance and machine learning).​

Example: Positive skew indicates long tail on the right (income distributions often
show this).

Summary Table:

Measure Key Metrics Purpose Application Example


Type

Central Mean, Median, Central Value Average salary of


Tendency Mode employees

Dispersion Std Dev, Spread of data Risk in stock returns


Variance, Range

Shape Skewness, Shape of Assessing normality in


Kurtosis distribution predictive models
Question Bank for UNIT-3: Optimization and Complexity

2-Marks Questions (12 Questions)


1.​ Define Machine Learning in the context of Data Science.​

2.​ State any two real-life applications of Supervised Learning.​

3.​ What is the role of training data in Machine Learning?​

4.​ List two differences between supervised and unsupervised learning.​

5.​ Mention any two commonly used Machine Learning algorithms.​

6.​ What is the difference between classification and prediction in ML?​

7.​ Define reinforcement learning with a simple example.​

8.​ What is a labeled dataset? How is it used in supervised learning?​

9.​ State any two differences between deep learning and traditional ML.​

10.​Why is machine learning important in today’s business environment?​

11.​ Give an example where prediction is preferred over classification.​

12.​What does “learning from data” mean in ML?​

5-Marks Questions (6 Questions)


1.​ Compare and contrast supervised, unsupervised, and reinforcement learning
using suitable examples.​

2.​ Explain the significance of machine learning in modern business


decision-making with a relevant scenario.​
3.​ Illustrate how classification is performed using any one ML algorithm (e.g.,
Decision Tree or KNN).​

4.​ Differentiate between deep learning and machine learning in terms of


architecture, data requirements, and performance.​

5.​ Discuss three key types of machine learning algorithms and their areas of
application.​

6.​ Explain how machine learning supports personalization in online platforms


(e.g., Netflix or Amazon).

10-Marks Questions (6 Questions)


1.​ Explain in detail the various types of Machine Learning techniques
(Supervised, Unsupervised, Reinforcement, and Deep Learning).​

2.​ Assume a dataset with customer transactions for a bank. Design a machine
learning approach to classify customers as ‘high risk’ or ‘low risk’.​

3.​ Discuss the process of building a classification model using logistic


regression.​

4.​ Elaborate on how machine learning is revolutionizing predictive analytics in


industries.​

5.​ Design a use-case that combines supervised and unsupervised learning for a
real-world business scenario.​

6.​ Differentiate classification and prediction using appropriate datasets.

You might also like