0% found this document useful (0 votes)

10 views

data science

The document covers various aspects of data science programming, including key technologies like classification, prediction, and clustering, as well as an analytic plan for a bank to predict customer churn. It discusses the importance of data cleaning, the differences between univariate, bivariate, and multivariate analysis, and the significance of exploratory data analysis (EDA) in healthcare. Additionally, it highlights the differences between big data analytics and business intelligence, and outlines the data science lifecycle and methodologies such as CRISP-DM.

Uploaded by

pjamal.jp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

data science

Uploaded by

pjamal.jp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

QUESTION ONE

(a) Explain the following technologies as used in Data Science Programming:

i. Classification:
Classification is a supervised learning technique in which a model is trained on labeled data to
predict the category or class of new, unseen data points. For example, classifying emails as "spam"
or "not spam."

ii. Prediction:
Prediction involves using historical data to create a model that forecasts future values or trends. It
is a key aspect of regression analysis, such as predicting house prices based on features like size
and location.

iii. Clustering:
Clustering is an unsupervised learning technique used to group similar data points into clusters
based on their features. An example is segmenting customers into groups based on purchasing
behavior.

(b) Analytic Plan for the Bank:

1. Data Collection:
o Gather customer data, including demographics, transaction history, and reasons for
churn.
o Include external data such as economic indicators.
2. Data Cleaning:
o Handle missing or inconsistent values and remove duplicates.
3. Exploratory Data Analysis (EDA):
o Analyze churn rates and identify patterns.
o Use visualizations like histograms and bar plots to understand customer behavior.
4. Customer Segmentation:
o Use clustering techniques to segment customers based on value and risk.
5. Churn Prediction Model:
o Build a classification model (e.g., Logistic Regression or Random Forest) to predict
churn.
o Use this model to identify at-risk customers.
6. Retention Strategies:
o Personalize campaigns based on insights.
o Offer tailored incentives to retain high-value customers.
7. Evaluation:
o Monitor churn rates and campaign effectiveness.
8. Data Warehousing:
o Build a data warehouse to store cleaned and processed data for future use.

(c) Justification of Data Cleaning Efforts:

Data cleaning accounts for 80% of analysis time because:
1. Dirty data (missing, incorrect, or inconsistent entries) can bias results.
2. Ensures datasets are structured and meaningful for analysis.
3. Reduces errors and improves model accuracy.

(d) Code to Store Height and Weight in NumPy Arrays:

import numpy as np

height = [180, 165, 190, 175]

weight = [75, 68, 85, 80]

height_array = np.array(height)
weight_array = np.array(weight)

(e) Explanation of the Source Code:

import matplotlib.pyplot as plt

# Labels for the pie chart

labels = ['Python', 'C++', 'Ruby', 'Java']

# Sizes of each slice

sizes = [215, 130, 245, 210]

# Colors for each slice

colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']

# Explode the first slice slightly

explode = (0.1, 0, 0, 0)

# Create the pie chart

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%
%', shadow=True, startangle=140)

# Equal aspect ratio ensures the pie is circular

plt.axis('equal')

# Display the chart

plt.show()

(f) Time Series Analysis Plan:

i. Steps to Perform Time Series Analysis:

1. Data Collection: Gather historical electricity consumption data.

2. Preprocessing: Handle missing values, remove outliers, and ensure uniform time intervals.
3. Exploratory Analysis: Plot data trends, seasonality, and cycles using visualizations like
line graphs.
4. Model Selection: Use ARIMA, LSTM, or Prophet models based on data patterns.
5. Evaluation: Split data into training and testing sets, then calculate error metrics like RMSE
or MAE.

ii. Challenges and Solutions:

1. Seasonality Variation: Use decomposition techniques to handle seasonality.
2. Data Quality Issues: Ensure robust preprocessing and outlier removal.
3. Model Overfitting: Regularize the model or use cross-validation techniques.

QUESTION TWO (20 MARKS)

(a) Difference between Univariate, Bivariate, and Multivariate Analysis:

Type Description Example

Analysis of a single variable to Analyzing the distribution of customer ages

Univariate
summarize its characteristics. using histograms.
Examines the relationship between Analyzing the correlation between age and
Bivariate
two variables. purchase frequency using scatter plots.
Analyzing age, income, and purchase
Involves multiple variables to
Multivariate category to understand customer behavior
understand their interdependence.
trends.

(b) Data Cleaning and Preprocessing:

i. Steps to Clean the Dataset:

1. Handle Missing Values:

o Impute missing values using mean/median for numerical data and mode for
categorical data.
o Use libraries like pandas to manage missing entries.
2. Remove Duplicates:
o Identify and drop duplicate rows using pandas.DataFrame.drop_duplicates().
3. Standardize Data:
o Ensure consistent formatting (e.g., "USD" vs "usd").
4. Outlier Detection:
o Use Z-scores or IQR to identify and handle outliers.
5. Data Type Conversion:
o Convert data types (e.g., strings to dates) for analysis compatibility.

ii. Potential Consequences of Not Cleaning Data:

1. Bias in Results: Skewed insights due to inconsistent or incorrect data.

2. Model Errors: Reduced accuracy and reliability in machine learning models.
3. Decision-Making Impact: Faulty decisions based on flawed analysis.

QUESTION THREE (20 MARKS)

(a) Big Data Analytics vs. Business Intelligence:

Feature Big Data Analytics Business Intelligence
Analyzing vast amounts of complex data to Generating reports and dashboards for
Purpose
uncover patterns and trends. business decision-making.
Tools Hadoop, Spark, TensorFlow. Tableau, Power BI, Excel.
Focus Predictive and prescriptive analytics. Descriptive analytics.

Example:

 Big Data Analytics: Predict customer preferences using real-time transaction data.
 Business Intelligence: Report monthly sales trends.

(b) EDA in Healthcare Dataset:

i. Techniques and Visualizations:

1. Descriptive Statistics:
o Summarize attributes like average age, gender distribution, and most common
conditions.
o Use histograms and boxplots for distributions.
2. Correlation Analysis:
o Analyze relationships between variables such as age and condition severity.
o Use heatmaps for visualization.

ii. Importance of Outliers/Anomalies Detection:

Outliers might represent data entry errors or rare but critical cases. Detecting them ensures better
model performance and provides insights into exceptional medical conditions.

(c) Data Science in Telecommunications and Biological Data Analysis:

1. Telecommunications:
o Optimize network usage using predictive models.
o Reduce churn through customer segmentation.
2. Biological Data Analysis:
o Analyze DNA sequences using clustering techniques.
o Predict disease outbreaks with machine learning.

(d) Random Forest Model with Overfitting:

A training error of 0.00 and a high validation error indicate overfitting, where the model
memorizes training data but fails on new data.
Solution:

 Reduce model complexity (e.g., fewer trees or shallower depth).

 Use regularization techniques or prune trees.
QUESTION FOUR (20 MARKS)

(a) Two Benefits of Using NumPy Arrays Over Nested Python Lists:

1. Performance:
o NumPy arrays are optimized for numerical computations and run significantly faster
than lists.
Example:

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b # Vectorized addition

2. Memory Efficiency:
o Arrays store elements of the same type and require less memory compared to lists.
Example:

python_list = [1, 2, 3, 4]
numpy_array = np.array(python_list)
print(numpy_array.nbytes) # Displays memory usage

(b) Python Code to Manipulate a List:

i. Create the List:

time = [11.25, 18.0, 20.0, 10.75, 9.50, 15.5, 14.5, 14.5]

ii. Add Three Items to the List:

time.extend([24.5, 15.45])

iii. Print the List:

print(time)

iv. Reverse the List:

time.reverse()
print(time)

(c) Code to Remove Duplicate Rows in a DataFrame:

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)

# Remove duplicates based on 'Name' column

df_unique = df.drop_duplicates(subset='Name')
print(df_unique)
(d) General Syntax for Calling Functions and Saving Results:

1. Syntax:

result = function_name(arguments)

Example:

def add_numbers(a, b):

return a + b

result = add_numbers(5, 10) # Saves 15 to result

print(result)

Python Basics and Flow Control

Q5: What are mutable and immutable data types in Python? Provide examples.
A:

 Mutable Data Types: Can be changed after creation.

Example:

list, dict, set.

my_list = [1, 2, 3]
my_list.append(4) # Modifies the list
print(my_list) # Output: [1, 2, 3, 4]

 Immutable Data Types: Cannot be changed after creation.

Example:

int, float, str, tuple.

my_str = "hello"
my_str[0] = "H" # Error: strings are immutable

B:What are the key data types in Python, and what are their uses?
 int: Represents integers (e.g., 10).
 float: Represents decimal numbers (e.g., 3.14).
 str: Represents text (e.g., "hello").
 list: Ordered and mutable collection (e.g., [1, 2, 3]).
 tuple: Ordered and immutable collection (e.g., (1, 2, 3)).
 dict: Key-value pairs (e.g., {"key": "value"}).
 set: Unordered, unique elements (e.g., {1, 2, 3})

C.What is the purpose of the break and continue statements in loops?

A:
 break: Exits the loop prematurely.
 continue: Skips the rest of the current iteration.
Example:

for i in range(5):

if i == 3:

break # Stops at 3

print(i)

for i in range(5):

if i == 3:

continue # Skips 3

print(i)

Q6: What are loops, and how are they used in Python? Provide examples of for and while
loops.
A: Loops are used for repetitive tasks.

 For Loop: Iterates over sequences.

for i in range(3):
print(i) # Output: 0, 1, 2

 While Loop: Executes as long as a condition is true.

count = 0
while count < 3:
print(count)
count += 1 # Output: 0, 1, 2

Introduction to Data Science and Its Value

Q7: What skills are required to succeed in Data Science?

1. Technical Skills:
o Python/R, SQL, and statistical analysis.
o Data visualization tools (e.g., Tableau).
2. Soft Skills:
Problem-solving.
o
Communication for presenting insights.
o
3. Domain Knowledge: Understanding industry-specific challenges.

Q8: What are some real-world applications of data science?

 Healthcare: Predicting disease outbreaks.

 Finance: Fraud detection.
 Retail: Customer segmentation for personalized marketing.
 Transportation: Optimizing delivery routes.

Data Science Lifecycle/Methodology

Q9: What challenges arise during the Data Science Lifecycle?

 Data Collection: Incomplete or noisy data.

 Modeling: Choosing the right algorithm.
 Deployment: Ensuring scalability in production.

Q10: What is CRISP-DM, and how does it relate to Data Science?

A: CRISP-DM (Cross-Industry Standard Process for Data Mining) is a methodology:

1. Business Understanding.
2. Data Understanding.
3. Data Preparation.
4. Modeling.
5. Evaluation.
6. Deployment.

Python Libraries for Data Science and Visualization

Q11: How does Seaborn enhance data visualization?

A: Seaborn builds on Matplotlib and simplifies creating advanced visualizations.
Example:

import seaborn as sns

import matplotlib.pyplot as plt

sns.barplot(x=['A', 'B', 'C'], y=[10, 20, 15])

plt.show()
Q12: What is Scikit-learn, and why is it important?
A: Scikit-learn provides tools for:

 Supervised learning: Linear regression, decision trees.

 Unsupervised learning: Clustering, PCA.
 Model evaluation: Accuracy, F1 score.

Data Wrangling and Feature Engineering

Q13: What is one-hot encoding, and when is it used?

A: Converts categorical variables into binary vectors.
Example:

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue']})

encoded_df = pd.get_dummies(df, columns=['Color'])
print(encoded_df)

Q14: What is the difference between normalization and standardization?

 Normalization: Scales data to a range (0, 1).

 Standardization: Scales data to have a mean of 0 and standard deviation of 1.

Exploratory Data Analysis (EDA)

Q15: What is the role of correlation in EDA?

A: Correlation quantifies the relationship between variables.

 Positive correlation: Variables increase together.

 Negative correlation: One increases as the other decreases.
Example:

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 6, 9]})

print(df.corr()) # Shows the correlation matrix

Machine Learning Models

Q16: What is the difference between regression and classification models?

A:
 Regression: Predicts continuous values (e.g., house prices).
 Classification: Predicts discrete labels (e.g., spam vs. not spam).

Q17: What are some examples of clustering algorithms?

 K-Means: Partitions data into K clusters.

 DBSCAN: Detects clusters based on density.

Mini Data Science Projects

Q18: Describe a project using a public dataset.

Project: Analyzing Titanic dataset.
Steps:

1. Load and clean data.

2. Visualize survival rates based on gender and class.
3. Build a logistic regression model to predict survival.

MySQL for Data Science

Q19: How can SQL be used for feature extraction?

A: Use SQL queries to group data, calculate aggregates, or extract time-based features.
Example:

SELECT customer_id, AVG(purchase_amount) AS avg_purchase

FROM purchases
GROUP BY customer_id;

Zicta Ict Survey - 2018
100% (1)
Zicta Ict Survey - 2018
188 pages
Earlywatch Alert - P01 1 Service Summary
No ratings yet
Earlywatch Alert - P01 1 Service Summary
78 pages
Data Science Notes
No ratings yet
Data Science Notes
44 pages
FDS - 1 SOLVED
No ratings yet
FDS - 1 SOLVED
17 pages
DATASCIENCE(Unit-1) Question Bank
No ratings yet
DATASCIENCE(Unit-1) Question Bank
6 pages
Report
No ratings yet
Report
18 pages
DS1
No ratings yet
DS1
20 pages
Lab Manual FOR CSE 355/ Data Science Professional Certification Name
No ratings yet
Lab Manual FOR CSE 355/ Data Science Professional Certification Name
20 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Data Science QnA
No ratings yet
Data Science QnA
15 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
51 pages
MGNM801 Ca2 Final
No ratings yet
MGNM801 Ca2 Final
13 pages
Ida CH 1 JRP Mam
No ratings yet
Ida CH 1 JRP Mam
17 pages
Revision Questions
No ratings yet
Revision Questions
19 pages
Fds Answers
No ratings yet
Fds Answers
53 pages
UNIT 4 Data Science Notes
No ratings yet
UNIT 4 Data Science Notes
4 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
suraj report file
No ratings yet
suraj report file
17 pages
Computational
No ratings yet
Computational
7 pages
Unit 1
100% (1)
Unit 1
69 pages
OCS353_Review Questions
No ratings yet
OCS353_Review Questions
3 pages
Ct3 Qb Answers
No ratings yet
Ct3 Qb Answers
8 pages
Dav - Lab Manual (1)
No ratings yet
Dav - Lab Manual (1)
34 pages
FINAL FDS MANUAL print
No ratings yet
FINAL FDS MANUAL print
55 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
DSBDA 4
No ratings yet
DSBDA 4
16 pages
jenisha INTERNSHIP REPORT-2.docx (1)
No ratings yet
jenisha INTERNSHIP REPORT-2.docx (1)
19 pages
tool and lib in Data Science
No ratings yet
tool and lib in Data Science
32 pages
DAP Lab Manual (1)
No ratings yet
DAP Lab Manual (1)
20 pages
fds-fundamentals-of-data-science-laboratory
No ratings yet
fds-fundamentals-of-data-science-laboratory
53 pages
DS1
No ratings yet
DS1
10 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Foundation of Data Science Solve Question Paper Aug 2022
No ratings yet
Foundation of Data Science Solve Question Paper Aug 2022
7 pages
data science
No ratings yet
data science
42 pages
Machine Learning Lecture2
No ratings yet
Machine Learning Lecture2
38 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
OCS353 Data Science Fundamentals QB_(Common to EEE,Mech,Civil)
No ratings yet
OCS353 Data Science Fundamentals QB_(Common to EEE,Mech,Civil)
7 pages
Datascience
No ratings yet
Datascience
8 pages
sfds aat
No ratings yet
sfds aat
8 pages
Chapter - 2: Data Science & Python
No ratings yet
Chapter - 2: Data Science & Python
17 pages
Important Questions With Solutions IP
No ratings yet
Important Questions With Solutions IP
5 pages
Soal CISDM
No ratings yet
Soal CISDM
3 pages
DATA SCIENCE SAMPLE
No ratings yet
DATA SCIENCE SAMPLE
5 pages
Python Ca22
No ratings yet
Python Ca22
14 pages
data science unit 1
No ratings yet
data science unit 1
30 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
PDSC_Few_Questions_Answers_2020
No ratings yet
PDSC_Few_Questions_Answers_2020
36 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
Solution
No ratings yet
Solution
18 pages
ds viva
No ratings yet
ds viva
9 pages
Data Science using Python_ Introduction
No ratings yet
Data Science using Python_ Introduction
6 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
Set-D_CT2_answerKey
No ratings yet
Set-D_CT2_answerKey
11 pages
DataScience - ML DEEP LEARNING - LPEI - 120 Days
No ratings yet
DataScience - ML DEEP LEARNING - LPEI - 120 Days
8 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
DATASCIENCE
No ratings yet
DATASCIENCE
2 pages
Practical Data Science
No ratings yet
Practical Data Science
121 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Experiment No - 1
No ratings yet
Experiment No - 1
13 pages
Debian GNU - Linux Installation Guide
No ratings yet
Debian GNU - Linux Installation Guide
6 pages
8554 OS6250 Flash Recoevry Uboot Based File Transfer PDF
No ratings yet
8554 OS6250 Flash Recoevry Uboot Based File Transfer PDF
14 pages
1. Python Module (1)
No ratings yet
1. Python Module (1)
15 pages
Network Topology Report
No ratings yet
Network Topology Report
9 pages
4 Best Exchanges to Buy Bitcoin in Iraq (2024)
No ratings yet
4 Best Exchanges to Buy Bitcoin in Iraq (2024)
1 page
Last Resume
No ratings yet
Last Resume
1 page
Pre Board 10
No ratings yet
Pre Board 10
4 pages
Ams Font Keyboard PDF
No ratings yet
Ams Font Keyboard PDF
1 page
Which Are The Design Goals of Webapps?
No ratings yet
Which Are The Design Goals of Webapps?
8 pages
ZKBioSecurity API Brochure 20191011
No ratings yet
ZKBioSecurity API Brochure 20191011
1 page
Infoturbest Küberturbe Halduseni - Lähenemine Vastavalt Standarditele ISO 27001 Ja 27032
No ratings yet
Infoturbest Küberturbe Halduseni - Lähenemine Vastavalt Standarditele ISO 27001 Ja 27032
93 pages
Shell Scripting Tutorial
No ratings yet
Shell Scripting Tutorial
6 pages
2020CSEPID63 - Spam Alert System Synopsis Final
No ratings yet
2020CSEPID63 - Spam Alert System Synopsis Final
12 pages
Name That Logo _ Blooket
No ratings yet
Name That Logo _ Blooket
7 pages
The Pigeonhole Principle
No ratings yet
The Pigeonhole Principle
5 pages
Converting Conventional Lathe Into Semi-Automatic Lathe Machine
No ratings yet
Converting Conventional Lathe Into Semi-Automatic Lathe Machine
101 pages
Hiddenlink Iso Iec 25024 2015 PDF Free
No ratings yet
Hiddenlink Iso Iec 25024 2015 PDF Free
52 pages
Car Rental Application - Updated
No ratings yet
Car Rental Application - Updated
10 pages
Algorithms
No ratings yet
Algorithms
10 pages
Error List
No ratings yet
Error List
2 pages
24-Batch-Data-Structures-In-Depth
No ratings yet
24-Batch-Data-Structures-In-Depth
7 pages
Machine Learning - PPT
No ratings yet
Machine Learning - PPT
4 pages
HTML Programme
No ratings yet
HTML Programme
43 pages
Introduction to Convex Optimization 3-4 Edition Elad Hazan 2024 scribd download
100% (1)
Introduction to Convex Optimization 3-4 Edition Elad Hazan 2024 scribd download
77 pages
An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
No ratings yet
An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
16 pages
Umair Internship Report
No ratings yet
Umair Internship Report
71 pages
Temp Mail - Disposable Temporary Email
No ratings yet
Temp Mail - Disposable Temporary Email
4 pages