0% found this document useful (0 votes)
10 views

data science

The document covers various aspects of data science programming, including key technologies like classification, prediction, and clustering, as well as an analytic plan for a bank to predict customer churn. It discusses the importance of data cleaning, the differences between univariate, bivariate, and multivariate analysis, and the significance of exploratory data analysis (EDA) in healthcare. Additionally, it highlights the differences between big data analytics and business intelligence, and outlines the data science lifecycle and methodologies such as CRISP-DM.

Uploaded by

pjamal.jp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

data science

The document covers various aspects of data science programming, including key technologies like classification, prediction, and clustering, as well as an analytic plan for a bank to predict customer churn. It discusses the importance of data cleaning, the differences between univariate, bivariate, and multivariate analysis, and the significance of exploratory data analysis (EDA) in healthcare. Additionally, it highlights the differences between big data analytics and business intelligence, and outlines the data science lifecycle and methodologies such as CRISP-DM.

Uploaded by

pjamal.jp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

QUESTION ONE

(a) Explain the following technologies as used in Data Science Programming:


i. Classification:
Classification is a supervised learning technique in which a model is trained on labeled data to
predict the category or class of new, unseen data points. For example, classifying emails as "spam"
or "not spam."

ii. Prediction:
Prediction involves using historical data to create a model that forecasts future values or trends. It
is a key aspect of regression analysis, such as predicting house prices based on features like size
and location.

iii. Clustering:
Clustering is an unsupervised learning technique used to group similar data points into clusters
based on their features. An example is segmenting customers into groups based on purchasing
behavior.

(b) Analytic Plan for the Bank:

1. Data Collection:
o Gather customer data, including demographics, transaction history, and reasons for
churn.
o Include external data such as economic indicators.
2. Data Cleaning:
o Handle missing or inconsistent values and remove duplicates.
3. Exploratory Data Analysis (EDA):
o Analyze churn rates and identify patterns.
o Use visualizations like histograms and bar plots to understand customer behavior.
4. Customer Segmentation:
o Use clustering techniques to segment customers based on value and risk.
5. Churn Prediction Model:
o Build a classification model (e.g., Logistic Regression or Random Forest) to predict
churn.
o Use this model to identify at-risk customers.
6. Retention Strategies:
o Personalize campaigns based on insights.
o Offer tailored incentives to retain high-value customers.
7. Evaluation:
o Monitor churn rates and campaign effectiveness.
8. Data Warehousing:
o Build a data warehouse to store cleaned and processed data for future use.

(c) Justification of Data Cleaning Efforts:


Data cleaning accounts for 80% of analysis time because:
1. Dirty data (missing, incorrect, or inconsistent entries) can bias results.
2. Ensures datasets are structured and meaningful for analysis.
3. Reduces errors and improves model accuracy.

(d) Code to Store Height and Weight in NumPy Arrays:

import numpy as np

height = [180, 165, 190, 175]


weight = [75, 68, 85, 80]

height_array = np.array(height)
weight_array = np.array(weight)

(e) Explanation of the Source Code:

import matplotlib.pyplot as plt

# Labels for the pie chart


labels = ['Python', 'C++', 'Ruby', 'Java']

# Sizes of each slice


sizes = [215, 130, 245, 210]

# Colors for each slice


colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']

# Explode the first slice slightly


explode = (0.1, 0, 0, 0)

# Create the pie chart


plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%
%', shadow=True, startangle=140)

# Equal aspect ratio ensures the pie is circular


plt.axis('equal')

# Display the chart


plt.show()

(f) Time Series Analysis Plan:


i. Steps to Perform Time Series Analysis:

1. Data Collection: Gather historical electricity consumption data.


2. Preprocessing: Handle missing values, remove outliers, and ensure uniform time intervals.
3. Exploratory Analysis: Plot data trends, seasonality, and cycles using visualizations like
line graphs.
4. Model Selection: Use ARIMA, LSTM, or Prophet models based on data patterns.
5. Evaluation: Split data into training and testing sets, then calculate error metrics like RMSE
or MAE.

ii. Challenges and Solutions:


1. Seasonality Variation: Use decomposition techniques to handle seasonality.
2. Data Quality Issues: Ensure robust preprocessing and outlier removal.
3. Model Overfitting: Regularize the model or use cross-validation techniques.

QUESTION TWO (20 MARKS)

(a) Difference between Univariate, Bivariate, and Multivariate Analysis:

Type Description Example

Analysis of a single variable to Analyzing the distribution of customer ages


Univariate
summarize its characteristics. using histograms.
Examines the relationship between Analyzing the correlation between age and
Bivariate
two variables. purchase frequency using scatter plots.
Analyzing age, income, and purchase
Involves multiple variables to
Multivariate category to understand customer behavior
understand their interdependence.
trends.

(b) Data Cleaning and Preprocessing:


i. Steps to Clean the Dataset:

1. Handle Missing Values:


o Impute missing values using mean/median for numerical data and mode for
categorical data.
o Use libraries like pandas to manage missing entries.
2. Remove Duplicates:
o Identify and drop duplicate rows using pandas.DataFrame.drop_duplicates().
3. Standardize Data:
o Ensure consistent formatting (e.g., "USD" vs "usd").
4. Outlier Detection:
o Use Z-scores or IQR to identify and handle outliers.
5. Data Type Conversion:
o Convert data types (e.g., strings to dates) for analysis compatibility.

ii. Potential Consequences of Not Cleaning Data:

1. Bias in Results: Skewed insights due to inconsistent or incorrect data.


2. Model Errors: Reduced accuracy and reliability in machine learning models.
3. Decision-Making Impact: Faulty decisions based on flawed analysis.

QUESTION THREE (20 MARKS)

(a) Big Data Analytics vs. Business Intelligence:


Feature Big Data Analytics Business Intelligence
Analyzing vast amounts of complex data to Generating reports and dashboards for
Purpose
uncover patterns and trends. business decision-making.
Tools Hadoop, Spark, TensorFlow. Tableau, Power BI, Excel.
Focus Predictive and prescriptive analytics. Descriptive analytics.

Example:

 Big Data Analytics: Predict customer preferences using real-time transaction data.
 Business Intelligence: Report monthly sales trends.

(b) EDA in Healthcare Dataset:


i. Techniques and Visualizations:

1. Descriptive Statistics:
o Summarize attributes like average age, gender distribution, and most common
conditions.
o Use histograms and boxplots for distributions.
2. Correlation Analysis:
o Analyze relationships between variables such as age and condition severity.
o Use heatmaps for visualization.

ii. Importance of Outliers/Anomalies Detection:


Outliers might represent data entry errors or rare but critical cases. Detecting them ensures better
model performance and provides insights into exceptional medical conditions.

(c) Data Science in Telecommunications and Biological Data Analysis:

1. Telecommunications:
o Optimize network usage using predictive models.
o Reduce churn through customer segmentation.
2. Biological Data Analysis:
o Analyze DNA sequences using clustering techniques.
o Predict disease outbreaks with machine learning.

(d) Random Forest Model with Overfitting:


A training error of 0.00 and a high validation error indicate overfitting, where the model
memorizes training data but fails on new data.
Solution:

 Reduce model complexity (e.g., fewer trees or shallower depth).


 Use regularization techniques or prune trees.
QUESTION FOUR (20 MARKS)

(a) Two Benefits of Using NumPy Arrays Over Nested Python Lists:

1. Performance:
o NumPy arrays are optimized for numerical computations and run significantly faster
than lists.
Example:

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b # Vectorized addition

2. Memory Efficiency:
o Arrays store elements of the same type and require less memory compared to lists.
Example:

python_list = [1, 2, 3, 4]
numpy_array = np.array(python_list)
print(numpy_array.nbytes) # Displays memory usage

(b) Python Code to Manipulate a List:


i. Create the List:

time = [11.25, 18.0, 20.0, 10.75, 9.50, 15.5, 14.5, 14.5]

ii. Add Three Items to the List:

time.extend([24.5, 15.45])

iii. Print the List:

print(time)

iv. Reverse the List:

time.reverse()
print(time)

(c) Code to Remove Duplicate Rows in a DataFrame:

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)

# Remove duplicates based on 'Name' column


df_unique = df.drop_duplicates(subset='Name')
print(df_unique)
(d) General Syntax for Calling Functions and Saving Results:

1. Syntax:

result = function_name(arguments)

Example:

def add_numbers(a, b):


return a + b

result = add_numbers(5, 10) # Saves 15 to result


print(result)

Python Basics and Flow Control

Q5: What are mutable and immutable data types in Python? Provide examples.
A:

 Mutable Data Types: Can be changed after creation.


Example:

list, dict, set.

my_list = [1, 2, 3]
my_list.append(4) # Modifies the list
print(my_list) # Output: [1, 2, 3, 4]

 Immutable Data Types: Cannot be changed after creation.


Example:

int, float, str, tuple.

my_str = "hello"
my_str[0] = "H" # Error: strings are immutable

B:What are the key data types in Python, and what are their uses?
 int: Represents integers (e.g., 10).
 float: Represents decimal numbers (e.g., 3.14).
 str: Represents text (e.g., "hello").
 list: Ordered and mutable collection (e.g., [1, 2, 3]).
 tuple: Ordered and immutable collection (e.g., (1, 2, 3)).
 dict: Key-value pairs (e.g., {"key": "value"}).
 set: Unordered, unique elements (e.g., {1, 2, 3})

C.What is the purpose of the break and continue statements in loops?


A:
 break: Exits the loop prematurely.
 continue: Skips the rest of the current iteration.
Example:

for i in range(5):

if i == 3:

break # Stops at 3

print(i)

for i in range(5):

if i == 3:

continue # Skips 3

print(i)

Q6: What are loops, and how are they used in Python? Provide examples of for and while
loops.
A: Loops are used for repetitive tasks.

 For Loop: Iterates over sequences.

for i in range(3):
print(i) # Output: 0, 1, 2

 While Loop: Executes as long as a condition is true.

count = 0
while count < 3:
print(count)
count += 1 # Output: 0, 1, 2

Introduction to Data Science and Its Value

Q7: What skills are required to succeed in Data Science?


A:

1. Technical Skills:
o Python/R, SQL, and statistical analysis.
o Data visualization tools (e.g., Tableau).
2. Soft Skills:
Problem-solving.
o
Communication for presenting insights.
o
3. Domain Knowledge: Understanding industry-specific challenges.

Q8: What are some real-world applications of data science?


A:

 Healthcare: Predicting disease outbreaks.


 Finance: Fraud detection.
 Retail: Customer segmentation for personalized marketing.
 Transportation: Optimizing delivery routes.

Data Science Lifecycle/Methodology

Q9: What challenges arise during the Data Science Lifecycle?


A:

 Data Collection: Incomplete or noisy data.


 Modeling: Choosing the right algorithm.
 Deployment: Ensuring scalability in production.

Q10: What is CRISP-DM, and how does it relate to Data Science?


A: CRISP-DM (Cross-Industry Standard Process for Data Mining) is a methodology:

1. Business Understanding.
2. Data Understanding.
3. Data Preparation.
4. Modeling.
5. Evaluation.
6. Deployment.

Python Libraries for Data Science and Visualization

Q11: How does Seaborn enhance data visualization?


A: Seaborn builds on Matplotlib and simplifies creating advanced visualizations.
Example:

import seaborn as sns


import matplotlib.pyplot as plt

sns.barplot(x=['A', 'B', 'C'], y=[10, 20, 15])


plt.show()
Q12: What is Scikit-learn, and why is it important?
A: Scikit-learn provides tools for:

 Supervised learning: Linear regression, decision trees.


 Unsupervised learning: Clustering, PCA.
 Model evaluation: Accuracy, F1 score.

Data Wrangling and Feature Engineering

Q13: What is one-hot encoding, and when is it used?


A: Converts categorical variables into binary vectors.
Example:

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue']})


encoded_df = pd.get_dummies(df, columns=['Color'])
print(encoded_df)

Q14: What is the difference between normalization and standardization?


A:

 Normalization: Scales data to a range (0, 1).


 Standardization: Scales data to have a mean of 0 and standard deviation of 1.

Exploratory Data Analysis (EDA)

Q15: What is the role of correlation in EDA?


A: Correlation quantifies the relationship between variables.

 Positive correlation: Variables increase together.


 Negative correlation: One increases as the other decreases.
Example:

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 6, 9]})


print(df.corr()) # Shows the correlation matrix

Machine Learning Models

Q16: What is the difference between regression and classification models?


A:
 Regression: Predicts continuous values (e.g., house prices).
 Classification: Predicts discrete labels (e.g., spam vs. not spam).

Q17: What are some examples of clustering algorithms?


A:

 K-Means: Partitions data into K clusters.


 DBSCAN: Detects clusters based on density.

Mini Data Science Projects

Q18: Describe a project using a public dataset.


Project: Analyzing Titanic dataset.
Steps:

1. Load and clean data.


2. Visualize survival rates based on gender and class.
3. Build a logistic regression model to predict survival.

MySQL for Data Science

Q19: How can SQL be used for feature extraction?


A: Use SQL queries to group data, calculate aggregates, or extract time-based features.
Example:

SELECT customer_id, AVG(purchase_amount) AS avg_purchase


FROM purchases
GROUP BY customer_id;

You might also like