0% found this document useful (0 votes)
43 views

FDS - 3 SOLVED

TYBCS foundation of data science solved question paper

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

FDS - 3 SOLVED

TYBCS foundation of data science solved question paper

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Q1) Attempt any EIGHT of the following: [8×1=8]

a) List any two applications of Data Science.

Sol:

1. Healthcare: Data science is used in healthcare for predictive modeling to foresee


patient outcomes, disease diagnosis, personalized treatment plans, and management
of hospital resources. For example, machine learning algorithms can predict the
likelihood of diseases based on patient data, helping in early intervention and
prevention strategies.

2. Finance: In finance, data science is applied for fraud detection, risk management,
algorithmic trading, and credit scoring. By analyzing transaction data, financial
institutions can identify unusual patterns that may indicate fraudulent activities and
develop models to assess the risk profile of loan applicants.

b) What are outliers?

Sol:

Outliers are data points that deviate significantly from other observations in a dataset. They
can be caused by variability in the data, errors in data collection, or unusual events.
Outliers can affect the results of data analysis and statistical modeling, making it
important to detect and address them appropriately.

c) What are missing values?

Sol:

Missing values are data points that are not recorded in a dataset. They can occur due to
various reasons, such as errors in data collection, data entry issues, or intentional
omission. Handling missing values is crucial in data analysis, as they can lead to biased
estimates, reduced statistical power, and incorrect conclusions.
d) Define Variance.

Sol:

Variance is a statistical measure that quantifies the dispersion or spread of a set of data
points around their mean. It is calculated as the average of the squared differences
between each data point and the mean of the dataset. A higher variance indicates that the
data points are more spread out from the mean, while a lower variance indicates that they
are closer to the mean.

e) What is a nominal attribute?

Sol:

A nominal attribute is a type of categorical attribute that represents discrete and unordered
categories or labels. Nominal attributes are used to identify or classify data without
implying any order or hierarchy among the categories. Examples include gender (male,
female), colors (red, blue, green), and types of vehicles (car, truck, bike).

f) What is data transformation?

Sol:

Data transformation involves converting data from one format or structure to another. This
process is essential for preparing data for analysis, ensuring compatibility with analytical
tools, and improving the quality of data. Common data transformation techniques include
normalization, standardization, aggregation, and encoding of categorical variables.

g) What is one hot encoding?

Sol:

One hot encoding is a technique used to convert categorical variables into a binary (0 or 1)
format. Each category of the variable is transformed into a separate binary column, where
the presence of the category is represented by 1, and the absence is represented by 0. This
method is commonly used in machine learning to handle categorical data.
h) What is the use of a Bubble plot?

Sol:

A bubble plot is a type of data visualization that displays three dimensions of data. Each
point in the plot represents an observation, with its position determined by two variables (X
and Y coordinates), and the size of the bubble representing a third variable. Bubble plots
are useful for visualizing relationships among multiple variables and identifying patterns or
trends.

i) Define Data visualization.

Sol:

Data visualization is the process of representing data in a graphical or pictorial format to


make it easier to understand, interpret, and communicate. Visualization techniques such
as charts, graphs, and maps help to uncover insights, identify trends, and highlight
relationships in the data, making it accessible to a wider audience.

j) Define standard deviation.

Sol:

Standard deviation is a statistical measure that quantifies the amount of variation or


dispersion in a set of data points. It is the square root of the variance and provides an
indication of how much individual data points differ from the mean. A low standard
deviation indicates that data points are close to the mean, while a high standard deviation
indicates greater variability.
Q2) Attempt any FOUR of the following. [4×2=8]

a) Differentiate between structured and unstructured data.

Sol:

Feature Structured Data Unstructured Data


Organized and formatted in a Lacks a predefined format or
Definition
fixed schema or structure organization
Stored in relational
Stored in data lakes, NoSQL
Storage databases, spreadsheets,
databases, file systems
data warehouses
Text documents, images,
Tables in databases, Excel
Examples videos, emails, social media
sheets, SQL databases
posts
Easily queried using SQL and Requires more complex
Querying other structured query processing and analytics
languages tools
More challenging to analyze;
Ease of Easier to analyze due to well-
requires advanced tools for
Analysis defined structure
processing
Typically numeric or Can include text,
Data Types
categorical multimedia, and mixed types
Usually smaller and Often larger in size and
Data Size
manageable complexity
More flexible, can
Less flexible, as schema accommodate various types
Flexibility
must be defined beforehand of data without a predefined
schema

b) What is inferential statistics?

Sol:

Inferential statistics is a branch of statistics that allows us to make predictions or


inferences about a population based on a sample of data drawn from that population. It
involves using data from a sample to estimate population parameters and test hypotheses.
This helps in understanding patterns, relationships, and trends that might not be apparent
from the sample data alone. Inferential statistics employs methods such as hypothesis
testing, confidence intervals, and regression analysis to draw conclusions and make
decisions.
c) What do you mean by data preprocessing?

Sol:

Data preprocessing is the process of transforming raw data into a clean and usable format
before it is fed into a machine learning model or any data analysis tool. This involves
several steps:

1. Data Cleaning: Removing noise, handling missing values, and correcting


inconsistencies in the data.

2. Data Integration: Combining data from multiple sources to create a cohesive


dataset.

3. Data Transformation: Normalizing or scaling the data to ensure all features


contribute equally to the analysis.

4. Data Reduction: Reducing the volume of data by aggregating, selecting features, or


using dimensionality reduction techniques. Data preprocessing is crucial for
improving the quality of the data and ensuring that the analytical model built on it
performs well.

d) Define Data Discretization

Sol:

Data discretization is the process of converting continuous data attributes into discrete
categories or intervals. This is often done to simplify the data and make it more
manageable for analysis. There are several methods of data discretization:

1. Binning: Dividing the range of a continuous variable into intervals (bins), and then
assigning each data point to a bin.

2. Clustering: Grouping data points into clusters based on their similarities, and then
treating each cluster as a discrete category.

3. Decision Tree: Using decision tree algorithms to create bins based on the criteria
that best split the data. Data discretization is particularly useful in transforming
numerical data into categorical data for use in classification algorithms.
e) What is visual encoding?

Sol:

Visual encoding refers to the process of representing data in a visual format that leverages
human visual perception to communicate information effectively. This involves mapping
data attributes to visual elements such as position, size, shape, color, and texture. The goal
of visual encoding is to create clear and intuitive visualizations that enable users to quickly
understand and interpret complex data. Effective visual encoding can highlight patterns,
trends, and anomalies, making it easier to derive insights from data.

Q3) Attempt any two of the following. [2×4=8]

a) Explain outlier detection methods in brief

Sol:

Outlier detection involves identifying data points that deviate significantly from the rest of
the data. These outliers can indicate errors, variability in measurements, or novel
phenomena. Several methods are used for detecting outliers:

1. Statistical Methods:

• Z-Score: Measures the number of standard deviations a data point is from the
mean. Data points with a Z-score beyond a certain threshold (e.g., ±3) are
considered outliers.

Z = (X − μ) / σ

• IQR (Interquartile Range): Identifies outliers as data points that fall below Q1 -
1.5IQR or above Q3 + 1.5IQR, where Q1 and Q3 are the first and third quartiles,
respectively.

IQR = Q3 − Q1

• Boxplot: A graphical method using quartiles and whiskers to identify outliers


visually.
2. Machine Learning Methods:

o Isolation Forest: Constructs an ensemble of trees to isolate observations by


randomly selecting a feature and splitting the data. Outliers are isolated more
quickly compared to normal points.

o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters


data points based on their density. Points in low-density regions are considered
outliers.

3. Distance-Based Methods:

o K-Nearest Neighbors (KNN): Identifies outliers by measuring the distance of each


data point to its K nearest neighbors. Points with distances significantly greater
than others are considered outliers.

4. Visual Methods:

o Scatter Plots: Plots data points on a Cartesian plane to identify outliers visually.

o Histogram: Visualizes the distribution of data and helps identify outliers as bars
with significantly lower or higher frequency.

Outlier detection is an essential step in data preprocessing to ensure the quality and
reliability of the analysis or predictive models built on the data.
b) Write different data visualization libraries in Python

Sol:

Python offers several powerful libraries for data visualization. Some of the most commonly
used libraries are:

1. Matplotlib:

o Description: Matplotlib is one of the most widely used data visualization libraries in
Python. It provides a flexible platform for creating static, animated, and interactive
visualizations.

o Capabilities: Line plots, scatter plots, bar charts, histograms, pie charts, and more.

o Example:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [10, 20, 25, 30, 35]

plt.plot(x, y)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Sample Line Plot')

plt.show()

2. Seaborn:

o Description: Built on top of Matplotlib, Seaborn offers a high-level interface for


drawing attractive and informative statistical graphics.

o Capabilities: Enhanced visualizations, including heatmaps, violin plots, box plots, and
pair plots.

o Example:

import seaborn as sns

import matplotlib.pyplot as plt

data = sns.load_dataset("iris")
sns.pairplot(data, hue="species")

plt.show()

3. Plotly:

o Description: Plotly is an interactive, open-source plotting library that supports a wide


range of visualization types and interactive features.

o Capabilities: 3D plots, geographic maps, interactive charts, and dashboards.

o Example:

import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species')

fig.show()

4. Bokeh:

o Description: Bokeh provides an elegant and concise way to create interactive


visualizations for modern web browsers.

o Capabilities: Interactive plots, dashboards, and data applications.

o Example:

from bokeh.plotting import figure, show

p = figure(title="Bokeh Plot Example", x_axis_label='x', y_axis_label='y')

p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], legend_label="Line", line_width=2)

show(p)

5. Altair:

o Description: Altair is a declarative statistical visualization library based on Vega and


Vega-Lite, designed for creating simple yet powerful visualizations.

o Capabilities: Interactive visualizations with concise code.

o Example:
import altair as alt

import pandas as pd

data = pd.DataFrame({

'a': [1, 2, 3, 4, 5],

'b': [3, 4, 5, 6, 7]

})

chart = alt.Chart(data).mark_line().encode(

x='a',

y='b'

chart.show()

c) What is data cleaning? Explain any two data cleaning methods in detail.

Sol:

Data cleaning is the process of identifying, correcting, or removing errors and


inconsistencies in a dataset. This involves handling missing values, outliers, duplicate
records, and formatting issues to ensure that the data is accurate, complete, and ready for
analysis.

Two Data Cleaning Methods:

1. Handling Missing Values:

o Description: Missing values can occur due to various reasons such as data entry
errors, data corruption, or incomplete data collection. Handling missing values is
crucial for accurate data analysis.

o Techniques:

i. Imputation: Replacing missing values with estimated values, such as the mean,
median, or mode of the attribute.

▪ Example:

import pandas as pd
data = {'age': [25, 30, None, 35, 40]}

df = pd.DataFrame(data)

df['age'].fillna(df['age'].mean(), inplace=True)

print(df)

▪ Explanation: In this example, the missing value in the 'age' column is replaced with
the mean of the non-missing values.

ii. Deletion: Removing records with missing values from the dataset, either through
complete case analysis (removing entire rows) or pairwise deletion (removing
specific values).

▪ Example:

import pandas as pd

data = {'age': [25, 30, None, 35, 40]}

df = pd.DataFrame(data)

df.dropna(inplace=True)

print(df)

▪ Explanation: In this example, any row with a missing value in the 'age' column is
removed from the dataset.

2. Handling Outliers:

o Description: Outliers are data points that significantly differ from other observations
in a dataset. They can skew statistical analyses and affect the accuracy of models.

o Techniques:

i. Z-Score Method: Identifying outliers using the Z-score, which measures the
number of standard deviations a data point is from the mean. Data points with Z-
scores greater than a certain threshold (e.g., 3 or -3) are considered outliers.

▪ Example:

import numpy as np

data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier
mean = np.mean(data)

std_dev = np.std(data)

z_scores = [(x - mean) / std_dev for x in data]

outliers = [x for x, z in zip (data, z_scores) if np.abs(z) > 3]

print ("Outliers:", outliers)

▪ Explanation: In this example, the Z-score method identifies 100 as an outlier because
its Z-score is significantly higher than the threshold.

ii. IQR Method: Identifying outliers using the Interquartile Range (IQR), which
measures the spread of the middle 50% of the data. Data points that fall below Q1
- 1.5 IQR or above Q3 + 1.5 IQR are considered outliers.

▪ Example:

import numpy as np

data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier

Q1 = np.percentile(data, 25)

Q3 = np.percentile(data, 75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = [x for x in data if x < lower_bound or x > upper_bound]

print ("Outliers:", outliers)

▪ Explanation: In this example, the IQR method identifies 100 as an outlier because it
falls outside the range defined by Q1 - 1.5 IQR and Q3 + 1.5 IQR.
Q4) Attempt any two of the following:

a) Explain 3 V’ s of Data Science

Sol:

1. Volume:

o Description: The volume characteristic of data refers to the sheer amount of


data generated and stored in data systems. It signifies the massive quantities
of data that organizations collect, process, and analyze.

o Example: Social media platforms generating terabytes of user data every day.

2. Velocity:

o Description: Velocity refers to the speed at which data is generated,


processed, and analyzed. It emphasizes the need for real-time or near real-
time data processing to gain timely insights.

o Example: Stock market data where prices and trades are updated every
second.

3. Variety:

o Description: Variety refers to the different types and sources of data. It


includes structured data (databases), semi-structured data (XML, JSON), and
unstructured data (text, images, videos).

o Example: Data from social media posts, emails, transaction records, and
multimedia content.
b) Explain data cube aggregation method in detail

Sol: -

Data cube aggregation is a method used in data warehousing and Online Analytical
Processing (OLAP) to represent multi-dimensional data. It allows users to explore and
analyze data in a summarized manner by aggregating data across different dimensions. A
data cube consists of cells, each representing an aggregated value such as sum, count,
average, etc., of a measure over multiple dimensions.

Key Concepts:

1. Dimensions: These are the perspectives or entities with respect to which data can
be organized. For example, time, location, and product.

2. Measures: These are the numerical values that are analyzed. For example, sales,
revenue, and profit.

3. Aggregation Operations: Common operations include sum, average, min, max, and
count.

Steps in Data Cube Aggregation:

1. Data Loading: Raw data is loaded into the data warehouse from different sources.

2. Data Preprocessing: Cleaning and transformation are applied to ensure data quality.

3. Cube Creation: The data cube is created by defining the dimensions and measures.
Each cell in the cube represents an aggregated value for a specific combination of
dimension values.

4. Data Aggregation: Aggregation operations are performed to compute the values for
each cell in the data cube. For example, summing sales for each product-category
over different time periods.

5. Querying the Cube: Users can query the data cube to retrieve summarized data. For
example, querying total sales for a specific product across different regions.

Example:

Consider a sales data warehouse with the following dimensions: Time (Year, Quarter,
Month), Product (Category, Sub-Category), and Location (Country, State, City). The
measure is Sales.
A data cube for this scenario would look like:

| Time | Product | Location | Sales |

----------------------------------------------------------

| Year, Quarter, Month | Category, Sub-Category | Country, State, City | Aggregated Sales |

Advantages:

• Improved Query Performance: Pre-aggregated data allows faster query responses.

• Flexible Analysis: Users can drill down or roll up to different levels of aggregation.

• Holistic View: Provides a comprehensive view of data across multiple dimensions.

Disadvantages:

• Storage Space: Requires significant storage space for large datasets.

• Complexity: Can be complex to design and maintain for very large data sets with
many dimensions.
c) Explain any two data transformation techniques in detail

Sol: -

1. Normalization:

o Normalization is the process of scaling numerical data to a standard range, usually


between 0 and 1, or -1 and 1.

o Purpose: It ensures that no single feature dominates others due to its scale,
especially important for algorithms that calculate distances, such as k-means
clustering and k-nearest neighbors.

o Technique:

▪ Min-Max Scaling: Transforms each feature to a range between 0 and 1.

Xnorm = (X – Xmin ) / (Xmax − Xmin)

Example: If the original range of a feature is [10, 20], the normalized value for 15 would be:

Xnorm = 15 – 10 / 20 – 10 = 5 / 10 = 1 / 2 = 0.5

2. Log Transformation:

o Log transformation is the process of applying a logarithmic function to each data


point. Commonly used log functions are the natural log (ln), base-2 log (log2), and
base-10 log (log10).

o Purpose: It reduces skewness in the data, stabilizes variance, and makes the data
more normally distributed. It is particularly useful for data with exponential growth or
wide-ranging values.

o Technique:

▪ Application: Apply the log function to each data point.

X′ = log (X + 1)

• Example: If the original data point is 1000, the log-transformed value using natural
log would be:

X′ = ln (1000+1) ≈ 6.908
Q5) Attempt any ONE of the following. [1×3=3]

a) Write a short note on feature extraction

Sol: -

Feature Extraction is a crucial step in the data preprocessing phase of machine learning
and data science. It involves transforming raw data into a set of meaningful features that
can be effectively used by machine learning models to improve their performance. The
main objective of feature extraction is to reduce the dimensionality of the data while
retaining its significant information and characteristics.

Key Concepts:

1. Dimensionality Reduction:

o Principal Component Analysis (PCA): A statistical technique that transforms the


original features into a set of linearly uncorrelated components, ordered by the
amount of variance they explain in the data. PCA reduces the dimensionality of the
dataset while preserving as much variability as possible.

o Linear Discriminant Analysis (LDA): A method used to find a linear combination of


features that best separates two or more classes of objects. LDA is used for both
dimensionality reduction and classification.

2. Feature Engineering:

o Creating New Features: Deriving new features from existing ones using domain
knowledge. For example, creating a feature that represents the day of the week from a
timestamp.

o Transformation: Applying mathematical transformations such as logarithmic scaling,


polynomial transformations, or normalizations to existing features.
Steps in Feature Extraction:

1. Identify Relevant Features:

o Analyze the dataset to identify features that are most relevant to the problem at
hand. This may involve domain knowledge and exploratory data analysis.

2. Extract Features:

o Use algorithms and statistical methods to extract meaningful features from the raw
data. For example, in image processing, edge detection algorithms can extract
features representing edges in an image.

3. Select the Best Features:

o Apply feature selection techniques to choose the most relevant features for the
machine learning model. Techniques include filter methods (e.g., correlation
coefficient), wrapper methods (e.g., recursive feature elimination), and embedded
methods (e.g., Lasso).

Applications:

1. Image Processing:

o Features such as edges, textures, and shapes are extracted from images using
techniques like SIFT (Scale-Invariant Feature Transform) or HOG (Histogram of
Oriented Gradients).

2. Natural Language Processing (NLP):

o Features such as word frequencies, n-grams, and word embeddings (e.g.,


Word2Vec, GloVe) are extracted from text data to represent semantic
information.
3. Signal Processing:

o Features such as frequency components, amplitude, and phase are extracted


from audio signals for applications like speech recognition or music
classification.

Benefits:

• Improved Model Performance: Reduces overfitting by eliminating irrelevant and


redundant features, leading to more accurate models.

• Reduced Computational Cost: Simplifies the dataset, making it more efficient to


process and analyze.

• Enhanced Interpretability: Helps in understanding the underlying structure and


relationships in the data.

Example:

Consider a dataset containing timestamps, and you need to extract features for a machine
learning model. Using feature extraction techniques, you could create features such as the
hour of the day, day of the week, and whether the timestamp falls on a weekend or
weekday. These new features can provide valuable information for the model to improve its
performance.
b) Explain Exploratory Data Analysis (EDA) in detail

Sol: -

Exploratory Data Analysis (EDA) is the process of analyzing and summarizing the main
characteristics of a dataset using statistical and graphical methods. EDA aims to uncover
patterns, detect anomalies, and test hypotheses to gain insights into the data.

EDA is crucial in the data analysis process as it helps in understanding the data's structure,
identifying potential issues, and informing the choice of further analysis techniques and
models. It provides a foundation for building predictive models and making data-driven
decisions.

1. Steps in EDA:

o Data Collection: Gathering relevant data from various sources.

o Data Cleaning: Handling missing values, outliers, and inconsistencies to ensure data
quality.

o Data Transformation: Transforming data into a suitable format for analysis, such as
normalization or standardization.

o Summary Statistics: Calculating measures such as mean, median, mode, variance,


and standard deviation to understand the distribution of the data.

o Data Visualization: Using graphical techniques such as histograms, boxplots, scatter


plots, and correlation matrices to visualize data and identify relationships.

2. Techniques:

o Descriptive Statistics: Summarizing data using numerical measures (mean, median,


variance, etc.).

o Graphical Methods: Visualizing data using charts and plots to detect patterns, trends,
and outliers.

o Correlation Analysis: Assessing relationships between variables using correlation


coefficients and scatter plots.
3. Example:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Load dataset

df = sns.load_dataset('iris')

# Summary Statistics

print(df.describe())

# Histogram

df['sepal_length'].hist()

plt.title('Sepal Length Distribution')

plt.show()

# Scatter Plot

sns.scatterplot(x='sepal_length', y='sepal_width', data=df, hue='species')

plt.title('Sepal Length vs Sepal Width')

plt.show()

# Correlation Matrix

correlation_matrix = df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

plt.title('Correlation Matrix')

plt.show()

You might also like