FDS - 3 SOLVED
FDS - 3 SOLVED
Sol:
2. Finance: In finance, data science is applied for fraud detection, risk management,
algorithmic trading, and credit scoring. By analyzing transaction data, financial
institutions can identify unusual patterns that may indicate fraudulent activities and
develop models to assess the risk profile of loan applicants.
Sol:
Outliers are data points that deviate significantly from other observations in a dataset. They
can be caused by variability in the data, errors in data collection, or unusual events.
Outliers can affect the results of data analysis and statistical modeling, making it
important to detect and address them appropriately.
Sol:
Missing values are data points that are not recorded in a dataset. They can occur due to
various reasons, such as errors in data collection, data entry issues, or intentional
omission. Handling missing values is crucial in data analysis, as they can lead to biased
estimates, reduced statistical power, and incorrect conclusions.
d) Define Variance.
Sol:
Variance is a statistical measure that quantifies the dispersion or spread of a set of data
points around their mean. It is calculated as the average of the squared differences
between each data point and the mean of the dataset. A higher variance indicates that the
data points are more spread out from the mean, while a lower variance indicates that they
are closer to the mean.
Sol:
A nominal attribute is a type of categorical attribute that represents discrete and unordered
categories or labels. Nominal attributes are used to identify or classify data without
implying any order or hierarchy among the categories. Examples include gender (male,
female), colors (red, blue, green), and types of vehicles (car, truck, bike).
Sol:
Data transformation involves converting data from one format or structure to another. This
process is essential for preparing data for analysis, ensuring compatibility with analytical
tools, and improving the quality of data. Common data transformation techniques include
normalization, standardization, aggregation, and encoding of categorical variables.
Sol:
One hot encoding is a technique used to convert categorical variables into a binary (0 or 1)
format. Each category of the variable is transformed into a separate binary column, where
the presence of the category is represented by 1, and the absence is represented by 0. This
method is commonly used in machine learning to handle categorical data.
h) What is the use of a Bubble plot?
Sol:
A bubble plot is a type of data visualization that displays three dimensions of data. Each
point in the plot represents an observation, with its position determined by two variables (X
and Y coordinates), and the size of the bubble representing a third variable. Bubble plots
are useful for visualizing relationships among multiple variables and identifying patterns or
trends.
Sol:
Sol:
Sol:
Sol:
Sol:
Data preprocessing is the process of transforming raw data into a clean and usable format
before it is fed into a machine learning model or any data analysis tool. This involves
several steps:
Sol:
Data discretization is the process of converting continuous data attributes into discrete
categories or intervals. This is often done to simplify the data and make it more
manageable for analysis. There are several methods of data discretization:
1. Binning: Dividing the range of a continuous variable into intervals (bins), and then
assigning each data point to a bin.
2. Clustering: Grouping data points into clusters based on their similarities, and then
treating each cluster as a discrete category.
3. Decision Tree: Using decision tree algorithms to create bins based on the criteria
that best split the data. Data discretization is particularly useful in transforming
numerical data into categorical data for use in classification algorithms.
e) What is visual encoding?
Sol:
Visual encoding refers to the process of representing data in a visual format that leverages
human visual perception to communicate information effectively. This involves mapping
data attributes to visual elements such as position, size, shape, color, and texture. The goal
of visual encoding is to create clear and intuitive visualizations that enable users to quickly
understand and interpret complex data. Effective visual encoding can highlight patterns,
trends, and anomalies, making it easier to derive insights from data.
Sol:
Outlier detection involves identifying data points that deviate significantly from the rest of
the data. These outliers can indicate errors, variability in measurements, or novel
phenomena. Several methods are used for detecting outliers:
1. Statistical Methods:
• Z-Score: Measures the number of standard deviations a data point is from the
mean. Data points with a Z-score beyond a certain threshold (e.g., ±3) are
considered outliers.
Z = (X − μ) / σ
• IQR (Interquartile Range): Identifies outliers as data points that fall below Q1 -
1.5IQR or above Q3 + 1.5IQR, where Q1 and Q3 are the first and third quartiles,
respectively.
IQR = Q3 − Q1
3. Distance-Based Methods:
4. Visual Methods:
o Scatter Plots: Plots data points on a Cartesian plane to identify outliers visually.
o Histogram: Visualizes the distribution of data and helps identify outliers as bars
with significantly lower or higher frequency.
Outlier detection is an essential step in data preprocessing to ensure the quality and
reliability of the analysis or predictive models built on the data.
b) Write different data visualization libraries in Python
Sol:
Python offers several powerful libraries for data visualization. Some of the most commonly
used libraries are:
1. Matplotlib:
o Description: Matplotlib is one of the most widely used data visualization libraries in
Python. It provides a flexible platform for creating static, animated, and interactive
visualizations.
o Capabilities: Line plots, scatter plots, bar charts, histograms, pie charts, and more.
o Example:
x = [1, 2, 3, 4, 5]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
2. Seaborn:
o Capabilities: Enhanced visualizations, including heatmaps, violin plots, box plots, and
pair plots.
o Example:
data = sns.load_dataset("iris")
sns.pairplot(data, hue="species")
plt.show()
3. Plotly:
o Example:
import plotly.express as px
df = px.data.iris()
fig.show()
4. Bokeh:
o Example:
show(p)
5. Altair:
o Example:
import altair as alt
import pandas as pd
data = pd.DataFrame({
'b': [3, 4, 5, 6, 7]
})
chart = alt.Chart(data).mark_line().encode(
x='a',
y='b'
chart.show()
c) What is data cleaning? Explain any two data cleaning methods in detail.
Sol:
o Description: Missing values can occur due to various reasons such as data entry
errors, data corruption, or incomplete data collection. Handling missing values is
crucial for accurate data analysis.
o Techniques:
i. Imputation: Replacing missing values with estimated values, such as the mean,
median, or mode of the attribute.
▪ Example:
import pandas as pd
data = {'age': [25, 30, None, 35, 40]}
df = pd.DataFrame(data)
df['age'].fillna(df['age'].mean(), inplace=True)
print(df)
▪ Explanation: In this example, the missing value in the 'age' column is replaced with
the mean of the non-missing values.
ii. Deletion: Removing records with missing values from the dataset, either through
complete case analysis (removing entire rows) or pairwise deletion (removing
specific values).
▪ Example:
import pandas as pd
df = pd.DataFrame(data)
df.dropna(inplace=True)
print(df)
▪ Explanation: In this example, any row with a missing value in the 'age' column is
removed from the dataset.
2. Handling Outliers:
o Description: Outliers are data points that significantly differ from other observations
in a dataset. They can skew statistical analyses and affect the accuracy of models.
o Techniques:
i. Z-Score Method: Identifying outliers using the Z-score, which measures the
number of standard deviations a data point is from the mean. Data points with Z-
scores greater than a certain threshold (e.g., 3 or -3) are considered outliers.
▪ Example:
import numpy as np
data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier
mean = np.mean(data)
std_dev = np.std(data)
▪ Explanation: In this example, the Z-score method identifies 100 as an outlier because
its Z-score is significantly higher than the threshold.
ii. IQR Method: Identifying outliers using the Interquartile Range (IQR), which
measures the spread of the middle 50% of the data. Data points that fall below Q1
- 1.5 IQR or above Q3 + 1.5 IQR are considered outliers.
▪ Example:
import numpy as np
data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
▪ Explanation: In this example, the IQR method identifies 100 as an outlier because it
falls outside the range defined by Q1 - 1.5 IQR and Q3 + 1.5 IQR.
Q4) Attempt any two of the following:
Sol:
1. Volume:
o Example: Social media platforms generating terabytes of user data every day.
2. Velocity:
o Example: Stock market data where prices and trades are updated every
second.
3. Variety:
o Example: Data from social media posts, emails, transaction records, and
multimedia content.
b) Explain data cube aggregation method in detail
Sol: -
Data cube aggregation is a method used in data warehousing and Online Analytical
Processing (OLAP) to represent multi-dimensional data. It allows users to explore and
analyze data in a summarized manner by aggregating data across different dimensions. A
data cube consists of cells, each representing an aggregated value such as sum, count,
average, etc., of a measure over multiple dimensions.
Key Concepts:
1. Dimensions: These are the perspectives or entities with respect to which data can
be organized. For example, time, location, and product.
2. Measures: These are the numerical values that are analyzed. For example, sales,
revenue, and profit.
3. Aggregation Operations: Common operations include sum, average, min, max, and
count.
1. Data Loading: Raw data is loaded into the data warehouse from different sources.
2. Data Preprocessing: Cleaning and transformation are applied to ensure data quality.
3. Cube Creation: The data cube is created by defining the dimensions and measures.
Each cell in the cube represents an aggregated value for a specific combination of
dimension values.
4. Data Aggregation: Aggregation operations are performed to compute the values for
each cell in the data cube. For example, summing sales for each product-category
over different time periods.
5. Querying the Cube: Users can query the data cube to retrieve summarized data. For
example, querying total sales for a specific product across different regions.
Example:
Consider a sales data warehouse with the following dimensions: Time (Year, Quarter,
Month), Product (Category, Sub-Category), and Location (Country, State, City). The
measure is Sales.
A data cube for this scenario would look like:
----------------------------------------------------------
| Year, Quarter, Month | Category, Sub-Category | Country, State, City | Aggregated Sales |
Advantages:
• Flexible Analysis: Users can drill down or roll up to different levels of aggregation.
Disadvantages:
• Complexity: Can be complex to design and maintain for very large data sets with
many dimensions.
c) Explain any two data transformation techniques in detail
Sol: -
1. Normalization:
o Purpose: It ensures that no single feature dominates others due to its scale,
especially important for algorithms that calculate distances, such as k-means
clustering and k-nearest neighbors.
o Technique:
Example: If the original range of a feature is [10, 20], the normalized value for 15 would be:
Xnorm = 15 – 10 / 20 – 10 = 5 / 10 = 1 / 2 = 0.5
2. Log Transformation:
o Purpose: It reduces skewness in the data, stabilizes variance, and makes the data
more normally distributed. It is particularly useful for data with exponential growth or
wide-ranging values.
o Technique:
X′ = log (X + 1)
• Example: If the original data point is 1000, the log-transformed value using natural
log would be:
X′ = ln (1000+1) ≈ 6.908
Q5) Attempt any ONE of the following. [1×3=3]
Sol: -
Feature Extraction is a crucial step in the data preprocessing phase of machine learning
and data science. It involves transforming raw data into a set of meaningful features that
can be effectively used by machine learning models to improve their performance. The
main objective of feature extraction is to reduce the dimensionality of the data while
retaining its significant information and characteristics.
Key Concepts:
1. Dimensionality Reduction:
2. Feature Engineering:
o Creating New Features: Deriving new features from existing ones using domain
knowledge. For example, creating a feature that represents the day of the week from a
timestamp.
o Analyze the dataset to identify features that are most relevant to the problem at
hand. This may involve domain knowledge and exploratory data analysis.
2. Extract Features:
o Use algorithms and statistical methods to extract meaningful features from the raw
data. For example, in image processing, edge detection algorithms can extract
features representing edges in an image.
o Apply feature selection techniques to choose the most relevant features for the
machine learning model. Techniques include filter methods (e.g., correlation
coefficient), wrapper methods (e.g., recursive feature elimination), and embedded
methods (e.g., Lasso).
Applications:
1. Image Processing:
o Features such as edges, textures, and shapes are extracted from images using
techniques like SIFT (Scale-Invariant Feature Transform) or HOG (Histogram of
Oriented Gradients).
Benefits:
Example:
Consider a dataset containing timestamps, and you need to extract features for a machine
learning model. Using feature extraction techniques, you could create features such as the
hour of the day, day of the week, and whether the timestamp falls on a weekend or
weekday. These new features can provide valuable information for the model to improve its
performance.
b) Explain Exploratory Data Analysis (EDA) in detail
Sol: -
Exploratory Data Analysis (EDA) is the process of analyzing and summarizing the main
characteristics of a dataset using statistical and graphical methods. EDA aims to uncover
patterns, detect anomalies, and test hypotheses to gain insights into the data.
EDA is crucial in the data analysis process as it helps in understanding the data's structure,
identifying potential issues, and informing the choice of further analysis techniques and
models. It provides a foundation for building predictive models and making data-driven
decisions.
1. Steps in EDA:
o Data Cleaning: Handling missing values, outliers, and inconsistencies to ensure data
quality.
o Data Transformation: Transforming data into a suitable format for analysis, such as
normalization or standardization.
2. Techniques:
o Graphical Methods: Visualizing data using charts and plots to detect patterns, trends,
and outliers.
import pandas as pd
# Load dataset
df = sns.load_dataset('iris')
# Summary Statistics
print(df.describe())
# Histogram
df['sepal_length'].hist()
plt.show()
# Scatter Plot
plt.show()
# Correlation Matrix
correlation_matrix = df.corr()
plt.title('Correlation Matrix')
plt.show()