0% found this document useful (0 votes)
16 views

EDS Unit 1?

This document serves as a syllabus for an introductory course on Data Science, covering key concepts such as the definition of Data Science, the data science lifecycle, and the basics of R programming. It outlines the steps involved in data science, including understanding business problems, data preparation, exploratory data analysis, modeling, evaluation, and deployment. Additionally, it discusses applications, limitations, and characteristics of Big Data, along with an overview of statistical inference and R programming fundamentals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

EDS Unit 1?

This document serves as a syllabus for an introductory course on Data Science, covering key concepts such as the definition of Data Science, the data science lifecycle, and the basics of R programming. It outlines the steps involved in data science, including understanding business problems, data preparation, exploratory data analysis, modeling, evaluation, and deployment. Additionally, it discusses applications, limitations, and characteristics of Big Data, along with an overview of statistical inference and R programming fundamentals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

📐

Unit 1 Introduction to Data


Science
Syllabus
Introduction: Definition of Data Science- Big Data and Data Science hype – and
getting past the hype - Datafication - Current landscape of perspectives -
Statistical Inference - Populations and samples -Statistical modeling, probability
distributions, fitting a model – Overfitting.
Basics of R: Introduction, R-Environment Setup, Programming with R, Basic
Data Types.

Definition of Data Science


Data Science is the area of study that extracts, manages, manipulates, and
interprets knowledge from vast amounts of data using various scientific
methods, algorithms, and processes.

Activities or lifecycle of Data science with


neat diagram

Unit 1 Introduction to Data Science 1


1. Understand the Business Problem
Definition: The first step involves understanding the business challenge
and identifying the objectives to be solved with data.

Key Points:

Clear problem statement.

Business goals and metrics.

Stakeholder collaboration.

Example: A company wants to reduce customer churn. The business


problem is identifying factors contributing to churn and predicting future
churn.

2. Preparing the Data


Definition: Preparing the data involves data collection, cleaning, and
transformation to make it usable for analysis.

Key Points:

Unit 1 Introduction to Data Science 2


Data Collection: Gathering raw data from sources like databases, files,
and APIs.

Data Cleaning: Handling missing values, duplicates, and outliers.

Feature Engineering: Creating new features or transforming existing


ones.

Example: Removing duplicates from customer data, filling missing values


for missing customer age.

3. Exploratory Data Analysis (EDA)


Definition: EDA is the process of analyzing datasets to summarize their
main characteristics, often using visual methods.

Key Points:

Summary Statistics: Mean, median, standard deviation.

Visualizations: Histograms, box plots, scatter plots.

Correlation: Identifying relationships between variables.

Example: Visualizing sales data to identify seasonal trends or correlations


between features like advertising spend and sales.

4. Modeling the Data


Definition: Building and training machine learning models to make
predictions or classify data.

Key Points:

Model Selection: Choose the right algorithm (e.g., linear regression,


decision trees).

Training: Use the training data to teach the model.

Hyperparameter Tuning: Fine-tuning model parameters to improve


accuracy.

Example: Using a decision tree classifier to predict customer churn based


on demographic and usage data.

5. Evaluating the Model

Unit 1 Introduction to Data Science 3


Definition: Evaluate the performance of the trained model using relevant
metrics.

Key Points:

Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

Validation: Cross-validation to prevent overfitting.

Confusion Matrix: To understand false positives and negatives.

Example: Using accuracy and precision to evaluate a model that predicts


whether a customer will churn.

6. Deploying the Model


Definition: Deploying the model into a production environment where it can
be used by users or other systems.

Key Points:

Integration: Incorporating the model into business processes or


applications.

Monitoring: Tracking model performance in real-time.

Updates: Periodically retraining the model with new data.

Example: A recommendation engine deployed on an e-commerce website,


suggesting products based on user behavior.

Applications of Data Science


1. Healthcare

2. Finance

3. E-Commerce

4. Retail

5. Marketing

6. Transportation

7. Media and Entertainment

8. Education

Unit 1 Introduction to Data Science 4


9. Cybersecurity

10. Energy

Limitations of Data Science


Mastering Data Science is near to impossible: Being a mixture of many
fields, Data Science stems from Statistics, Computer Science and
Mathematics. It is far from possible to master each field and be equivalently
expert in all of them.

Large Amount of Domain Knowledge Required: Another disadvantage of


Data Science is its dependency on Domain Knowledge. A person with a
considerable background in Statistics and Computer Science will find it
difficult to solve Data Science problem without its background knowledge.

Arbitrary Data May Yield Unexpected Results: A Data Scientist analyzes


the data and makes careful predictions in order to facilitate the decision-
making process. Many times, the data provided is arbitrary and does not
yield expected results.

Problem of Data Privacy: For many industries, data is their fuel. Data
Scientists help companies make data-driven decisions. However, the data
utilized in the process may breach the privacy of customers.

What is Data Science used for?


Data science is used to study data in four main ways:

1. Descriptive analysis: Descriptive analysis examines data to gain insights


into what happened or what is happening in the data environment.

2. Diagnostic analysis: Diagnostic analysis is a deep-dive or detailed data


examination to understand why something happened.

3. Predictive analysis: Predictive analysis uses historical data to make


accurate forecasts about data patterns that may occur in the future.

4. Prescriptive analysis: Prescriptive analytics takes predictive data to the


next level. It not only predicts what is likely to happen but also suggests an
optimum response to that outcome. It can analyze the potential implications
of different choices and recommend the best course of action.

Unit 1 Introduction to Data Science 5


Big Data
Big data refers to extremely large and diverse collections of structured,
unstructured, and semi-structured data that continues to grow exponentially
over time.

Characteristics
There are totally 5 characteristics of Big Data

1. Volume: Big Data involves large datasets that are too complex for traditional
data processing tools to handle. These datasets can range from terabytes,
petabytes to Zettabytes of information.

2. Variety: The data comes in multiple forms, including structured data (like
databases),unstructured data (like text, images, and videos) and semi-
structured data (like XML files).

3. Velocity: velocity refers to the speed at which data is generated. Today,


data is often produced in real time or near real time, and therefore, it must
also be processed, accessed, and analyzed at the same rate.

4. Veracity: Veracity refers to the degree of accuracy in data sets and how
trustworthy they are.

5. Value. Not all the data that's collected has real business value/benefits. It's
essential to determine the business value of the data you collect.

Types of Big Data


Structured Data
Any data that can be stored, accessed and processed in the form of fixed
format is termed as a ‘structured’ data (Tables)

Unstructured Data
Heterogeneous data source containing a combination of simple text files,
images, videos etc. (The output of a Google search)

Semi-Structured Data
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined.

Unit 1 Introduction to Data Science 6


Big Data vs Data Science

Datafication
Datafication is the process of transforming every aspect of business into
quantifiable data that can be tracked, monitored, analyzed.

Current Landscape of Perspectives

Unit 1 Introduction to Data Science 7


Data science is not merely (only) Statistics or Hacking skills or Mathematics.

Data science includes

Statistics(traditional mathematical analysis)

Data processing (parsing, scraping, and formatting data)

Visualization (graphs, tools, etc.)

Statistical Interference
Statistical Interference, also called as Inference Statistics is a branch of
Statistics which uses a random sample of a population to make inferences
about the whole population using statistical techniques.
Statistical Inferences are done by using information such as process and data

Process
The activities or functions that are happening in and around the worlds are
called processes

Unit 1 Introduction to Data Science 8


Data
The records or traces of the real-world processes is called data

Types of Statistical Interferences


1. Hypothesis Testing
It’s like testing a claim or idea about a population using a sample.

Example: If someone claims the average height of students is 5.5 feet, we


can use sample data to confirm or reject this claim.

2. Confidence Intervals
These give a range where the true value of a population parameter (like an
average) is likely to fall.

Example: "We are 95% confident that the average score of students is
between 70 and 80."

3. Regression Analysis
A method to explore the relationship between one dependent variable (e.g.,
sales) and one or more independent variables (e.g., advertising budget).

Example: Predicting house prices based on factors like size and location.

4. Analysis of Variance (ANOVA)


Compares averages of two or more groups to check if they are significantly
different.

Example: Determining if students from different schools have different


average test scores.

5. Chi-Square Tests
Used to find out if there’s a relationship between two categorical variables.

Example: Testing if age group affects people’s choice of a smartphone


brand.

Unit 1 Introduction to Data Science 9


Population
It refers to the entire group of individuals about whom you wish to draw
conclusions

Sample
It refers to the subset of the (people) population from which you’ll be collecting
data to draw conclusions and make inferences about the population

Types of Sampling Methods


https://www.scribbr.com/methodology/sampling-methods/

Statistical Modelling
It refers to the process of creating a mathematical model that describes the
relationship between variables. The goal is to use this model to analyze data,
predict or explain outcomes.
It refers to the process of applying statistical analysis techniques to observe,
analyze, and predict trends and patterns in data.
Probability distributions are the foundations of statistical models

Process of modelling

💡 Define a problem → Collect and prepare Data → Choose a model →


Split the Data → Train the model → Evaluate the model → Model
Tuning and Optimization → Deploy the model

Probability Distributions
A probability distribution is a mathematical function that describes the
probability of different possible values of a variable. Probability distributions
are often depicted using graphs

Unit 1 Introduction to Data Science 10


Types of Probability Distributions
Probability Distribution of One Random Variable:
Describes how the values of a single random variable are distributed.

For discrete variables, we use Probability Mass Function (PMF), and for
continuous variables, we use Probability Density Function (PDF).

Probability Distribution of Multiple Random Variables:


Joint Probability: Probability of two events, A and B, occurring together.

P (A ∩ B) = P (A) × P (B)
Conditional Probability: Probability of A occurring given that B has
occurred.
P (A ∣ B) = P (A ∩ B)/P (B)

Independence and Exclusivity:


Independence: Two events are independent if one does not affect the
other.
P (A ∩ B) = P (A) × P (B)
Exclusivity: Two events are mutually exclusive if they cannot happen
together.
P (A ∩ B) = 0

Fitting a Model and Overfitting


1. Fitting a Model:

Involves finding the best parameters to represent the relationship


between input and target variables. The goal is to minimize the error on
training data.

2. Overfitting:

Occurs when a model learns both the patterns and noise from training
data, leading to poor performance on unseen data.

Signs: High Variance and Low Bias, High Training Accuracy and Low
Test Accuracy

Unit 1 Introduction to Data Science 11


Prevention: Use simpler models, apply regularization, and use cross-
validation, Ensemble, Remove features.

What is an Operator?
Definition: An operator in R is a symbol or function that performs
operations on variables or values. Operators are used to perform
calculations, manipulate data, and perform logical comparisons.

Different Operators in R Programming


1. Arithmetic Operators:
Used for basic mathematical operations.

+ : Addition

- : Subtraction

x : Multiplication

/ : Division

%% : Modulus (remainder)

^ : Exponentiation

2. Relational Operators:
Used for comparisons between values.

== : Equal to

!= : Not equal to

< : Less than

> : Greater than

<= : Less than or equal to

>= : Greater than or equal to

3. Logical Operators:
Used for logical operations (AND, OR, NOT).

& : AND

| : OR

Unit 1 Introduction to Data Science 12


! : NOT

&& : AND (for first element)

|| : OR (for first element)

4. Assignment Operators:
Used to assign values to variables.

<- or = : Assign value to a variable

> : Assign value to a variable (right-hand side)

5. Miscellaneous Operators:

: : Sequence operator (e.g., 1:5 gives 1, 2, 3, 4, 5 )

in : Checks if an element is in a vector or list (e.g., 3 %in% c(1, 2, 3) )

Data Types in R Language


R has several data types that represent different kinds of information.

1. Numeric:
Represents real numbers (e.g., 10.5, 3.14).

Can be integer or floating-point numbers.

2. Integer:
Whole numbers (e.g., 5, 100).

Declared using L suffix (e.g., 5L ).

3. Character:

Represents text or strings (e.g., "Hello" , "R Programming" ).

4. Logical:
Represents Boolean values: TRUE or FALSE .

5. Complex:
Represents complex numbers with real and imaginary parts (e.g., 2 + 3i ).

6. Raw:
Represents raw bytes (used for binary data).

7. Factor:

Unit 1 Introduction to Data Science 13


Used for categorical data with a fixed number of unique values (e.g.,
gender with levels male and female ).

8. List:
A collection of elements, which can be of different data types (e.g.,
numbers, strings, vectors).

9. Data Frame:
A two-dimensional table-like structure where columns can have different
data types. It’s used to store datasets in R.

10. Matrix:
A two-dimensional array where all elements must be of the same data type.

Steps for R Environment Setup


1. Download and Install R:

Go to the official R website: https://cran.r-project.org/.

Download the R installer for your operating system (Windows, macOS,


or Linux).

Run the installer and follow the on-screen instructions to complete the
installation.

2. Download and Install RStudio (Optional but Recommended):

RStudio is an integrated development environment (IDE) that makes it


easier to work with R.

Visit the RStudio website: https://www.rstudio.com/.

Download the free version of RStudio Desktop for your operating


system.

Run the installer and follow the instructions to complete the installation.

3. Verify Installation:

After installing both R and RStudio, open RStudio.

In the Console window of RStudio, type the following command to


check if R is working:

Unit 1 Introduction to Data Science 14


version

This will display the version of R installed on your system.

4. Install Required Packages:

R comes with many built-in functions, but you often need additional
packages for more advanced tasks.

To install a package, use the following command in the RStudio console:

install.packages("package_name")

For example, to install the ggplot2 package:

install.packages("ggplot2")

5. Load the Package:

After installation, you need to load the package before using it in your R
session.

Use the library() function to load a package:

library(ggplot2)

6. Test R with a Simple Script:

Test the environment by running a simple script or code.

For example, to check if everything is working properly:

print("Hello, R!")

Unit 1 Introduction to Data Science 15

You might also like