0% found this document useful (0 votes)

16 views

EDS Unit 1?

This document serves as a syllabus for an introductory course on Data Science, covering key concepts such as the definition of Data Science, the data science lifecycle, and the basics of R programming. It outlines the steps involved in data science, including understanding business problems, data preparation, exploratory data analysis, modeling, evaluation, and deployment. Additionally, it discusses applications, limitations, and characteristics of Big Data, along with an overview of statistical inference and R programming fundamentals.

Uploaded by

moinuddinmoin1357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

EDS Unit 1?

Uploaded by

moinuddinmoin1357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

📐

Unit 1 Introduction to Data

Science
Syllabus
Introduction: Definition of Data Science- Big Data and Data Science hype – and
getting past the hype - Datafication - Current landscape of perspectives -
Statistical Inference - Populations and samples -Statistical modeling, probability
distributions, fitting a model – Overfitting.
Basics of R: Introduction, R-Environment Setup, Programming with R, Basic
Data Types.

Definition of Data Science

Data Science is the area of study that extracts, manages, manipulates, and
interprets knowledge from vast amounts of data using various scientific
methods, algorithms, and processes.

Activities or lifecycle of Data science with

neat diagram

Unit 1 Introduction to Data Science 1

1. Understand the Business Problem
Definition: The first step involves understanding the business challenge
and identifying the objectives to be solved with data.

Key Points:

Clear problem statement.

Business goals and metrics.

Stakeholder collaboration.

Example: A company wants to reduce customer churn. The business

problem is identifying factors contributing to churn and predicting future
churn.

2. Preparing the Data

Definition: Preparing the data involves data collection, cleaning, and
transformation to make it usable for analysis.

Key Points:

Unit 1 Introduction to Data Science 2

Data Collection: Gathering raw data from sources like databases, files,
and APIs.

Data Cleaning: Handling missing values, duplicates, and outliers.

Feature Engineering: Creating new features or transforming existing

ones.

Example: Removing duplicates from customer data, filling missing values

for missing customer age.

3. Exploratory Data Analysis (EDA)

Definition: EDA is the process of analyzing datasets to summarize their
main characteristics, often using visual methods.

Key Points:

Summary Statistics: Mean, median, standard deviation.

Visualizations: Histograms, box plots, scatter plots.

Correlation: Identifying relationships between variables.

Example: Visualizing sales data to identify seasonal trends or correlations

between features like advertising spend and sales.

4. Modeling the Data

Definition: Building and training machine learning models to make
predictions or classify data.

Key Points:

Model Selection: Choose the right algorithm (e.g., linear regression,

decision trees).

Training: Use the training data to teach the model.

Hyperparameter Tuning: Fine-tuning model parameters to improve

accuracy.

Example: Using a decision tree classifier to predict customer churn based

on demographic and usage data.

5. Evaluating the Model

Unit 1 Introduction to Data Science 3

Definition: Evaluate the performance of the trained model using relevant
metrics.

Key Points:

Metrics: Accuracy, precision, recall, F1-score, ROC-AUC.

Validation: Cross-validation to prevent overfitting.

Confusion Matrix: To understand false positives and negatives.

Example: Using accuracy and precision to evaluate a model that predicts

whether a customer will churn.

6. Deploying the Model

Definition: Deploying the model into a production environment where it can
be used by users or other systems.

Key Points:

Integration: Incorporating the model into business processes or

applications.

Monitoring: Tracking model performance in real-time.

Updates: Periodically retraining the model with new data.

Example: A recommendation engine deployed on an e-commerce website,

suggesting products based on user behavior.

Applications of Data Science

1. Healthcare

2. Finance

3. E-Commerce

4. Retail

5. Marketing

6. Transportation

7. Media and Entertainment

8. Education

Unit 1 Introduction to Data Science 4

9. Cybersecurity

10. Energy

Limitations of Data Science

Mastering Data Science is near to impossible: Being a mixture of many
fields, Data Science stems from Statistics, Computer Science and
Mathematics. It is far from possible to master each field and be equivalently
expert in all of them.

Large Amount of Domain Knowledge Required: Another disadvantage of

Data Science is its dependency on Domain Knowledge. A person with a
considerable background in Statistics and Computer Science will find it
difficult to solve Data Science problem without its background knowledge.

Arbitrary Data May Yield Unexpected Results: A Data Scientist analyzes

the data and makes careful predictions in order to facilitate the decision-
making process. Many times, the data provided is arbitrary and does not
yield expected results.

Problem of Data Privacy: For many industries, data is their fuel. Data
Scientists help companies make data-driven decisions. However, the data
utilized in the process may breach the privacy of customers.

What is Data Science used for?

Data science is used to study data in four main ways:

1. Descriptive analysis: Descriptive analysis examines data to gain insights

into what happened or what is happening in the data environment.

2. Diagnostic analysis: Diagnostic analysis is a deep-dive or detailed data

examination to understand why something happened.

3. Predictive analysis: Predictive analysis uses historical data to make

accurate forecasts about data patterns that may occur in the future.

4. Prescriptive analysis: Prescriptive analytics takes predictive data to the

next level. It not only predicts what is likely to happen but also suggests an
optimum response to that outcome. It can analyze the potential implications
of different choices and recommend the best course of action.

Unit 1 Introduction to Data Science 5

Big Data
Big data refers to extremely large and diverse collections of structured,
unstructured, and semi-structured data that continues to grow exponentially
over time.

Characteristics
There are totally 5 characteristics of Big Data

1. Volume: Big Data involves large datasets that are too complex for traditional
data processing tools to handle. These datasets can range from terabytes,
petabytes to Zettabytes of information.

2. Variety: The data comes in multiple forms, including structured data (like
databases),unstructured data (like text, images, and videos) and semi-
structured data (like XML files).

3. Velocity: velocity refers to the speed at which data is generated. Today,

data is often produced in real time or near real time, and therefore, it must
also be processed, accessed, and analyzed at the same rate.

4. Veracity: Veracity refers to the degree of accuracy in data sets and how
trustworthy they are.

5. Value. Not all the data that's collected has real business value/benefits. It's
essential to determine the business value of the data you collect.

Types of Big Data

Structured Data
Any data that can be stored, accessed and processed in the form of fixed
format is termed as a ‘structured’ data (Tables)

Unstructured Data
Heterogeneous data source containing a combination of simple text files,
images, videos etc. (The output of a Google search)

Semi-Structured Data
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined.

Unit 1 Introduction to Data Science 6

Big Data vs Data Science

Datafication
Datafication is the process of transforming every aspect of business into
quantifiable data that can be tracked, monitored, analyzed.

Current Landscape of Perspectives

Unit 1 Introduction to Data Science 7

Data science is not merely (only) Statistics or Hacking skills or Mathematics.

Data science includes

Statistics(traditional mathematical analysis)

Data processing (parsing, scraping, and formatting data)

Visualization (graphs, tools, etc.)

Statistical Interference
Statistical Interference, also called as Inference Statistics is a branch of
Statistics which uses a random sample of a population to make inferences
about the whole population using statistical techniques.
Statistical Inferences are done by using information such as process and data

Process
The activities or functions that are happening in and around the worlds are
called processes

Unit 1 Introduction to Data Science 8

Data
The records or traces of the real-world processes is called data

Types of Statistical Interferences

1. Hypothesis Testing
It’s like testing a claim or idea about a population using a sample.

Example: If someone claims the average height of students is 5.5 feet, we

can use sample data to confirm or reject this claim.

2. Confidence Intervals
These give a range where the true value of a population parameter (like an
average) is likely to fall.

Example: "We are 95% confident that the average score of students is
between 70 and 80."

3. Regression Analysis
A method to explore the relationship between one dependent variable (e.g.,
sales) and one or more independent variables (e.g., advertising budget).

Example: Predicting house prices based on factors like size and location.

4. Analysis of Variance (ANOVA)

Compares averages of two or more groups to check if they are significantly
different.

Example: Determining if students from different schools have different

average test scores.

5. Chi-Square Tests
Used to find out if there’s a relationship between two categorical variables.

Example: Testing if age group affects people’s choice of a smartphone

brand.

Unit 1 Introduction to Data Science 9

Population
It refers to the entire group of individuals about whom you wish to draw
conclusions

Sample
It refers to the subset of the (people) population from which you’ll be collecting
data to draw conclusions and make inferences about the population

Types of Sampling Methods

https://www.scribbr.com/methodology/sampling-methods/

Statistical Modelling
It refers to the process of creating a mathematical model that describes the
relationship between variables. The goal is to use this model to analyze data,
predict or explain outcomes.
It refers to the process of applying statistical analysis techniques to observe,
analyze, and predict trends and patterns in data.
Probability distributions are the foundations of statistical models

Process of modelling

💡 Define a problem → Collect and prepare Data → Choose a model →

Split the Data → Train the model → Evaluate the model → Model
Tuning and Optimization → Deploy the model

Probability Distributions
A probability distribution is a mathematical function that describes the
probability of different possible values of a variable. Probability distributions
are often depicted using graphs

Unit 1 Introduction to Data Science 10

Types of Probability Distributions
Probability Distribution of One Random Variable:
Describes how the values of a single random variable are distributed.

For discrete variables, we use Probability Mass Function (PMF), and for
continuous variables, we use Probability Density Function (PDF).

Probability Distribution of Multiple Random Variables:

Joint Probability: Probability of two events, A and B, occurring together.

P (A ∩ B) = P (A) × P (B)
Conditional Probability: Probability of A occurring given that B has
occurred.
P (A ∣ B) = P (A ∩ B)/P (B)

Independence and Exclusivity:

Independence: Two events are independent if one does not affect the
other.
P (A ∩ B) = P (A) × P (B)
Exclusivity: Two events are mutually exclusive if they cannot happen
together.
P (A ∩ B) = 0

Fitting a Model and Overfitting

1. Fitting a Model:

Involves finding the best parameters to represent the relationship

between input and target variables. The goal is to minimize the error on
training data.

2. Overfitting:

Occurs when a model learns both the patterns and noise from training
data, leading to poor performance on unseen data.

Signs: High Variance and Low Bias, High Training Accuracy and Low
Test Accuracy

Unit 1 Introduction to Data Science 11

Prevention: Use simpler models, apply regularization, and use cross-
validation, Ensemble, Remove features.

What is an Operator?
Definition: An operator in R is a symbol or function that performs
operations on variables or values. Operators are used to perform
calculations, manipulate data, and perform logical comparisons.

Different Operators in R Programming

1. Arithmetic Operators:
Used for basic mathematical operations.

+ : Addition

- : Subtraction

x : Multiplication

/ : Division

%% : Modulus (remainder)

^ : Exponentiation

2. Relational Operators:
Used for comparisons between values.

== : Equal to

!= : Not equal to

< : Less than

> : Greater than

<= : Less than or equal to

>= : Greater than or equal to

3. Logical Operators:
Used for logical operations (AND, OR, NOT).

& : AND

| : OR

Unit 1 Introduction to Data Science 12

! : NOT

&& : AND (for first element)

|| : OR (for first element)

4. Assignment Operators:
Used to assign values to variables.

<- or = : Assign value to a variable

> : Assign value to a variable (right-hand side)

5. Miscellaneous Operators:

: : Sequence operator (e.g., 1:5 gives 1, 2, 3, 4, 5 )

in : Checks if an element is in a vector or list (e.g., 3 %in% c(1, 2, 3) )

Data Types in R Language

R has several data types that represent different kinds of information.

1. Numeric:
Represents real numbers (e.g., 10.5, 3.14).

Can be integer or floating-point numbers.

2. Integer:
Whole numbers (e.g., 5, 100).

Declared using L suffix (e.g., 5L ).

3. Character:

Represents text or strings (e.g., "Hello" , "R Programming" ).

4. Logical:
Represents Boolean values: TRUE or FALSE .

5. Complex:
Represents complex numbers with real and imaginary parts (e.g., 2 + 3i ).

6. Raw:
Represents raw bytes (used for binary data).

7. Factor:

Unit 1 Introduction to Data Science 13

Used for categorical data with a fixed number of unique values (e.g.,
gender with levels male and female ).

8. List:
A collection of elements, which can be of different data types (e.g.,
numbers, strings, vectors).

9. Data Frame:
A two-dimensional table-like structure where columns can have different
data types. It’s used to store datasets in R.

10. Matrix:
A two-dimensional array where all elements must be of the same data type.

Steps for R Environment Setup

1. Download and Install R:

Go to the official R website: https://cran.r-project.org/.

Download the R installer for your operating system (Windows, macOS,

or Linux).

Run the installer and follow the on-screen instructions to complete the
installation.

2. Download and Install RStudio (Optional but Recommended):

RStudio is an integrated development environment (IDE) that makes it

easier to work with R.

Visit the RStudio website: https://www.rstudio.com/.

Download the free version of RStudio Desktop for your operating

system.

Run the installer and follow the instructions to complete the installation.

3. Verify Installation:

After installing both R and RStudio, open RStudio.

In the Console window of RStudio, type the following command to

check if R is working:

Unit 1 Introduction to Data Science 14

version

This will display the version of R installed on your system.

4. Install Required Packages:

R comes with many built-in functions, but you often need additional
packages for more advanced tasks.

To install a package, use the following command in the RStudio console:

install.packages("package_name")

For example, to install the ggplot2 package:

install.packages("ggplot2")

5. Load the Package:

After installation, you need to load the package before using it in your R
session.

Use the library() function to load a package:

library(ggplot2)

6. Test R with a Simple Script:

Test the environment by running a simple script or code.

For example, to check if everything is working properly:

print("Hello, R!")

Unit 1 Introduction to Data Science 15

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Electronic Eye Project - Antony Joy
No ratings yet
Electronic Eye Project - Antony Joy
14 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
Kadir
No ratings yet
Kadir
84 pages
Data
No ratings yet
Data
43 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Data Science Course Road Map
No ratings yet
Data Science Course Road Map
14 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Kadir
No ratings yet
Kadir
80 pages
Unit-1 - Introduction to Data Science
No ratings yet
Unit-1 - Introduction to Data Science
17 pages
Unit 1
No ratings yet
Unit 1
76 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
DS
No ratings yet
DS
32 pages
IDS- UNIT-1
No ratings yet
IDS- UNIT-1
14 pages
File
No ratings yet
File
27 pages
CUITM217-DATA-SCIENCE Data
No ratings yet
CUITM217-DATA-SCIENCE Data
48 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
unit 1 final (1)
No ratings yet
unit 1 final (1)
75 pages
Session 1819
No ratings yet
Session 1819
47 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
Data Science 1A
100% (1)
Data Science 1A
53 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Project Report
No ratings yet
Project Report
29 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
Data Science
No ratings yet
Data Science
15 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Lesson 1 - Introduction To Data Science
No ratings yet
Lesson 1 - Introduction To Data Science
5 pages
Data Science Overview Basic to Advance Guide
No ratings yet
Data Science Overview Basic to Advance Guide
27 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
DS-Unit-1_ABM
No ratings yet
DS-Unit-1_ABM
103 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
84 pages
Unit I
No ratings yet
Unit I
52 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
From Everand
Comprehensive Guide to Implementing Data Science and Analytics: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Euclidean Parallel Postulate
No ratings yet
Euclidean Parallel Postulate
49 pages
Automated Double Way Hacksaw: Sathyanathan.M, Chandru.S, Lingesh.K, Ragul.K, Vijay.S
No ratings yet
Automated Double Way Hacksaw: Sathyanathan.M, Chandru.S, Lingesh.K, Ragul.K, Vijay.S
3 pages
Stress-Intensity Factors For A Wide Range of Semi-Elliptical Surface Cracks in Finite-Thickness Plates
No ratings yet
Stress-Intensity Factors For A Wide Range of Semi-Elliptical Surface Cracks in Finite-Thickness Plates
13 pages
York Packaged Unit China (CM Brochure Ummhzcan Cooling Rooftop)
100% (1)
York Packaged Unit China (CM Brochure Ummhzcan Cooling Rooftop)
49 pages
23EECC209 - EM - Fields and Waves
No ratings yet
23EECC209 - EM - Fields and Waves
18 pages
Schneider SAMWHA EOCR SSD NEW
No ratings yet
Schneider SAMWHA EOCR SSD NEW
5 pages
Algorithms WS1
No ratings yet
Algorithms WS1
5 pages
High Pressure Industrial / Commercial Pounds-to-Pounds Regulators 1580V and AA1580V Series
No ratings yet
High Pressure Industrial / Commercial Pounds-to-Pounds Regulators 1580V and AA1580V Series
1 page
Harsha Dotnet Resume
No ratings yet
Harsha Dotnet Resume
9 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
Gold (2003) - FX Trading Via Recurrent Reinforcement Learning PDF
No ratings yet
Gold (2003) - FX Trading Via Recurrent Reinforcement Learning PDF
8 pages
Microservice Interview - Design Patterns
No ratings yet
Microservice Interview - Design Patterns
15 pages
The Very Basics of Windchill
No ratings yet
The Very Basics of Windchill
18 pages
Quadratics Brochure
No ratings yet
Quadratics Brochure
2 pages
4) PIC IO Port Programming
No ratings yet
4) PIC IO Port Programming
18 pages
Shear and Moment in Beams
No ratings yet
Shear and Moment in Beams
60 pages
Numerical Simulation of A Stage Constructed Rockfill Dam On Plastic Clay Foundation - Ashwani Kumar Verma - CWC
No ratings yet
Numerical Simulation of A Stage Constructed Rockfill Dam On Plastic Clay Foundation - Ashwani Kumar Verma - CWC
13 pages
Grade 6 Olympiad: Answer The Questions
No ratings yet
Grade 6 Olympiad: Answer The Questions
15 pages
Smart Parking System (Student Activity Project) : January 2017
No ratings yet
Smart Parking System (Student Activity Project) : January 2017
13 pages
Dark Activity Detection in AIS-Based Maritime Networks
No ratings yet
Dark Activity Detection in AIS-Based Maritime Networks
6 pages
Formative and Summative Template
No ratings yet
Formative and Summative Template
3 pages
Blended Wing Design Considerations For A Next Generation Commerci
No ratings yet
Blended Wing Design Considerations For A Next Generation Commerci
83 pages
Physical Geology of Shallow Magmatic Systems Christoph Breitkreuz pdf download
100% (3)
Physical Geology of Shallow Magmatic Systems Christoph Breitkreuz pdf download
59 pages
INGENIAS Agent Framework: Development Guide Version 1.0
No ratings yet
INGENIAS Agent Framework: Development Guide Version 1.0
51 pages
Class Ix Science Sample Paper Half Yearly Exam
100% (1)
Class Ix Science Sample Paper Half Yearly Exam
4 pages
Laboratory Experiment No1 Turbidity and PH
No ratings yet
Laboratory Experiment No1 Turbidity and PH
5 pages
Chromatographic Detectors Design, Function, and Operation - Raymond P
No ratings yet
Chromatographic Detectors Design, Function, and Operation - Raymond P
545 pages
JLR 10 Instruction Supplement
No ratings yet
JLR 10 Instruction Supplement
6 pages
CSC 2118 EXAM JAN2023 Final
No ratings yet
CSC 2118 EXAM JAN2023 Final
15 pages