0% found this document useful (0 votes)

7 views

IDS-UNIT-1-FINAL (1)

Uploaded by

pnithishreddy14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

IDS-UNIT-1-FINAL (1)

Uploaded by

pnithishreddy14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

INTRODUCTION TO DATA SCIENCE(IDS)

Prepared by N.Pandu Ranga Reddy

UNIT-1
Definition of Data Science:
Data is widely considered a crucial resource in different organizations across every
industry.
Data Science can be described as a separate field of work that deals with the
management and processing of data using statistical methods, artificial intelligence,
and other tools in partnership with domain specialists.

Essentially, data science is about using scientific methods to unlock the potential of
data, uncover patterns, make predictions, and drive informed decision-making across
various domains and industries.

The term “data science” combines two key elements: “data” and “science.”

1. Data: It refers to the raw information that is collected, stored, and processed. In
today’s digital age, enormous amounts of data are generated from various sources
such as sensors, social media, transactions, and more. This data can come in
structured formats (e.g., databases) or unstructured formats (e.g., text, images,
videos).

2. Science: It refers to the systematic study and investigation of phenomena using

scientific methods and principles. Science involves forming hypotheses, conducting
experiments, analyzing data, and drawing conclusions based on evidence.
“data+science” refers to the scientific study of data.

What is Data Science Applications?

Data Science has a wide array of applications across various industries, significantly
impacting the way businesses operate and how services are delivered. Here are some
key applications of Data Science:

1. Healthcare:
 Predictive Analytics: Predicting disease outbreaks, patient readmissions, and
individual health risks.
 Medical Imaging: Enhancing image recognition to diagnose conditions from X-rays,
MRIs, and CT scans.
 Personalized Medicine: Tailoring treatment plans based on genetic information and
patient history.

2. Finance:
 Risk Management: Identifying and mitigating financial risks through predictive
modeling.
 Fraud Detection: Analyzing transactions to detect fraudulent activities.
 Algorithmic Trading: Using data-driven algorithms to execute high-frequency
trading strategies.

3. Marketing:
 Customer Segmentation: Grouping customers based on purchasing behavior and
preferences for targeted marketing.
 Sentiment Analysis: Analyzing customer feedback and social media interactions to
gauge public sentiment.
 Predictive Analytics: Forecasting sales trends and customer lifetime value.

4. Retail:
 Inventory Management: Optimizing stock levels based on demand forecasting.
 Recommendation Systems: Providing personalized product recommendations to
customers.
 Price Optimization: Adjusting prices dynamically based on market trends and
consumer behavior.

5. Transportation:
 Route Optimization: Enhancing logistics by determining the most efficient routes.
 Predictive Maintenance: Forecasting equipment failures to schedule timely
maintenance.
 Autonomous Vehicles: Developing self-driving cars using machine learning
algorithms.

6. Education:

 Personalized Learning: Creating customized learning experiences based on

student performance and preferences.
 Academic Analytics: Analyzing data to improve student retention and graduation
rates.
 Curriculum Development: Using data to develop and refine educational programs.

7. Entertainment:
 Content Recommendation: Suggesting movies, shows, and music based on user
preferences.
 Audience Analytics: Understanding audience behavior to improve content delivery.
 Production Analytics: Optimizing production schedules and budgets through data
analysis.

8. Manufacturing:
 Quality Control: Using data to monitor and improve product quality.
 Supply Chain Optimization: Streamlining supply chain processes through
predictive analytics.
 Process Automation: Implementing automated systems for efficient production
workflows.

9. Energy:
 Smart Grids: Enhancing the efficiency and reliability of energy distribution.
 Predictive Maintenance: Forecasting and preventing equipment failures in power
plants.
 Energy Consumption Analytics: Analyzing patterns to optimize energy usage and
reduce costs.

10. Government:
 Public Safety: Analyzing crime data to improve law enforcement strategies.
 Urban Planning: Using data to plan and develop smarter cities.
 Policy Making: Leveraging data to make informed decisions and create effective
policies.


What is Big Data?

Big Data refers to the vast volumes of data generated at high velocity from a variety of
sources. This data is characterized by the three V’s: Volume, Velocity, and Variety.

1. Volume: Big Data involves large datasets that are too complex for traditional data
processing tools to handle. These datasets can range from terabytes to petabytes of
information.
2. Velocity: Big Data is generated in real-time or near real-time, requiring fast processing
to extract meaningful insights.
3. Variety: The data comes in multiple forms, including structured data (like databases),
semi-structured data (like XML files), and unstructured data (like text, images, and
videos).

Big Data’s primary role is to collect and store this massive amount of information
efficiently. Technologies such as Hadoop, Apache Spark, and NoSQL databases like
MongoDB are commonly used to manage and process Big Data.

MongoDB are commonly used to manage and process Big Data.

What is Data Science?

Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data.
It encompasses a variety of techniques from statistics, machine learning, data
mining, and big data analytics.

Data Scientists use their expertise to:

1. Analyze: They examine complex datasets to identify patterns, trends, and correlations.

2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.

3. Interpret: They translate data findings into actionable business strategies and
decisions.

Data Science involves a broad skill set, including proficiency in programming languages
like Python and R, knowledge of databases, and expertise in machine learning
frameworks such as TensorFlow and Scikit-Learn.

Key Differences Between Big Data and Data Science

While Big Data and Data Science are interrelated, they serve different purposes and
require different skill sets.

Aspect Big Data Data Science

Handling and processing Extracting insights and

Definition
vast amounts of data knowledge from data

Efficient storage,
Analyzing data to inform
Objective processing, and
decisions and predict trends
management of data

Volume, velocity, and Analytical methods, models,

Focus
variety of data and algorithms

Collection, storage, and Data analysis, modeling, and

Primary Tasks
processing of data interpretation

Hadoop, Spark, NoSQL Python, R, TensorFlow,

Tools/Technologies
databases (e.g., MongoDB) Scikit-Learn

Data Types Structured, semi-structured, Processed and cleaned data

Aspect Big Data Data Science

and unstructured data for analysis

Accessible data Actionable insights,

Outcome
repositories for analysis predictive models

Data engineering, Statistical analysis, machine

Skill Set
distributed computing learning, programming

Data Engineers, Big Data Data Scientists, Machine

Typical Roles
Analysts Learning Engineers

Real-time data processing, Predictive analytics, data-

Applications
large-scale data storage driven decision making

Distributed computing, data Statistical modeling, machine

Key Techniques
warehousing learning algorithms

How Big Data and Data Science Complement Each Other

Despite their differences, Big Data and Data Science are complementary fields.
Big Data provides the foundation by collecting and storing vast amounts of information.
Without this foundational layer, Data Science would lack the raw material needed for
analysis.

Conversely, Data Science adds value to Big Data by analyzing and interpreting the data.
The insights derived from Data Science can help businesses leverage Big Data more
effectively, uncovering trends and patterns that can inform strategic decisions.

For instance, in the healthcare sector, Big Data technologies can aggregate patient data
from various sources, including electronic health records, wearable devices, and genomic
databases. Data Science can then analyze this data to predict disease outbreaks,
personalize treatment plans, and improve patient outcomes.

Big Data focuses on managing and processing large datasets, whereas Data Science
aims to analyze this data and derive actionable insights

Data Science hype – and getting past the hype

Over time,the data science was too hyped up for its own good. The field is so glamorized
that the term is thrown around by people who don’t really understand what it entails.

There is a phrase “85% of all productionized machine learning models will fail.” This
was a prediction made by research firm Gartner,
And a failed data science investment can incur massive expense to organizations,
especially if an incorrect business decision is made based on the output of a predictive
algorithm.

Machine learning isn’t a one size fits all solution to every problem, which is necessary for
stakeholders and non-technical managers to grasp. Not every problem can be solved with
machine learning, and not every problem should be.

However, as a data scientist, don’t start building a predictive algorithm until you are sure
that machine learning modelling is your best option.

Building machine learning models, especially on large amounts of data is computationally

heavy and can incur unnecessary expense to the organization.
And if the problem can be solved with hard-coded logic or simple calculations in a
spreadsheet, why waste time building a machine learning model?

when attempting to build a data science solution:

 Don’t use machine learning if there is a simpler solution available.

 Only start building predictive models once you understand the business
requirement. Otherwise, you will end up spending time on a fancy algorithm that nobody
else uses.

Datafication:
Datafication is the transformation of social action into online quantified data, thus
allowing for real-time tracking and predictive analysis.

Simply said, it is about taking previously invisible process/activity and turning it into
data, that can be monitored, tracked, analysed and optimized.

Examples where datafication process is actively used:

 Insurance: Data used to update risk profile development and business models.
 Banking: Data used to establish trustworthiness and likelihood of a person paying back
a loan.
 Human resources: Data used to identify e.g. employees risk-taking profiles.
 Hiring and recruitment: Data used to replace personality tests.
 Social science research: Datafication replaces sampling techniques and restructures
the manner in which social science research is performed.

Datafication enables the transformation of business operations, behaviors, and actions,

in addition to those of its clients and consumers, into quantifiable, usable, and
actionable data.

This information can then be tracked, processed, monitored, analyzed, and utilized to
improve an organization and the products and services it offers to customers. To put
them into perspective.

 Google transforms our searches into data

 Facebook transforms our friendships into data
 LinkedIn transforms our professional life into data
 Netflix or Amazon Prime transforms our watched TV shows and films into data
 Tinder transforms our dating activities into data
 Amazon transforms our shopping into data

Benefits of Datafication
Datafication is a technique that is financially advantageous to pursue since it provides
great opportunity for streamlining corporate procedures. Datafication is a cutting-edge
process for creating a futuristic framework that is both secure and inventive.

1. Actionable Insights
Datafication converts unstructured, incomprehensible data into usable insights, allowing
you to get insight into your processes and procedures – the basis of any organization.

Datafication suggests that you will be better able to comprehend the advantages,
disadvantages, potential, and future of your business. Additionally, it gives you
knowledge about the results and effects of your efforts, allowing you to evaluate what
you’re doing and how you’re doing it.

2. Digital Transformation

In order for organisations to stay relevant and up-to-date in a constantly shifting

ecosystem, digital transformation is no longer just a passing trend but rather a
necessity.

You should have useable data if you want to benefit from the most recent and cutting-
edge technologies. It holds the key to enhancing the effectiveness and efficiency of
corporate operations. You will be better able to comprehend the organization’s current
situation and the necessary next actions to advance.

3. Improve Productivity and Efficiency

Datafication will be able to understand what you’re doing and how you’re improving.
Streamlining operations will improve the utilisation of all assets, including staff
members, to increase output and efficiency overall and turn your company into a
prosperous corporation.

4. Manage Information
Any business generates a lot of data, which is constantly collected and stored. A well-
managed data set will produce superior outcomes. Otherwise, it could become
overwhelming or useless information.

By correctly organising it through datafication, you can use the information to make
judgements. You will be able to access and analyse the data in addition to storing it.
How Datafication is Performed?

1.Data Collection:
It all starts with data collection and retrieval. This could be from anything
you do—clicking on a website, using an app, or even just walking around
with your smartphone. Devices and sensors collect this data, often without
you even noticing.

2. Data Storage:
Once collected, this data needs a place to go. Think of it like storing your
favorite movies or music. The data is saved in databases or cloud storage,
where it can be accessed and used later.

3. Data Processing:
Here’s where things get interesting. The raw data collected isn’t very useful
on its own. It’s like having all the ingredients for a cake but not baking it yet.
Data processing involves data cleaning, organizing, and transforming this
data into a more usable format. For example, if you’ve ever used a
spreadsheet to track your expenses, you’ve engaged in a basic form of
data processing.

4. Data Analysis:
This is the magic moment when data becomes valuable information. Using
various tools and techniques, analysts can look at patterns and trends in
the data. For example, they might discover that people tend to buy more
ice cream on hot days—a useful insight for a business.

5. Data Visualization:
To make the data easy to understand, it’s often presented visually, like in
charts or graphs. If you’ve ever seen a bar chart showing monthly sales or
a line graph tracking your steps over time, you’ve encountered data
visualization. This step helps people quickly grasp the insights hidden in
the data.

6. Data Application:
Finally, the insights gained from data analysis are put to use. This could
mean anything from tweaking a marketing strategy to designing a new
product. For example, if data shows that customers prefer shopping online
at certain times of the day, a business might run targeted ads during those
hours.

Why is Datafication Important?

Another exciting aspect of datafication is its predictive capabilities. By analyzing past
behaviors and trends, businesses can predict future outcomes.

Current Landscape Perspective of Data Science

The current landscape perspective of data science is characterized by

several key trends and developments:
 AI and Machine Learning: Data science is increasingly intertwined with
artificial intelligence (AI) and machine learning, enabling more advanced
predictive analytics and automation of data-driven processes.

 Big Data: The proliferation of big data continues to shape the data science
landscape, with organizations leveraging large volumes of data from
diverse sources to gain insights and make informed decisions.

 Ethics and Privacy: There is a growing emphasis on ethical
considerations and privacy concerns in data science, with a focus on
responsible data usage, transparency, and compliance with regulations
such as GDPR.

 Interdisciplinary Collaboration: Data science is increasingly viewed as a
multidisciplinary field, requiring collaboration between data scientists,
domain experts, and stakeholders to ensure the meaningful interpretation
and application of data-driven insights.

 Data Visualization and Interpretability: The importance of effective data
visualization and interpretability techniques is on the rise, as stakeholders
seek to comprehend and communicate complex data findings in a more
accessible manner.

 Automation and Scalability: There is a shift towards automating repetitive
data tasks and scaling data science processes through the use of cloud-
based platforms and tools, enabling greater efficiency and agility.

 Continuous Learning and Upskilling: Given the rapid evolution of data
science technologies and methodologies, professionals in the field are
increasingly focused on continuous learning and upskilling to stay abreast
of the latest developments.

In summary, the current landscape perspective of data science is
characterized by the convergence of AI and machine learning, the impact
of big data, a heightened focus on ethics and privacy, interdisciplinary
collaboration, emphasis on data visualization, automation, and the need for
continuous learning and upskilling.

Statistical Inference
Statistical inference is the process of using data analysis to infer properties of an
underlying distribution of a population. It is a branch of statistics that deals with making
inferences about a population based on data from a sample.
Statistical inference is based on probability theory and probability distributions. It involves
making assumptions about the population and the sample, and using statistical models to
analyze the data.

Statistical Inference

Statistical inference is the process of drawing conclusions or making predictions about a

population based on data collected from a sample of that population.

It involves using statistical methods to analyze sample data and make inferences or
predictions about parameters or characteristics of the entire population from which the
sample was drawn.

Branches of Statistical Inference

There are two main branches of statistical inference:

 Parameter Estimation
 Hypothesis Testing

Parameter Estimation
Parameter estimation is another primary goal of statistical inference. Parameters are
capable of being deduced; they are quantified traits or properties related to the population
you are studying. Some instances comprise the population mean, population variance.

There are two broad methods of parameter estimation:

 Point Estimation
 Interval Estimation
 Interval Estimation-Confidence Intervals (CI)

Another statistical concept that involves confidence intervals is determining a range of

possible values where the population parameter can be, given a certain confidence
percentage – usually 95%. In simpler terms, CI’s provide an estimate of the population
value and the level of uncertainty that comes with it.

Hypothesis Testing

Hypothesis testing is used to make decisions or draw conclusions about a population

based on sample data. It involves formulating a hypothesis about the population
parameter, collecting sample data, and then using statistical methods to determine
whether the data provide enough evidence to reject or fail to reject the hypothesis .

Statistical Inference Methods

There are various methods of statistical inference, some of these methods are:

 Parametric Methods
 Non-parametric Methods
 Bayesian Methods

Parametric Methods

In this scenario, the parametric statistical methods will assume that the data is drawn from
a population characterized by a probability distribution. It is mainly believed that they
follow a normal distribution thus can allow one to make guesses about the populace in
question . For example, the t-tests and ANOVA are parametric tests that give accurate
results with the assumption that the data ought to be

Non-Parametric Methods

These are less assumptive and more flexible analysis methods when dealing with data out
of normal distribution. They are also used to conduct data analysis when one is uncertain
about meeting the assumption for parametric methods and when one have less or
inadequate data .
Some of the non-parametric tests include Wilcoxon signed-rank test and Kruskal-Wallis
test among others.
Bayesian Methods

Bayesian statistics is distinct from conventional methods in that it includes prior

knowledge and beliefs. It determines the various potential probabilities of a hypothesis
being genuine in the light of current and previous knowledge. Thus, it allows updating
the likelihood of beliefs with new data.

 Example: consider a situation where a doctor is investigating a new treatment and has
the prior belief about the success rate of the treatment. Upon conducting a new clinical
trial, the doctor uses Bayesian method to update his “prior belief” with the data from the
new trials to estimate the true success rate of the treatment.

Population vs Sample in Statistics

In statistics, understanding the difference between a population and a sample is
fundamental to many aspects of data analysis and inference.

Population Vs Sample

The population refers to the entire group of individuals or items that we are interested in
studying and drawing conclusions about. It can be a group of individuals or a set of items.

The population is usually denoted by N.

A sample is a subset of the population selected for study. It is a representative portion of
the population from which we collect data in order to make inferences or draw conclusions
about the entire population.

It is denoted by n.

Population Sample

The population includes all

A sample is a subset of the population.
members of a specified group.

Collecting data from an entire Samples offer a more feasible approach to

population can be time-consuming, studying populations, allowing researchers
expensive, and sometimes to draw conclusions based on smaller,
impractical or impossible. manageable datasets

Consists of 1000 households, a subset of

Includes all residents in the city.
the entire population.

Collecting Data From Population and Sample

Populations are used when your research question requires, or when you have
access to, data from every member of the population. Usually, it is only
straightforward to collect data from a whole population when it is small, accessible
and cooperative.

Example:

A marketing manager for a small local bakery wants to understand customer preferences
for different types of bread sold at their shop. Since they are solely interested in analyzing
the preferences of customers who visit their bakery, they decide to collect data on bread
preferences from every customer who makes a purchase over the course of a month. By
using the entire dataset of bread purchases, including preferences indicated by
customers, they aim to identify trends and patterns in bread choices specifically among
their bakery’s clientele.

When your population is large in size, geographically dispersed, or difficult to

contact, it’s necessary to use a sample. With statistical analysis, you can use
sample data to make estimates or test hypotheses about population data.

Example:

Suppose you are conducting research on smartphone usage habits among teenagers in a
specific city. Your population comprises all teenagers aged 13-18 living in that city, which
could number in the tens of thousands. Due to logistical constraints and the difficulty of
reaching every teenager in the city, you opt to use a sample of 500 teenagers randomly
selected from different schools within the city. This sample will participate in interviews or
surveys to provide insights into their smartphone usage patterns, preferences, and
behaviors.

When Should Samples be used?

 When studying a large population where it is impractical or impossible to collect data

from every individual.
 When resources such as time, cost, and manpower are limited, making it more feasible
to collect data from a subset of the population.
 When conducting research or experiments where it is important to minimize potential
biases in data collection.

Statistical Modeling

A statistical model is a type of mathematical model that comprises of

the assumptions undertaken to describe the data generation process.

Let us focus on the two highlighted terms above:

1. Type of mathematical model? Statistical model is non-deterministic unlike other mathematical

models where variables have specific values. Variables in statistical models are stochastic

i.e. they have probability distributions.

2. Assumptions? But how do those assumptions help us understand the properties or

characteristics of the true data? Simply put, these assumptions make it easy to calculate the

probability of an event.

Quoting an example to better understand the role of statistical assumptions in data

modeling:

 Assumption 1: Assuming that we have 2 fair dice, and each face has equal probability to

show up i.e. 1/6. Now, we can calculate the probability of two dice showing up 5 as 1/6*1/6.

As we can calculate the probability of every event, it constitutes a statistical model.

Why Do We Need Statistical Modeling?

It is a statement of finding statistical evidence.

The statistical model plays a fundamental role in carrying out

statistical inference which helps in making propositions about the unknown
properties and characteristics of the population.

Fitting a Model-Underfitting and Overfitting

Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting are the two main problems that occur in
machine learning and degrade the performance of the machine learning
models.

The main goal of each machine learning model is to generalize well.

Here generalization defines the ability of an ML model to provide a suitable
output by adapting the given set of unknown input. It means after providing
training on the dataset, it can produce reliable and accurate output. Hence,
the underfitting and overfitting are the two terms that need to be checked
for the performance of the model and whether the model is generalizing well
or not.

o Signal: It refers to the true underlying pattern of the data that helps
the machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the
training dataset, but does not perform well with the test dataset, then
variance occurs.
Overfitting

Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.

The chances of occurrence of overfitting increase as much we provide

training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below

graph of the linear regression output:

As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is not
so. Because the goal of the regression model to find the best fit line, but here
we have not got any best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the

machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
o
o Underfitting
o
o Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due
to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
o

o In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.

o An underfitted model has high bias and low variance.

o Example: We can understand the underfitting using below output of

the linear regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.

o By increasing the number of features.
o
o Goodness of Fit
o
o The "Goodness of fit" term is taken from the statistics, and the goal of
the machine learning models to achieve the goodness of fit. In
statistics modeling, it defines how closely the result or predicted values
match the true values of the dataset.
o The model with a good fit is between the underfitted and overfitted
model, and ideally, it makes predictions with 0 errors, but in practice, it
is difficult to achieve it.
o As when we train our model for a time, the errors in the training data
go down, and the same happens with test data. But if we train the
model for a long duration, then the performance of the model may
decrease due to the overfitting, as the model also learn the noise
present in the dataset. The errors in the test dataset start
increasing, so the point, just before the raising of errors, is the good
point, and we can stop here for achieving a good model .

There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.

Note : The value of the probability always lies between 0 to 1.

What is Random Variable?

Set of all possible values from a Random Experiment is called Random Variable.
It is represented by X.

Example: Outcome of coin toss

What is an example of Probability Distribution?

Let’s understand the probability distribution by an example:

When two dice are rolled with six sided dots, let the possible outcome of rolling is
denoted by (a, b), where

a : number on the top of first dice

b : number on the top of second dice

Then, sum of a + b are:

Sum of a + b (a, b)
2 (1,1)
3 (1,2), (2,1)
4 (1,3), (2,2), (3,1)
5 (1,4), (2,3), (3,2), (4,1)
6 (1,5), (2,4), (3,3), (4,2), (5,1)
7 (1,6), (2,5), (3,4),(4,3), (5,2), (6,1)
8 (2,6), (3,5), (4,4), (5,3), (6,2)
9 (3,6), (4,5), (5,4), (6,3)
10 (4,6), (5,5), (6,4)
11 (5,6), (6,5)
12 (6,6)

If a random variable is a discrete variable, it’s probability distribution is called discrete probability
distribution.
 Example : Flipping of two coins

 Functions that represents a discrete probability distribution is known as Probability Mass
Function.

 If a random variable is a continuous variable, it’s probability distribution is called continuous probability
distribution.
 Example: Measuring temperature over a period of time
 Functions that represents a continuous probability distribution is known as Probability Density
Function.

Types of Random Variable:

 Discrete Random Variable

 X is a discrete because it has a countable values between two numbers
 Example : number of balls in a bag, number of tails in tossing coin

 Continuous Random Variable

 X is a continuous because it has a infinite number of values between two values
 Example : distance travelled, Height of students

Types of Probability Distributions

1.Uniform Distribution
What is Uniform Distribution?
Probability distribution in which all the outcome have equal probability is known as
Uniform Distribution.
Example: Perfect Random Generator

2.Normal Distribution (Gaussian Distribution):

A continuous probability distribution, which is symmetric about it’s mean value (i.e. data
near the mean are more frequency in occurrence) is known as Normal Distribution.

 Random variable X is normally distributed if the distribution function is given by:

R Programming Language – Introduction
The R Language stands out as a powerful tool in the modern era of statistical computing
and data analysis.the R Language offers an extensive suite of packages and libraries
tailored for data manipulation, statistical modeling, and visualization.
What is R
R is a popular programming language used for statistical computing and graphical presentation.

Its most common use is to analyze and visualize data.

Why Use R?
 It is a great resource for data analysis, data visualization, data science and machine
learning
 It provides many statistical techniques (such as statistical tests, classification, clustering
and data reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
 It works on different platforms (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to solve different problems

How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows,
Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.

The screenshot below shows how it may look like when you run R on a Windows PC:

Syntax
To output text in R, use single or double quotes:

To output numbers, just type the number (without quotes):

To do simple calculations, add numbers together

Print
Unlike many other programming languages, you can output code in R without using a print
function:

However, R does have a print() function available if you want to use it.

Comments
Comments can be used to explain R code, and to make it more readable. It can also be used to
prevent execution when testing alternative code.

Comments starts with a #. When executing code, R will ignore anything that starts with #.

Creating Variables in R
Variables are containers for storing data values.

R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name:

Example
name <- "John"
age <- 40

name # output "John"

age # output 40

From the example above, name and age are variables, while "John" and 40 are values.

R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just
typing the following command at your command prompt −

This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −

> myString <- "Hello, World!"

> print ( myString)
[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello, World!"
and then next statement print() is being used to print the value stored in variable myString.

R Script File
Usually, you will do your programming by writing your programs in script files and then you
execute those scripts at your command prompt with the help of R interpreter called Rscript. So
let's start with writing following code in a text file called test.R as under −
Live Demo

# My first program in R Programming

myString <- "Hello, World!"
print ( myString)

Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.

$ Rscript test.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

Basic Data Types in R-program

Data types

Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be stored in the reserved memory.

Basic data types in R can be divided into the following types:

 numeric - (10.5, 55, 787)

 integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
 logical (a.k.a. boolean) - (TRUE or FALSE)

We can use the class() function to check the data type of a variable:

Example
# numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)

Syllabus - Statistical Analysis With Software Application
100% (7)
Syllabus - Statistical Analysis With Software Application
4 pages
Quiz 10 - Regression, Cluster Analysis, & Association Analysis
No ratings yet
Quiz 10 - Regression, Cluster Analysis, & Association Analysis
3 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
DS B&V-1 (1)
No ratings yet
DS B&V-1 (1)
30 pages
data science chacha
No ratings yet
data science chacha
150 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Unit1 R Full Material
No ratings yet
Unit1 R Full Material
11 pages
ChatGPT_MyLearning on Big Data, Data Science and Machine Learning
No ratings yet
ChatGPT_MyLearning on Big Data, Data Science and Machine Learning
44 pages
Data Science Modern Technology5
No ratings yet
Data Science Modern Technology5
6 pages
himadev
No ratings yet
himadev
37 pages
IDS- UNIT-1
No ratings yet
IDS- UNIT-1
14 pages
DS 1
No ratings yet
DS 1
85 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
Unit I & II_FDS_II AI&DS
No ratings yet
Unit I & II_FDS_II AI&DS
48 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Unit 1 Pds Material.docx
No ratings yet
Unit 1 Pds Material.docx
19 pages
Extended_Comprehensive_Guide_to_Data_Science
No ratings yet
Extended_Comprehensive_Guide_to_Data_Science
2 pages
Lecture 1
No ratings yet
Lecture 1
12 pages
Lecture 2-Quick Overview of Data Science
No ratings yet
Lecture 2-Quick Overview of Data Science
18 pages
Data SC Details
No ratings yet
Data SC Details
3 pages
Question Bank Syllbuswise
No ratings yet
Question Bank Syllbuswise
16 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
Data Science Using Python
No ratings yet
Data Science Using Python
85 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
Assignament
No ratings yet
Assignament
4 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
BDTT-introductry class
No ratings yet
BDTT-introductry class
3 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
DATA SCIENCE Basics
No ratings yet
DATA SCIENCE Basics
6 pages
Unit I
No ratings yet
Unit I
13 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
AD3491 UNIT 1 NOTES EduEngg
100% (1)
AD3491 UNIT 1 NOTES EduEngg
35 pages
Overview of Data Science
No ratings yet
Overview of Data Science
3 pages
Data Science
No ratings yet
Data Science
85 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science vs. Big Data vs. Data Analytics
No ratings yet
Data Science vs. Big Data vs. Data Analytics
7 pages
Data-Science-and-Analytics-Reviewer
No ratings yet
Data-Science-and-Analytics-Reviewer
5 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Handbook Dsc 1 2
No ratings yet
Handbook Dsc 1 2
35 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Data
No ratings yet
Data
43 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction to Data Science- Unit-1
No ratings yet
Introduction to Data Science- Unit-1
9 pages
Unit 1
No ratings yet
Unit 1
60 pages
Data Science and The Future
No ratings yet
Data Science and The Future
20 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Data Science Essentials: Machine Learning and Natural Language Processing
From Everand
Data Science Essentials: Machine Learning and Natural Language Processing
Angel Gabaldon
No ratings yet
IDS-UNIT-2-FINAL (1)
No ratings yet
IDS-UNIT-2-FINAL (1)
18 pages
IDS-UNIT-5-FINAL (1)
No ratings yet
IDS-UNIT-5-FINAL (1)
25 pages
IDS-UNIT-4-FINAL (1)
No ratings yet
IDS-UNIT-4-FINAL (1)
32 pages
IDS-UNIT-3-FINAL (1)
No ratings yet
IDS-UNIT-3-FINAL (1)
42 pages
Research Guide
No ratings yet
Research Guide
24 pages
SLR For Master Dissertation
No ratings yet
SLR For Master Dissertation
29 pages
Five Slides About Cyclic Voltammetry: - Chip Nataro
No ratings yet
Five Slides About Cyclic Voltammetry: - Chip Nataro
18 pages
Understanding The Independent-Samples T Test
No ratings yet
Understanding The Independent-Samples T Test
8 pages
Statistical & Quantitative Techniques For Decision Science
No ratings yet
Statistical & Quantitative Techniques For Decision Science
3 pages
Lesson Study PDF
No ratings yet
Lesson Study PDF
17 pages
Aula1-Estatistica Basica e Probabilidade
No ratings yet
Aula1-Estatistica Basica e Probabilidade
68 pages
4.1. Uncertainty
100% (2)
4.1. Uncertainty
18 pages
Modelling and Assessing Scientific Methods
No ratings yet
Modelling and Assessing Scientific Methods
7 pages
Ethics Committee Protocol Submission Guidelines
No ratings yet
Ethics Committee Protocol Submission Guidelines
3 pages
Patent Law Project
No ratings yet
Patent Law Project
15 pages
ASSIGNMENT 1 (EDUCATIONAL RESEARCH METHODOLOGY)
No ratings yet
ASSIGNMENT 1 (EDUCATIONAL RESEARCH METHODOLOGY)
6 pages
Aregash
No ratings yet
Aregash
35 pages
Introduction To Educational Research (Res 1)
No ratings yet
Introduction To Educational Research (Res 1)
2 pages
Summative Stat q4
No ratings yet
Summative Stat q4
1 page
Business Analytics - Intro
No ratings yet
Business Analytics - Intro
2 pages
PS50452 Spectronic Standards
No ratings yet
PS50452 Spectronic Standards
4 pages
Ch4 Multiple Linear Regression - PDF (Ch4 Multiple Linear Regression - PDF) (Z-Library)
No ratings yet
Ch4 Multiple Linear Regression - PDF (Ch4 Multiple Linear Regression - PDF) (Z-Library)
24 pages
Towards Homeostatic Architecture: Simulation of The Generative Process of A Termite Mound Construction
No ratings yet
Towards Homeostatic Architecture: Simulation of The Generative Process of A Termite Mound Construction
57 pages
Pairwise Granger Causality Tests
No ratings yet
Pairwise Granger Causality Tests
3 pages
Ch08_MF (1)
No ratings yet
Ch08_MF (1)
28 pages
MBR Session 4-5
No ratings yet
MBR Session 4-5
54 pages
Stochastic Programming
No ratings yet
Stochastic Programming
15 pages
CAT6-Randomised Controlled Trial
No ratings yet
CAT6-Randomised Controlled Trial
2 pages
Calibration and Validation of Condition Indicator For Managing Urban Pavement Networks
No ratings yet
Calibration and Validation of Condition Indicator For Managing Urban Pavement Networks
9 pages
Visible Spectrophotometric
100% (2)
Visible Spectrophotometric
24 pages
Question Results
No ratings yet
Question Results
22 pages
Sample Assessment Task Year 12 Investigating Science
No ratings yet
Sample Assessment Task Year 12 Investigating Science
10 pages