0% found this document useful (0 votes)
7 views

IDS-UNIT-1-FINAL (1)

Uploaded by

pnithishreddy14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

IDS-UNIT-1-FINAL (1)

Uploaded by

pnithishreddy14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

INTRODUCTION TO DATA SCIENCE(IDS)

Prepared by N.Pandu Ranga Reddy


UNIT-1
Definition of Data Science:
Data is widely considered a crucial resource in different organizations across every
industry.
Data Science can be described as a separate field of work that deals with the
management and processing of data using statistical methods, artificial intelligence,
and other tools in partnership with domain specialists.

Essentially, data science is about using scientific methods to unlock the potential of
data, uncover patterns, make predictions, and drive informed decision-making across
various domains and industries.

The term “data science” combines two key elements: “data” and “science.”

1. Data: It refers to the raw information that is collected, stored, and processed. In
today’s digital age, enormous amounts of data are generated from various sources
such as sensors, social media, transactions, and more. This data can come in
structured formats (e.g., databases) or unstructured formats (e.g., text, images,
videos).

2. Science: It refers to the systematic study and investigation of phenomena using


scientific methods and principles. Science involves forming hypotheses, conducting
experiments, analyzing data, and drawing conclusions based on evidence.
“data+science” refers to the scientific study of data.

What is Data Science Applications?

Data Science has a wide array of applications across various industries, significantly
impacting the way businesses operate and how services are delivered. Here are some
key applications of Data Science:

1. Healthcare:
 Predictive Analytics: Predicting disease outbreaks, patient readmissions, and
individual health risks.
 Medical Imaging: Enhancing image recognition to diagnose conditions from X-rays,
MRIs, and CT scans.
 Personalized Medicine: Tailoring treatment plans based on genetic information and
patient history.

2. Finance:
 Risk Management: Identifying and mitigating financial risks through predictive
modeling.
 Fraud Detection: Analyzing transactions to detect fraudulent activities.
 Algorithmic Trading: Using data-driven algorithms to execute high-frequency
trading strategies.

3. Marketing:
 Customer Segmentation: Grouping customers based on purchasing behavior and
preferences for targeted marketing.
 Sentiment Analysis: Analyzing customer feedback and social media interactions to
gauge public sentiment.
 Predictive Analytics: Forecasting sales trends and customer lifetime value.

4. Retail:
 Inventory Management: Optimizing stock levels based on demand forecasting.
 Recommendation Systems: Providing personalized product recommendations to
customers.
 Price Optimization: Adjusting prices dynamically based on market trends and
consumer behavior.

5. Transportation:
 Route Optimization: Enhancing logistics by determining the most efficient routes.
 Predictive Maintenance: Forecasting equipment failures to schedule timely
maintenance.
 Autonomous Vehicles: Developing self-driving cars using machine learning
algorithms.

6. Education:

 Personalized Learning: Creating customized learning experiences based on


student performance and preferences.
 Academic Analytics: Analyzing data to improve student retention and graduation
rates.
 Curriculum Development: Using data to develop and refine educational programs.

7. Entertainment:
 Content Recommendation: Suggesting movies, shows, and music based on user
preferences.
 Audience Analytics: Understanding audience behavior to improve content delivery.
 Production Analytics: Optimizing production schedules and budgets through data
analysis.

8. Manufacturing:
 Quality Control: Using data to monitor and improve product quality.
 Supply Chain Optimization: Streamlining supply chain processes through
predictive analytics.
 Process Automation: Implementing automated systems for efficient production
workflows.

9. Energy:
 Smart Grids: Enhancing the efficiency and reliability of energy distribution.
 Predictive Maintenance: Forecasting and preventing equipment failures in power
plants.
 Energy Consumption Analytics: Analyzing patterns to optimize energy usage and
reduce costs.

10. Government:
 Public Safety: Analyzing crime data to improve law enforcement strategies.
 Urban Planning: Using data to plan and develop smarter cities.
 Policy Making: Leveraging data to make informed decisions and create effective
policies.

What is Big Data?

Big Data refers to the vast volumes of data generated at high velocity from a variety of
sources. This data is characterized by the three V’s: Volume, Velocity, and Variety.

1. Volume: Big Data involves large datasets that are too complex for traditional data
processing tools to handle. These datasets can range from terabytes to petabytes of
information.
2. Velocity: Big Data is generated in real-time or near real-time, requiring fast processing
to extract meaningful insights.
3. Variety: The data comes in multiple forms, including structured data (like databases),
semi-structured data (like XML files), and unstructured data (like text, images, and
videos).

Big Data’s primary role is to collect and store this massive amount of information
efficiently. Technologies such as Hadoop, Apache Spark, and NoSQL databases like
MongoDB are commonly used to manage and process Big Data.

MongoDB are commonly used to manage and process Big Data.

What is Data Science?

Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data.
It encompasses a variety of techniques from statistics, machine learning, data
mining, and big data analytics.

Data Scientists use their expertise to:

1. Analyze: They examine complex datasets to identify patterns, trends, and correlations.

2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.

3. Interpret: They translate data findings into actionable business strategies and
decisions.

Data Science involves a broad skill set, including proficiency in programming languages
like Python and R, knowledge of databases, and expertise in machine learning
frameworks such as TensorFlow and Scikit-Learn.

Key Differences Between Big Data and Data Science

While Big Data and Data Science are interrelated, they serve different purposes and
require different skill sets.

Aspect Big Data Data Science

Handling and processing Extracting insights and


Definition
vast amounts of data knowledge from data

Efficient storage,
Analyzing data to inform
Objective processing, and
decisions and predict trends
management of data

Volume, velocity, and Analytical methods, models,


Focus
variety of data and algorithms

Collection, storage, and Data analysis, modeling, and


Primary Tasks
processing of data interpretation

Hadoop, Spark, NoSQL Python, R, TensorFlow,


Tools/Technologies
databases (e.g., MongoDB) Scikit-Learn

Data Types Structured, semi-structured, Processed and cleaned data


Aspect Big Data Data Science

and unstructured data for analysis

Accessible data Actionable insights,


Outcome
repositories for analysis predictive models

Data engineering, Statistical analysis, machine


Skill Set
distributed computing learning, programming

Data Engineers, Big Data Data Scientists, Machine


Typical Roles
Analysts Learning Engineers

Real-time data processing, Predictive analytics, data-


Applications
large-scale data storage driven decision making

Distributed computing, data Statistical modeling, machine


Key Techniques
warehousing learning algorithms

How Big Data and Data Science Complement Each Other

Despite their differences, Big Data and Data Science are complementary fields.
Big Data provides the foundation by collecting and storing vast amounts of information.
Without this foundational layer, Data Science would lack the raw material needed for
analysis.

Conversely, Data Science adds value to Big Data by analyzing and interpreting the data.
The insights derived from Data Science can help businesses leverage Big Data more
effectively, uncovering trends and patterns that can inform strategic decisions.

For instance, in the healthcare sector, Big Data technologies can aggregate patient data
from various sources, including electronic health records, wearable devices, and genomic
databases. Data Science can then analyze this data to predict disease outbreaks,
personalize treatment plans, and improve patient outcomes.

Big Data focuses on managing and processing large datasets, whereas Data Science
aims to analyze this data and derive actionable insights

Data Science hype – and getting past the hype


Over time,the data science was too hyped up for its own good. The field is so glamorized
that the term is thrown around by people who don’t really understand what it entails.

There is a phrase “85% of all productionized machine learning models will fail.” This
was a prediction made by research firm Gartner,
And a failed data science investment can incur massive expense to organizations,
especially if an incorrect business decision is made based on the output of a predictive
algorithm.

Machine learning isn’t a one size fits all solution to every problem, which is necessary for
stakeholders and non-technical managers to grasp. Not every problem can be solved with
machine learning, and not every problem should be.

However, as a data scientist, don’t start building a predictive algorithm until you are sure
that machine learning modelling is your best option.

Building machine learning models, especially on large amounts of data is computationally


heavy and can incur unnecessary expense to the organization.
And if the problem can be solved with hard-coded logic or simple calculations in a
spreadsheet, why waste time building a machine learning model?

when attempting to build a data science solution:

 Don’t use machine learning if there is a simpler solution available.


 Only start building predictive models once you understand the business
requirement. Otherwise, you will end up spending time on a fancy algorithm that nobody
else uses.

Datafication:
Datafication is the transformation of social action into online quantified data, thus
allowing for real-time tracking and predictive analysis.

Simply said, it is about taking previously invisible process/activity and turning it into
data, that can be monitored, tracked, analysed and optimized.

Examples where datafication process is actively used:

 Insurance: Data used to update risk profile development and business models.
 Banking: Data used to establish trustworthiness and likelihood of a person paying back
a loan.
 Human resources: Data used to identify e.g. employees risk-taking profiles.
 Hiring and recruitment: Data used to replace personality tests.
 Social science research: Datafication replaces sampling techniques and restructures
the manner in which social science research is performed.

Datafication enables the transformation of business operations, behaviors, and actions,


in addition to those of its clients and consumers, into quantifiable, usable, and
actionable data.

This information can then be tracked, processed, monitored, analyzed, and utilized to
improve an organization and the products and services it offers to customers. To put
them into perspective.

 Google transforms our searches into data


 Facebook transforms our friendships into data
 LinkedIn transforms our professional life into data
 Netflix or Amazon Prime transforms our watched TV shows and films into data
 Tinder transforms our dating activities into data
 Amazon transforms our shopping into data

Benefits of Datafication
Datafication is a technique that is financially advantageous to pursue since it provides
great opportunity for streamlining corporate procedures. Datafication is a cutting-edge
process for creating a futuristic framework that is both secure and inventive.

1. Actionable Insights
Datafication converts unstructured, incomprehensible data into usable insights, allowing
you to get insight into your processes and procedures – the basis of any organization.

Datafication suggests that you will be better able to comprehend the advantages,
disadvantages, potential, and future of your business. Additionally, it gives you
knowledge about the results and effects of your efforts, allowing you to evaluate what
you’re doing and how you’re doing it.

2. Digital Transformation

In order for organisations to stay relevant and up-to-date in a constantly shifting


ecosystem, digital transformation is no longer just a passing trend but rather a
necessity.

You should have useable data if you want to benefit from the most recent and cutting-
edge technologies. It holds the key to enhancing the effectiveness and efficiency of
corporate operations. You will be better able to comprehend the organization’s current
situation and the necessary next actions to advance.

3. Improve Productivity and Efficiency


Datafication will be able to understand what you’re doing and how you’re improving.
Streamlining operations will improve the utilisation of all assets, including staff
members, to increase output and efficiency overall and turn your company into a
prosperous corporation.

4. Manage Information
Any business generates a lot of data, which is constantly collected and stored. A well-
managed data set will produce superior outcomes. Otherwise, it could become
overwhelming or useless information.

By correctly organising it through datafication, you can use the information to make
judgements. You will be able to access and analyse the data in addition to storing it.
How Datafication is Performed?

1.Data Collection:
It all starts with data collection and retrieval. This could be from anything
you do—clicking on a website, using an app, or even just walking around
with your smartphone. Devices and sensors collect this data, often without
you even noticing.

2. Data Storage:
Once collected, this data needs a place to go. Think of it like storing your
favorite movies or music. The data is saved in databases or cloud storage,
where it can be accessed and used later.

3. Data Processing:
Here’s where things get interesting. The raw data collected isn’t very useful
on its own. It’s like having all the ingredients for a cake but not baking it yet.
Data processing involves data cleaning, organizing, and transforming this
data into a more usable format. For example, if you’ve ever used a
spreadsheet to track your expenses, you’ve engaged in a basic form of
data processing.

4. Data Analysis:
This is the magic moment when data becomes valuable information. Using
various tools and techniques, analysts can look at patterns and trends in
the data. For example, they might discover that people tend to buy more
ice cream on hot days—a useful insight for a business.

5. Data Visualization:
To make the data easy to understand, it’s often presented visually, like in
charts or graphs. If you’ve ever seen a bar chart showing monthly sales or
a line graph tracking your steps over time, you’ve encountered data
visualization. This step helps people quickly grasp the insights hidden in
the data.

6. Data Application:
Finally, the insights gained from data analysis are put to use. This could
mean anything from tweaking a marketing strategy to designing a new
product. For example, if data shows that customers prefer shopping online
at certain times of the day, a business might run targeted ads during those
hours.

Why is Datafication Important?


Another exciting aspect of datafication is its predictive capabilities. By analyzing past
behaviors and trends, businesses can predict future outcomes.

Current Landscape Perspective of Data Science

The current landscape perspective of data science is characterized by


several key trends and developments:
 AI and Machine Learning: Data science is increasingly intertwined with
artificial intelligence (AI) and machine learning, enabling more advanced
predictive analytics and automation of data-driven processes.

 Big Data: The proliferation of big data continues to shape the data science
landscape, with organizations leveraging large volumes of data from
diverse sources to gain insights and make informed decisions.

 Ethics and Privacy: There is a growing emphasis on ethical
considerations and privacy concerns in data science, with a focus on
responsible data usage, transparency, and compliance with regulations
such as GDPR.

 Interdisciplinary Collaboration: Data science is increasingly viewed as a
multidisciplinary field, requiring collaboration between data scientists,
domain experts, and stakeholders to ensure the meaningful interpretation
and application of data-driven insights.

 Data Visualization and Interpretability: The importance of effective data
visualization and interpretability techniques is on the rise, as stakeholders
seek to comprehend and communicate complex data findings in a more
accessible manner.

 Automation and Scalability: There is a shift towards automating repetitive
data tasks and scaling data science processes through the use of cloud-
based platforms and tools, enabling greater efficiency and agility.

 Continuous Learning and Upskilling: Given the rapid evolution of data
science technologies and methodologies, professionals in the field are
increasingly focused on continuous learning and upskilling to stay abreast
of the latest developments.

In summary, the current landscape perspective of data science is
characterized by the convergence of AI and machine learning, the impact
of big data, a heightened focus on ethics and privacy, interdisciplinary
collaboration, emphasis on data visualization, automation, and the need for
continuous learning and upskilling.

Statistical Inference
Statistical inference is the process of using data analysis to infer properties of an
underlying distribution of a population. It is a branch of statistics that deals with making
inferences about a population based on data from a sample.
Statistical inference is based on probability theory and probability distributions. It involves
making assumptions about the population and the sample, and using statistical models to
analyze the data.

Statistical Inference

Statistical inference is the process of drawing conclusions or making predictions about a


population based on data collected from a sample of that population.

It involves using statistical methods to analyze sample data and make inferences or
predictions about parameters or characteristics of the entire population from which the
sample was drawn.

Branches of Statistical Inference

There are two main branches of statistical inference:


 Parameter Estimation
 Hypothesis Testing

Parameter Estimation
Parameter estimation is another primary goal of statistical inference. Parameters are
capable of being deduced; they are quantified traits or properties related to the population
you are studying. Some instances comprise the population mean, population variance.

There are two broad methods of parameter estimation:


 Point Estimation
 Interval Estimation
 Interval Estimation-Confidence Intervals (CI)

Another statistical concept that involves confidence intervals is determining a range of


possible values where the population parameter can be, given a certain confidence
percentage – usually 95%. In simpler terms, CI’s provide an estimate of the population
value and the level of uncertainty that comes with it.

Hypothesis Testing

Hypothesis testing is used to make decisions or draw conclusions about a population


based on sample data. It involves formulating a hypothesis about the population
parameter, collecting sample data, and then using statistical methods to determine
whether the data provide enough evidence to reject or fail to reject the hypothesis .

Statistical Inference Methods

There are various methods of statistical inference, some of these methods are:

 Parametric Methods
 Non-parametric Methods
 Bayesian Methods

Parametric Methods

In this scenario, the parametric statistical methods will assume that the data is drawn from
a population characterized by a probability distribution. It is mainly believed that they
follow a normal distribution thus can allow one to make guesses about the populace in
question . For example, the t-tests and ANOVA are parametric tests that give accurate
results with the assumption that the data ought to be

Non-Parametric Methods

These are less assumptive and more flexible analysis methods when dealing with data out
of normal distribution. They are also used to conduct data analysis when one is uncertain
about meeting the assumption for parametric methods and when one have less or
inadequate data .
Some of the non-parametric tests include Wilcoxon signed-rank test and Kruskal-Wallis
test among others.
Bayesian Methods

Bayesian statistics is distinct from conventional methods in that it includes prior


knowledge and beliefs. It determines the various potential probabilities of a hypothesis
being genuine in the light of current and previous knowledge. Thus, it allows updating
the likelihood of beliefs with new data.

 Example: consider a situation where a doctor is investigating a new treatment and has
the prior belief about the success rate of the treatment. Upon conducting a new clinical
trial, the doctor uses Bayesian method to update his “prior belief” with the data from the
new trials to estimate the true success rate of the treatment.

Population vs Sample in Statistics


In statistics, understanding the difference between a population and a sample is
fundamental to many aspects of data analysis and inference.

Population Vs Sample

The population refers to the entire group of individuals or items that we are interested in
studying and drawing conclusions about. It can be a group of individuals or a set of items.

The population is usually denoted by N.


A sample is a subset of the population selected for study. It is a representative portion of
the population from which we collect data in order to make inferences or draw conclusions
about the entire population.

It is denoted by n.

Population Sample

The population includes all


A sample is a subset of the population.
members of a specified group.

Collecting data from an entire Samples offer a more feasible approach to


population can be time-consuming, studying populations, allowing researchers
expensive, and sometimes to draw conclusions based on smaller,
impractical or impossible. manageable datasets

Consists of 1000 households, a subset of


Includes all residents in the city.
the entire population.

Collecting Data From Population and Sample

Populations are used when your research question requires, or when you have
access to, data from every member of the population. Usually, it is only
straightforward to collect data from a whole population when it is small, accessible
and cooperative.

Example:

A marketing manager for a small local bakery wants to understand customer preferences
for different types of bread sold at their shop. Since they are solely interested in analyzing
the preferences of customers who visit their bakery, they decide to collect data on bread
preferences from every customer who makes a purchase over the course of a month. By
using the entire dataset of bread purchases, including preferences indicated by
customers, they aim to identify trends and patterns in bread choices specifically among
their bakery’s clientele.

When your population is large in size, geographically dispersed, or difficult to


contact, it’s necessary to use a sample. With statistical analysis, you can use
sample data to make estimates or test hypotheses about population data.

Example:

Suppose you are conducting research on smartphone usage habits among teenagers in a
specific city. Your population comprises all teenagers aged 13-18 living in that city, which
could number in the tens of thousands. Due to logistical constraints and the difficulty of
reaching every teenager in the city, you opt to use a sample of 500 teenagers randomly
selected from different schools within the city. This sample will participate in interviews or
surveys to provide insights into their smartphone usage patterns, preferences, and
behaviors.

When Should Samples be used?

 When studying a large population where it is impractical or impossible to collect data


from every individual.
 When resources such as time, cost, and manpower are limited, making it more feasible
to collect data from a subset of the population.
 When conducting research or experiments where it is important to minimize potential
biases in data collection.

Statistical Modeling

A statistical model is a type of mathematical model that comprises of


the assumptions undertaken to describe the data generation process.

Let us focus on the two highlighted terms above:

1. Type of mathematical model? Statistical model is non-deterministic unlike other mathematical

models where variables have specific values. Variables in statistical models are stochastic

i.e. they have probability distributions.

2. Assumptions? But how do those assumptions help us understand the properties or

characteristics of the true data? Simply put, these assumptions make it easy to calculate the

probability of an event.

Quoting an example to better understand the role of statistical assumptions in data


modeling:

 Assumption 1: Assuming that we have 2 fair dice, and each face has equal probability to

show up i.e. 1/6. Now, we can calculate the probability of two dice showing up 5 as 1/6*1/6.

As we can calculate the probability of every event, it constitutes a statistical model.


Why Do We Need Statistical Modeling?

It is a statement of finding statistical evidence.

The statistical model plays a fundamental role in carrying out


statistical inference which helps in making propositions about the unknown
properties and characteristics of the population.

Fitting a Model-Underfitting and Overfitting

Overfitting and Underfitting in Machine Learning


Overfitting and Underfitting are the two main problems that occur in
machine learning and degrade the performance of the machine learning
models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable
output by adapting the given set of unknown input. It means after providing
training on the dataset, it can produce reliable and accurate output. Hence,
the underfitting and overfitting are the two terms that need to be checked
for the performance of the model and whether the model is generalizing well
or not.

o Signal: It refers to the true underlying pattern of the data that helps
the machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the
training dataset, but does not perform well with the test dataset, then
variance occurs.
Overfitting

Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.

The chances of occurrence of overfitting increase as much we provide


training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below


graph of the linear regression output:

As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is not
so. Because the goal of the regression model to find the best fit line, but here
we have not got any best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the


machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
o
o Underfitting
o
o Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due
to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
o

o In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.

o An underfitted model has high bias and low variance.

o Example: We can understand the underfitting using below output of


the linear regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.
o
o Goodness of Fit
o
o The "Goodness of fit" term is taken from the statistics, and the goal of
the machine learning models to achieve the goodness of fit. In
statistics modeling, it defines how closely the result or predicted values
match the true values of the dataset.
o The model with a good fit is between the underfitted and overfitted
model, and ideally, it makes predictions with 0 errors, but in practice, it
is difficult to achieve it.
o As when we train our model for a time, the errors in the training data
go down, and the same happens with test data. But if we train the
model for a long duration, then the performance of the model may
decrease due to the overfitting, as the model also learn the noise
present in the dataset. The errors in the test dataset start
increasing, so the point, just before the raising of errors, is the good
point, and we can stop here for achieving a good model .

There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.

Note : The value of the probability always lies between 0 to 1.

What is Random Variable?


Set of all possible values from a Random Experiment is called Random Variable.
It is represented by X.

Example: Outcome of coin toss

What is an example of Probability Distribution?

Let’s understand the probability distribution by an example:


When two dice are rolled with six sided dots, let the possible outcome of rolling is
denoted by (a, b), where

a : number on the top of first dice


b : number on the top of second dice

Then, sum of a + b are:

Then, sum of a + b are:


Sum of a + b (a, b)
2 (1,1)
3 (1,2), (2,1)
4 (1,3), (2,2), (3,1)
5 (1,4), (2,3), (3,2), (4,1)
6 (1,5), (2,4), (3,3), (4,2), (5,1)
7 (1,6), (2,5), (3,4),(4,3), (5,2), (6,1)
8 (2,6), (3,5), (4,4), (5,3), (6,2)
9 (3,6), (4,5), (5,4), (6,3)
10 (4,6), (5,5), (6,4)
11 (5,6), (6,5)
12 (6,6)

If a random variable is a discrete variable, it’s probability distribution is called discrete probability
distribution.
 Example : Flipping of two coins

 Functions that represents a discrete probability distribution is known as Probability Mass
Function.

 If a random variable is a continuous variable, it’s probability distribution is called continuous probability
distribution.
 Example: Measuring temperature over a period of time
 Functions that represents a continuous probability distribution is known as Probability Density
Function.

Types of Random Variable:


 Discrete Random Variable

 X is a discrete because it has a countable values between two numbers
 Example : number of balls in a bag, number of tails in tossing coin

 Continuous Random Variable

 X is a continuous because it has a infinite number of values between two values
 Example : distance travelled, Height of students

Types of Probability Distributions

1.Uniform Distribution
What is Uniform Distribution?
Probability distribution in which all the outcome have equal probability is known as
Uniform Distribution.
Example: Perfect Random Generator

2.Normal Distribution (Gaussian Distribution):


A continuous probability distribution, which is symmetric about it’s mean value (i.e. data
near the mean are more frequency in occurrence) is known as Normal Distribution.

 Random variable X is normally distributed if the distribution function is given by:


R Programming Language – Introduction
The R Language stands out as a powerful tool in the modern era of statistical computing
and data analysis.the R Language offers an extensive suite of packages and libraries
tailored for data manipulation, statistical modeling, and visualization.
What is R
R is a popular programming language used for statistical computing and graphical presentation.

Its most common use is to analyze and visualize data.

Why Use R?
 It is a great resource for data analysis, data visualization, data science and machine
learning
 It provides many statistical techniques (such as statistical tests, classification, clustering
and data reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
 It works on different platforms (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to solve different problems

How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows,
Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.

The screenshot below shows how it may look like when you run R on a Windows PC:

Syntax
To output text in R, use single or double quotes:

To output numbers, just type the number (without quotes):


To do simple calculations, add numbers together

Print
Unlike many other programming languages, you can output code in R without using a print
function:

However, R does have a print() function available if you want to use it.

Comments
Comments can be used to explain R code, and to make it more readable. It can also be used to
prevent execution when testing alternative code.

Comments starts with a #. When executing code, R will ignore anything that starts with #.

Creating Variables in R
Variables are containers for storing data values.

R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name:

Example
name <- "John"
age <- 40

name # output "John"


age # output 40

From the example above, name and age are variables, while "John" and 40 are values.

R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just
typing the following command at your command prompt −

$R

This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −

> myString <- "Hello, World!"


> print ( myString)
[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello, World!"
and then next statement print() is being used to print the value stored in variable myString.

R Script File
Usually, you will do your programming by writing your programs in script files and then you
execute those scripts at your command prompt with the help of R interpreter called Rscript. So
let's start with writing following code in a text file called test.R as under −
Live Demo

# My first program in R Programming


myString <- "Hello, World!"
print ( myString)

Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.

$ Rscript test.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

Basic Data Types in R-program


Data types

Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be stored in the reserved memory.

Basic data types in R can be divided into the following types:

 numeric - (10.5, 55, 787)


 integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
 logical (a.k.a. boolean) - (TRUE or FALSE)

We can use the class() function to check the data type of a variable:

Example
# numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)

You might also like