IDS-UNIT-1-FINAL (1)
IDS-UNIT-1-FINAL (1)
Essentially, data science is about using scientific methods to unlock the potential of
data, uncover patterns, make predictions, and drive informed decision-making across
various domains and industries.
The term “data science” combines two key elements: “data” and “science.”
1. Data: It refers to the raw information that is collected, stored, and processed. In
today’s digital age, enormous amounts of data are generated from various sources
such as sensors, social media, transactions, and more. This data can come in
structured formats (e.g., databases) or unstructured formats (e.g., text, images,
videos).
Data Science has a wide array of applications across various industries, significantly
impacting the way businesses operate and how services are delivered. Here are some
key applications of Data Science:
1. Healthcare:
Predictive Analytics: Predicting disease outbreaks, patient readmissions, and
individual health risks.
Medical Imaging: Enhancing image recognition to diagnose conditions from X-rays,
MRIs, and CT scans.
Personalized Medicine: Tailoring treatment plans based on genetic information and
patient history.
2. Finance:
Risk Management: Identifying and mitigating financial risks through predictive
modeling.
Fraud Detection: Analyzing transactions to detect fraudulent activities.
Algorithmic Trading: Using data-driven algorithms to execute high-frequency
trading strategies.
3. Marketing:
Customer Segmentation: Grouping customers based on purchasing behavior and
preferences for targeted marketing.
Sentiment Analysis: Analyzing customer feedback and social media interactions to
gauge public sentiment.
Predictive Analytics: Forecasting sales trends and customer lifetime value.
4. Retail:
Inventory Management: Optimizing stock levels based on demand forecasting.
Recommendation Systems: Providing personalized product recommendations to
customers.
Price Optimization: Adjusting prices dynamically based on market trends and
consumer behavior.
5. Transportation:
Route Optimization: Enhancing logistics by determining the most efficient routes.
Predictive Maintenance: Forecasting equipment failures to schedule timely
maintenance.
Autonomous Vehicles: Developing self-driving cars using machine learning
algorithms.
6. Education:
Big Data refers to the vast volumes of data generated at high velocity from a variety of
sources. This data is characterized by the three V’s: Volume, Velocity, and Variety.
1. Volume: Big Data involves large datasets that are too complex for traditional data
processing tools to handle. These datasets can range from terabytes to petabytes of
information.
2. Velocity: Big Data is generated in real-time or near real-time, requiring fast processing
to extract meaningful insights.
3. Variety: The data comes in multiple forms, including structured data (like databases),
semi-structured data (like XML files), and unstructured data (like text, images, and
videos).
Big Data’s primary role is to collect and store this massive amount of information
efficiently. Technologies such as Hadoop, Apache Spark, and NoSQL databases like
MongoDB are commonly used to manage and process Big Data.
Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data.
It encompasses a variety of techniques from statistics, machine learning, data
mining, and big data analytics.
1. Analyze: They examine complex datasets to identify patterns, trends, and correlations.
2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.
3. Interpret: They translate data findings into actionable business strategies and
decisions.
Data Science involves a broad skill set, including proficiency in programming languages
like Python and R, knowledge of databases, and expertise in machine learning
frameworks such as TensorFlow and Scikit-Learn.
While Big Data and Data Science are interrelated, they serve different purposes and
require different skill sets.
Efficient storage,
Analyzing data to inform
Objective processing, and
decisions and predict trends
management of data
Despite their differences, Big Data and Data Science are complementary fields.
Big Data provides the foundation by collecting and storing vast amounts of information.
Without this foundational layer, Data Science would lack the raw material needed for
analysis.
Conversely, Data Science adds value to Big Data by analyzing and interpreting the data.
The insights derived from Data Science can help businesses leverage Big Data more
effectively, uncovering trends and patterns that can inform strategic decisions.
For instance, in the healthcare sector, Big Data technologies can aggregate patient data
from various sources, including electronic health records, wearable devices, and genomic
databases. Data Science can then analyze this data to predict disease outbreaks,
personalize treatment plans, and improve patient outcomes.
Big Data focuses on managing and processing large datasets, whereas Data Science
aims to analyze this data and derive actionable insights
There is a phrase “85% of all productionized machine learning models will fail.” This
was a prediction made by research firm Gartner,
And a failed data science investment can incur massive expense to organizations,
especially if an incorrect business decision is made based on the output of a predictive
algorithm.
Machine learning isn’t a one size fits all solution to every problem, which is necessary for
stakeholders and non-technical managers to grasp. Not every problem can be solved with
machine learning, and not every problem should be.
However, as a data scientist, don’t start building a predictive algorithm until you are sure
that machine learning modelling is your best option.
Datafication:
Datafication is the transformation of social action into online quantified data, thus
allowing for real-time tracking and predictive analysis.
Simply said, it is about taking previously invisible process/activity and turning it into
data, that can be monitored, tracked, analysed and optimized.
Insurance: Data used to update risk profile development and business models.
Banking: Data used to establish trustworthiness and likelihood of a person paying back
a loan.
Human resources: Data used to identify e.g. employees risk-taking profiles.
Hiring and recruitment: Data used to replace personality tests.
Social science research: Datafication replaces sampling techniques and restructures
the manner in which social science research is performed.
This information can then be tracked, processed, monitored, analyzed, and utilized to
improve an organization and the products and services it offers to customers. To put
them into perspective.
Benefits of Datafication
Datafication is a technique that is financially advantageous to pursue since it provides
great opportunity for streamlining corporate procedures. Datafication is a cutting-edge
process for creating a futuristic framework that is both secure and inventive.
1. Actionable Insights
Datafication converts unstructured, incomprehensible data into usable insights, allowing
you to get insight into your processes and procedures – the basis of any organization.
Datafication suggests that you will be better able to comprehend the advantages,
disadvantages, potential, and future of your business. Additionally, it gives you
knowledge about the results and effects of your efforts, allowing you to evaluate what
you’re doing and how you’re doing it.
2. Digital Transformation
You should have useable data if you want to benefit from the most recent and cutting-
edge technologies. It holds the key to enhancing the effectiveness and efficiency of
corporate operations. You will be better able to comprehend the organization’s current
situation and the necessary next actions to advance.
4. Manage Information
Any business generates a lot of data, which is constantly collected and stored. A well-
managed data set will produce superior outcomes. Otherwise, it could become
overwhelming or useless information.
By correctly organising it through datafication, you can use the information to make
judgements. You will be able to access and analyse the data in addition to storing it.
How Datafication is Performed?
1.Data Collection:
It all starts with data collection and retrieval. This could be from anything
you do—clicking on a website, using an app, or even just walking around
with your smartphone. Devices and sensors collect this data, often without
you even noticing.
2. Data Storage:
Once collected, this data needs a place to go. Think of it like storing your
favorite movies or music. The data is saved in databases or cloud storage,
where it can be accessed and used later.
3. Data Processing:
Here’s where things get interesting. The raw data collected isn’t very useful
on its own. It’s like having all the ingredients for a cake but not baking it yet.
Data processing involves data cleaning, organizing, and transforming this
data into a more usable format. For example, if you’ve ever used a
spreadsheet to track your expenses, you’ve engaged in a basic form of
data processing.
4. Data Analysis:
This is the magic moment when data becomes valuable information. Using
various tools and techniques, analysts can look at patterns and trends in
the data. For example, they might discover that people tend to buy more
ice cream on hot days—a useful insight for a business.
5. Data Visualization:
To make the data easy to understand, it’s often presented visually, like in
charts or graphs. If you’ve ever seen a bar chart showing monthly sales or
a line graph tracking your steps over time, you’ve encountered data
visualization. This step helps people quickly grasp the insights hidden in
the data.
6. Data Application:
Finally, the insights gained from data analysis are put to use. This could
mean anything from tweaking a marketing strategy to designing a new
product. For example, if data shows that customers prefer shopping online
at certain times of the day, a business might run targeted ads during those
hours.
Statistical Inference
Statistical inference is the process of using data analysis to infer properties of an
underlying distribution of a population. It is a branch of statistics that deals with making
inferences about a population based on data from a sample.
Statistical inference is based on probability theory and probability distributions. It involves
making assumptions about the population and the sample, and using statistical models to
analyze the data.
Statistical Inference
It involves using statistical methods to analyze sample data and make inferences or
predictions about parameters or characteristics of the entire population from which the
sample was drawn.
Parameter Estimation
Parameter estimation is another primary goal of statistical inference. Parameters are
capable of being deduced; they are quantified traits or properties related to the population
you are studying. Some instances comprise the population mean, population variance.
Hypothesis Testing
There are various methods of statistical inference, some of these methods are:
Parametric Methods
Non-parametric Methods
Bayesian Methods
Parametric Methods
In this scenario, the parametric statistical methods will assume that the data is drawn from
a population characterized by a probability distribution. It is mainly believed that they
follow a normal distribution thus can allow one to make guesses about the populace in
question . For example, the t-tests and ANOVA are parametric tests that give accurate
results with the assumption that the data ought to be
Non-Parametric Methods
These are less assumptive and more flexible analysis methods when dealing with data out
of normal distribution. They are also used to conduct data analysis when one is uncertain
about meeting the assumption for parametric methods and when one have less or
inadequate data .
Some of the non-parametric tests include Wilcoxon signed-rank test and Kruskal-Wallis
test among others.
Bayesian Methods
Example: consider a situation where a doctor is investigating a new treatment and has
the prior belief about the success rate of the treatment. Upon conducting a new clinical
trial, the doctor uses Bayesian method to update his “prior belief” with the data from the
new trials to estimate the true success rate of the treatment.
Population Vs Sample
The population refers to the entire group of individuals or items that we are interested in
studying and drawing conclusions about. It can be a group of individuals or a set of items.
It is denoted by n.
Population Sample
Populations are used when your research question requires, or when you have
access to, data from every member of the population. Usually, it is only
straightforward to collect data from a whole population when it is small, accessible
and cooperative.
Example:
A marketing manager for a small local bakery wants to understand customer preferences
for different types of bread sold at their shop. Since they are solely interested in analyzing
the preferences of customers who visit their bakery, they decide to collect data on bread
preferences from every customer who makes a purchase over the course of a month. By
using the entire dataset of bread purchases, including preferences indicated by
customers, they aim to identify trends and patterns in bread choices specifically among
their bakery’s clientele.
Example:
Suppose you are conducting research on smartphone usage habits among teenagers in a
specific city. Your population comprises all teenagers aged 13-18 living in that city, which
could number in the tens of thousands. Due to logistical constraints and the difficulty of
reaching every teenager in the city, you opt to use a sample of 500 teenagers randomly
selected from different schools within the city. This sample will participate in interviews or
surveys to provide insights into their smartphone usage patterns, preferences, and
behaviors.
Statistical Modeling
models where variables have specific values. Variables in statistical models are stochastic
characteristics of the true data? Simply put, these assumptions make it easy to calculate the
probability of an event.
Assumption 1: Assuming that we have 2 fair dice, and each face has equal probability to
show up i.e. 1/6. Now, we can calculate the probability of two dice showing up 5 as 1/6*1/6.
o Signal: It refers to the true underlying pattern of the data that helps
the machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the
training dataset, but does not perform well with the test dataset, then
variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.
As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is not
so. Because the goal of the regression model to find the best fit line, but here
we have not got any best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
o
o Underfitting
o
o Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due
to which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.
o
o In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.
There are two other methods by which we can get a good point for our
model, which are the resampling method to estimate model accuracy
and validation dataset.
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.
If a random variable is a discrete variable, it’s probability distribution is called discrete probability
distribution.
Example : Flipping of two coins
Functions that represents a discrete probability distribution is known as Probability Mass
Function.
If a random variable is a continuous variable, it’s probability distribution is called continuous probability
distribution.
Example: Measuring temperature over a period of time
Functions that represents a continuous probability distribution is known as Probability Density
Function.
1.Uniform Distribution
What is Uniform Distribution?
Probability distribution in which all the outcome have equal probability is known as
Uniform Distribution.
Example: Perfect Random Generator
Why Use R?
It is a great resource for data analysis, data visualization, data science and machine
learning
It provides many statistical techniques (such as statistical tests, classification, clustering
and data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve different problems
How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows,
Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.
The screenshot below shows how it may look like when you run R on a Windows PC:
Syntax
To output text in R, use single or double quotes:
Print
Unlike many other programming languages, you can output code in R without using a print
function:
However, R does have a print() function available if you want to use it.
Comments
Comments can be used to explain R code, and to make it more readable. It can also be used to
prevent execution when testing alternative code.
Comments starts with a #. When executing code, R will ignore anything that starts with #.
Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name:
Example
name <- "John"
age <- 40
From the example above, name and age are variables, while "John" and 40 are values.
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by just
typing the following command at your command prompt −
$R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −
Here first statement defines a string variable myString, where we assign a string "Hello, World!"
and then next statement print() is being used to print the value stored in variable myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then you
execute those scripts at your command prompt with the help of R interpreter called Rscript. So
let's start with writing following code in a text file called test.R as under −
Live Demo
Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer,
floating point, double floating point, Boolean etc. Based on the data type of a variable, the
operating system allocates memory and decides what can be stored in the reserved memory.
We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)