EDS Unit 1?
EDS Unit 1?
Key Points:
Stakeholder collaboration.
Key Points:
Key Points:
Key Points:
Key Points:
Key Points:
2. Finance
3. E-Commerce
4. Retail
5. Marketing
6. Transportation
8. Education
10. Energy
Problem of Data Privacy: For many industries, data is their fuel. Data
Scientists help companies make data-driven decisions. However, the data
utilized in the process may breach the privacy of customers.
Characteristics
There are totally 5 characteristics of Big Data
1. Volume: Big Data involves large datasets that are too complex for traditional
data processing tools to handle. These datasets can range from terabytes,
petabytes to Zettabytes of information.
2. Variety: The data comes in multiple forms, including structured data (like
databases),unstructured data (like text, images, and videos) and semi-
structured data (like XML files).
4. Veracity: Veracity refers to the degree of accuracy in data sets and how
trustworthy they are.
5. Value. Not all the data that's collected has real business value/benefits. It's
essential to determine the business value of the data you collect.
Unstructured Data
Heterogeneous data source containing a combination of simple text files,
images, videos etc. (The output of a Google search)
Semi-Structured Data
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined.
Datafication
Datafication is the process of transforming every aspect of business into
quantifiable data that can be tracked, monitored, analyzed.
Statistical Interference
Statistical Interference, also called as Inference Statistics is a branch of
Statistics which uses a random sample of a population to make inferences
about the whole population using statistical techniques.
Statistical Inferences are done by using information such as process and data
Process
The activities or functions that are happening in and around the worlds are
called processes
2. Confidence Intervals
These give a range where the true value of a population parameter (like an
average) is likely to fall.
Example: "We are 95% confident that the average score of students is
between 70 and 80."
3. Regression Analysis
A method to explore the relationship between one dependent variable (e.g.,
sales) and one or more independent variables (e.g., advertising budget).
Example: Predicting house prices based on factors like size and location.
5. Chi-Square Tests
Used to find out if there’s a relationship between two categorical variables.
Sample
It refers to the subset of the (people) population from which you’ll be collecting
data to draw conclusions and make inferences about the population
Statistical Modelling
It refers to the process of creating a mathematical model that describes the
relationship between variables. The goal is to use this model to analyze data,
predict or explain outcomes.
It refers to the process of applying statistical analysis techniques to observe,
analyze, and predict trends and patterns in data.
Probability distributions are the foundations of statistical models
Process of modelling
Probability Distributions
A probability distribution is a mathematical function that describes the
probability of different possible values of a variable. Probability distributions
are often depicted using graphs
For discrete variables, we use Probability Mass Function (PMF), and for
continuous variables, we use Probability Density Function (PDF).
P (A ∩ B) = P (A) × P (B)
Conditional Probability: Probability of A occurring given that B has
occurred.
P (A ∣ B) = P (A ∩ B)/P (B)
2. Overfitting:
Occurs when a model learns both the patterns and noise from training
data, leading to poor performance on unseen data.
Signs: High Variance and Low Bias, High Training Accuracy and Low
Test Accuracy
What is an Operator?
Definition: An operator in R is a symbol or function that performs
operations on variables or values. Operators are used to perform
calculations, manipulate data, and perform logical comparisons.
+ : Addition
- : Subtraction
x : Multiplication
/ : Division
%% : Modulus (remainder)
^ : Exponentiation
2. Relational Operators:
Used for comparisons between values.
== : Equal to
!= : Not equal to
3. Logical Operators:
Used for logical operations (AND, OR, NOT).
& : AND
| : OR
4. Assignment Operators:
Used to assign values to variables.
5. Miscellaneous Operators:
1. Numeric:
Represents real numbers (e.g., 10.5, 3.14).
2. Integer:
Whole numbers (e.g., 5, 100).
3. Character:
4. Logical:
Represents Boolean values: TRUE or FALSE .
5. Complex:
Represents complex numbers with real and imaginary parts (e.g., 2 + 3i ).
6. Raw:
Represents raw bytes (used for binary data).
7. Factor:
8. List:
A collection of elements, which can be of different data types (e.g.,
numbers, strings, vectors).
9. Data Frame:
A two-dimensional table-like structure where columns can have different
data types. It’s used to store datasets in R.
10. Matrix:
A two-dimensional array where all elements must be of the same data type.
Run the installer and follow the on-screen instructions to complete the
installation.
Run the installer and follow the instructions to complete the installation.
3. Verify Installation:
R comes with many built-in functions, but you often need additional
packages for more advanced tasks.
install.packages("package_name")
install.packages("ggplot2")
After installation, you need to load the package before using it in your R
session.
library(ggplot2)
print("Hello, R!")