We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30
Unit-3
Components of learning – Types of machine learning - Data
Objects and Attribute Types – Basic Statistical Descriptions of Data - Data Preprocessing: Cleaning – Integration - Feature selection - Feature extraction by Principal Component Analysis – Data Transformation by Normalization - Discretization: Binning and Histogram analysis.
Machine Learning Basics and
Preprocessing What is machine learning?
• Machine Learning, as the name suggests,
provides machines with the ability to learn autonomously based on experiences, observations and analysing patterns within a given data set without explicitly programming Example of machine learning
• Facebook: For instance, think of Facebook’s facial
recognition algorithm which prompts you to tag photos whenever you upload a photo. • Alexa, Cortana, and other voice assistants: Another example is of the voice assistants who use machine learning to identify and service the user’s request. • Tesla automobiles: One more example is of Tesla’s autopilot feature. Every machine learning algorithm has three components • Representation: This implies how to represent knowledge. Examples include decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles and others. • Evaluation: This is the way to evaluate candidate programs (hypotheses). Examples include accuracy, prediction and recall, squared error, likelihood, posterior probability, cost, margin, entropy k-L divergence and others. • Optimization: Last but not the least, optimization is the way candidate programs are generated and is known as the search process. For example, combinational optimization, convex optimization, and constrained optimization. Types of machine learning Data Objects • A data object represents the entity. Data Objects are like a group of attributes of an entity.
• For example, a sales data object may represent
customers, sales, or purchases. When a data object is listed in a database they are called data tuples. Attribute
• Customer - object attributes can be customer
Id, address, etc. • Type of attributes : • Qualitative (Nominal (N), Ordinal (O), Binary(B)). • Quantitative (Numeric, Discrete, Continuous) 1. Quantitative Data Type
• This Type Of Data Type Consists Of Numerical
Values. Anything Which Is Measured By Numbers.
• E.G., Profit, Quantity Sold, Height, Weight,
Temperature, Etc. A.) Discrete Data Type
• The Numeric Data Which Have Discrete Values Or
Whole Numbers. This Type Of Variable Value If Expressed In Decimal Format Will Have No Proper Meaning. Their Values Can Be Counted. • E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc. • B.) Continuous Data Type
• The Numerical Measures Which Can Take The Value Within A
Certain Range. This Type Of Variable Value If Expressed In Decimal Format Has True Meaning. Their Values Can Not Be Counted But Measured. The Value Can Be Infinite. • E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc. • 2. Qualitative Data Type
• These Are The Data Types That Cannot Be
Expressed In Numbers. • This Describes Categories Or Groups And Is Hence Known As The Categorical Data Type A. Structured Data
• This Type Of Data Is Either Number Or Words. This
Can Take Numerical Values But Mathematical Operations Cannot Be Performed On It. This Type Of Data Is Expressed In Tabular Format.
• E.G.) Sunny=1, Cloudy=2, Windy=3 Or Binary
Form Data Like 0 Or1, Good Or Bad, Etc. • B. Unstructured Data • This Type Of Data Does Not Have The Proper Format And Therefore Known As Unstructured Data. This Comprises Textual Data, Sounds, Images, Videos, Etc. • Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data Measures:- • Nominal • Ordinal • Interval • Ratio I. Nominal Data Type
• This Is In Use To Express Names Or Labels
Which Are Not Order Or Measurable. • E.G., Male Or Female (Gender), Race, Country, Etc. II. Ordinal Data Type
• This Is Also A Categorical Data Type Like
Nominal Data But Has Some Natural Ordering Associated With It. • E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc. III. Interval Data Type
• This Is Numeric Data Which Has Proper Order
And The Exact Zero Means The True Absence Of A Value Attached. Here Zero Means Not A Complete Absence But Has Some Value. This Is The Local Scale. • E.G., Temperature Measured In Degree Celsius, Time, Sat Score, Credit Score, PH, Etc. Difference Between Values Is Familiar. In This Case, There Is No Absolute Zero. Absolute IV. Ratio Data Type
• This Quantitative Data Type Is The Same As
The Interval Data Type But Has The Absolute Zero. Here Zero Means Complete Absence And The Scale Starts From Zero. This Is The Global Scale. • E.G., Temperature In Kelvin, Height, Weight, Etc Basic Statistical Descriptions of Data • Data Scientist is defined as the most desirable profession of the 21st century. Machine Learning and Statistics are the two core skills required to become a data scientist. • Statistical Analysis In statistics, data is collected, analyzed, explored, and presented to identify patterns and trends. Alternatively, it is referred to as quantitative analysis. 1. Descriptive Statistics • Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the main characteristics of that data. • Mean: It is the central value which is commonly known as arithmetic average. • Mode: It refers to the value that appears most often in a data set. • Median: It is the middle value of the ordered set that divides it in exactly half. 2. Variability • Standard Deviation: It is a statistic that calculates the dispersion of a data set as compared to its mean. • Range: This is defined as the difference between the largest and smallest value of a dataset. • Percentile: It refers to the measure used in statistics that indicates the value below which the given percentage of observation in the dataset falls. • Quartile: It is defined as the value that divides the data points into quarters. • Variance: It refers to a statistical measure of the spread between the numbers in a data set. 3. Correlation • It is one of the major statistical techniques that measure the relationship between two variables.
• A correlation coefficient that is more than zero
indicates a positive relationship. • A correlation coefficient that is less than zero indicates a negative relationship. • Correlation coefficient zero indicates that there is no relationship between the two variables. 4. Probability Distribution • It specifies the likelihood of all possible events. In simple terms, an event refers to the result of an experiment like tossing a coin. Events are of two types dependent and independent. • Independent event ( tossing a coin ) • Dependent event ( queen draw in card) 5. Regression • It is a method that is used to determine the relationship between one or more independent variables and a dependent variable. • Linear regression: It is used to fit the regression model that explains the relationship between a numeric predictor variable and one or more predictor variables. • Logistic regression: It is used to fit a regression model that explains the relationship between the binary response variable and one or more predictor variables. 6. Normal Distribution 7. BIAS • In statistical terms, it means when a model is representative of a complete population. This needs to be minimized to get the desired outcome. • The three most common types of bias are: • Selection bias: It is a phenomenon of selecting a group of data for statistical analysis, the selection in such a way that data is not randomized resulting in the data being unrepresentative of the whole population. • Confirmation bias: It occurs when the person performing the statistical analysis has some predefined assumption. • Time interval bias: It is caused intentionally by specifying a certain time range to favour a particular outcome.