0% found this document useful (0 votes)
6 views

FHA UNIT 1 INTRODUCTION

This document provides an introduction to data and its significance in the information age, detailing types of data, sources, and the journey of data science. It also covers key concepts in statistics, including variables, populations, samples, and types of statistics such as descriptive and inferential statistics. Additionally, it highlights the applications of computational science and the tools and skills necessary for data analysis and machine learning.

Uploaded by

ancy rodhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

FHA UNIT 1 INTRODUCTION

This document provides an introduction to data and its significance in the information age, detailing types of data, sources, and the journey of data science. It also covers key concepts in statistics, including variables, populations, samples, and types of statistics such as descriptive and inferential statistics. Additionally, it highlights the applications of computational science and the tools and skills necessary for data analysis and machine learning.

Uploaded by

ancy rodhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT 1 INTRODUCTION

Introduction:
We are frequently reminded of the fact that we are living in the information age.
Appropriately, then, this book is about information—how it is obtained, how it is analyzed, and
how it is interpreted. The information about which we are concerned we call data, and the data
are available to us in the form of numbers.

Basic Concepts:
What is data?
Data is a collection of facts, such as numbers, words, measurements, observations.
Types of Data:
1.Structured Data: highly organized (Example Spread Sheet and Databases)
2.Unstructured Data: no regular structure (emails, social media posts, online blogs,
newspapers, books, and scientific publications)
3.Big Data: structured(DATABASES), semi-structured(XML, HTML), unstructured (photo,
video)
What are the sources of Data?
1.Routinely kept records
2.Surveys.
3.Experiments.
4.External sources(already existing datas)

What is the differences between data and information and knowledge?


Big Data goes beyond the simple concept of the data type or volume used. It also integrates
(i)analytical techniques(e.g., machine learning)
(ii)technologies that make it possible(e.g., parallel and cloud computing), and
(iii)modern visualization solutions(e.g., interactive graphing and infographics). (iv)Big Data
processing by applying a specialized combination of scientific approaches (e.g., statistics,
mathematics, informatics, and background knowledge in a specific area)

Storage -Data at Rest


Transfer -Data in Transit
Data in transit is digital data which is exchanged between the computing machines at the exact
moment of the transfer.
Secure File Transfer Protocol (SFTP), Secure HyperTextTransfer Protocol (HTTP), Off the Record
Messaging (OTR), Peer to Peer Communication (P2P), and Secure Sockets Layer (SSL) for data
encryption.
Process -Data in Use
Data in use is digital data being actively processed by computer applications at the exact
moment. That data is temporarily stored in random-access memory (RAM), processor registers, or
hardware cache.
Journey of Data Science:
•1962: John Tukey writes The Future of Data Analysis, where he envisions a new field for learning
insights from data
• 1977: Tukey publishes the book Exploratory Data Analysis, which is a key part of data science
today
• 1991: Guido Van Rossum publishes the Python programming language online for the first time,
which goes on to become the top data science language used at the time of writing
•1993: The R programming language is publicly released, which goes on to become the second
most-used data science general-purpose language
•2008: Jeff Hammerbacherand DJ Patil use the term "data scientist" in job postings after trying to
come up with a good job title for their work
•2010: Kaggle.com launches as an online data science community and data science competition
website
• 2010s: Universities begin offering masters and bachelor's degrees in data science; data science job
postings explode to new heights year after year; big breakthroughs are made in deep learning; the
number of data science software libraries and publications burgeons.
• 2015: TensorFlow (a deep learning and machine learning library) is released.
• 2018: Google releases cloud AutoML, democratizing a new automatic technique for machine
learning and data science.
• 2020: Amazon SageMaker Studio is released, which is a cloud tool for building, training,
deploying, and analyzing machine learning models

Data Science Competitions:


• Kaggle ($10K)
•Analytics Vidhya
• HackerRank
• DrivenData(focused on social justice)
• AIcrowd
A couple of websites that list data science competitions are:
ods.ai
www.mlcontests.com
The top data science tools and skills:
•Programming Languages: Python, R.
•Libraries: Pandas, NumPy, Scikit-learn.
•Databases: SQL, NoSQL.
•Big Data Tools: Hadoop, Spark
•GUI-Excel, GraphPad Prism
•Cloud tools
•Amazon Web Services (AWS) (general purpose)
•Google Cloud Platform (GCP) (general purpose)
•Microsoft Azure (general purpose)
•IBM (general purpose)
•Databricks (data science and AI platform)
•Snowflake (data warehousing)
Statistical methods and math:
• Exploratory analysis statistics (exploratory data analysis, or EDA), like statistical plotting
and aggregate calculations such as quantiles
• Statistical tests and their principles, like p-values, chi-squared tests, t-tests, and ANOVA
• Machine learning modeling, including regression, classification, and clustering methods
• Probability and statistical distributions, like Gaussian and Poisson distributions
Software development:
Python, Git & GitHub, Docker and Kubernetes
Statistical methods and math:
Jupyter Notebook to create a presentation allows one to actively demo Python or other
code during the presentation, unlike classic presentation software
Scope of the field:

Applications of Computational Science:

Data analysis-used to analyze large datasets to extract meaningful insights and patterns
Modeling and Simulation-used to develop and analyze mathematical models of complex systems
Machine learning-used to develop and apply algorithms automatically learning from data and
making predictions
Optimization-used to find the best or the most efficient/robust solution to a problem
Visualization-used to create visual representations of data and models

Data :
The raw material of statistics is data. For our purposes we may define data as numbers.
The two kinds of numbers that we use in statistics are numbers that result from the taking—in
the usual sense of the term—of a measurement, and those that result from the process of
counting. For example, when a nurse weighs a patient or takes a patient’s temperature, a
measurement, consisting of a number such as 150 pounds or 100 degrees Fahrenheit, is
obtained. Quite a different type of number is obtained when a hospital administrator counts the
number of patients—perhaps 20—discharged from the hospital on a given day. Each of the
three numbers is a datum, and the three taken together are data.
Statistics
Statistics is a field of study concerned with the collection, organization, summarization,
and analysis of data; and the drawing of inferences about a body of data when only a part of
the data is observed. For example The person who performs these statistical activities must be
prepared to interpret and to communicate the results to someone else as the situation demands.
Simply put, we may say that data are numbers, numbers contain information, and the purpose
of statistics is to investigate and evaluate the nature and meaning of this information.
Biostatistics
The tools of statistics are employed in many fields—business, education, psychology,
agriculture, and economics, to mention only a few. When the data analyzed are derived from
the biological sciences and medicine, we use the term biostatistics to distinguish this particular
application of statistical tools and concepts.
Variable -If, as we observe a characteristic, we find that it takes on different values in different
persons, places, or things, we label the characteristic a variable. Some examples of variables
include diastolic blood pressure, heart rate, the heights of adult males, the weights of preschool
children, and the ages of patients seen in a clinic.
1.Quantitative Variables -A quantitative variable is one that can be measured in the usual
sense. We can, for example, obtain measurements on the heights of adult males, the weights of
preschool children, and the ages of patients seen in a dental clinic. These are examples of
quantitative variables.
2.Qualitative Variables -Some characteristics are not capable of being measured in the sense
that height, weight, and age are measured. Many characteristics can be categorized only, as, for
example, when an ill person is given a medical diagnosis, a person is designated as belonging
to an ethnic group, or a person, place, or object is said to possess or not to possess some
characteristic of interest. In such cases measuring consists of categorizing. We refer to variables
of this kind as qualitative variables.

3.Random Variable -Whenever we determine the height, weight, or age of an individual, the
result is frequently referred to as a value of the respective variable. When the values obtained
arise as a result of chance factors, so that they cannot be exactly predicted in advance, the
variable is called a random variable. An example of a random variable is adult height. When a
child is born, we cannot predict exactly his or her height at maturity. Attained adult height is
the result of numerous genetic and environmental factors. Values resulting from measurement
procedures are often referred to as observations or measurements.
4.Discrete Random Variable -Variables may be characterized further as to whether they are
discrete or continuous. discrete variable is characterized by gaps or interruptions in the values
that it can assume. The number of daily admissions to a general hospital is a discrete random
variable since the number of admissions each day must be represented by a whole number,
such as 0, 1, 2, or 3. The number of admissions on a given day cannot be a number such as 1.5,
2.997, or 3.333.
5.Continuous Random Variable- A continuous random variable does not possess the gaps or
interruptions characteristic of a discrete random variable. A continuous random variable can
assume any value within a specified relevant interval. of values assumed by the variable.
Examples of continuous variables include the various measurements that can be made on
individuals such as height, weight, and skull circumference. No matter how close together the
observed heights of two people, for example, we can, theoretically, find another person whose
height falls somewhere inbetween.
4.Population-A population or collection of entities may, however, consist of animals,
machines, places, or cells. For our purposes, we define a population of entities as the largest
collection of entities a population of values as the largest collection of values of a random
variable for which we have an interest at a particular time. for example, we are interested in
the weights of all the children enrolled in a certain county elementary school system, our
population consists of all these weights. If our interest lies only in the weights of first-grade
students in the system, we have a different population—weights of first-grade students enrolled
in the school system. Hence, populations are determined or defined by our sphere of interest.
Populations may be finite or infinite. If a population of values consists of a fixed number of
these values, the population is said to be finite. If, on the other hand, a population consists of
an endless succession of values, the population is an infinite one.
5.Sample -A sample may be defined simply as a part of a population. Suppose our population
consists of the weights of all the elementary school children enrolled in a certain county school
system. If we collect for analysis the weights of only a fraction of these children, we have only
a part of our population of weights, that is, we have a sample.

Types Of Statistics:
Consider an example of a book dealing with lot of information. The objectives of this
book are twofold: (1) to teach the student to organize and summarize data, and (2) to teach the
student how to reach decisions about a large body of data by examining only a small part of it.
The concepts and methods necessary for achieving the first objective are presented under the
heading of descriptive statistics, and the second objective is reached through the study of what
is called inferential statistics.

Definition of Descriptive statistics:


It summarize and organize data to make it easier to understand. Instead of presenting
raw data, descriptive statistics help describe, show, or summarize data meaningfully.
Types of Descriptive Statistics:
Measures of Central Tendency: Describe the central point of a dataset.(Mean,
Median and Mode)
Measures of Dispersion (Spread): Describe the variability or spread of data points.
(Range, Variance and Standard Deviation)
Measures of Shape: Describe the shape of the distribution of data.(Skewness,
Kurtosis)

You might also like