Chapter 1.1 Introduction to Data
Chapter 1.1 Introduction to Data
Data are collected and analyzed; data only becomes information suitable for making
decisions once it has been analyzed in some fashion
Data are used in scientific research, businesses management (e.g., sales data, revenue,
profits, stock price), finance, governance (e.g., crime rates, unemployment
rates, literacy rates), and in virtually every other form of human organizational
activity (e.g., censuses of the number of homeless people by non-profit
organizations).
Data are measured, collected, reported, and analyzed, and used to create
data visualizations such as graphs, tables or images. Data as a general concept refers
to the fact that some existing information or knowledge is represented or coded in
some form suitable for better usage or processing.
Types of Data
1. Quantitative data
Quantitative data seems to be the easiest to explain. It answers key questions such as “how
many, “how much” and “how often”.
Quantitative data can be expressed as a number or can be quantified. Simply put, it can be
measured by numerical variables.
Quantitative data are easily amenable to statistical manipulation and can be represented by a
wide variety of statistical types of graphs and charts such as line, bar graph, scatter plot, and
etc.
Examples of quantitative data:
Scores on tests and exams e.g. 85, 67, 90 and etc.
The weight of a person or a subject.
Your shoe size.
The temperature in a room.
There are 2 general types of quantitative data: discrete data and continuous data. We will
explain them later in this article.
2. Qualitative data
Qualitative data can’t be expressed as a number and can’t be measured. Qualitative data
consist of words, pictures, and symbols, not numbers.
Qualitative data is also called categorical data because the information can be sorted by
category, not by number.
Qualitative data can answer questions such as “how this has happened” or and “why this has
happened”.
3. Nominal data
Nominal data is used just for labeling variables, without any type of quantitative value. The
name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
The nominal data just name a thing without applying it to order. Actually, the nominal data
could just be called “labels.”
Eye color is a nominal variable having a few categories (Blue, Green, Brown) and there is no
way to order these categories from highest to lowest.
4. Ordinal data
Ordinal data shows where a number is in order. This is the crucial difference from nominal
types of data.
Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority.
However, you cannot do arithmetic with ordinal numbers because they only show
sequence.
Ordinal variables are considered as “in between” qualitative and quantitative variables.
In other words, the ordinal data is qualitative data for which the values are ordered.
In comparison with nominal data, the second one is qualitative data for which the values
cannot be placed in an ordered.
We can also assign numbers to ordinal data to show their relative position. But we cannot do
math with those numbers. For example: “first, second, third…etc.”
5. Discrete data
Discrete data is a count that involves only integers. The discrete values cannot be subdivided
into parts.
For example, the number of children in a class is discrete data. You can count whole
individuals. You can’t count 1.5 kids.
To put in other words, discrete data can take only certain values. The data variables cannot be
divided into smaller parts.
For example, you can measure your height at very precise scales — meters, centimeters,
millimeters and etc.
You can record continuous data at so many different measurements – width, temperature,
time, and etc. This is where the key difference from discrete types of data lies.
The continuous variables can take any value between two numbers. For example, between 50
and 72 inches, there are literally millions of possible heights: 52.04762 inches, 69.948376
inches and etc.
A good great rule for defining if a data is continuous or discrete is that if the point of
measurement can be reduced in half and still make sense, the data is continuous.
A data source is most commonly used in context with databases and database management
systems or any system that primarily deals with data, and is referred to as a data source name
(DSN), which is defined in the application so that it can find the location of the data. It
simply means what the words mean: where data is coming from.
A data source is
(1) the physical or digital location where data under question is stored as a data table
(or other format),
(2) the degree of originality of a data table,
(3) a brand name data provider
(4) the data used via a self-service data tool such as Excel, Tableau, or Power BI,
(5) the computer storage type, i.e File Data Source or Machine Data Source,
(6) a technical database such as Amazon AWS or Microsoft Azure
(7) a legacy data source with a proper name within an organization,
(8) a data type such as stock, accounting, or economic indicator.
The basic interaction with data sources is found at the data table level. A data table is nothing
more than columns and rows. Each row holds an ID and entries under each column that
describe the row, whereas each column contains all entries for every ID on the specific
describer for that column. In my article on data sets, I explain this with the following example
table:
Item Color Weight
2. Conceptual Level
Gray 1 2 tons
3. Research Level
When we’re looking for data from an external provider such as Google Finance or Data.gov,
“data source” refers to the brands themselves. This is the research level because it occurs
when we’re looking for external data to use on an internal assessment, i.e research. In my
article on data sets, I outlined the following data sources that can be used in research:
1. Kaggle. Kaggle has a good variety of data sets on machine learning. It requires
registration but is worth it.
2. FiveThirtyEight. FiveThirtyEight is a news and sports site with data sets that
are available on GitHub.
3. BuzzFeed. BuzzFeed is a news and entertainment site that publishes data used
in its articles on GitHub.
4. Reddit. Reddit data sets from contributors.
4. Self-Service Application Level
When we’re working with self-service data applications such as Tableau and Power BI, the
data source is tabular data available via our connection. We can connect to different servers,
tables, and joins, but that is the extent of it.
At the self-service application level, data source can mean data from any brand, and data
that’s original or aggregate. As long as it’s available for connection.
5. Computer Level
When we’re talking about computers and the actual location of data storage, the topic is
slightly different. Computer level scope does not concern tabular data used by analysts, but
instead how a computer stores information.
6. Database Level
Perhaps the most common place for data sources is databases. A database is defined not only
by the data it holds but also by the brand of the tool used to create it. Common examples
include Microsoft Azure, Amazon AWS, Dynamics 365, and SAP. Each of these tools work
as a data warehouse or as an enterprise resource planning (ERP) tool.
If you hear “what is the database data source?” At the database level, the correct answer is the
brand name of the software that hosts the data AND the data itself.
7. Legacy Level
Legacy data sources are databases whose technical structure is built within a company that
does not specialize in database creation.
Many digital companies have built internal data warehouses to handle transactional data.
Today, databases are most often outsourced (to AWS or Azure for example), but there was a
time when in-house solutions were preferable. As you can imagine, once the data
infrastructure is set, it’s not altogether easy to modify, so these legacy systems still exist in
many places.
You may hear the question “what is the data source?” If the question is at the legacy level,
the correct answer is the name of the legacy system.
Data sources can also be thought of as data types, such as accounting, stock, transactional, or
economic indicators. Usually the data type comes from an external source, and there are few
subcategories to choose from.
For example, NASA Earth Observation Data is concerned with biosphere, agriculture, and
other Earthy topics: