0% found this document useful (0 votes)
37 views

DS Unit 1

The document discusses data science concepts including the data science process, differences between business intelligence and data science, comparisons of data science and statistics, definitions of data science, applications of data science, roles of data scientists, comparisons of big data and data science, purposes of reporting and analysis, advantages of web scraping, using data science to predict stock markets, decision making and predictions in data science, differences between data mining and data science, ethics in data science, roles of data science, justifications for bootstrapping, and developing a general algorithm for the data science process.

Uploaded by

Akshaya Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

DS Unit 1

The document discusses data science concepts including the data science process, differences between business intelligence and data science, comparisons of data science and statistics, definitions of data science, applications of data science, roles of data scientists, comparisons of big data and data science, purposes of reporting and analysis, advantages of web scraping, using data science to predict stock markets, decision making and predictions in data science, differences between data mining and data science, ethics in data science, roles of data science, justifications for bootstrapping, and developing a general algorithm for the data science process.

Uploaded by

Akshaya Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DATA SCIENCE

UNIT – 1
PART – A
1.What is Data Science process? Explain.
A data science process helps data scientists use the tools to find unseen patterns,
extract data, and convert information to actionable insights that can be meaningful to the
company.

2. Differentiate Business Intelligence (BI) and Data Science

3. Compare Data Science and Statistics.


Statistics is a mathematically-based field which seeks to collect and interpret
quantitative data. In contrast, data science is a multidisciplinary field which uses scientific
methods, processes, and systems to extract knowledge from data in a range of forms. Data
scientists use methods from many disciplines, including statistics.
4. Define Data Science
Data Science is a blend of various tools, algorithms, and machine learning principles
with the goal to discover hidden patterns from the raw data.

5. List out the areas in which Data Science can be applied


Data science has become fuel for industries. There are various industries like
➢ Banking
➢ Finance
➢ Manufacturing
➢ transport
➢ e-commerce
➢ education
etc. that use data science
6. Who is a Data Scientist?
There are several definitions available on Data Scientists. In simple words, a Data
Scientist is one who practices the art of Data Science. The term “Data Scientist” has been
coined after considering the fact that a Data Scientist draws a lot of information from the
scientific fields and applications whether it is statistics or mathematics.

7. Compare Big Data with Data Science.


Big data analysis caters to a large amount of data set which is also known as data
mining, but data science makes use of the machine learning algorithms to design and develop
statistical models to generate knowledge from the pile of big data.

8. State the purpose or reporting and analysis


Reporting:
The process of organizing data into informational summaries in order to monitor
Analysis:
The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.

One way to distinguish between reporting and analysis is by identifying the primary
tasks that are being performed.
Activities such as building, configuring, consolidating, organizing, formatting, and
summarizing comes under reporting.
Analysis focuses on different tasks such as questioning, examining, interpreting,
comparing, and confirming

9. List out the advantages of web scraping.


➢ Reliable and Robust performance
➢ Fast and efficient
➢ Low maintenance cost
➢ Unique and rich datasets
➢ Effective data management

10. Can Data Science Predict the Stock Market? Examine


Yes, data science can, and is applied to stock market analysis.
Data Science can be used to give us a unique perspective on how we understand the
stock market and financial data. Certain basic principles are followed during trading, such as
sell, buy, or hold. Our main goal is to make high profits. Various trading platforms are
becoming increasingly popular. It is important to understand certain key concept of data
science for you to be able to determine whether it is worth investing in a particular stock and
stock market analysis

11. Discuss about analysis and reporting


Reporting:
The process of organizing data into informational summaries in order to monitor
Analysis:
The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.
12. Give Drew Conway’s Venn diagram of Data Science

13. Specify the life cycle of Data Science.

14. Illustrate the use of Data Science with an example.


Data science can identify patterns, permitting the making of inferences and
predictions, from seemingly unstructured or unrelated data. Tech companies that collect user
data can use techniques to turn what's collected into sources of useful or profitable
information.
For example, an algorithm created by researchers at the Massachusetts Institute of
Technology can be used to detect differences between 3D medical images—such as MRI
scans—more than one thousand times faster than a human. Because of this time saved,
doctors can respond to urgent issues revealed in the scans and potentially save patients’ lives.

15. Show the ways in which decision making and predictions are made in Data Science
DECISION -MAKING:
By applying data science to operational procedures, decision makers are able to
implement changes much more efficiently and monitor if they are successful or not much
more closely through trial and error.

PREDICTIVE ANALYSIS:
Predictive analytics uses historical data to predict future events. Typically, historical
data is used to build a mathematical model that captures important trends. That predictive
model is then used on current data to predict what will happen next, or to suggest actions to
take for optimal outcomes.

16. Differentiate Data Mining and Data Science.


DATA MINING DATA SCIENCE
Data mining is a technique used in business data science is an actual field
and data science both of scientific study or discipline
Data mining’s goal is to render data more Data science, in contrast, aims to create
usable for a specific business purpose. data-driven products and outcomes—
usually in a business context.
Data mining deals mostly with structured mining is part of what a data scientists
data, as exploring huge amounts of raw, might do,
unprocessed data is within the bounds of and it's a skill that's part of the science.
data science

17. Analyze Data Science ethics.


Particularly, privacy rights, data validity, and algorithm fairness in the areas of Big
Data, Artificial Intelligence, and Machine Learning are the most important ethical challenges
in need of a more thorough investigation.

18. Analyze the roles of Data Science.


A data scientist's role combines computer science, statistics, and mathematics.
They analyze, process, and model data then interpret the results to create actionable plans for
companies and other organizations.

19. Bootstrap is more thorough in terms of the magnitude of replication. Justify.

20. Develop a general algorithm for Data Science process

PART – B
1.(i) What is Bigdata?
Big data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional data-processing application software.
Basics of Bigdata:
➢ Big Data platform is IT solution which combines several Big Data tools and utilities
into one packaged solution for managing and analysing Big Data.
➢ Big data platform is a type of IT solution that combines the features and capabilities
of several big data application and utilities within a single solution.
➢ It is an enterprise class IT platform that enables organization in developing,
deploying, operating, and managing a big data infrastructure /environment.

(ii) Describe the main features of a big data in detail


Features of Big Data Platform:
Here are most important features of any good Big Data Analytics Platform:
➢ Big Data platform should be able to accommodate new platforms and tool based on
the business requirement. Because business needs can change due to new technologies
or due to change in business process.
➢ It should support linear scale-out
➢ It should have capability for rapid deployment
➢ It should support variety of data format
➢ Platform should provide data analysis and reporting tools
➢ It should provide real-time data analysis software
➢ It should have tools for searching the data through large data sets

Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate.

Challenges include
1. Analysis,
2. Capture,
3. Data Curation,
4. Search,
5. Sharing,
6. Storage,
7. Transfer,
8. Visualization,
9. Querying,
10. Updating

Information Privacy.
➢ The term often refers simply to the use of predictive analytics or certain other
advanced methods to extract value from data, and seldom to a particular size of data
set.
➢ ACCURACY in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk.
➢ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time.
➢ Big data "size" is a constantly moving target.
➢ Big data requires a set of techniques and technologies with new forms of integration
to reveal insights from datasets that are diverse, complex, and of a massive scale

List of BigData Platforms:


a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD

a) Hadoop
➢ Hadoop is open-source, Java based programming framework and server software
which is used to save and analyze data with the help of 100s or even 1000s of
commodity servers in a clustered environment.
➢ Hadoop is designed to storage and process large datasets extremely fast and in fault
tolerant way.
➢ Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no
loss of data even in hardware failure.
➢ Hadoop is Apache sponsored project and it consists of many software packages
which runs on the top of the Apache Hadoop system.
➢ Top Hadoop based Commercial Big Data Analytics Platform
➢ Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system.
➢ Hadoop ecosystem provides necessary tools and software for handling and analyzing
Big Data.
➢ On the top of the Hadoop system many applications can be developed and plugged-in
to provide ideal solution for Big Data needs.

b) Cloudera
➢ Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
➢ Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera
Data Science & Engineering and Cloudera Essentials.
➢ All these products are based on the Apache Hadoop and provides real-time processing
and analytics of massive data sets.

c) Amazon Web Services


➢ Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services
package.
➢ AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud
Compute and Simple Storage Service (S3).
➢ Enterprises can use the Amazon AWS to run their Big Data processing analytics in
the cloud environment.
➢ Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark,
HBase, Presto, Hive, and other Big Data Frameworks using its cloud hosting
environment.
d) Hortonworks
➢ Hortonworks is using 100% open-source software without any propriety software.
Hortonworks were the one who first integrated support for Apache HCatalog.
➢ The Hortonworks is a Big Data company based in California.
➢ This company is developing and supports application for Apache Hadoop.

Hortonworks Hadoop distribution is 100% open source and its enterprise ready with
following features:
• Centralized management and configuration of clusters
• Security and data governance are built in feature of the system
• Centralized security administration across the system

e) MapR
➢ MapR is another Big Data platform which us using the Unix file system for handling
data.
➢ It is not using HDFS and this system is easy to learn anyone familiar with the Unix
system.
➢ This solution integrates Hadoop, Spark, and Apache Drill with a real-time data
processing feature.

f) IBM Open Platform


➢ IBM also offers Big Data Platform which is based on the Hadoop eco-system
software.
➢ IBM well knows company in software and data computing.

It uses the latest Hadoop software and provides following features (IBM Open Platform
Features):
➢ Based on 100% Open source software
➢ Native support for rolling Hadoop upgrades
➢ Support for long running applications within YEARN.
➢ Support for heterogeneous storage which includes HDFS for in-memory and SSD in
addition to HDD
➢ Native support for Spark, developers can use Java, Python and Scala to written
program
➢ Platform includes Ambari, which is a best tool for provisioning, managing &
monitoring Apache Hadoop clusters
➢ IBM Open Platform includes all the software of Hadoop ecosystem e.g. HDFS,
YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider
➢ Developer can download the trial Docker Image or Native installer for testing and
learning the system
➢ Application is well supported by IBM technology team

g) Microsoft HDInsight
➢ The Microsoft HDInsight is also based on the Hadoop distribution and it’s a
commercial Big Data platform from Microsoft.
➢ Microsoft is software giant which is into development of windows operating system
for Desktop users and Server users.
➢ This is the big Hadoop distribution offering which runs on the Windows and Azure
environment.
➢ It offers customized, optimized open source Hadoop based analytics clusters which
uses Spark, Hive, MapReduce, HBase, Strom, Kafka and R Server which runs on the
Hadoop system on windows/Azure environment.

2. Describe life cycle of Data Science with neat diagram

LIFE CYCLE OF DATA SCIENCE:

Phase 1—Discovery:
Before you begin the project, it is important to understand the various specifications,
requirements, priorities and required budget. You must possess the ability to ask the right
questions. Here, you assess if you have the required resources present in terms of people,
technology, time, and data to support the project. In this phase, you also need to frame the
business problem and formulate initial hypotheses (IH) to test.

Phase 2—Data preparation:


In this phase, you require analytical sandbox in which you can perform analytics for
the entire duration of the project. You need to explore, pre-process and condition data prior to
modeling. Further, you will perform ETLT (extract, transform, load and transform) to get
data into the sandbox. Let’s have a look at the Statistical Analysis flow below.
You can use R for data cleaning, transformation, and visualization. This will help you to spot
the outliers and establish a relationship between the variables. Once you have cleaned and
prepared the data, it’s time to do exploratory analytics on it.

Phase 3—Model planning:


Here, you will determine the methods and techniques to draw the relationships
between variables. These relationships will set the base for the algorithms which you will
implement in the next phase. You will apply Exploratory Data Analytics (EDA) using various
statistical formulas and visualization tools. Let’s have a look at various model planning tools.

1. R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
2. SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
3. SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.
Although, many tools are present in the market, but R is the most used tool. Now that
you have got insights into the nature of your data and have decided the algorithms to be used.
In the next stage, you will apply the algorithm and build up a model.

Phase 4—Model building:


In this phase, you will develop datasets for training and testing purposes. Here you
need to consider whether your existing tools will suffice for running the models or it will
need a more robust environment (like fast and parallel processing). You will analyse various
learning techniques like classification, association, and clustering to build the model.

You can achieve model building through the following tools.

Phase 5—Operationalize:
In this phase, you deliver final reports, briefings, code and technical documents. In
addition, sometimes a pilot project is also implemented in a real-time production
environment. This will provide you a clear picture of the performance and other related
constraints on a small scale before full deployment.

Phase 6—Communicate results:


Now it is important to evaluate if you have been able to achieve your goal that you
had planned in the first phase. So, in the last phase, you identify all the key findings,
communicate to the stakeholders, and determine if the results of the project are a success, or a
failure based on the criteria developed in Phase 1.

3.List the main characteristics of Big Data.

MAIN CHARACTERISTICS OF BIG DATA:


➢ Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data. Hence, 'Volume' is one characteristic which needs to be considered
while dealing with Big Data.
➢ Variety – The next aspect of Big Data is its variety. Variety refers to heterogeneous
sources and the nature of data, both structured and unstructured. During earlier days,
spreadsheets and databases were the only sources of data considered by most of the
applications. Nowadays, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. are also being considered in the analysis applications. This
variety of unstructured data poses certain issues for storage, mining and analyzing
data.
➢ Velocity – The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors, Mobile
devices, etc. The flow of data is massive and continuous.
➢ Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Benefits of Big Data Processing
Ability to process Big Data brings in multiple benefits, such as
1. Businesses can utilize outside intelligence while taking decisions
- Access to social data from search engines and sites like Facebook, twitter are
enabling organizations to fine tune their business strategies.
2. Improved customer service
- Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new systems, Big Data and
natural language processing technologies are being used to read and evaluate
consumer responses.
3. Early identification of risk to the product/services, if any
- Better operational efficiency
- Big Data technologies can be used for creating a staging area or landing zone
for new data before identifying what data should be moved to the data
warehouse. In addition, such integration of Big Data technologies and data
warehouse helps an organization to offload infrequently accessed data

4.(i) Discuss nature of data.

Data:
➢ Data is a set of values of qualitative or quantitative variables; restated, pieces of data
are individual pieces of information.
➢ Data is measured, collected and reported, and analyzed, whereupon it can be
visualized using graphs or images

TYPES OF DATA
➢ In order to understand the nature of data it is necessary to categorize them into various
types.
➢ Different categorizations of data are possible.
➢ The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
➢ Within each of these fields, there may be several ways in which data can be
categorized into types.
➢ There are four types of data:
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can
be performed.
The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero

Nominal Scales
Nominal scales measure categories and have the following characteristics:
➢ Order: The order of the responses or observations does not matter.
➢ Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
➢ True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.

Appropriate statistics for nominal scales:


mode, count, frequencies
Displays:
histograms or bar charts

Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order.
So, our characteristics for ordinal scales are:
➢ Order: The order of the responses or observations matters.
➢ Distance: Ordinal scales do not hold distance. The distance between first and second
is unknown as is the distance between first and third along with all observations.
➢ True Zero: There is no true or real zero. An item, observation, or category cannot
finish zero.

Appropriate statistics for ordinal scales:


count, frequencies, mode
Displays:
histograms or bar charts

Interval Scales
Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 – strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to the website from
a company email” with a response ranging on a scale of values.
The characteristics of interval scales are:
➢ Order: The order of the responses or observations does matter.
➢ Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears
the same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence,
we can perform arithmetic operations on the data.
➢ True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as
11 to 19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the
same a -4 to 4 scale because we subtracted 5 from all values. Although the new scale
contains zero, zero remains uninterruptable because it only appears in the scale from
the transformation.

Appropriate statistics for interval scales:


count, frequencies, mode, median, mean, standard deviation (and variance),
skewness, and kurtosis.

Displays:
histograms or bar charts, line charts, and scatter plots.

Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
➢ Order: The order of the responses or observations matters.
➢ Distance: Ratio scales do do have an interpretable distance.
➢ True Zero: There is a true zero.

Income is a classic example of a ratio scale:


➢ Order is established. We would all prefer $100 to $1!
➢ Zero dollars means we have no income (or, in accounting terms, our revenue exactly
equals our expenses!)
➢ Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.

For the web analyst, the statistics for ratio scales are the same as for interval scales.

Appropriate statistics for ratio scales:


count, frequencies, mode, median, mean, standard deviation (and variance), skewness,
and kurtosis.

Displays:
histograms or bar charts, line charts, and scatter plots.

(ii) Give detail description of applications of data.

APPLICATIONS OF DATA:
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement

Amenability of use:
From the dictionary meaning of data it is learnt that data are facts used in deciding
something. In short, data are meant to be used as a base for arriving at definitive conclusions.

a) Clarity:
Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.

b) Accuracy:
Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
c) Essence:
A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
d) Aggregation:
Aggregation is cumulating or adding up.
e) Compression:
Large amounts of data are always compressed to make them more meaningful.
Compress data to a manageable size. Graphs and charts are some examples of
compressed data.
f) Refinement:
Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined

5.(i) Give the Difference between Traditional Business Intelligence (BI) versus Big Data
Business Intelligence (BI) vs. Data Science:
➢ Business Intelligence (BI) basically analyzes the previous data to find hindsight and
insight to describe business trends. Here BI enables you to take data from external and
internal sources, prepare it, run queries on it and create dashboards to answer
questions like quarterly revenue analysis or business problems. BI can evaluate the
impact of certain events in the near future.
➢ Data Science is a more forward-looking approach, an exploratory way with the focus
on analyzing the past or current data and predicting the future outcomes with the aim
of making informed decisions. It answers the open-ended questions as to “what” and
“how” events occur.

Let’s have a look at some contrasting features.


(ii) Give the various drawbacks of using Traditional system approach

6.(i) Demonstrate the ETL (Extract, Transform and Load) system?

ETL stands for "Extract, Transform, and Load.


➢ "EXTRACT data from its original source
➢ TRANSFORM data by deduplicating it, combining it, and ensuring quality, to
then
➢ LOAD data into the target database

ETL tools enable data integration strategies by allowing companies to gather data
from multiple data sources and consolidate it into a single, centralized location. ETL tools
also make it possible for different types of data to work together.

WORKING OF ETL:
The ETL process is comprised of 3 steps that enable data integration from source to
destination: data extraction, data transformation, and data loading.

Step 1: Extraction
Most businesses manage data from a variety of data sources and use a number of data
analysis tools to produce business intelligence. To execute such a complex data strategy, the
data must be able to travel freely between systems and apps. Before data can be moved to a
new destination, it must first be extracted from its source — such as a data warehouse or data
lake. In this first step of the ETL process, structured and unstructured data is imported and
consolidated into a single repository. Volumes of data can be extracted from a wide range of
data sources, including:
➢ Existing databases and legacy systems
➢ Cloud, hybrid, and on-premises environments
➢ Sales and marketing applications
➢ Mobile devices and apps
➢ CRM systems
➢ Data storage platforms
➢ Data warehouses
➢ Analytics tools

Step 2: Transformation
During this phase of the ETL process, rules and regulations can be applied that ensure
data quality and accessibility. You can also apply rules to help your company meet reporting
requirements. The process of data transformation is comprised of several sub-processes:
➢ Cleansing — inconsistencies and missing values in the data are resolved.
➢ Standardization — formatting rules are applied to the dataset.
➢ Deduplication — redundant data is excluded or discarded.
➢ Verification — unusable data is removed and anomalies are flagged.
➢ Sorting — data is organized according to type.
➢ Other tasks — any additional/optional rules can be applied to improve data quality.

Transformation is generally considered to be the most important part of the ETL process.
Data transformation improves data integrity — removing duplicates and ensuring that raw
data arrives at its new destination fully compatible and ready to use

Step 3: Loading
The final step in the ETL process is to load the newly transformed data into a new
destination (data lake or data warehouse.) Data can be loaded all at once (full load) or at
scheduled intervals (incremental load).

Full loading
— In an ETL full loading scenario, everything that comes from the transformation
assembly line goes into new, unique records in the data warehouse or data repository. Though
there may be times this is useful for research purposes, full loading produces datasets that
grow exponentially and can quickly become difficult to maintain.

Incremental loading
— A less comprehensive but more manageable approach is incremental loading.
Incremental loading compares incoming data with what’s already on hand, and only produces
additional records if new and unique information is found. This architecture allows smaller,
less expensive data warehouses to maintain and manage business intelligence.

(ii) Explain Big Data Technology Landscape


7. Analyze and write short notes on the following.
i. Hadoop Distributed File System (HDFS).

HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even
thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the
others being MapReduce and YARN. HDFS should not be confused with or replaced
by Apache HBase, which is a column-oriented non-relational database management
system that sits on top of HDFS and can better support real-time data needs with its in
memory processing engine.

Fast recovery from hardware failures


Because one HDFS instance may consist of thousands of servers, failure of at least
one server is inevitable. HDFS has been built to detect faults and automatically recover
quickly.
Access to streaming data
HDFS is intended more for batch processing versus interactive use, so the emphasis in
the design is for high data throughput rates, which accommodate streaming access to data
sets.

Accommodation of large data sets


HDFS accommodates applications that have data sets typically gigabytes to terabytes
in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in
a single cluster.

Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying operating systems.

(ii) YARN.

YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager YARN also allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data
processing, it is quite necessary to manage the available resources properly so that every
application can leverage them.
YARN Features:
YARN gained popularity because of the following features
Scalability:
The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.

Compatibility:
YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.

Cluster Utilization:
Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.

Multi-tenancy:
It allows multiple engine access thus giving organizations a benefit of multi-tenancy.

The main components of YARN architecture include:


Client:
It submits map-reduce jobs.
Resource Manager:
It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications. Whenever it receives a processing request, it
forwards it to the corresponding node manager and allocates resources for the completion of
the request accordingly. It has two major components:
Scheduler:
It performs scheduling based on the allocated application and available resources. It is
a pure scheduler, means it does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as
Capacity Scheduler and Fair Scheduler to partition the cluster resources.

Application manager:
It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.

Node Manager:
It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on
directions from the resource manager. It is also responsible for creating the container process
and start it on the request of Application master.

Application Master:
An application is a single job submitted to a framework. The application manager is
responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context (CLC) which includes
everything an application needs to run. Once the application is started, it sends the health
report to the resource manager from time-to time. Container: It is a collection of physical
resources such as RAM, CPU cores and disk on a single node. The containers are invoked by
Container Launch Context (CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.

8. Explain the following Detail:


(i) Map Reduce

MapReduce is a programming model for performing parallel processing on large data


sets. Although it is a powerful technique, its basics are relatively simple.

Imagine we have a collection of items we’d like to process somehow. For instance,
the items might be website logs, the texts of various books, image files, or anything else. A
basic version of the MapReduce algorithm consists of the following steps:
1. Use a mapper function to turn each item into zero or more key-value pairs. (Often
this is called the map function, but there is already a Python function called map and we don’t
need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.
This is all sort of abstract, so let’s look at a specific example. There are few absolute
rules of data science, but one of them is that your first MapReduce example has to involve
counting words.

Example: Word Count


Data Sciencester has grown to millions of users! This is great for your job security,
but it makes routine analyses slightly more difficult. For example, your VP of Content wants
to know what sorts of things people are talking about in their status updates. As a first
attempt, you decide to count the words that appear, so that you can prepare a report on the
most frequent ones.
When you had a few hundred users this was simple to do:

def word_count_old(documents):
"""word count not using MapReduce"""
return Counter (word
for document in documents
for word in tokenize(document))

With millions of users the set of documents (status updates) is suddenly too big to fit
on your computer. If you can just fit this into the MapReduce model, you can use some “big
data” infrastructure that your engineers have implemented.
First, we need a function that turns a document into a sequence of key-value pairs.
We’ll want our output to be grouped by word, which means that the keys should be words.
And for each word, we’ll just emit the value 1 to indicate that this pair corresponds to one
occurrence of the word:
def wc_mapper(document):
"""for each word in the document, emit (word,1)"""
for word in tokenize(document):
yield (word, 1)
Skipping the “plumbing” step 2 for the moment, imagine that for some word we’ve
collected a list of the corresponding counts we emitted. Then to produce the overall count
for that word we just need:
def wc_reducer(word, counts):
"""sum up the counts for a word"""
yield (word, sum(counts))

Returning to step 2, we now need to collect the results from wc_mapper and feed
them to wc_reducer. Let’s think about how we would do this on just one computer:
def word_count(documents):
"""count the words in the input documents using MapReduce"""
# place to store grouped values
collector = defaultdict(list)
for document in documents:
for word, count in wc_mapper(document):
collector[word].append(count)
return [output
for word, counts in collector.iteritems()
for output in wc_reducer(word, counts)]
Imagine that we have three documents ["data science", "big data", "science
fiction"]. Then wc_mapper applied to the first document yields the two pairs ("data", 1) and
("science", 1). After we’ve gone through all three documents, the collector contains
{ "data" : [1, 1],
"science" : [1, 1],
"big" : [1],
"fiction" : [1] }
Then wc_reducer produces the count for each word:

[("data", 2), ("science", 2), ("big", 1), ("fiction", 1)]

(ii) YARN.

YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager YARN also allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data
processing, it is quite necessary to manage the available resources properly so that every
application can leverage them.
YARN Features:
YARN gained popularity because of the following features
Scalability:
The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.

Compatibility:
YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.

Cluster Utilization:
Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.

Multi-tenancy:
It allows multiple engine access thus giving organizations a benefit of multi-tenancy.

The main components of YARN architecture include:


Client:
It submits map-reduce jobs.
Resource Manager:
It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications. Whenever it receives a processing request, it
forwards it to the corresponding node manager and allocates resources for the completion of
the request accordingly. It has two major components:
Scheduler:
It performs scheduling based on the allocated application and available resources. It is
a pure scheduler, means it does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as
Capacity Scheduler and Fair Scheduler to partition the cluster resources.

Application manager:
It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.

Node Manager:
It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on
directions from the resource manager. It is also responsible for creating the container process
and start it on the request of Application master.

Application Master:
An application is a single job submitted to a framework. The application manager is
responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context (CLC) which includes
everything an application needs to run. Once the application is started, it sends the health
report to the resource manager from time-to time. Container: It is a collection of physical
resources such as RAM, CPU cores and disk on a single node. The containers are invoked by
Container Launch Context (CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.

9. (i) Assess the difference between analysis and analytics.

There's an important difference between analysis and analytics. A lack of


understanding can affect how marketers' ability to leverage customer intelligence to their best
advantage.
The process of exploring data and reports in order to extract meaningful insights
which can be used to better understand and improve business performance. Analysis focuses
on different tasks such as questioning, examining, interpreting, comparing, and confirming
(I’ve left out testing as I view optimization efforts as part of the action stage).
Analysis is the separation of a whole into its component parts, and analytics is the
method of logical analysis. Analysis and analytics is to think in terms of past and future.
Analysis looks backwards over time, providing marketers with a historical view of what has
happened. Typically, analytics look forward to model the future or predict a result.

Data analytics is a broader term and includes data analysis as necessary


subcomponent. Analytics defines the science behind the analysis. The science means
understanding the cognitive processes an analyst uses to understand problems and explore
data in meaningful ways. Analytics also include data extract, transform, and load; specific
tools, techniques, and methods; and how to successfully communicate results.
(ii) Discuss the importance of big data analytics?
Importance of Big data:
Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data. Every company
uses its collected data in its own way. More effectively the company uses its data, more
rapidly it grows.

The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.

2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus helping in
making quick decisions based on the learnings.

3. Understand the market conditions


Big Data analysis helps businesses to get a better understanding of market
situations.
For example, analysis of customer purchasing behavior helps companies to
identify the products sold most and thus produces those products accordingly. This
helps companies to get ahead of their competitors.

4. Social Media Listening


Companies can perform sentiment analysis using Big Data tools. These enable
them to get feedback about their company, that is, who is saying what about the
company. Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention


Customers are a vital asset on which any business depends on. No single
business can achieve its success without building a robust customer base. But even
with a solid customer base, the companies can’t ignore the competition in the market.

If we don’t know what our customers want then it will degrade companies’
success. It will result in the loss of clientele which creates an adverse effect on
business growth.

Big data analytics helps businesses to identify customer related trends and
patterns. Customer behavior analysis leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights


Big data analytics shapes all business operations. It enables companies to
fulfill customer expectations. Big data analytics helps in changing the company’s
product line. It ensures powerful marketing campaigns.

7. The driver of Innovations and Product Development


Big data makes companies capable to innovate and redevelop their products.
10. Extrapolate big data analytics and Develop a summary of various applications in the
real world scenario.

BIG DATA:
Big Data refers to massive amounts of data produced by different sources like social
media platforms, web logs, sensors, IoT devices, and many more. It can be either structured
(like tables in DBMS), semi-structured (like XML files), or unstructured (like audios, videos,
images).
Traditional database management systems are not able to handle this vast amount of
data. Big Data helps companies to generate valuable insights.
Companies use Big Data to refine their marketing campaigns and techniques.
Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications.

NEED OF BIG DATA:


Big Data initiatives were rated as “extremely important” to 93% of companies.
Leveraging a Big Data analytics solution helps organizations to unlock the strategic values
and take full advantage of their assets.

REAL-TIME BENEFITS OF BIG DATA:


Big Data analytics has expanded its roots in all the fields. This results in the use of
Big Data in a wide range of industries including
• Finance and Banking,
• Healthcare,
• Education,
• Government,
• Retail,
• Manufacturing, and many more.

There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy,etc
which use big data analytics. Banking sectors make the maximum use of Big Data Analytics.
Education sector is also using data analytics to enhance students’ performance as well as
making teaching easier for instructors.

Big Data analytics help retailers from traditional to e-commerce to understand


customer behaviour and recommend products as per customer interest. This helps them in
developing new and improved products which help the firm enormously.

Big Data in Education Industry:


Proper study and analysis of this data can provide insights that can be used to improve
the operational effectiveness and working of educational institutes.
Following are some of the fields in the education industry that has been transformed by big
data-motivated changes:
• Customized and Dynamic Learning Programs
• Reframing Course Material
• Grading Systems
• Career Prediction
Example
The University of Alabama has more than 38,000 students and an ocean of data. In
the past when there were no real solutions to analyze that much data, some of them seemed
useless. Now, administrators can use analytics and data visualizations for this data to draw
out patterns of students revolutionizing the university’s operations, recruitment, and retention
efforts.

Big Data in Healthcare Industry:


Healthcare is yet another industry that is bound to generate a huge amount of data.
Following are some of how big data has contributed to healthcare:
• Big data reduces the costs of a treatment since there are fewer chances of having to
perform unnecessary diagnoses.
• It helps in predicting outbreaks of epidemics and also in deciding what preventive
measures could be taken to minimize the effects of the same.
• It helps avoid preventable diseases by detecting them in the early stages. It prevents
them from getting any worse which in turn makes their treatment easy and effective.
• Patients can be provided with evidence-based medicine which is identified and
prescribed after researching past medical results.

Big Data in Government Sector:


Governments, be it of any country, come face to face with a very huge amount of data
on an almost daily basis. The reason for this is, they have to keep track of various records and
databases regarding their citizens, their growth, energy resources, geographical surveys, and
many more. All this data contributes to big data. The proper study and analysis of this data,
hence, helps governments in endless ways. A few of them are as follows:

• Welfare Schemes
In making faster and more informed decisions regarding various political
programs
To identify areas that are in immediate need of attention
To overcome national challenges such as unemployment, terrorism, energy
resources exploration, and much more.

• Cyber Security
Big Data is hugely used for deceit recognition in the domain of cyber security.
It is also used in catching tax evaders.
Cyber security engineers protect networks and data from unauthorized access.
11.Describe the roles and stages in data science project.

ROLES AND STAGES IN DATA SCIENCE PROJECT:

Phase 1—Discovery:
Before you begin the project, it is important to understand the various specifications,
requirements, priorities and required budget. You must possess the ability to ask the right
questions. Here, you assess if you have the required resources present in terms of people,
technology, time, and data to support the project. In this phase, you also need to frame the
business problem and formulate initial hypotheses (IH) to test.

Phase 2—Data preparation:


In this phase, you require analytical sandbox in which you can perform analytics for
the entire duration of the project. You need to explore, pre-process and condition data prior to
modeling. Further, you will perform ETLT (extract, transform, load and transform) to get
data into the sandbox. Let’s have a look at the Statistical Analysis flow below.

You can use R for data cleaning, transformation, and visualization. This will help you to spot
the outliers and establish a relationship between the variables. Once you have cleaned and
prepared the data, it’s time to do exploratory analytics on it.

Phase 3—Model planning:


Here, you will determine the methods and techniques to draw the relationships
between variables. These relationships will set the base for the algorithms which you will
implement in the next phase. You will apply Exploratory Data Analytics (EDA) using various
statistical formulas and visualization tools. Let’s have a look at various model planning tools.

1. R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
2. SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
3. SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.
Although, many tools are present in the market, but R is the most used tool. Now that
you have got insights into the nature of your data and have decided the algorithms to be used.
In the next stage, you will apply the algorithm and build up a model.

Phase 4—Model building:


In this phase, you will develop datasets for training and testing purposes. Here you
need to consider whether your existing tools will suffice for running the models or it will
need a more robust environment (like fast and parallel processing). You will analyse various
learning techniques like classification, association, and clustering to build the model.

You can achieve model building through the following tools.


Phase 5—Operationalize:
In this phase, you deliver final reports, briefings, code and technical documents. In
addition, sometimes a pilot project is also implemented in a real-time production
environment. This will provide you a clear picture of the performance and other related
constraints on a small scale before full deployment.

Phase 6—Communicate results:


Now it is important to evaluate if you have been able to achieve your goal that you
had planned in the first phase. So, in the last phase, you identify all the key findings,
communicate to the stakeholders, and determine if the results of the project are a success, or a
failure based on the criteria developed in Phase 1.

12.(i) Illustrate the importance of big data

Importance of Big data:


Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data. Every company
uses its collected data in its own way. More effectively the company uses its data, more
rapidly it grows.

The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.

2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus helping in
making quick decisions based on the learnings.

3. Understand the market conditions


Big Data analysis helps businesses to get a better understanding of market
situations.
For example, analysis of customer purchasing behavior helps companies to
identify the products sold most and thus produces those products accordingly. This
helps companies to get ahead of their competitors.

4. Social Media Listening


Companies can perform sentiment analysis using Big Data tools. These enable
them to get feedback about their company, that is, who is saying what about the
company. Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention


Customers are a vital asset on which any business depends on. No single
business can achieve its success without building a robust customer base. But even
with a solid customer base, the companies can’t ignore the competition in the market.
If we don’t know what our customers want then it will degrade companies’
success. It will result in the loss of clientele which creates an adverse effect on
business growth.

Big data analytics helps businesses to identify customer related trends and
patterns. Customer behavior analysis leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights


Big data analytics shapes all business operations. It enables companies to
fulfill customer expectations. Big data analytics helps in changing the company’s
product line. It ensures powerful marketing campaigns.

7. The driver of Innovations and Product Development


Big data makes companies capable to innovate and redevelop their products.

(ii) List out the various challenges faced in big data in detail
VARIOUS CHALLENGES FACED IN BIG DATA:
1. Lack of knowledge Professionals
To run these modern technologies and large Data tools, companies need skilled data
professionals. These professionals will include data scientists, data analysts, and data
engineers to work with the tools and make sense of giant data sets. One of the Big Data
Challenges that any Company face is a drag of lack of massive Data professionals. This is
often because data handling tools have evolved rapidly, but in most cases, the professionals
haven't.

Solution:
Companies are investing extra money in the recruitment of skilled professionals. They
even have to supply training programs to the prevailing staff to urge the foremost out of
them. Another important step taken by organizations is purchasing knowledge analytics
solutions powered by artificial intelligence/machine learning.

2. Lack of proper understanding of Massive Data


Companies fail in their Big Data initiatives, all thanks to insufficient understanding.
Employees might not know what data is, its storage, processing, importance, and sources.
Data professionals may know what's happening, but others might not have a transparent
picture.

Solution:
Big Data workshops and seminars must be held at companies for everybody. Military
training programs must be arranged for all the workers handling data regularly and are a
neighborhood of large Data projects. All levels of the organization must inculcate a basic
understanding of knowledge concepts.

3. Data Growth Issues:


One of the foremost pressing challenges of massive Data is storing these huge sets of
knowledge properly. the quantity of knowledge being stored in data centers and databases of
companies is increasing rapidly. As these data sets grow exponentially with time, it gets
challenging to handle. Most of the info is unstructured and comes from documents, videos,
audio, text files, and other sources.
4. Confusion while Big Data Tool selection:
Companies often get confused while selecting the simplest tool for giant Data analysis
and storage. Is HBase or Cassandra the simplest technology for data storage? Is Hadoop
MapReduce ok, or will Spark be a far better option for data analytics and storage? These
questions bother companies, and sometimes they're unable to seek out the answers. They find
themselves making poor decisions and selecting inappropriate technology.

Solution:
You'll either hire experienced professionals who know far more about these tools.
Differently is to travel for giant Data consulting. Here, consultants will provide a
recommendation of the simplest tools supporting your company’s scenario.

5. Integrating Data from a Spread of Sources:


Data in a corporation comes from various sources, like social media pages, ERP
applications, customer logs, financial reports, e-mails, presentations, and reports created by
employees. Combining all this data to organize reports may be a challenging task.

Solution:
Companies need to solve their Data Integration problems by purchasing the proper
tools. A number of the simplest data integration tools are mentioned below:

• Talend Data Integration


• Centerprise Data Integrator
• ArcESB
• IBM InfoSphere
• Xplenty
• Informatica PowerCenter
• CloverDX
• Microsoft SQL QlikView

6. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of massive
Data. Often companies are so busy in understanding, storing, and analyzing their data sets
that they push data security for later stages. This is often not a sensible move as unprotected
data repositories can become breeding grounds for malicious hackers. Companies can lose up
to $3.7 million for a stolen record or a knowledge breach.

Solution:
Companies are recruiting more cybersecurity professionals to guard their data. Other
steps taken for Securing Big Data include: Data encryption Data segregation Identity and
access control Implementation of endpoint security Real-time security monitoring Use Big
Data security tools, like IBM Guardian.

13. Explain storage consideration in Big Data.

STORAGE CONSIDERATIONS IN BIG DATA:


In any environment intended to support the analysis of massive amounts of data, there
must be the infrastructure supporting the data lifecycle from acquisition, preparation,
integration, and execution. The need to acquire and manage massive amounts of data
suggests a need for specialty storage systems to accommodate the big data applications.
When evaluating specialty storage offerings, some variables to
consider include:
• Scalability, which looks at whether expectations for performance improvement are
aligned with the additional of storage resources, and the degree to which the storage
subsystem can support massive data volumes of increasing size.
• Extensibility, which examines how flexible the storage system’s architecture is in
allowing the system to be grown without the constraint of artificial limits.
• Accessibility, which looks at any limitations or constraints in providing simultaneous
access to an expanding user community without compromising performance.
• Fault tolerance, which imbues the storage environment with the capability to recover
from intermittent failures.
• High-speed I/O capacity, which measures whether the input/output channels can
satisfy the demanding timing requirements for absorbing, storing, and sharing large
data volumes.
• Integrability, which measures how well the storage environment can be integrated into
the production environment.

Often, the storage framework involves a software layer for managing a collection of
storage resources and providing much of these capabilities. The software configures
storage for replication to provide a level of fault tolerance, as well as managing
communications using standard protocols (such as UDP or TCP/IP) among the different
processing nodes. In addition, some frameworks will replicate stored data, providing
redundancy in the event of a fault or failure.

14. Discuss Data Cleaning and Sampling


DATA CLEANING:
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabeled. If data is
incorrect, outcomes and algorithms are unreliable, even though they may look correct. There
is no one absolute way to prescribe the exact steps in the data cleaning process because the
processes will vary from dataset to dataset. But it is crucial to establish a template for your
data cleaning process so you know you are doing it the right way every time.
When using data, most people agree that your insights and analysis are only as good
as the data you are using. Essentially, garbage data in is garbage analysis out. Data cleaning,
also referred to as data cleansing and data scrubbing, is one of the most important steps for
your organization if you want to create a culture around quality data decision-making
While the techniques used for data cleaning may vary according to the types of data
your company stores, you can follow these basic steps to map out a framework for your
organization.
Step 1: Remove duplicate or irrelevant observations Remove unwanted observations
from your dataset, including duplicate observations or irrelevant observations. Duplicate
observations will happen most often during data collection. When you combine data sets
from multiple places, scrape data, or receive data from clients or multiple departments, there
are opportunities to create duplicate data. De-duplication is one of the largest areas to be
considered in this process. Irrelevant observations are when you notice observations that do
not fit into the specific problem you are trying to analyze. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older generations, you
might remove those irrelevant observations. This can make analysis more efficient and
minimize distraction from your primary target—as well as creating a more manageable and
more performant dataset.

Step 2: Fix structural errors Structural errors are when you measure or transfer data
and notice strange naming conventions, typos, or incorrect capitalization. These
inconsistencies can cause mislabelled categories or classes. For example, you may find
“N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.

Step 3: Filter unwanted outliers Often, there will be one-off observations where, at a
glance, they do not appear to fit within the data you are analyzing. If you have a legitimate
reason to remove an outlier, like improper data entry, doing so will help the performance of
the data you are working with. However, sometimes it is the appearance of an outlier that will
prove a theory you are working on.

Step 4: Handle missing data You can’t ignore missing data because many algorithms
will not accept missing values. There are a couple of ways to deal with missing data. Neither
is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations;
again, there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate
null values.

Step 5: Validate and QA At the end of the data cleaning process, you should be able
to answer these questions as a part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can inform poor business
strategy and decision-making. False conclusions can lead to an embarrassing moment in a
reporting meeting when you realize your data doesn’t stand up to scrutiny. Before you get
there, it is important to create a culture of quality data in your organization. To do this, you
should document the tools you might use to create this culture and what data quality means to
you.

In data analysis, sampling is the practice of analyzing a subset of all data in order to
uncover the meaningful information in the larger data set. For example, if you wanted to
estimate the number of trees in a 100-acre area where the distribution of trees was fairly
uniform, you could count the number of trees in 1 acre and multiply by 100, or count the trees
in a half acre and multiply by 200 to get an accurate representation of the entire 100 acres.

SAMPLING THRESHOLDS:
Default reports are not subject to sampling. Ad-hoc queries of your data are subject to
the following general thresholds for sampling:
• Analytics Standard: 500k sessions at the property level for the date range you
are using
• Analytics 360: 100M sessions at the view level for the date range you are
using
• Queries may include events, custom variables, and custom dimensions and
metrics. All other queries have a threshold of 1M
• Historical data is limited to up to 14 months (on a rolling basis)

In some circumstances, you may see fewer sessions sampled. This can result from the
complexity of your Analytics implementation, the use of view filters, query complexity for
segmentation, or some combination of those factors. Although we make a best effort to
sample up to the thresholds described above, it's normal to sometimes see slightly fewer
sessions returned for an ad-hoc query.

SAMPLING USED:
The following sections explain where you can expect session sampling in Analytics
reports.
Default reports
Analytics has a set of preconfigured, default reports listed in the left pane under
Audience, Acquisition, Behavior, and Conversions. Analytics stores one complete, unfiltered
set of data for each property in each account. For each reporting view in a property, Analytics
also creates tables of aggregated dimensions and metrics from the complete, unfiltered data.
When you run a default report, Analytics queries the tables of aggregated data to quickly
deliver unsampled results.

PART – C

1.Create a brief summary about the challenges faced in processing big data now a day.

VARIOUS CHALLENGES FACED IN BIG DATA:


1. Lack of knowledge Professionals
To run these modern technologies and large Data tools, companies need skilled data
professionals. These professionals will include data scientists, data analysts, and data
engineers to work with the tools and make sense of giant data sets. One of the Big Data
Challenges that any Company face is a drag of lack of massive Data professionals. This is
often because data handling tools have evolved rapidly, but in most cases, the professionals
haven't.

Solution:
Companies are investing extra money in the recruitment of skilled professionals. They
even have to supply training programs to the prevailing staff to urge the foremost out of
them. Another important step taken by organizations is purchasing knowledge analytics
solutions powered by artificial intelligence/machine learning.

2. Lack of proper understanding of Massive Data


Companies fail in their Big Data initiatives, all thanks to insufficient understanding.
Employees might not know what data is, its storage, processing, importance, and sources.
Data professionals may know what's happening, but others might not have a transparent
picture.

Solution:
Big Data workshops and seminars must be held at companies for everybody. Military
training programs must be arranged for all the workers handling data regularly and are a
neighborhood of large Data projects. All levels of the organization must inculcate a basic
understanding of knowledge concepts.

3. Data Growth Issues:


One of the foremost pressing challenges of massive Data is storing these huge sets of
knowledge properly. the quantity of knowledge being stored in data centers and databases of
companies is increasing rapidly. As these data sets grow exponentially with time, it gets
challenging to handle. Most of the info is unstructured and comes from documents, videos,
audio, text files, and other sources.
4. Confusion while Big Data Tool selection:
Companies often get confused while selecting the simplest tool for giant Data analysis
and storage. Is HBase or Cassandra the simplest technology for data storage? Is Hadoop
MapReduce ok, or will Spark be a far better option for data analytics and storage? These
questions bother companies, and sometimes they're unable to seek out the answers. They find
themselves making poor decisions and selecting inappropriate technology.

Solution:
You'll either hire experienced professionals who know far more about these tools.
Differently is to travel for giant Data consulting. Here, consultants will provide a
recommendation of the simplest tools supporting your company’s scenario.

5. Integrating Data from a Spread of Sources:


Data in a corporation comes from various sources, like social media pages, ERP
applications, customer logs, financial reports, e-mails, presentations, and reports created by
employees. Combining all this data to organize reports may be a challenging task.

Solution:
Companies need to solve their Data Integration problems by purchasing the proper
tools. A number of the simplest data integration tools are mentioned below:

• Talend Data Integration


• Centerprise Data Integrator
• ArcESB
• IBM InfoSphere
• Xplenty
• Informatica PowerCenter
• CloverDX
• Microsoft SQL QlikView

6. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of massive
Data. Often companies are so busy in understanding, storing, and analyzing their data sets
that they push data security for later stages. This is often not a sensible move as unprotected
data repositories can become breeding grounds for malicious hackers. Companies can lose up
to $3.7 million for a stolen record or a knowledge breach.

Solution:
Companies are recruiting more cybersecurity professionals to guard their data. Other
steps taken for Securing Big Data include: Data encryption Data segregation Identity and
access control Implementation of endpoint security Real-time security monitoring Use Big
Data security tools, like IBM Guardian.

2. Evaluate in detail about the case study of big data solutions.

3. Explain Traditional Vs Big data business approach with its drawbacks

4. Evaluate the various formats of data and illustrate with a real time examples.

You might also like