DS Unit 1
DS Unit 1
UNIT – 1
PART – A
1.What is Data Science process? Explain.
A data science process helps data scientists use the tools to find unseen patterns,
extract data, and convert information to actionable insights that can be meaningful to the
company.
One way to distinguish between reporting and analysis is by identifying the primary
tasks that are being performed.
Activities such as building, configuring, consolidating, organizing, formatting, and
summarizing comes under reporting.
Analysis focuses on different tasks such as questioning, examining, interpreting,
comparing, and confirming
15. Show the ways in which decision making and predictions are made in Data Science
DECISION -MAKING:
By applying data science to operational procedures, decision makers are able to
implement changes much more efficiently and monitor if they are successful or not much
more closely through trial and error.
PREDICTIVE ANALYSIS:
Predictive analytics uses historical data to predict future events. Typically, historical
data is used to build a mathematical model that captures important trends. That predictive
model is then used on current data to predict what will happen next, or to suggest actions to
take for optimal outcomes.
PART – B
1.(i) What is Bigdata?
Big data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional data-processing application software.
Basics of Bigdata:
➢ Big Data platform is IT solution which combines several Big Data tools and utilities
into one packaged solution for managing and analysing Big Data.
➢ Big data platform is a type of IT solution that combines the features and capabilities
of several big data application and utilities within a single solution.
➢ It is an enterprise class IT platform that enables organization in developing,
deploying, operating, and managing a big data infrastructure /environment.
Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate.
Challenges include
1. Analysis,
2. Capture,
3. Data Curation,
4. Search,
5. Sharing,
6. Storage,
7. Transfer,
8. Visualization,
9. Querying,
10. Updating
Information Privacy.
➢ The term often refers simply to the use of predictive analytics or certain other
advanced methods to extract value from data, and seldom to a particular size of data
set.
➢ ACCURACY in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk.
➢ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time.
➢ Big data "size" is a constantly moving target.
➢ Big data requires a set of techniques and technologies with new forms of integration
to reveal insights from datasets that are diverse, complex, and of a massive scale
a) Hadoop
➢ Hadoop is open-source, Java based programming framework and server software
which is used to save and analyze data with the help of 100s or even 1000s of
commodity servers in a clustered environment.
➢ Hadoop is designed to storage and process large datasets extremely fast and in fault
tolerant way.
➢ Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no
loss of data even in hardware failure.
➢ Hadoop is Apache sponsored project and it consists of many software packages
which runs on the top of the Apache Hadoop system.
➢ Top Hadoop based Commercial Big Data Analytics Platform
➢ Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system.
➢ Hadoop ecosystem provides necessary tools and software for handling and analyzing
Big Data.
➢ On the top of the Hadoop system many applications can be developed and plugged-in
to provide ideal solution for Big Data needs.
b) Cloudera
➢ Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
➢ Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera
Data Science & Engineering and Cloudera Essentials.
➢ All these products are based on the Apache Hadoop and provides real-time processing
and analytics of massive data sets.
Hortonworks Hadoop distribution is 100% open source and its enterprise ready with
following features:
• Centralized management and configuration of clusters
• Security and data governance are built in feature of the system
• Centralized security administration across the system
e) MapR
➢ MapR is another Big Data platform which us using the Unix file system for handling
data.
➢ It is not using HDFS and this system is easy to learn anyone familiar with the Unix
system.
➢ This solution integrates Hadoop, Spark, and Apache Drill with a real-time data
processing feature.
It uses the latest Hadoop software and provides following features (IBM Open Platform
Features):
➢ Based on 100% Open source software
➢ Native support for rolling Hadoop upgrades
➢ Support for long running applications within YEARN.
➢ Support for heterogeneous storage which includes HDFS for in-memory and SSD in
addition to HDD
➢ Native support for Spark, developers can use Java, Python and Scala to written
program
➢ Platform includes Ambari, which is a best tool for provisioning, managing &
monitoring Apache Hadoop clusters
➢ IBM Open Platform includes all the software of Hadoop ecosystem e.g. HDFS,
YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider
➢ Developer can download the trial Docker Image or Native installer for testing and
learning the system
➢ Application is well supported by IBM technology team
g) Microsoft HDInsight
➢ The Microsoft HDInsight is also based on the Hadoop distribution and it’s a
commercial Big Data platform from Microsoft.
➢ Microsoft is software giant which is into development of windows operating system
for Desktop users and Server users.
➢ This is the big Hadoop distribution offering which runs on the Windows and Azure
environment.
➢ It offers customized, optimized open source Hadoop based analytics clusters which
uses Spark, Hive, MapReduce, HBase, Strom, Kafka and R Server which runs on the
Hadoop system on windows/Azure environment.
Phase 1—Discovery:
Before you begin the project, it is important to understand the various specifications,
requirements, priorities and required budget. You must possess the ability to ask the right
questions. Here, you assess if you have the required resources present in terms of people,
technology, time, and data to support the project. In this phase, you also need to frame the
business problem and formulate initial hypotheses (IH) to test.
1. R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
2. SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
3. SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.
Although, many tools are present in the market, but R is the most used tool. Now that
you have got insights into the nature of your data and have decided the algorithms to be used.
In the next stage, you will apply the algorithm and build up a model.
Phase 5—Operationalize:
In this phase, you deliver final reports, briefings, code and technical documents. In
addition, sometimes a pilot project is also implemented in a real-time production
environment. This will provide you a clear picture of the performance and other related
constraints on a small scale before full deployment.
Data:
➢ Data is a set of values of qualitative or quantitative variables; restated, pieces of data
are individual pieces of information.
➢ Data is measured, collected and reported, and analyzed, whereupon it can be
visualized using graphs or images
TYPES OF DATA
➢ In order to understand the nature of data it is necessary to categorize them into various
types.
➢ Different categorizations of data are possible.
➢ The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
➢ Within each of these fields, there may be several ways in which data can be
categorized into types.
➢ There are four types of data:
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can
be performed.
The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
Nominal Scales
Nominal scales measure categories and have the following characteristics:
➢ Order: The order of the responses or observations does not matter.
➢ Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
➢ True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order.
So, our characteristics for ordinal scales are:
➢ Order: The order of the responses or observations matters.
➢ Distance: Ordinal scales do not hold distance. The distance between first and second
is unknown as is the distance between first and third along with all observations.
➢ True Zero: There is no true or real zero. An item, observation, or category cannot
finish zero.
Interval Scales
Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 – strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to the website from
a company email” with a response ranging on a scale of values.
The characteristics of interval scales are:
➢ Order: The order of the responses or observations does matter.
➢ Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears
the same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence,
we can perform arithmetic operations on the data.
➢ True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as
11 to 19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the
same a -4 to 4 scale because we subtracted 5 from all values. Although the new scale
contains zero, zero remains uninterruptable because it only appears in the scale from
the transformation.
Displays:
histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
➢ Order: The order of the responses or observations matters.
➢ Distance: Ratio scales do do have an interpretable distance.
➢ True Zero: There is a true zero.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Displays:
histograms or bar charts, line charts, and scatter plots.
APPLICATIONS OF DATA:
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
Amenability of use:
From the dictionary meaning of data it is learnt that data are facts used in deciding
something. In short, data are meant to be used as a base for arriving at definitive conclusions.
a) Clarity:
Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
b) Accuracy:
Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
c) Essence:
A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
d) Aggregation:
Aggregation is cumulating or adding up.
e) Compression:
Large amounts of data are always compressed to make them more meaningful.
Compress data to a manageable size. Graphs and charts are some examples of
compressed data.
f) Refinement:
Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined
5.(i) Give the Difference between Traditional Business Intelligence (BI) versus Big Data
Business Intelligence (BI) vs. Data Science:
➢ Business Intelligence (BI) basically analyzes the previous data to find hindsight and
insight to describe business trends. Here BI enables you to take data from external and
internal sources, prepare it, run queries on it and create dashboards to answer
questions like quarterly revenue analysis or business problems. BI can evaluate the
impact of certain events in the near future.
➢ Data Science is a more forward-looking approach, an exploratory way with the focus
on analyzing the past or current data and predicting the future outcomes with the aim
of making informed decisions. It answers the open-ended questions as to “what” and
“how” events occur.
ETL tools enable data integration strategies by allowing companies to gather data
from multiple data sources and consolidate it into a single, centralized location. ETL tools
also make it possible for different types of data to work together.
WORKING OF ETL:
The ETL process is comprised of 3 steps that enable data integration from source to
destination: data extraction, data transformation, and data loading.
Step 1: Extraction
Most businesses manage data from a variety of data sources and use a number of data
analysis tools to produce business intelligence. To execute such a complex data strategy, the
data must be able to travel freely between systems and apps. Before data can be moved to a
new destination, it must first be extracted from its source — such as a data warehouse or data
lake. In this first step of the ETL process, structured and unstructured data is imported and
consolidated into a single repository. Volumes of data can be extracted from a wide range of
data sources, including:
➢ Existing databases and legacy systems
➢ Cloud, hybrid, and on-premises environments
➢ Sales and marketing applications
➢ Mobile devices and apps
➢ CRM systems
➢ Data storage platforms
➢ Data warehouses
➢ Analytics tools
Step 2: Transformation
During this phase of the ETL process, rules and regulations can be applied that ensure
data quality and accessibility. You can also apply rules to help your company meet reporting
requirements. The process of data transformation is comprised of several sub-processes:
➢ Cleansing — inconsistencies and missing values in the data are resolved.
➢ Standardization — formatting rules are applied to the dataset.
➢ Deduplication — redundant data is excluded or discarded.
➢ Verification — unusable data is removed and anomalies are flagged.
➢ Sorting — data is organized according to type.
➢ Other tasks — any additional/optional rules can be applied to improve data quality.
Transformation is generally considered to be the most important part of the ETL process.
Data transformation improves data integrity — removing duplicates and ensuring that raw
data arrives at its new destination fully compatible and ready to use
Step 3: Loading
The final step in the ETL process is to load the newly transformed data into a new
destination (data lake or data warehouse.) Data can be loaded all at once (full load) or at
scheduled intervals (incremental load).
Full loading
— In an ETL full loading scenario, everything that comes from the transformation
assembly line goes into new, unique records in the data warehouse or data repository. Though
there may be times this is useful for research purposes, full loading produces datasets that
grow exponentially and can quickly become difficult to maintain.
Incremental loading
— A less comprehensive but more manageable approach is incremental loading.
Incremental loading compares incoming data with what’s already on hand, and only produces
additional records if new and unique information is found. This architecture allows smaller,
less expensive data warehouses to maintain and manage business intelligence.
HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even
thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the
others being MapReduce and YARN. HDFS should not be confused with or replaced
by Apache HBase, which is a column-oriented non-relational database management
system that sits on top of HDFS and can better support real-time data needs with its in
memory processing engine.
Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying operating systems.
(ii) YARN.
YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager YARN also allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data
processing, it is quite necessary to manage the available resources properly so that every
application can leverage them.
YARN Features:
YARN gained popularity because of the following features
Scalability:
The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
Compatibility:
YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
Cluster Utilization:
Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.
Multi-tenancy:
It allows multiple engine access thus giving organizations a benefit of multi-tenancy.
Application manager:
It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.
Node Manager:
It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on
directions from the resource manager. It is also responsible for creating the container process
and start it on the request of Application master.
Application Master:
An application is a single job submitted to a framework. The application manager is
responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context (CLC) which includes
everything an application needs to run. Once the application is started, it sends the health
report to the resource manager from time-to time. Container: It is a collection of physical
resources such as RAM, CPU cores and disk on a single node. The containers are invoked by
Container Launch Context (CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.
Imagine we have a collection of items we’d like to process somehow. For instance,
the items might be website logs, the texts of various books, image files, or anything else. A
basic version of the MapReduce algorithm consists of the following steps:
1. Use a mapper function to turn each item into zero or more key-value pairs. (Often
this is called the map function, but there is already a Python function called map and we don’t
need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.
This is all sort of abstract, so let’s look at a specific example. There are few absolute
rules of data science, but one of them is that your first MapReduce example has to involve
counting words.
def word_count_old(documents):
"""word count not using MapReduce"""
return Counter (word
for document in documents
for word in tokenize(document))
With millions of users the set of documents (status updates) is suddenly too big to fit
on your computer. If you can just fit this into the MapReduce model, you can use some “big
data” infrastructure that your engineers have implemented.
First, we need a function that turns a document into a sequence of key-value pairs.
We’ll want our output to be grouped by word, which means that the keys should be words.
And for each word, we’ll just emit the value 1 to indicate that this pair corresponds to one
occurrence of the word:
def wc_mapper(document):
"""for each word in the document, emit (word,1)"""
for word in tokenize(document):
yield (word, 1)
Skipping the “plumbing” step 2 for the moment, imagine that for some word we’ve
collected a list of the corresponding counts we emitted. Then to produce the overall count
for that word we just need:
def wc_reducer(word, counts):
"""sum up the counts for a word"""
yield (word, sum(counts))
Returning to step 2, we now need to collect the results from wc_mapper and feed
them to wc_reducer. Let’s think about how we would do this on just one computer:
def word_count(documents):
"""count the words in the input documents using MapReduce"""
# place to store grouped values
collector = defaultdict(list)
for document in documents:
for word, count in wc_mapper(document):
collector[word].append(count)
return [output
for word, counts in collector.iteritems()
for output in wc_reducer(word, counts)]
Imagine that we have three documents ["data science", "big data", "science
fiction"]. Then wc_mapper applied to the first document yields the two pairs ("data", 1) and
("science", 1). After we’ve gone through all three documents, the collector contains
{ "data" : [1, 1],
"science" : [1, 1],
"big" : [1],
"fiction" : [1] }
Then wc_reducer produces the count for each word:
(ii) YARN.
YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split between the
resource manager and application manager YARN also allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System) thus
making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data
processing, it is quite necessary to manage the available resources properly so that every
application can leverage them.
YARN Features:
YARN gained popularity because of the following features
Scalability:
The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
Compatibility:
YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
Cluster Utilization:
Since YARN supports Dynamic utilization of cluster in Hadoop, which enables
optimized Cluster Utilization.
Multi-tenancy:
It allows multiple engine access thus giving organizations a benefit of multi-tenancy.
Application manager:
It is responsible for accepting the application and negotiating the first container from
the resource manager. It also restarts the Application Manager container if a task fails.
Node Manager:
It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Node Manager. It
monitors resource usage, performs log management and also kills a container based on
directions from the resource manager. It is also responsible for creating the container process
and start it on the request of Application master.
Application Master:
An application is a single job submitted to a framework. The application manager is
responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container
from the node manager by sending a Container Launch Context (CLC) which includes
everything an application needs to run. Once the application is started, it sends the health
report to the resource manager from time-to time. Container: It is a collection of physical
resources such as RAM, CPU cores and disk on a single node. The containers are invoked by
Container Launch Context (CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus helping in
making quick decisions based on the learnings.
If we don’t know what our customers want then it will degrade companies’
success. It will result in the loss of clientele which creates an adverse effect on
business growth.
Big data analytics helps businesses to identify customer related trends and
patterns. Customer behavior analysis leads to a profitable business.
BIG DATA:
Big Data refers to massive amounts of data produced by different sources like social
media platforms, web logs, sensors, IoT devices, and many more. It can be either structured
(like tables in DBMS), semi-structured (like XML files), or unstructured (like audios, videos,
images).
Traditional database management systems are not able to handle this vast amount of
data. Big Data helps companies to generate valuable insights.
Companies use Big Data to refine their marketing campaigns and techniques.
Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications.
There are many companies like Amazon, Netflix, Spotify, LinkedIn, Swiggy,etc
which use big data analytics. Banking sectors make the maximum use of Big Data Analytics.
Education sector is also using data analytics to enhance students’ performance as well as
making teaching easier for instructors.
• Welfare Schemes
In making faster and more informed decisions regarding various political
programs
To identify areas that are in immediate need of attention
To overcome national challenges such as unemployment, terrorism, energy
resources exploration, and much more.
• Cyber Security
Big Data is hugely used for deceit recognition in the domain of cyber security.
It is also used in catching tax evaders.
Cyber security engineers protect networks and data from unauthorized access.
11.Describe the roles and stages in data science project.
Phase 1—Discovery:
Before you begin the project, it is important to understand the various specifications,
requirements, priorities and required budget. You must possess the ability to ask the right
questions. Here, you assess if you have the required resources present in terms of people,
technology, time, and data to support the project. In this phase, you also need to frame the
business problem and formulate initial hypotheses (IH) to test.
You can use R for data cleaning, transformation, and visualization. This will help you to spot
the outliers and establish a relationship between the variables. Once you have cleaned and
prepared the data, it’s time to do exploratory analytics on it.
1. R has a complete set of modeling capabilities and provides a good environment for
building interpretive models.
2. SQL Analysis services can perform in-database analytics using common data mining
functions and basic predictive models.
3. SAS/ACCESS can be used to access data from Hadoop and is used for creating
repeatable and reusable model flow diagrams.
Although, many tools are present in the market, but R is the most used tool. Now that
you have got insights into the nature of your data and have decided the algorithms to be used.
In the next stage, you will apply the algorithm and build up a model.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various
sources. Tools like Hadoop help them to analyze data immediately thus helping in
making quick decisions based on the learnings.
Big data analytics helps businesses to identify customer related trends and
patterns. Customer behavior analysis leads to a profitable business.
(ii) List out the various challenges faced in big data in detail
VARIOUS CHALLENGES FACED IN BIG DATA:
1. Lack of knowledge Professionals
To run these modern technologies and large Data tools, companies need skilled data
professionals. These professionals will include data scientists, data analysts, and data
engineers to work with the tools and make sense of giant data sets. One of the Big Data
Challenges that any Company face is a drag of lack of massive Data professionals. This is
often because data handling tools have evolved rapidly, but in most cases, the professionals
haven't.
Solution:
Companies are investing extra money in the recruitment of skilled professionals. They
even have to supply training programs to the prevailing staff to urge the foremost out of
them. Another important step taken by organizations is purchasing knowledge analytics
solutions powered by artificial intelligence/machine learning.
Solution:
Big Data workshops and seminars must be held at companies for everybody. Military
training programs must be arranged for all the workers handling data regularly and are a
neighborhood of large Data projects. All levels of the organization must inculcate a basic
understanding of knowledge concepts.
Solution:
You'll either hire experienced professionals who know far more about these tools.
Differently is to travel for giant Data consulting. Here, consultants will provide a
recommendation of the simplest tools supporting your company’s scenario.
Solution:
Companies need to solve their Data Integration problems by purchasing the proper
tools. A number of the simplest data integration tools are mentioned below:
6. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of massive
Data. Often companies are so busy in understanding, storing, and analyzing their data sets
that they push data security for later stages. This is often not a sensible move as unprotected
data repositories can become breeding grounds for malicious hackers. Companies can lose up
to $3.7 million for a stolen record or a knowledge breach.
Solution:
Companies are recruiting more cybersecurity professionals to guard their data. Other
steps taken for Securing Big Data include: Data encryption Data segregation Identity and
access control Implementation of endpoint security Real-time security monitoring Use Big
Data security tools, like IBM Guardian.
Often, the storage framework involves a software layer for managing a collection of
storage resources and providing much of these capabilities. The software configures
storage for replication to provide a level of fault tolerance, as well as managing
communications using standard protocols (such as UDP or TCP/IP) among the different
processing nodes. In addition, some frameworks will replicate stored data, providing
redundancy in the event of a fault or failure.
Step 2: Fix structural errors Structural errors are when you measure or transfer data
and notice strange naming conventions, typos, or incorrect capitalization. These
inconsistencies can cause mislabelled categories or classes. For example, you may find
“N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.
Step 3: Filter unwanted outliers Often, there will be one-off observations where, at a
glance, they do not appear to fit within the data you are analyzing. If you have a legitimate
reason to remove an outlier, like improper data entry, doing so will help the performance of
the data you are working with. However, sometimes it is the appearance of an outlier that will
prove a theory you are working on.
Step 4: Handle missing data You can’t ignore missing data because many algorithms
will not accept missing values. There are a couple of ways to deal with missing data. Neither
is optimal, but both can be considered.
1. As a first option, you can drop observations that have missing values, but doing this
will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations;
again, there is an opportunity to lose integrity of the data because you may be operating from
assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate
null values.
Step 5: Validate and QA At the end of the data cleaning process, you should be able
to answer these questions as a part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
False conclusions because of incorrect or “dirty” data can inform poor business
strategy and decision-making. False conclusions can lead to an embarrassing moment in a
reporting meeting when you realize your data doesn’t stand up to scrutiny. Before you get
there, it is important to create a culture of quality data in your organization. To do this, you
should document the tools you might use to create this culture and what data quality means to
you.
In data analysis, sampling is the practice of analyzing a subset of all data in order to
uncover the meaningful information in the larger data set. For example, if you wanted to
estimate the number of trees in a 100-acre area where the distribution of trees was fairly
uniform, you could count the number of trees in 1 acre and multiply by 100, or count the trees
in a half acre and multiply by 200 to get an accurate representation of the entire 100 acres.
SAMPLING THRESHOLDS:
Default reports are not subject to sampling. Ad-hoc queries of your data are subject to
the following general thresholds for sampling:
• Analytics Standard: 500k sessions at the property level for the date range you
are using
• Analytics 360: 100M sessions at the view level for the date range you are
using
• Queries may include events, custom variables, and custom dimensions and
metrics. All other queries have a threshold of 1M
• Historical data is limited to up to 14 months (on a rolling basis)
In some circumstances, you may see fewer sessions sampled. This can result from the
complexity of your Analytics implementation, the use of view filters, query complexity for
segmentation, or some combination of those factors. Although we make a best effort to
sample up to the thresholds described above, it's normal to sometimes see slightly fewer
sessions returned for an ad-hoc query.
SAMPLING USED:
The following sections explain where you can expect session sampling in Analytics
reports.
Default reports
Analytics has a set of preconfigured, default reports listed in the left pane under
Audience, Acquisition, Behavior, and Conversions. Analytics stores one complete, unfiltered
set of data for each property in each account. For each reporting view in a property, Analytics
also creates tables of aggregated dimensions and metrics from the complete, unfiltered data.
When you run a default report, Analytics queries the tables of aggregated data to quickly
deliver unsampled results.
PART – C
1.Create a brief summary about the challenges faced in processing big data now a day.
Solution:
Companies are investing extra money in the recruitment of skilled professionals. They
even have to supply training programs to the prevailing staff to urge the foremost out of
them. Another important step taken by organizations is purchasing knowledge analytics
solutions powered by artificial intelligence/machine learning.
Solution:
Big Data workshops and seminars must be held at companies for everybody. Military
training programs must be arranged for all the workers handling data regularly and are a
neighborhood of large Data projects. All levels of the organization must inculcate a basic
understanding of knowledge concepts.
Solution:
You'll either hire experienced professionals who know far more about these tools.
Differently is to travel for giant Data consulting. Here, consultants will provide a
recommendation of the simplest tools supporting your company’s scenario.
Solution:
Companies need to solve their Data Integration problems by purchasing the proper
tools. A number of the simplest data integration tools are mentioned below:
6. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of massive
Data. Often companies are so busy in understanding, storing, and analyzing their data sets
that they push data security for later stages. This is often not a sensible move as unprotected
data repositories can become breeding grounds for malicious hackers. Companies can lose up
to $3.7 million for a stolen record or a knowledge breach.
Solution:
Companies are recruiting more cybersecurity professionals to guard their data. Other
steps taken for Securing Big Data include: Data encryption Data segregation Identity and
access control Implementation of endpoint security Real-time security monitoring Use Big
Data security tools, like IBM Guardian.
4. Evaluate the various formats of data and illustrate with a real time examples.