|
|
By Mohamed El-Refaey |
Article Rating: |
|
July 15, 2015 10:15 AM EDT |
Reads: |
3,674 |
This post is the first in a series of blog posts that will explore and exploit the Big Data and analytics tools. I will walk through easy steps to start working with such tools like Apache Hadoop, Pig, Mahout and solve some problems related to analytics and learning in the large scale by exploiting such tools, and shed the light on some of the challenges we face while working with these tools.
1. Apache Hadoop 1.1 Overview Hadoop is a framework that simplifies the processing of data sets distributed across clusters of servers. Two of the main components of Hadoop are HDFS and MapReduce.HDFS is the file system that is used by Hadoop to store all the data. This file system spans across all the nodes that are being used by Hadoop. These nodes could be on a single server or they can be spread across a large number of servers.In this section, we will go through the instruction of how to get the Hadoop up and running with the configurations needed to make it useful for other components/frameworks that integrate or depends on Hadoop (e.g. Hive, Pig, HBase etc.).
Note: The installation will be a Pseudo distribution.
1.2 Tools and Versions I've used the following tools and versions throughout this installation:
- Ubuntu 14.04 LTS
- Java 1.7.0_65 (java-7-openjdk-amd64)
- Hadoop 2.5.1
1.3 Installation and Configurations
1. Install Java using the following command:
apt-get update apt-get install default-jdk |
2. Create Security Keys using the following commands:
ssh-keygen -t rsa -P ' ' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys |
3. Download Hadoop tar file using:
wget http://www.webhostingreviewjam.com/mirror/apache/hadoop/common/hadoop-2.5.1/hadoop-2.5.1.tar.gz |
4. Extract the tar file using:
tar -xzvf hadoop-2.5.1.tar.gz |
5. Move the extracted files into a location you can easily recognize, and easily change the version used without much modifications using:
mv hadoop-2.5.1/ /usr/local/hadoop |
6. Configure the following environment variables in the bashrc file (to make sure every time they are set with the machine sartup):
#HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END |
7. Source the bashrc file after changes, for the system to recognize the changes using the following command:
8. Edit the Hadoop-env.sh using vim:
vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh |
The hadoop-env.sh file should look like this: That will make the value of the JAVA_HOME always available to Hadoop whenever it starts. |
9- Edit the core-site.xml file using vim as well:
vim /usr/local/hadoop/etc/hadoop/core-site.xml |
The file will look like:  |
10- Edit the YARN file yarn-site.xml as follows:
vim /usr/local/hadoop/etc/hadoop/yarn-site.xml |
The file will look like:  |
11. Create and edit the mapred-site.xml file:
vim /usr/local/hadoop/etc/hadoop/mapred-site.xml |
The file will contains the following property, that specify which framework will be used for MapReduce:
|
12. Edit the hdfs-site.xml file, in order to specify the directories that will be used as datanode and namenode on that server.
vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml |
Create the two directories: mkdir -p /usr/local/hadoop_store/hdfs/namenode mkdir -p /usr/local/hadoop_store/hdfs/datanode after editing the file, it will contains the following properties:
|
13.Forma t the new Hadoop file system using the following command:
hdfs namenode -format Note: This operation needs to be done once before we start using Hadoop. If it is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop filesystem. |
14. Now, all configurations are done, we can start using Hadoop, we should first run the following shell scripts:
start-dfs.sh start-yarn.sh |
And to make sure everything is okay, and the right process is running, run the command jps and see the following:
|
15- We can run MapReduce examples that exist in Hadoop bundle, but we need to run the following:
We should create the HDFS directories required to execute MapReduce jobs: hdfs dfs -mkdir /user hdfs dfs -mkdir /user/mohamed and copy the input files to be processed into the distributed filesystem: hdfs dfs -put {here is the path to the files to be copied} input |
16- We can check the web console for the resource manager, HDFS nodes and running jobs as shown in the following screens: 
Issues and problems:
I've experienced some issues related to: Ø Formatting the HDFS, and I resolved it by changing permissions and ownership of the user who can format the namenode and datanode. Ø Problem connecting to the resource manager, with the following error: ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); maxRetries=45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 And I resolved it by: adding a few properties to yarn-site.xml : We reached to the end of our first post on big data and analytics, hope you enjoyed reading it and experiminting with Hadoop installation and configuration. next post will be about Apache Pig.
Read the original blog entry...
Work as head of research and development at EDC (Egypt Development Center) a member of NTG. previously worked for Qlayer, Acquired by (Sun Microsystems), when my passion about cloud computing domain started. with more than 10 years of experience in software design and development in e-commerce, BPM, EAI, Web 2.0, Banking applications, financial market, Java and J2EE. HIPAA, SOX, BPEL and SOA, and late two year focusing on virtualization technology and cloud computing in studies, technical papers and researches, and international group participation and events. I've been awarded in recognition of innovation and thought leadership while working as IT Specialist at EDS (an HP Company). Also a member of the Cloud Computing Interoperability Forum (CCIF) and member of the UCI (Unified Cloud Interface) open source project, in which he contributed with the project architecture.
@ThingsExpo Stories By Elizabeth White  SYS-CON Events announced today that Yuasa System will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Yuasa System is introducing a multi-purpose endurance testing system for flexible displays, OLED devices, flexible substrates, flat cables, and films in smartphones, wearables, automobiles, and healthcare. Oct. 29, 2017 02:15 AM EDT Reads: 2,219 | By Yeshim Deniz  SYS-CON Events announced today that ECS Refining to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. With rapid advances in technology, the proliferation of consumer and enterprise electronics, and the exposure of unethical e-waste disposal methods, there is an increasing demand for responsible electronics recycling and reuse services. As a pioneer in the electronics recycling and ... Oct. 29, 2017 01:15 AM EDT Reads: 726 | By Yeshim Deniz  SYS-CON Events announced today that Cedexis will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cedexis is the leader in data-driven enterprise global traffic management. Whether optimizing traffic through datacenters, clouds, CDNs, or any combination, Cedexis solutions drive quality and cost-effectiveness.
For more information, please visit https://www.cedexis.com. Oct. 29, 2017 01:00 AM EDT Reads: 690 | By Yeshim Deniz  SYS-CON Events announced today that Vivint to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. As a leading smart home technology provider, Vivint offers home security, energy management, home automation, local cloud storage, and high-speed Internet solutions to more than one million customers throughout the United States and Canada. The end result is a smart home solution that sav... Oct. 29, 2017 12:30 AM EDT Reads: 779 | By Yeshim Deniz  SYS-CON Events announced today that CAST Software will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CAST was founded more than 25 years ago to make the invisible visible. Built around the idea that even the best analytics on the market still leave blind spots for technical teams looking to deliver better software and prevent outages, CAST provides the software intelligence that matter ... Oct. 29, 2017 12:00 AM EDT Reads: 2,859 | By Yeshim Deniz  SYS-CON Events announced today that Opsani to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. Opsani is creating the next generation of automated continuous deployment tools designed specifically for containers. How is continuous deployment different from continuous integration and continuous delivery? CI/CD tools provide build and test. Continuous Deployment is the means by which... Oct. 28, 2017 11:45 PM EDT Reads: 878 | By Yeshim Deniz  SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ... Oct. 28, 2017 10:00 PM EDT Reads: 945 | By Yeshim Deniz  SYS-CON Events announced today that T-Mobile exhibited at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on qua... Oct. 28, 2017 09:15 PM EDT Reads: 711 | By Pat Romanski  Digital Transformation (DX) is not a "one-size-fits all" strategy. Each organization needs to develop its own unique, long-term DX plan. It must do so by realizing that we now live in a data-driven age, and that technologies such as Cloud Computing, Big Data, the IoT, Cognitive Computing, and Blockchain are only tools. In her general session at 21st Cloud Expo, Rebecca Wanta will explain how the strategy must focus on DX and include a commitment from top management to create great IT jobs, monit... Oct. 28, 2017 08:45 PM EDT Reads: 882 | By Pat Romanski  SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/. Oct. 28, 2017 06:30 PM EDT Reads: 2,349 | By Pat Romanski  SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/. Oct. 28, 2017 04:45 PM EDT Reads: 2,402 | By Yeshim Deniz  SYS-CON Events announced today that Nirmata to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. Nirmata provides comprehensive policy-based automation for deploying, operating, and optimizing containerized applications across clouds, via easy-to-use, intuitive interfaces. Nirmata empowers enterprise DevOps teams by fully automating the complex operations and management of applicati... Oct. 28, 2017 04:30 PM EDT Reads: 949 | By Elizabeth White  SYS-CON Events announced today that Avere Systems, a leading provider of hybrid cloud enablement solutions, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Avere Systems was created by file systems experts determined to reinvent storage by changing the way enterprises thought about and bought storage resources. With decades of experience behind the company’s founders, Avere got its ... Oct. 28, 2017 04:30 PM EDT Reads: 1,679 | By Elizabeth White  Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, will discuss how they bu... Oct. 28, 2017 03:00 PM EDT Reads: 1,056 | By Yeshim Deniz  Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha... Oct. 28, 2017 02:15 PM EDT Reads: 2,440 | By Elizabeth White  Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, will discuss some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he’ll go over some of the best practices for structured team migrat... Oct. 28, 2017 01:30 PM EDT Reads: 1,317 | By Liz McMillan  In his session at 21st Cloud Expo, Ikuo Nakagawa, Co-Founder and Board Member at Transparent Cloud Computing Consortium, will introduce the big change in economic models with leading-edge business cases for digitalization of transactions and discuss the future of monetary economy in the digital era.
Nowadays, "digital innovation" is a big wave of business transformation based on digital technologies. IoT, Big Data, AI, FinTech and various leading-edge technologies are key components of such bus... Oct. 28, 2017 01:00 PM EDT Reads: 1,031 | By Elizabeth White  22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ... Oct. 28, 2017 12:45 PM EDT Reads: 953 | By Pat Romanski  SYS-CON Events announced today that Datera will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera offers a radically new approach to data management, where innovative software makes data infrastructure invisible, elastic and able to perform at the highest level. It eliminates hardware lock-in and gives IT organizations the choice to source x86 server nodes, with business model option... Oct. 28, 2017 12:30 PM EDT Reads: 2,673 | By Roger Strukhoff  Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov... Oct. 28, 2017 12:00 PM EDT Reads: 1,352 |
|
|
|
|
|
|