How to get started with Hadoop – Hadoop căn bản

1 of the most painful jobs of a system engineer is to build a whole system by installing multiple packages, one-by-one. We all worry about incompatibility and dependencies

With Hadoop, you can do that with big help from HDP (Hortonworks Data Platform)

Great tutorials and documentation can be found here

http://hortonworks.com/hdp/downloads/

The order of methods you should try to install Hadoop

hortonworks

Mình viết bài này cho bạn nào muốn bắt đầu với Hadoop mà không biết bắt đầu từ đâu. Chỉ đơn giản là HDP, những trang khác như Cloudera hay MapR tài liệu không tốt bằng. HDP còn có thể cài trên cả Windows Server (but you should not do that, shouldn’t you)

Chỉ ngắn gọn vậy thôi. Nếu bạn đã từng cài Hadoop bằng cách tải từ Apache về thì thấy rất mệt mỏi.

Work smarter, not harder (well, actually we should be working harder, sometimes)

 

Hadoop 2.2 and Flume 1.4 Protobuf Problem and Solution

I have to say the big THANK to the author of  “Hadoop in Practice” : Alex Holmes

Source : http://grepalex.com/2014/02/09/flume-and-hadoop-2.2/

The problem you may encounter while  trying to integrate Hadoop 2.2 and Flume 1.4 is the incompatibility between protobuf versions :

2014-04-15 13:56:23,251 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR – org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:422)] process failed

java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$RecoverLeaseRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
at java.lang.Class.privateGetPublicMethods(Class.java:2651)
at java.lang.Class.privateGetPublicMethods(Class.java:2661)
at java.lang.Class.getMethods(Class.java:1467)
at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:328)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:235)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:510)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:220)
at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:536)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:160)
at org.apache.flume.sink.hdfs.BucketWriter.access$1000(BucketWriter.java:56)
at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:533)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Exception in thread “SinkRunner-PollingRunner-DefaultSinkProcessor” java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$RecoverLeaseRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
at java.lang.Class.privateGetPublicMethods(Class.java:2651)
at java.lang.Class.privateGetPublicMethods(Class.java:2661)
at java.lang.Class.getMethods(Class.java:1467)
at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:328)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:235)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:510)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:220)
at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:536)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:160)
at org.apache.flume.sink.hdfs.BucketWriter.access$1000(BucketWriter.java:56)
at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:533)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

This is his  post :

Google really screwed the pooch with their protobuf 2.5 release. Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries (I guess Google missed the semantic versioning boat on this release). Unfortunately the current stable release of Flume 1.4 packages protobuf 2.4.1 and if you try and use HDFS on Hadoop 2.2 as a sink you’ll be smacked with the following exception:

java.lang.VerifyError: class org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    ...
    at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
    at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:328)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:235)

Hadoop 2.2 uses protobuf 2.5 for its RPC, and Flume loads its older packaged version of protobuf ahead of Hadoop’s, which causes this error. To fix this you’ll need to move both protobuf and guava out of Flume’s lib directory. The following command moves them into your home directory.

$ mv ${flume_bin}/lib/{protobuf-java-2.4.1.jar,guava-10.0.1.jar} ~/

Now if you restart your Flume agent you’ll be able to target HDFS as a sink with Hadoop 2.2. Great success!

Flume’s next release will move to protobuf 2.5 so this problem should magically disappear in due course.

Hadoop 2.2 Single Node Installation on CentOS 6.5

By far the best tutorial for you to get started with Hadoop installation.

Source : http://alanxelsys.com/2014/02/01/hadoop-2-2-single-node-installation-on-centos-6-5/

Introduction

This HOWTO covers Hadoop 2.2 installation with CentOS 6.5. My series of tutorials are meant just as that – tutorials. The intent is to allow the user to gain familiarity with the application and should not be construed as any type of best practices document to be used in a production environment and as such performance, reliability and security considerations are compromised. The tutorials are freely available and may be distributed with the proper acknowledgements. Actual screenshots of the commands are used to eliminate any possibility of typographical errors, in addition long sequences of text are placed in front of the screenshots to facilitate copy and paste. Command text is printed using Courier font. In general the document will only cover the bare minimum of how to get a single node cluster up and running with the emphasis on HOW rather than WHY. For more in depth information the reader should consult the many excellent publications on Hadoop such as Tom White’s – Hadoop: The Definitive Guide, 3rd edition and Eric Sammer’s – Hadoop Operations along with the Apache Hadoop website.

Please consult www.alan-johnson.net for an online version of this document.

Prerequisites

  • CentOS 6.5 installed

Machine configuration

In this HOWTO a physical machine was used; but for educational purposes Vmware Workstation or Virtualbox (https://www.virtualbox.org/) would work just as well. The screenshot below shows acceptable VM machine settings for VMWare.

Note an additional Network Adapter and physical drive have been added. Memory allocation is 2GB which is sufficient for the tutorial.

User configuration

If installing CentOS from scratch then select a user <hadoopuser> at installation time otherwise the user can be added by the command below. In addition create a group called <hadoopgroup>.

Note the initial configuration is done as user root.

=> passwd hadoopuser
to enable log-in for this one. 

Now make hadoopuser a member of hadoopgroup.

usermod –g hadoopgroup hadoopuser

Verify by issuing the id command.ss

id hadoopuser

The next step is to give hadoopuser access to sudo commands. Do this by executing thevisudo command and adding the highlighted line shown below.

Reboot and now log in as user hadoopuser.

Setting up ssh

Setup ssh for password-less authentication using keys.

ssh-keygen -t rsa -P ”

Next change file ownership and mode.

sudo chown hadoopuser ~/.ssh

sudo chmod 700 ~/.ssh

sudo chmod 600 ~/.ssh/id_rsa

Then append the public key to the file authorized_keys

sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Change permissions.

sudo chmod 600 ~/.ssh/authorized_keys

Edit /etc/ssh/sshd_config

Set PasswordAuthentication to no and allow empty passwords
as shown below in the extract of the file.

Verify that login can be accomplished without requiring a password.

Installing and configuring java

It is recommended to install the full openJDK package to take advantage of some of the java tools,

Installing openJDK

yum install java-1.7.0-openjdk*

After the installation verify the java version

java -version

The folder etc/alternatives contains a link to the java installation; perform a long listing of the file to show the link and use it as the location for JAVA_HOME.

Set the JAVA_HOME environmental variable by editing ~/.bashrc

Installing Hadoop

Downloading Hadoop

From the Hadoop releases page http://hadoop.apache.org/releases.html , download hadoop-2.2.0.tar.gz from one of the mirror sites.

Next untar the file

tar xzvf hadoop-2.2.0.tar.gz

Move the untarred folder

sudo mv hadoop-2.2.0 /usr/local/hadoop

Change the ownership with sudo chown -R hadoopuser:hadoopgroup /usr/local/hadoop

Next create namenode and datanode folders

mkdir -p ~/hadoopspace/hdfs/namenode

mkdir -p ~/hadoopspace/hdfs/datanode

Configuring Hadoop

Next edit ~/.bashrc to set up the environmental variables for Hadoop

# User specific aliases and functions

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export PATH=$PATH:$HADOOP_INSTALL/sbin
export PATH=$PATH:$HADOOP_INSTALL/bin

Now apply the variables.

There are a number of xml files within the Hadoop folder that require editing which are:

  • mapred-site.xml
  • yarn-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hadoop-env.sh

The files can be found in /usr/local/hadoop/etc/hadoop/. First copy themapred-site template file over and then edit it.

mapred-site.xml

Add the following text between the configuration tabs.

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

Add the following text between the configuration tabs.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

core-site.xml

Add the following text between the configuration tabs.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

hdfs-site.xml

Add the following text between the configuration tabs.

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoopuser/hadoopspace/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoopuser/hadoopspace/hdfs/datanode</value>
</property>

Note other locations can be used in hdfs by separating values with a comma, e.g.

file:/home/hadoopuser/hadoopspace/hdfs/datanode, .disk2/Hadoop/datanode, . .

hadoop-env.sh

Add an entry for JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/

=> Actually you don’t need to configure JAVA_HOME here since you’ve already done that in ~/.bashrc

Next format the namenode.

. . .

Issue the following commands.

start-dfs.sh
start-yarn.sh

Issue the jps command and verify that the following jobs are running:

At this point Hadoop has been installed and configured

Testing the installation

A number of test files exist that can be used to benchmark Hadoop. Entering the command below without any arguments will list available tests.

The TestDFSIO test below can be used to measure read performance – initially create the files and then read:

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100

The results are logged in TestDFSIO_results.log which will show throughput rates:

During the test run a message will be printed with a tracking url such as that shown below:

The link can be selected or the address can be pasted into a browser.

Another test is mrbench which is a map/reduce test.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar mrbench –maps 100

Finally the test below is used to calculate pi. The first parameter refers to the number of maps and the second is the number of samples for each map.

hadoop jar $HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 20

. . .

Note accuracy can be improved by increasing the value of the second parameter.

Working from the command line

Invoking a command without any or insufficient parameters will generally print out help data”

hdfs commands

hdfs dfsadmin –help

. . .

hadoop commands

hadoop version

Web Access

The location for checking the Namenode status is at localhost:50070/. This web page contains status information relating to the cluster.

There are also links for browsing the filesystem.

Logs can also be examined from the NameNode Logs link.

. . .

The secondary namenode can be accessed using port 50090

On line documentation

Comprehensive documentation can be found at the Apache website or locally using a browser by pointing it at $HADOOP_INSTALL/share/doc/Hadoop/index.html/

Feedback, corrections and suggestions are welcome, as are suggestions for further HOWTOs.