Get a hadoop cluster running in 20 Minutes


Since Cloudera was starting to build a complete stack it is more easy to get a cluster up and running. They build rpm's for Redhat based systems, I use CentOS, the newest CDH build is CDH2, which comes with a lot nice addons. Thanks for the work, guys!

What hardware we should use? Simple, moderate servers are pretty nice for. 2 CPU, 8 GB RAM, 4x750GB HDD will be enough. The master should have 4 CPU, 8 or more GB RAM and 100 GB RAID 5, we need that hardware twice.Why? The master is the most important server, hosting the namenode, jobtracker and, if you setup, ganglia. The hardware you use depends on your usecases, what I describe there works for eventanalysis and weblogs. But it will be a good start to see what power hadoop can deliver. The systems are located in the same rack so we can rack awareness moving in background. Use lvm and create a large directory, I use /opt/hadoop:


# df -h /opt/hadoop/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/myvg-hadoopvol
                      1.4T  213G  1.1T  17% /opt/hadoop


Important: Grab from java.sun.com the jdk-rpm (1.6.x) and install it.

Hint: A good idea will be to deploy a SSH key to all nodes, so we can authenticate without a password. Done with cp rsa_id.pub from masternode to slaves /root/.ssh/authorized_keys

All servers are running, first get the repo working in your box. Simply add a file in /etc/yum.repos.d/:

# cat cloudera-cdh2.repo
[cloudera-cdh2]
name=Cloudera's Distribution for Hadoop, Version 2
mirrorlist=http://archive.cloudera.com/redhat/cdh/2/mirrors
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
enabled=1

deploy that file on all nodes.

Now we give a bit more limits:
# cat /etc/security/limits.conf
 hdfs            soft     nofile         5000
 hdfs            hard     nofile         5000
 mapred          hard     nofile         5000
 mapred          soft     nofile         5000
 hadoop          hard     nofile         5000
 hadoop          soft     nofile         5000

# cat /etc/sysctl.conf
 fs.file-max=200000

Setup the namenode and jobtracker on the master-box:
"yum install hadoop-0.20-pipes hadoop-0.20-native hadoop-0.20-jobtracker hadoop-0.20 hadoop-zookeeper hadoop-0.20-namenode -y && mkdir -p /opt/hadoop/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred"

To prevent the configs from a unwanted update we use alternatives here:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.used
alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.used 50

I use port 9000 for hdfs, but that depends on your environment.

Disable IPv6:
edit hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Edit the main config files:

core-site.xml:
<property>
  <name>fs.default.name</name>
  <value>hdfs://<your-server-name>:9000</value>
</property>

hdfs-site.xml:
<property>
    <name>dfs.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/opt/hadoop/hdfs/name</value>
</property>
<property>
  <name>dfs.secondary.http.address</name>
  <value>namenode2:50070</value>
</property>

mapred-site.xml:
<property>
<name>mapred.local.dir</name>
  <value>/opt/hadoop/hdfs/mapred/local</value>
</property>
<name>mapred.job.tracker</name>
  <value><your-server-name>:54311</value>
</property>

Describe your cluster. Here we have to edit 2 files, master and slave:
# cat /etc/hadoop-0.20/conf.used/masters
<your-namenode-server>


# cat /etc/hadoop-0.20/conf.used/slaves
datanode1
datanode2
datanode3
datanode4
datanode5
datanode6

Setup the secondary namenode:
# yum install hadoop-0.20-secondarynamenode -y

Format the namenode:
# sudo -u hdfs hadoop namenode -format
You have to wait, simple watch the logs (tail -f /var/log/hadoop-0.20/*.log)

and start:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

create the mapred-dirs in hdfs:
# sudo -u hdfs hadoop fs -mkdir /mapred/system
# sudo -u hdfs hadoop fs -chown -R mapred /mapred

Login at the secondary namenode and restart also:
# for x in /etc/init.d/hadoop-0.20-* ; do $x restart ; done

Watch the logs for errors.

Now let us install the datanodes and tasktrackers. Before you let run the script be sure all nodes are via DNS available, SSH keys deployed and hosts are known by the SSH subsystem.

# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'yum install hadoop-0.20-datanode.noarch hadoop-0.20-native.x86_64 hadoop-0.20-tasktracker.noarch -y && \ 
mkdir -p /opt/hadoop/hdfs/name && mkdir -p /opt/hadoop/hdfs/mapred/local && chown -R hdfs:hadoop /opt/hadoop/hdfs/ && chown -R mapred:hadoop /opt/hadoop/hdfs/mapred'; done

copy the config:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do scp -r /etc/hadoop-0.20/conf.used $i:/etc/hadoop-0.20/ ; done

activate via alternatives:
# for i in $(cat /etc/hadoop-0.20/conf/hadoop_slaves); \ 
do ssh $i alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.as24 50; done

Now it is time to get the cluster running the first time:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x restart; done'; done

Again, watch the logs for errors, and also you can check the whole nodes with:
# for i in $(cat /etc/hadoop-0.20/conf.used/slaves); do echo $i && ssh $i 'for x in /etc/init.d/hadoop-0.20-* ; do $x status; done'; done

If all goes well you can see your running cluster via web:
http://namenode:50070/dfshealth.jsp


Another good idea will be to edit your hosts according your cluster-nodes:

# cat /etc/hosts
...

# hadoop nodes
IP datanode1.FQDN datanode1
IP datanode2.FQDN datanode2
IP datanode3.FQDN datanode3
IP datanode4.FQDN datanode4
IP datanode5.FQDN datanode5
IP datanode6.FQDN datanode6
#hadoop master
IP namenode1.FQDN namenode1
#hadoop secondary
IP namenode2.FQDN namenode2


Enjoy!

created: 25.November 2010