Big Data and Business Analytics: Installing Hadoop in an Ubuntu environment

My first experience with Hadoop started when I tried to download it. With several versions of this (at first) misterious and almost fantastic program, Hadoop has developed a kind of "aura" in itself. Everybody in the IT community has heard of it recently, some have even written articles about it, but very few have seen it working, even less made it work from scratch. My intention is to try becoming one of these few "initiates"... :)

If you are new in a territory, i.e. you're navigating in uncharted waters, try finding some guide to help you through with it. In my case, I'm following the steps depicted in the book Hadoop in Action from Chuck Lam. The book was written in 2012 (almost a generation ago in the Big Data community in general and in Hadoop in particular), but I think it is sufficiently good to guide you and sufficiently outdated (yes, outdated) to make you think on how to adapt the commands.

Getting back to the download part, I noticed that there're two version of Hadoop avaiable to download, 1.2.1 and 2.6.0. I started downloading 2.6.0. Later I would comeback here in www.apache.org and would download 1.2.1 but this would happen later.

Also, I tried to install cygwin so I could still run Hadoop in (at that time) Windows 8 but I started facing so many difficulties that I decided to move my studies of Big Data (at least the Hadoop part of it) to Ubuntu. Yes, this was a momentous decision. I, a guy from the Windows world, was going to start walking in the shores of "Linux Middle Earth" in the search of this cybernetic version of "The Lord of The Rings" called Hadoop... :).

After fiddling around the possibilities of having a dual booted machine (with Windows and Linux) I understood that machines should have (usually) only one master operating system. In my case this was going to be at first Windows 8.1. Unfortunately Windows 8.1 was so prone to have all sorts of problems, strange features, idiosyncrasies and just plain bugs that it soon became impossible to work with it. I decided to downgrade my operating system to Windows 7, downloaded an Oracle Virtualbox, a master disk of Ubuntu (the so called version of Unix for Human Beings), installed Ubuntu now on a virtual machine (with Windows 7 as host) and started to work.

Ubuntu is an interesting thing. It is enough graphical to make you able to do basic work from the start, but it is still Unix. So, if you want to really know how things are done, you need to lose the fear of learning an entirely new language and logic of commands and move to terminal (as it is called the command line in the Unix world). But this task has been made easier because of the large Ubuntu community. Therefore, almost any doubt can be solved with a simple Google search. So you're up and learning in a short amount of time.

Ubuntu in particular and Unix in general is an interesting experience for a Windows person. It is also a knowledge that I now don't regret in having investing a reasonable amount of time in learning. Specially in the Big Data arena, Unix is the OS of choice (sorry fellow Windows friends but this is sadly true...), so if you want to do serious Big Data stuff, get your Ubuntu's capabilities ready! Full disclosure: these are my personal views as of March, 29th 2015. Things change fast, so I can't say that in a year Windows either will or won't become an excellent or even better platform for Hadoop development.

Returning to the point at issue here, as I said, I downloaded Hadoop from http://hadoop.apache.or/core/releases.html and saved the file hadoop-2.6.0.tar.gz (this is not the Hadoop source) in /usr/local/src. Than I unpacked hadoop-2.6.0.tar.gz (with the command tar - xzvf hadoop-2.6.0.tar.gz) which created a directory named hadoop-2.6.0.

Hadoop is a java program. So, anything that is developed with it is done in Java. So the next step is to download the java development kit, so called JDK. After that, you need to specify the Java directory in JAVA_HOME environment variable and the path to the Java binaries in PATH environment variable. To install JDK on Ubuntu: sudo apt-get install openjdk-7-jdk and to find the Java directory apt-cache search jdk and ls -al /etc/alternatives/java. Have in mind that the answer for this last command will be: /etc/alternatives/java -> /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java but the "correct" directory to be placed in JAVA_HOME is /usr/lib/jvm/java-7-openjdk-amd64 and in PATH is /usr/lib/jvm/java-7-openjdk-amd64/bin. The commands to make all of this happen are two: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64, and export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-amd64/bin. After this issue a simple command javac so you can check if java-jdk is working fine.

Now, edit /hadoop-2.6.0/etc/hadoop/hadoop-env.sh and change the original line where it is written "export JAVA_HOME=${JAVA_HOME}" with the line "export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64". In my case, file was located in /usr/local/src/hadoop-2.6.0/etc/hadoop.

In case hadoop's path is not located in environment variable PATH, and neither is the current directory, Hadoop to run needs to be called as /usr/local/src/hadoop-2.6.0/bin/hadoop.

After all this, I went again to terminal, typed Hadoop and voilá, got the standard answer when the program is called without any parameter.

HADOOP IS WORKING!

Best regards,

Gustavo,

Big Data and Business Analytics

Sunday, March 29, 2015

Installing Hadoop in an Ubuntu environment

No comments:

Post a Comment