Big Data and Business Analytics: March 2015

Sunday, March 29, 2015

Compiling and running Hadoop's WordCount.java

After installing and running Hadoop for the first time, it came the moment when I would have to compile and run one of the Hadoop's utilities. The first one is WordCount.java.

I first tryed doing the adjustments in Hadoop 2.6.0 but the differences where enough to prevent one from following the book. So I downloaded and unpacked Hadoop 1.2.1. Also I decided to keep just one directory with Hadoop, which was /usr/local/src/hadoop, plain and simple.

If you just try running the compiler with the commands as they are depicted in Chuck Lam's Hadoop in Action, the compilation process stops with several errors being found.

My first reaction was to say: "Well, I would have to learn Java for real one day...". But after a while I decided to look for that mistake in specific on the Internet. To my surprise, the book site had several suggestions on how to overcome the compilation error in WordCount.java.

To make a long story short, the right command to make it work is javac -classpath /usr/local/src/hadoop/hadoop-core-1.2.1.jar:/usr/local/src/hadoop/lib/commons-cli-1.2.jar -d playground/classes playground/src/WordCount.java where javac is the java compiler and classpath is the path where hadoop-core and any other java library that can't be found by default by the java compiler is located.

In this case if you issue just the standard command as it is written in the book "javac -classpath hadoop-core-1.2.1.jar -d playground/classes playground/src/WordCount.java" it will fail with several errors.

This can be partially overcome if you specify hadoop-core-1.2.1.jar directory explicitly. In this case you would issue: "javac -classpath /usr/local/src/hadoop/hadoop-core-1.2.1.jar -d playground/classes playground/src/WordCount.java". This will narrow the ammount of errors to one which is "class file for org.apache.commons.cli.Options not found".

Looking for commons.cli library in the hadoop tree structure you will see that it can be found at /usr/local/src/hadoop/lib and it's full name is commons-cli-1.2.jar. Now issue the command with both java libraries fully specified in the -classpath option (as presented in the beggining of the explanation) and it should compile smoothly.

This command can be reduced if you specify /usr/local/src/hadoop in PATH (export PATH=$PATH:/usr/local/src/hadoop). In this case the command would became: "javac -classpath hadoop-core-1.2.1.jar:/usr/local/src/hadoop/lib/commons-cli-1.2.jar -d playground/classes playground/src/WordCount.java"

Just remembering that it is also of good advice adding hadoop's command line utility path in environment PATH variable (export PATH=$PATH:usr/local/src/hadoop/bin), because this will make it avaiable for running just by typing "hadoop". Also, PATH has no effect in "lib"s directories (as far as I have noticed).

Long live Hadoop!

Gustavo

Installing Hadoop in an Ubuntu environment

My first experience with Hadoop started when I tried to download it. With several versions of this (at first) misterious and almost fantastic program, Hadoop has developed a kind of "aura" in itself. Everybody in the IT community has heard of it recently, some have even written articles about it, but very few have seen it working, even less made it work from scratch. My intention is to try becoming one of these few "initiates"... :)

If you are new in a territory, i.e. you're navigating in uncharted waters, try finding some guide to help you through with it. In my case, I'm following the steps depicted in the book Hadoop in Action from Chuck Lam. The book was written in 2012 (almost a generation ago in the Big Data community in general and in Hadoop in particular), but I think it is sufficiently good to guide you and sufficiently outdated (yes, outdated) to make you think on how to adapt the commands.

Getting back to the download part, I noticed that there're two version of Hadoop avaiable to download, 1.2.1 and 2.6.0. I started downloading 2.6.0. Later I would comeback here in www.apache.org and would download 1.2.1 but this would happen later.

Also, I tried to install cygwin so I could still run Hadoop in (at that time) Windows 8 but I started facing so many difficulties that I decided to move my studies of Big Data (at least the Hadoop part of it) to Ubuntu. Yes, this was a momentous decision. I, a guy from the Windows world, was going to start walking in the shores of "Linux Middle Earth" in the search of this cybernetic version of "The Lord of The Rings" called Hadoop... :).

After fiddling around the possibilities of having a dual booted machine (with Windows and Linux) I understood that machines should have (usually) only one master operating system. In my case this was going to be at first Windows 8.1. Unfortunately Windows 8.1 was so prone to have all sorts of problems, strange features, idiosyncrasies and just plain bugs that it soon became impossible to work with it. I decided to downgrade my operating system to Windows 7, downloaded an Oracle Virtualbox, a master disk of Ubuntu (the so called version of Unix for Human Beings), installed Ubuntu now on a virtual machine (with Windows 7 as host) and started to work.

Ubuntu is an interesting thing. It is enough graphical to make you able to do basic work from the start, but it is still Unix. So, if you want to really know how things are done, you need to lose the fear of learning an entirely new language and logic of commands and move to terminal (as it is called the command line in the Unix world). But this task has been made easier because of the large Ubuntu community. Therefore, almost any doubt can be solved with a simple Google search. So you're up and learning in a short amount of time.

Ubuntu in particular and Unix in general is an interesting experience for a Windows person. It is also a knowledge that I now don't regret in having investing a reasonable amount of time in learning. Specially in the Big Data arena, Unix is the OS of choice (sorry fellow Windows friends but this is sadly true...), so if you want to do serious Big Data stuff, get your Ubuntu's capabilities ready! Full disclosure: these are my personal views as of March, 29th 2015. Things change fast, so I can't say that in a year Windows either will or won't become an excellent or even better platform for Hadoop development.

Returning to the point at issue here, as I said, I downloaded Hadoop from http://hadoop.apache.or/core/releases.html and saved the file hadoop-2.6.0.tar.gz (this is not the Hadoop source) in /usr/local/src. Than I unpacked hadoop-2.6.0.tar.gz (with the command tar - xzvf hadoop-2.6.0.tar.gz) which created a directory named hadoop-2.6.0.

Hadoop is a java program. So, anything that is developed with it is done in Java. So the next step is to download the java development kit, so called JDK. After that, you need to specify the Java directory in JAVA_HOME environment variable and the path to the Java binaries in PATH environment variable. To install JDK on Ubuntu: sudo apt-get install openjdk-7-jdk and to find the Java directory apt-cache search jdk and ls -al /etc/alternatives/java. Have in mind that the answer for this last command will be: /etc/alternatives/java -> /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java but the "correct" directory to be placed in JAVA_HOME is /usr/lib/jvm/java-7-openjdk-amd64 and in PATH is /usr/lib/jvm/java-7-openjdk-amd64/bin. The commands to make all of this happen are two: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64, and export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-amd64/bin. After this issue a simple command javac so you can check if java-jdk is working fine.

Now, edit /hadoop-2.6.0/etc/hadoop/hadoop-env.sh and change the original line where it is written "export JAVA_HOME=${JAVA_HOME}" with the line "export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64". In my case, file was located in /usr/local/src/hadoop-2.6.0/etc/hadoop.

In case hadoop's path is not located in environment variable PATH, and neither is the current directory, Hadoop to run needs to be called as /usr/local/src/hadoop-2.6.0/bin/hadoop.

After all this, I went again to terminal, typed Hadoop and voilá, got the standard answer when the program is called without any parameter.

HADOOP IS WORKING!

Best regards,

Gustavo,

Friday, March 20, 2015

Food for Thoughts: People Perceptions of Highly Improbable Events and The Crisis of 2008 (post in Brazilian Portuguese)

O fenômeno das fat-tails como uma ilusão estatística ou A crise de 2008 é um evento INTEIRAMENTE dentro da teoria econômico-financeira

Estava escutando o livro do Kaneham, na parte em que ele fala da percepção da probabilidade de ocorrência de eventos raros ou extremos por parte das pessoas, quando me ocorreu uma ideia que talvez possa ser traduzida em um paper.

Após a crise de 2008 tenho escutado (lido) com certa frequência que a distribuição dos retornos dos ativos financeiros não seguiria, nos seus extremos, uma curva normal e sim uma curva com os extremos mais pronunciados ( “fat tails” em inglês). Talvez este suposto fenômeno seja na verdade uma ilusão provocada pela percepção distorcida de probabilidade de eventos raros que as pessoas tem. A argumentação parte da percepção das pessoas quanto a probabilidade de ocorrência de eventos raros, isto é, com uma baixíssima possibilidade.

Eventos negativos que as pessoas nunca tenham experimentado (ou sequer tomado conhecimento, justamente pelo fato deles serem raros) levam a uma percepção de probabilidade bem mais baixa daquela que eles efetivamente possuem. Isto causa uma falsa sensação de segurança e por consequência uma sub-avaliação do risco associado aos mesmos. As pessoas (os investidores em geral) assumem desta forma posições desnecessariamente arriscadas, levando, quando o evento ocorre, a uma percepção inversa, isto é, de que os mesmos tem uma probabilidade maior de ocorrer do que a probabilidade real.

Eventos negativos que as pessoas tenham experimentado (pelo fato de, embora raros, a quantidade absoluta de ocorrências dos mesmos é suficiente para coloca-los em evidência na media ou na memória de todos) levam a uma percepção de probabilidade maior do que aquela permitida pela estatística. Isto permite por exemplo, às companhias de seguro venderem apólices cujo valor real esta abaixo do preço de venda

A idéia de pesquisa seria mostrar que o que varia quando um evento raro acontece é a percepção de probabilidade de ocorrência das pessoas e não uma alteração (ou falha) básica no modelo das probabilidades dos retornos, o qual continua seguindo uma curva normal.

Abraços,

Gustavo,

Thursday, March 19, 2015

Research Program on Information Diffusion in Massively Connected Networks

Sometime ago I prepared a research program for studying the spread of information in massively connected networks. The program was inspired in an article about the same subject that can be found at https://www.mpi-inf.mpg.de/~tfried/paper/CACM1.pdf.

For those interested in this subject, the presentation can be downloaded from https://docs.google.com/presentation/d/1UsRiYGDP1Cz3T4wwCmKP3C3bNzUTQkrOKMKsshcztzg/edit?usp=sharing

Best regards,

Gustavo

Richard Bellman and The Origins of Dynamic Programming

For those interested in Dynamic Programming, I prepared a seminar about Richard Bellman, the topics that led to the development of Dynamic Programming and the debate between him and Simon about the future of computing.

The presentation can be found at https://docs.google.com/presentation/d/1gaK6kO4Sy4iPVHF4Wv-lE0j7mxR--krZDFx10AcDg84/edit?usp=sharing .

Good reading!

Gustavo,

Linear Programming and George Dantzig

For those interested in optimization problems in general and linear programming in particular, there´s an excellent scientific mini-biography of George Dantzig available for download at http://www.ams.org/notices/200703/fea-cottle.pdf . Of special notice for researchers and practitioners nowadays is one of the latest subjects of interest for Dantzig: stochastic optimization. In his own words (see article above) "stochastic optimization is where the real problems are".

Best regards,

The Consequences for Management Science of Robert McNamara´s "Body Count Measure" during The Vietnam War (post in Brazilian Portuguese)

Em 2013 (a quase dois anos...) terminei a leitura do livro "Big Data Revolution". É um livro voltado para o público em geral, apresentando, entre outros tópicos, uma grande listagem de empresas e idéias de negócio / consultoria que estão sendo aplicadas nos Estados Unidos e Europa.

O tema deste post diz respeito a uma das passagens do livro. Nela é descrita (de maneira sucinta) a carreira de Robert McNamara, secretário de estado do governo americano na década de 60, sua paixão pelo "management science" e sua influência na forma como a guerra do Vietnã foi conduzida.

Coloco abaixo algumas passagens do livro: “McNamara developed his love of numbers as a student at Harvard Business School and then its youngest assistant professor at age 24. He applied this rigor during the Second World War as part of an elite Pentagon team called Statistical Control, which brought data-driven decision-making to one of the world’s largest bureaucracies”. "At war’s end, the group decided to stick together and offer their skills to corporate America". Aqui aparece, provavelmente a mão da Rand Corporation e do grupo de "iluminados" liderados por Dantzig e cia... mais uma vez.

“McNamara rose swiftly up the ranks (at Ford Motor Company), trotting out a data point for every situation. Harried factory managers produced the figures he demanded—whether they were correct or not”. O livro faz uma digressão neste ponto sobre a guerra do Vietnã e sua gestão por números. McNamara propos e aplicou (a força) o critério de "body count" para "gerenciar" a guerra, critério esse que levou a uma série de "mismanagements" nas operações militares americanas.

“McNamara epitomized the mid XXth century manager, the hyper-rational executive who relied on numbers rather than sentiments, and who could apply his quantitative skills to any industry he turned them to”

“The use, abuse, and misuse of data by the U.S. military during the Vietnam War is a troubling lesson about the limitations of information in an age of small data, a lesson that must be heeded as the world hurls toward the big-data era. The quality of the underlying data can be poor”. Foi ai que eu anotei o seguinte comentário no e-book : Perhaps this mistake was the trigger for the downplay of highly quantitative techniques in the 70s. Here we have a paper: the effect of McNamara and the Vietnam War in the development of decision sciences during the 70s and the 80s. There´s a book called The War Managers that exposes this problem (like the body count measures)

Aqui parece haver espaço para um artigo sobre uma parte da história da Pesquisa Operacional, sua influência e as consequências do seu uso.

Abraços a todos.

Research Topics on The Origins and History of Operations Research

A possible list of research topics about the origins and history of Operations Research could include:

Initial Meetings and research trends in the 1940s - How O.R. was popularized or how O.R. became a fashionable topic (on an academic and managerial perspective)
The so called "Paradox Decade" (50s)

The critics of optimization and hard solutions for social problems (Ackoff for instance)
The reality of economic theory and the challenge of providing plenty for many

Researchers outside the mainstream path: Bellman, Raiffa
Thomas Edson and OR before 1940
Cooper and company x Dantzig
Computation + automation of tasks and OR
The Bellman x Simon Debate and their forecasts - Which was more prescient?
The 4Ps of O.R.: Paradoxes, Programmers, Pamphleteers, Propagandists
The effect of Robert McNamara and "The Body Count Measure"

Big Data and Business Analytics