Embedding Pig for CDH4/MRv1 java apps fer realz.

After working with Hadoop more and more and wading through the dependency hell of each crazy component of Hadoop’s little documented ecosystem, I’ve come up with the following cheeky definition for HADOOP: “Huge Ass Dependency Orgy Of Poo”. Ok, it’s not really that bad, but if you’ve ever tried to embed Pig in your own Java program as hinted at here, you might run into some pitfalls like I did. 

Read More


shooting things

My heart was pounding. I couldn’t really say why. I mean, I kinda knew what I was doing. The bullets go in the magazine. The magazine goes in the gun. Load the round in the chamber. Safety off. Point it away; far, far away from anything you care about. Squeeze. It should be easy. My little sister was standing right behind me, encouraging. My brother and father were in the lane to the left of me. 75% of the people I care most about in the world were less than 5 feet away. And I felt like a bumbling moron with a bomb and a hammer.

The instructions were hard to hear through the explosions and ear protection. I bent over to be closer to my sister’s head, which usually bobbles around 5 feet off the ground. The laughing head said “you just turned the safety ON”. “You mean, it’s turned OFF by default?!” My heart pounded a bit more in its cage. I had no idea why this was affecting me so much. I flipped the safety off, relaxed my shoulders, pointed the handgun at pink ink on a paper target, made sure I absolutely cared for nothing about the target except hitting it, and squeezed.

Voluntary actions turned involuntary. My arms were suddenly noodles and flew up ten feet in the air. My eyes squeezed shut. Something hot flew past my head. When I opened my eyes, I realized that the gun hadn’t flown out of my hands. It hadn’t pointed behind me to where my sister was. Smoke lingered around the end of the barrel; I smelled sulfur. A small black dot marred the pink ink not far from where I was aiming. I adjusted my expectations, and shot the rest of the magazine. When I realized nothing happened when I squeezed, I looked at the gun and realized the chamber had popped open like a little bird, screaming for more food. I could feel adrenaline in my fingernails as I ejected the magazine and shakily put the gun on the table in front of me. At some point, I remembered to breathe.

I actually enjoyed myself, once I got to the problem solving aspect of what I was doing. “Why won’t the black dot go where I want it to go?!” But the adrenaline in my body was telling me something foreboding and dangerous was going on. I found this strange, because I am no stranger to adrenaline or danger. I routinely smile as I dangle myself off a cliff, thousands of feet off the ground, sometimes without a rope. But there was something about discharing a ‘weapon’ that felt disturbingly dangerous. The truth of the matter is that guns are designed to kill. And each time you fire one, you are putting yourself in the hypothetical position of taking something, or someone’s, life. When I walk precipitously close to the edge of a cliff or a bridge, I’m doing something potentially dangerous to me. But when I fire a gun, I’m doing something that could be potentially dangerous to someone else.

When we were done, our little pink target looked like swiss cheese. It was obvious my sister and I both needed some practice. I smelled of gunpowder and sweat. I smiled and laughed and had a blast (pun intended). But I couldn’t suppress another jolt of the jitters when I realized that my brother and dad both shot clusters of 6 inch holes in the place where a chest would be in their human shaped target.

Sqoop hive import fails with IncompatibleClassChangeError

Exception in thread “main” java.lang.IncompatibleClassChangeError: class
com.facebook.fb303.FacebookService$Client has interface org.apache.thrift.TServiceClient as super class

As with most obscuro Hadoop issues, I found absolutely no documentation about this error anywhere on the interwebs. It turns out, this is because no one else had a sqoop install as hosed as ours 🙂

For some reason our sqoop installation had some random thrift jars lying around in /usr/lib/sqoop/lib, including:

# ls -l /usr/lib/sqoop/lib/*thrift*

I have absolutely no idea why, but maybe someone just copied stuff directly from a cdh3 installation or hacked around a previous problem. No idea. Needless to say, that’s a lot of unnecessary thrift crap, especially when none of those jars match up with the version of thrift that Hive is using (libthrift-0.7.0.jar).

So, if you happen to inherit a crazy hacked up CDH4 installation like I did, you can fix it with:

# rm /usr/lib/sqoop/lib/*thrift*

Make Hive know about hive-contrib’s RegexSerDe class indefinitely

At Metric Insights we specialize on quick big-data visualizations for the end user. As such we ran into a problem getting our product to routinely query Apache Hive tables that needed third party jars added to Hadoop’s classpath.

This is basically how it works: Someone creates a table in Hive that uses the RegexSerDe class in the hive-contrib.jar. Then users query it by running something like this:

hive> add jar /usr/lib/hive/lib/hive-contrib-0.8.1-cdh4.0.0b2.jar;
hive> select count(*) from regex_serde_table;

While this works, it has quite a few drawbacks, including:

  • It’s difficult to remember (let alone nearly impossible for run of the mill database folks to even figure out)
  • It’s temporary. When you quit your session, the next time you log in you will have to re-add the jar
  • It makes applications that make use of hive more complicated

While searching for a way to make this a more permanent Hive configuration, I came across this Cloudera article which explains 3 methods for solving this simple problem. Unfortunately, the first two methods are also temporary and require unacceptable setup steps for any of our users who wish to query Hive with Metric Insights.

That leaves the 3rd option which is to load the jar into the MapReduce task tracker’s classpath. Cloudera gives us two examples of how to do this, the easiest of which is to put the jar in your $HADOOP_HOME/lib directory on each task tracker server. For CDH4 on Debian, this means making sure the hive-contrib-0.8.1-cdh4.0.0b2.jar is in /usr/lib/hadoop/lib on each task tracker. (If you don’t know where your $HADOOP_HOME directory is, check /etc/default/hadoop-0.20.

After you have copied the jar into $HADOOP_HOME/lib on all your task tracker boxes, restart the tasktracker processes with:

/etc/init.d/hadoop-0.20-mapreduce-tasktracker stop

/etc/init.d/hadoop-0.20-mapreduce-tasktracker start

At this point, all your hive users and applications can automatically query RegexSerDe tables!

On a closing note, Cloudera mentions that you should just be able to modify the $HADOOP_TASKTRACKER_OPTS variable in /etc/hadoop/conf/hadoop_env.sh.  Unfortunately, this doesn’t work. What happens is that the resulting Task Tracker java process ends up with two -classpath options, only one of which is used by the java process, so copying to $HADOOP_HOME/lib appears to be the only sure fire way of making this config permanent.

Happy Hadooping!

blog take 18?

This is what I wrote the last time I tried to re-start a blog after a long hiatus:

Consistency is the last refuge of the unimaginative.

Is this really the 5th incarnation of a local steve blog?

1998 – 1999 – localsteve / back when home-baked php / html was cool

2000 – 2002 – localsteve – / back when .cc was cool

2002 – 2003/4 – localsteve / back when geeklog was cool

2004 – 2006 – localsteve / .net back when wordpress was cool

2007 – 2010 – intermittent attempts over on rockpunk.org

2010 – fine screw it.  i’ll use a hosted solution.

Honestly, I can’t stand it, but there’s a nice free droid app for tumblr, so here goes nothin…

Well, you can see how long that lasted. . .