Make Hive know about hive-contrib’s RegexSerDe class indefinitely

At Metric Insights we specialize on quick big-data visualizations for the end user. As such we ran into a problem getting our product to routinely query Apache Hive tables that needed third party jars added to Hadoop’s classpath.

This is basically how it works: Someone creates a table in Hive that uses the RegexSerDe class in the hive-contrib.jar. Then users query it by running something like this:

hive> add jar /usr/lib/hive/lib/hive-contrib-0.8.1-cdh4.0.0b2.jar;
hive> select count(*) from regex_serde_table;

While this works, it has quite a few drawbacks, including:

  • It’s difficult to remember (let alone nearly impossible for run of the mill database folks to even figure out)
  • It’s temporary. When you quit your session, the next time you log in you will have to re-add the jar
  • It makes applications that make use of hive more complicated

While searching for a way to make this a more permanent Hive configuration, I came across this Cloudera article which explains 3 methods for solving this simple problem. Unfortunately, the first two methods are also temporary and require unacceptable setup steps for any of our users who wish to query Hive with Metric Insights.

That leaves the 3rd option which is to load the jar into the MapReduce task tracker’s classpath. Cloudera gives us two examples of how to do this, the easiest of which is to put the jar in your $HADOOP_HOME/lib directory on each task tracker server. For CDH4 on Debian, this means making sure the hive-contrib-0.8.1-cdh4.0.0b2.jar is in /usr/lib/hadoop/lib on each task tracker. (If you don’t know where your $HADOOP_HOME directory is, check /etc/default/hadoop-0.20.

After you have copied the jar into $HADOOP_HOME/lib on all your task tracker boxes, restart the tasktracker processes with:

/etc/init.d/hadoop-0.20-mapreduce-tasktracker stop

/etc/init.d/hadoop-0.20-mapreduce-tasktracker start

At this point, all your hive users and applications can automatically query RegexSerDe tables!

On a closing note, Cloudera mentions that you should just be able to modify the $HADOOP_TASKTRACKER_OPTS variable in /etc/hadoop/conf/hadoop_env.sh.  Unfortunately, this doesn’t work. What happens is that the resulting Task Tracker java process ends up with two -classpath options, only one of which is used by the java process, so copying to $HADOOP_HOME/lib appears to be the only sure fire way of making this config permanent.

Happy Hadooping!

blog take 18?

This is what I wrote the last time I tried to re-start a blog after a long hiatus:

Consistency is the last refuge of the unimaginative.

Is this really the 5th incarnation of a local steve blog?

1998 – 1999 – localsteve / back when home-baked php / html was cool

2000 – 2002 – localsteve – / back when .cc was cool

2002 – 2003/4 – localsteve / back when geeklog was cool

2004 – 2006 – localsteve / .net back when wordpress was cool

2007 – 2010 – intermittent attempts over on rockpunk.org

2010 – fine screw it.  i’ll use a hosted solution.

Honestly, I can’t stand it, but there’s a nice free droid app for tumblr, so here goes nothin…

Well, you can see how long that lasted. . .