At Metric Insights we specialize on quick big-data visualizations for the end user. As such we ran into a problem getting our product to routinely query Apache Hive tables that needed third party jars added to Hadoop’s classpath.
This is basically how it works: Someone creates a table in Hive that uses the RegexSerDe class in the hive-contrib.jar. Then users query it by running something like this:
hive> add jar /usr/lib/hive/lib/hive-contrib-0.8.1-cdh4.0.0b2.jar;
hive> select count(*) from regex_serde_table;
While this works, it has quite a few drawbacks, including:
- It’s difficult to remember (let alone nearly impossible for run of the mill database folks to even figure out)
- It’s temporary. When you quit your session, the next time you log in you will have to re-add the jar
- It makes applications that make use of hive more complicated
While searching for a way to make this a more permanent Hive configuration, I came across this Cloudera article which explains 3 methods for solving this simple problem. Unfortunately, the first two methods are also temporary and require unacceptable setup steps for any of our users who wish to query Hive with Metric Insights.
That leaves the 3rd option which is to load the jar into the MapReduce task tracker’s classpath. Cloudera gives us two examples of how to do this, the easiest of which is to put the jar in your $HADOOP_HOME/lib directory on each task tracker server. For CDH4 on Debian, this means making sure the hive-contrib-0.8.1-cdh4.0.0b2.jar is in /usr/lib/hadoop/lib on each task tracker. (If you don’t know where your $HADOOP_HOME directory is, check /etc/default/hadoop-0.20.
After you have copied the jar into $HADOOP_HOME/lib on all your task tracker boxes, restart the tasktracker processes with:
At this point, all your hive users and applications can automatically query RegexSerDe tables!
On a closing note, Cloudera mentions that you should just be able to modify the $HADOOP_TASKTRACKER_OPTS variable in /etc/hadoop/conf/hadoop_env.sh. Unfortunately, this doesn’t work. What happens is that the resulting Task Tracker java process ends up with two -classpath options, only one of which is used by the java process, so copying to $HADOOP_HOME/lib appears to be the only sure fire way of making this config permanent.