Embedding Pig for CDH4/MRv1 java apps fer realz.

After working with Hadoop more and more and wading through the dependency hell of each crazy component of Hadoop’s little documented ecosystem, I’ve come up with the following cheeky definition for HADOOP: “Huge Ass Dependency Orgy Of Poo”. Ok, it’s not really that bad, but if you’ve ever tried to embed Pig in your own Java program as hinted at here, you might run into some pitfalls like I did. 

SETUP

To follow along at home, run:

  1. $ hadoop fs -put /etc/passwd /user/<you>/passwd
    $ mkdir /tmp/pigtest
  2. now create the following two files, being sure to modify the path locations as necessary.

/tmp/pigtest/query.pig

A = LOAD '/user/<you>/passwd' using PigStorage(':');
B = FOREACH A generate $0 as id;
dump B;

/tmp/pigtest/test.java

import java.io.*;
import org.apache.pig.*;
import org.apache.pig.tools.grunt.GruntParser;

public class test {
  public static void main ( String[] args ) throws Exception {
    GruntParser g = new GruntParser(new FileInputStream("/tmp/pigtest/query.pig"));
    g.setInteractive(false);
    PigServer server = new PigServer("mapreduce");
    g.setParams(server);
    g.parseStopOnError(true);
 }
}

By the way, if you don’t know how I came up with this code, there is some documentation for embedded pig but it’s back in the 0.8.1 version of the docs and is kind of hard to find. You can also find more helpful information on in the Pig 0.9.2 javadocs.

Now compile the new test.java with a simple:

$ javac -cp /usr/lib/pig/pig-0.9.2-cdh4.0.0b2-core.jar test.java

Finally, test that running your simple query.pig script actually works with:

# pig query.pig

GETTING THE JAVA CODE TO WORK

OK, so now on to getting this simple program to work. Like most Hadoop problems, this all boils down to setting up the right java classpath. Unfortunately, getting the right classpath requires that you understand a couple of things that may or may not be obvious to you.

Fail 1 – Using Pig with embedded Hadoop libraries

Pig 0.9.2 comes with several jars in /usr/lib/pig/ to use:

# ls -l /usr/lib/pig/*.jar
-rw-r--r-- 1 root root 5631233 Apr 23 01:23 /usr/lib/pig/pig-0.9.2-cdh4.0.0b2-core.jar
-rw-r--r-- 1 root root 29050153 Apr 23 01:23 /usr/lib/pig/pig-0.9.2-cdh4.0.0b2.jar
lrwxrwxrwx 1 root root 29 May 30 09:44 /usr/lib/pig/pig-withouthadoop.jar -> pig-0.9.2-cdh4.0.0b2-core.jar

The difference between the two might be obvious by the ‘pig-withouthadoop.jar’ symlink, but here it is explicitly:

  • skinny pig = pig-withouthadoop.jar = pig-0.9.2-cdh4.0.0b2-core.jar = Pig core classes and API
  • fat pig = pig-0.9.2-cdh4.0.0b2.jar = contains pig core classes above + Hadoop core libraries!

So, here’s where CDH4 Pig 0.9.2 and Apache Pig 0.9.2 are slightly different. Apache Pig 0.9.2 docs state that the Fat pig includes the embedded Hadoop 0.20 libraries. However, it seems that the nice folks at Cloudera have instead modified pig-0.9.2-cdh4 to make fat pig include the hadoop-0.23.1 jars *and* the MRv2/yarn libraries. This seems reasonable for folks on MRv2, but if you’re still on MRv1, it can be problematic.

So, if you lazily decide to link against /usr/lib/pig/* or specifically choose /usr/lib/pig-0.9.2-cdh4.0.0b2.jar, you might end up seeing something like the following side-effect error:

“Failed to set server’s Kerberos principal name”

# java -cp /usr/lib/pig/pig-0.9.2-cdh4.0.0b2.jar:.:/usr/lib/hadoop/lib/*:/etc/hadoop/conf idmapreduce
2012-09-30 21:26:53,041 INFO [main] executionengine.HExecutionEngine (HExecutionEngine.java:init(204)) - Connecting to hadoop file system at: hdfs://<your.hadoop.namenode>.com:9000
2012-09-30 21:26:53,048 WARN [main] conf.Configuration (Configuration.java:warnOnceIfDeprecated(664)) - fs.default.name is deprecated. Instead, use fs.defaultFS
2012-09-30 21:26:54,168 WARN [main] conf.Configuration (Configuration.java:warnOnceIfDeprecated(664)) - fs.default.name is deprecated. Instead, use fs.defaultFS
2012-09-30 21:26:54,644 ERROR [main] security.UserGroupInformation (UserGroupInformation.java:doAs(1180)) - PriviledgedActionException as:mapred/<yourhost>@<YOURREALM> (auth:KERBEROS) cause:java.io.IOException: Failed to specify server's Kerberos principal name

Fail 2 – Pseudo Success using wrong pig jar including correct Hadoop libs

OK, let’s say you didn’t know about fat and skinny pig differences. Instead, you fixed the above problem by adding in the appropriate hadoop client libs but are still using the fat pig jar. That might lead you to:

# java -cp .:/usr/lib/pig/pig-0.9.2-cdh4.0.0b2.jar:/etc/hadoop/conf:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/* test

Holy crap it worked!? This actually compiles and completes! You win, right?

The problem with this is that this simple script doesn’t actually run any map reduce queries. Let’s say, we modify query.pig to look like this instead:

/tmp/pigtest/query.pig

A = load '/user/<you>/passwd' using PigStorage(':');
B = GROUP A BY $6;
C = FOREACH B GENERATE group, COUNT($1);
dump C;

Now using the same classpath above (fat pig + correct hadoop 0.23.1 client libs, you’ll get something like this:

Exception in thread "main" org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C
 at org.apache.pig.PigServer.openIterator(PigServer.java:901)
 at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:680)
 at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
 at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
 at test.main(test.java:11)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias C
 at org.apache.pig.PigServer.storeEx(PigServer.java:1000)
 at org.apache.pig.PigServer.store(PigServer.java:963)
 at org.apache.pig.PigServer.openIterator(PigServer.java:876)
 ... 4 more
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000: Output Location Validation Failed for: 'hdfs://<your.hadoop.namenode>:9000/tmp/temp2007927520/tm
p-1798007485 More info to follow:
Can't get Master Kerberos principal for use as renewer

Fail 3 – Using the Wrong Map Reduce Libraries

Ok, so now you’re wise to fat and skinny pig, so you tell fat pig to take a hike and swap him out for the skinny pig.

As you may know, CDH4 now comes with the new version of the Map Reduce framework (MRv2) called Yarn (because what Hadoop really needs is another stupid, non-descriptive name for a subsystem.) MRv2/Yarn lives in /usr/lib/hadoop-mapreduce and MRv1 lives in /usr/lib/hadoop-0.20-mapreduce. You can install both of them, but you should only really run one of them on any given node. So, if your cluster is set up to use MRv1, but someone installed the Yarn MRv2 libraries as well on your node, you’ll get into the same errors as above if you run:

$ java -cp /usr/lib/pig/pig-withouthadoop.jar:.:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-hdfs/*:/etc/hadoop/conf test

2012-10-01 01:13:11,999 INFO [main] pigstats.ScriptState (ScriptState.java:setScriptFeatures(344)) - Pig features used in the script: GROUP_BY
Exception in thread "main" org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C
at org.apache.pig.PigServer.openIterator(PigServer.java:901)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:680)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at test.main(test.java:11)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias C
at org.apache.pig.PigServer.storeEx(PigServer.java:1000)
at org.apache.pig.PigServer.store(PigServer.java:963)
at org.apache.pig.PigServer.openIterator(PigServer.java:876)
... 4 more
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000: Output Location Validation Failed for: 'hdfs://<your.hadoop.namenode>:9000/tmp/temp-1298069119/tmp458602354 More info to follow:
Can't get Master Kerberos principal for use as renewer
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:294)
at org.apache.pig.PigServer.compilePp(PigServer.java:1360)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1297)
at org.apache.pig.PigServer.storeEx(PigServer.java:996)
... 6 more
Caused by: java.io.IOException: Can't get Master Kerberos principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:114)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 18 more

SUCCESS! – Take 4 – Correct pig and Correct MRv1 libraries

OK! Now armed with your new knowledge, you can change your classpath to the following and yell holy shit it worked! (Well, maybe that was just me)

$ java -cp /usr/lib/pig/pig-withouthadoop.jar:.:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-0.20-mapreduce/*:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-hdfs/*:/etc/hadoop/conf test 

12/10/01 01:37:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://<your.hadoop.namenode>
12/10/01 01:37:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
12/10/01 01:37:58 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: hdfs://<your.hadoop.jobtracker>
:
:
:
12/10/01 01:38:36 INFO input.FileInputFormat: Total input paths to process : 1
12/10/01 01:38:36 INFO util.MapRedUtil: Total input paths to process : 1
(/bin/sh,18)
(/bin/bash,5)
(/bin/sync,1)
(/bin/false,17)
(/usr/sbin/nologin,1)

CONCLUSION

So yeah, maybe I could have just cut to the chase and given you the working classpath but I wanted to write out each error and why they happened in case other people experienced the same errors and were googling for them.  So, to conclude, to embed Pig 0.9.2-CDH4 properly in java, you will need to link against the appropriate version of Hadoop that you’re running against, and use the appropriate Map Reduce libraries that are available on the system. It seems pretty simple and obvious, but it’s easy to lose track of it when you get lost in the Hadoop weeds.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s