pig aws emr jython serialization error - jython

I am trying to run a trivial Python UDF in Pig on Amazon EMR and it throws a java serialization error:
java.io.IOException: Deserialization error: could not instantiate 'org.apache.pig.scripting.jython.JythonFunction' with arguments '[/tmp/pig4877832484731242596tmp/simple.py, aprs]'
I've searched here and elsewhere and seen somewhat related questions and solutions posted, but none of the solutions seem to apply, including one post over a year ago that seemed to indicate this was working OK with Pig 0.9.1 on Amazon EMR.
$ pig --version
Apache Pig version 0.9.2-amzn (rexported)
compiled Aug 06 2012, 20:34:29
$ hadoop version
Hadoop 1.0.3
Here's my trivial python UDF:
#/usr/bin/python
#outputSchema("data:chararray")
def aprs(l):
return l
And here's the pig script invocation that shows the UDF is loaded and the #outputSchema did the right thing:
grunt> Register 's3n://n2ygk/simple.py' using jython as myudf;
grunt> raw = LOAD 's3n://aprs-is/small-sample.log' USING TextLoader as (line:chararray);
grunt> cooked = LIMIT raw 1000;
grunt> aprs = FOREACH cooked GENERATE FLATTEN(myudf.aprs(line));
grunt> DESCRIBE aprs;
aprs: {data: chararray}
grunt> dump aprs;
Any suggestions?

The short answer is to download, build and use Pig 0.11.1 as described with this post on the AWS forum. Once I did this, the Python UDF code "just worked".
I first did this naively by using the above-referenced script that downloads and builds Pig 0.11.1. I assumed this was likely overkill to build pig from scratch since I've been reading up more on Pig versions and believe that 0.11.1 is already available via --pig-versions. (Although the doc doesn't list 0.11.1, the pig installer script does include it.) I've used the following command to test using the EMR-distro of pig-0.11.1:
./elastic-mapreduce --create --name 'Pig11 2013-06-03-21:35:24' --alive \
--num-instances 1 --instance-type m1.small --pig-interactive \
--pig-versions 0.11.1 \
--bootstrap-action s3n://us-west-2.elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
--bootstrap-name 'memory intensive'
Unfortunately, this throws the following error which appears to perhaps be related to java version 1.6 vs. 1.7:
hadoop#ip-10-253-41-55:~$ pig
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/pig/Main : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:180)
hadoop#ip-10-253-41-55:~$ hadoop version
Hadoop 1.0.3
When I retested the above without specifying --pig-versions 0.11.1 it turns out that only 0.9.2 is installed in response to the default value for --pig-versions latest.
I see from (s3://us-west-2.elasticmapreduce/libs/pig/pig-script) that Pig 0.11.1 and hadoop 1.0.3 require java 7 whereas 0.9.2 requires hadoop 1.03 with java 6. I'm guessing that somehow java 7 did not get installed.
If someone can straighten me out on what I'm doing wrong, I'd really appreciate it. Otherwise, I'll live with the workaround of downloading and building it as it wastes a little time but does work.

Related

Databricks Connect: Unable to run scala program in databricks cluster from IntelliJ

Followed the steps mentioned in this docs. databricks-connect test command works fine. However, when I launch the test scala program from Intellij, I'm seeing following error:
Exception in thread "main" java.lang.NoSuchMethodError: com.databricks.spark.util.MetricDefinitions$.EVENT_INITIAL_CONFIG_LOG()Lcom/databricks/spark/util/MetricDefinition;
at com.databricks.spark.util.InitialConfigLogging$.recordInitialConfigs(InitialConfigLogging.scala:47)
at org.apache.spark.sql.internal.SQLConf.recordSessionInitialConfs(SQLConf.scala:4166)
at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:150)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1033)
at com.srikanth.Demo$.main(Demo.scala:13)
at com.srikanth.Demo.main(Demo.scala)
Environment details:
python - 3.8
java - 1.8
databricks-connect - 9.1
I had the same error message, turns out I did not add the databricks-connect library properly. You should add the library like described here: https://docs.databricks.com/dev-tools/databricks-connect.html#intellij-scala-or-java

Create Dataframe issue in Pyspark from Windows 10

I am unable to execute the below command from pyspark windows
schemaPeople = spark.createDataFrame(people)
I have set HADOOP_HOME to winutils
I have provide 77 permission to C:/tmp/hive
Still I am getting the below error -
Py4JJavaError: An error occurred while calling o23.applySchemaToPythonRDD.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
I have gone through a lot of similar questions before posting this , appreciate any help here
I got this error a bunch when trying to setup Spark on windows using the winutils file. I had to setup Spark differently to get around this.
I ended up downloading the Hadoop binary for my version of spark and going from there. I documented the whole thing with a walkthrough if you are interested. Spark on windows
The gist is that the official Hadoop release from Apache does not include a Windows binary and compiling from sources can be tedious so really helpful people have made compiled distributions available. If you want to use Spark 2.0.2 download the binaries from steve loughran's github for 2.1.0 you can download from here from there you should be able to set it up as expected.

Error in installation of Hive 2.1.0 on Hadoop 2.7.2 - Pseudo distributed mode

I followed Apache Hadoop installation links and could install the same along with PIG. They all are working fine.
Following is the configuration:
Hadoop: 2.7.2
Hive: 2.1.0
Machine: Ubuntu 14.04 LTS 64-bit
Java: Version 9
Now I tried to install Apache Hive 2.1.0 according to this link [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation#AdminManualInstallation-InstallingfromaTarball].
... and started test execution of Hive CLI but everytime it throws following error and exits.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" java.lang.ClassCastException: jdk.internal.loader.ClassLoaders$AppClassLoader (in module: java.base) cannot be cast to java.net.URLClassLoader (in module: java.base)
at org.apache.hadoop.hive.ql.session.SessionState.<init> (SessionState.java:374)
at org.apache.hadoop.hive.ql.session.SessionState.<init>(SessionState.java:350)
at org.apache.hadoop.hive.cli.CliSessionState.<init>(CliSessionState.java:60)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:663)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base#9-ea/Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base#9-ea/NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base#9-ea/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base#9-ea/Method.java:533)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
..But there is a catch. If I invoke Beeline CLI then it works fine.
Could you please help :
a. Are the Beeline CLI and Hive CLI same or any specific difference?
b. Help to install/configure Hive on my machine
A : Beeline CLI VS Hive CLI https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_dataintegration/content/beeline-vs-hive-cli.html
B : According to :
http://openjdk.java.net/projects/jigsaw/talks/prepare-for-jdk9-j1-2015.pdf
Java 9 Uses no longer uses java.net.URLClassLoader.
However, I was able to solve the issue by pointing Hive to JDK8.
** I have only begun using HIVE/HADOOP... Perhaps someone could proved a better explanation or a workaround so that we are able to use JDK9...

Nutch problems executing crawl

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.
Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.
I am getting the following error when I try to run a crawl execute with nutch:
Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt
Failed with exit value 127.
I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.
I am curious if I am calling the write directory correctly, if maybe that is the problem.
The full printout looks like:
User5#User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local
$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/ 2
Injecting seed URLs
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Injector: starting at 2015-12-23 17:48:21
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Failed with exit value 127.
The hadoop log that I think may have something to do with the error I am getting is:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 4 more
You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt
You have answer to your problem in this message:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.
I had same situation and I ended up using nutch1.11 in ubuntu virtual box.
hadoop-core jar file is needed when you are working with nutch
with nutch 1.11 compatible hadoop-core jar is 0.20.0
please download jar from this link :
http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm
paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder
and it will run successfully.
The problem is pretty clear. According to your hadoop log, it cannnot find the winutils.exe file. Include winutils.exe in %HADOOP_HOME%/bin folder

ERROR 1066: Unable to open iterator for alias - Pig

Just started Pig; trying to load the data from a file and dump it henceforth. Loading seems to be proper, no error is thrown. Below is the query:
NYSE = LOAD '/root/Desktop/Works/NYSE-2000-2001.tsv' USING
PigStorage() AS (exchange:chararray, stock_symbol:chararray,
date:chararray, stock_price_open:float, stock_price_high:float,
stock_price_low:float, stock_price_close:float, stock_volume:int,
stock_price_adj_close:float);
When I try to do the Dump, it throws the following error:
Pig Stack Trace
ERROR 1066: Unable to open iterator for alias NYSE
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias NYSE
at org.apache.pig.PigServer.openIterator(PigServer.java:857)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:849)"
Any idea what's causing the issue?
Are you running a pig 0.12.0 or earlier jar against hadoop 2.2, if this is the case then
I managed to get around this error by recompiling the pig jar from src, here is a summary of the steps involved on a debian type box
download the pig-0.12.0.tar.gz
unpack the jar and set permissions
then inside the unpacked directory compile the src with 'ant clean jar -Dhadoopversion=23'
then you need to get the jar on your class-path in maven, for example, in the
same directory
mvn install:install-file -Dfile=pig.jar -DgroupId={set a groupId}-
DartifactId={set a artifactId} -Dversion=1.0 -Dpackaging=jar
or if in eclipse then add jar as external libary/dependency
I was getting your exact trace trying to run pig 12 in a hadoop 2.2.0 and the above steps worked for me
UPDATE
I posted my issue on the pig jira and they responded. They have a pig jar already compiled for hadoop2 pig-h2.jar here http://search.maven.org/#artifactdetails|org.apache.pig|pig|0.12.0|jar
a maven tag for this jar is
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<classifier>h2</classifier>
<version>0.12.0</version>
<scope>provided</scope>
</dependency>
This could be due to a change in the Pig Version starting 0.12. The specific change is that Pig used to be permissive and automatically ignore the first line in the data file or it would interpret that line as column names, in the new version of Pig this permissiveness was removed. The work around is to delete the column names from the input file and this should solve the problem
Kapil
I also meet this problem. And then I see this link: http://www.fanli7.net/a/JAVAbiancheng/ANT/20140325/441264.html
I just replace pig version from 0.12.0 to 0.13.0 and the problem is solved. (Here, my hadoop version is 2.3.0)
You can place breakpoint to class PigServer to method store().
for(JobStats js : stats.getJobGraph()){
if(js.getException() != null) {
ex = js.getException();
}
}
Inside the js object there is field errorMessage and it may contain description of the problem