Getting exception while running Pig Script - apache-pig

I am getting the below error while running a Pig Script on approximately a 300GB dataset.
Error: Exceeded limits on number of counters - Counters=120 Limit=120
Does anybody have any ideas on how to resolve the issue without modifying counter config in the Pig properties file?

This can not be qualified as proper answer since you need modify configuration files. I don't think there is any way at the moment of doing this without modifying some configuration files.
Now this is pure nit picking but actually you can do this without modifying Pig properties. All you need to do is to configure counter limit in Hadoop configuration file.
Add mapreduce.job.counters.max or mapreduce.job.counters.limit, depending your Hadoop version, to your file mapred-site.xml. Eg.
Remember to restart all node managers and also the history server.


Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?
I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

Parquet Warning Filling up Logs in Hive MapReduce on Amazon EMR

I am running a custom UDAF on a table stored as parquet on Hive on Tez. Our Hive jobs are run on YARN, all set up in Amazon EMR. However, due to the fact that the parquet data we have was generated with an older version of Parquet (1.5), I am getting a warning that is filling up the YARN logs and causing the disk to run out of space before the job finishes. This is the warning:
PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring
because created_by could not be parsed (see PARQUET-251): parquet-mr version
It also prints a stack track. I have been trying to silence the warning logs to no avail. I have managed to turn off just about every type of log except this warning. I have tried modifying just about every Log4j settings file using the AWS config as outlined here.
Things I have tried so far:
I set the following settings in tez-site.xml (writing them in JSON format because that's what AWS requires for configuration) It is in proper XML format of course on the actual instance.
"": "OFF",
"tez.task.log.level": "OFF",
"": "-Dhadoop.metrics.log.level=OFF -Dtez.root.logger=OFF,CLA",
"tez.task-specific.log.level": "OFF;org.apache.parquet=OFF"
I have the following settings on mapred-site.xml. These settings effectively turned off all logging that occurs in my YARN logs except for the warning in question.
"": "OFF",
"mapreduce.reduce.log.level": "OFF",
"": "OFF"
I have these settings in just about every other file .I found in the list shown in previous AWS link.
"": "OFF",
"": "OFF",
"log4j.rootLogger": "OFF, console"
Honestly at this point, I just want to find some way turn off logs and get the job running somehow. I've read about similar issues such as this link where they fixed it by changing log4j settings, but that's for Spark and it just doesn't seem to be working on Hive/Tez and Amazon. Any help is appreciated.
Ok, So I ended up fixing this by modifying the java file for EVERY single data node and the master node in EMR. In my case the file was located at /etc/alternatives/jre/lib/
I added a shell command to the bootstrap action file to automatically add the following two lines to the end of the properties file:
org.apache.parquet.CorruptStatistics.level = SEVERE
Just wanted to update in case anyone else faced the same issue as this is really not set up properly by Amazon and required a lot of trial and error.

Hive - How can I stop logs displaying in console?

I have been trying to omit logs from console while querying in hive, but still it is showing up.
If you are opening the hive console by typing
> hive
in your terminal and then write queries, you can solve this by simply using
> hive -S
This basically means that you are starting hive in silent mode.
Hope that helps.
You could increase the polling interval to minutes or hours:
SET hive.exec.counters.pull.interval=[millis];
The default is 1000 milliseconds, but you can increase it to anything you like. That should decrease the number of logs written to stdout.
If you don't want any logs on the console while starting the shell you can set the hive.root.logger property
$HIVE_HOME/bin/hive --config hive.root.logger=INFO,DRFA
hive.root.logger specifies the logging level as well as the log
destination. Specifying console as the target sends the logs to the
standard error (instead of the log file).
If you want to see ERROR messages on console you can set this command
$HIVE_HOME/bin/hive --config hive.root.logger=ERROR,console
Start hive in silent mode using
$ hive -S
then Set logger level to Error, which will avoid Warnings/Info from printing.
hive> set logger.PerfLogger.level = ERROR;
If there is "SLF4J: Class path contains multiple SLF4J bindings." in your log, it means that there are multiple log4j jars (different versions, different behaviors) in the class path
I don't know the principle of log4j, but according to the Hadoop configuration file, perform the following steps:
cd $HIVE_HOME/conf
cat > <<EOL
log4j.rootLogger=WARN, CA
log4j.appender.CA.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n
After starting hive (Hive 3.1.2 Apache), the log is set to WARN level, which may not necessarily work, but you can try it.

Cannot Load Hive Table into Pig via HCatalog

I am currently configuring a Cloudera HDP dev image using this tutorial on CentOS 6.5, installing the base and then adding the different components as I need them. Currently, I am installing / testing HCatalog using this section of the tutorial linked above.
I have successfully installed the package and am now testing HCatalog integration with Pig with the following script:
A = LOAD 'groups' USING org.apache.hcatalog.pig.HCatLoader();
I have previously created and populated a 'groups' table in Hive before running the command. When I run the script with the command pig -useHCatalog test.pig I get an exception rather than the expected output. Below is the initial part of the stacktrace:
Pig Stack Trace
ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
at org.apache.pig.PigServer$Graph.parseQuery(
at org.apache.pig.PigServer$Graph.registerQuery(
at org.apache.pig.PigServer.registerQuery(
Has anyone encountered this error before? Any help would be much appreciated. I would be happy to provide more information if you need it.
The error was caused by HBase's Thrift server not being proper configured. I installed/configured Thrift and added the following to my with the proper server information added:
<value>thrift://<!--URL of Your Server-->:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
I thought the snippet above was not required since I am running Cloudera HDP in pseudo-distributed mode.Turns out, it and HBase Thrift are required to use HCatalog with Pig.

Weblogic forces recompile of EJBs when migrating from 9.2.1 to 9.2.3

I have a few EJBs compiled with Weblogic's EJBC complient with Weblogic 9.2.1.
Our customer uses Weblogic 9.2.3.
During server start Weblogic gives the following message:
<BEA-010087> <The EJB deployment named: YYY.jar is being recompiled within the WebLogic Server. Please consult the server logs if there are any errors. It is also possible to run weblogic.appc as a stand-alone tool to generate the required classes. The generated source files will be placed in .....>
Consequently, server start takes 1.5 hours instead of 20 min. The next server start takes exactly the same time, meaning Weblogic does not cache the products of the recompilation. Needless to say, we cannot recompile all our EJBs to 9.2.3 just for this specific customer, so we need an on-site solution.
My questions are:
1. Is there any way of telling Weblogic to leave those EJB jars as they are and avoid the re-compilation during server start?
2. Can I tell Weblogic to cache the recompiled EJBs to avoid prolonged restarts?
Our current workaround was to write a script that does this recompilation manually before the EAR's creation and deployment (by simply running java weblogic.appc <jar-name>), but we would rather avoid this solution being used in production.
I FIXED this problem by spending a great deal of time researching
and decompiling some classes.I encountered this when migrating from weblogic8 to 10
by this time you might have understood the pain in dealing with oracle weblogic tech support.
unfortunately they did not have a server configuration setting to disable this
You need to do 2 things
Step 1.You if you open the EJB jar files you can see
you see these hascodes for each of your ejb names.Make these hadcodes zero.
pack the jar file and deploy it on server.
This is just a Marker file that weblogic.appc keeps in each ejb jar to trigger the recompilation
during server boot up.i automated this process of making these hadcodes to zero.
This hashcodes remain the same for each ejb even if you execute appc for more than once
if you add a new EJB class or delete a class those entries are added to this marker file
Note 1:
how to get this file?
if you open domains/yourdomain/servers/yourServerName/cache/EJBCompilerCache/XXXXXXXXX
you will see this file for each ejb.weblogic makes the hashcodes to zero after it recompiles
Note 2:
When you generate EJB using appc.generate them to a exploded directory using -output C:\myejb
instead of C:\myejb.jar.This way you can play around with the marker file
Also you need a PATCH from weblogic.When you install the patch you see some message like this
"PATH CRXXXXXX installed successfully.Eliminate EJB recomilation for appc".
i dont remember the patch number but you can request weblogic for that.
You need to use both steps to fix the problem.The patch fixes only part of the problem
the Marker file in EJBs is WL_GENERATED
Just to update the solution we went with - eventually we opted to recompile the EJBs once at the Customer's site instead of messing with the EJBs' internal markers (we don't want Oracle saying they cannot support problems derived from this scenario).
We created two KSH scripts - the first iterates over all the EJB jars, copies them to a temp dir and then re-compiles them in parallel by running several instances of the 2nd script which does only one thing: java -Drecompiler=yes -cp $CLASSPATH weblogic.appc $1 (With error handling of course :))
This solution reduced compilation time from 70min to 15min. After this we re-create the EAR file and redeploy it with the new EJBs. We do this once per several UAT environment creations, so we save quite a lot of time here (55min X num of envs per drop X num of drops)