Pig script on aws emr with tez occasionally fails with OutOfMemoryException - apache-pig

I have a pig script running on an emr cluster (emr-5.4.0) using a custom UDF. The UDF is used to lookup some dimensional data for which it imports a (somewhat) large amout of text data.
In the pig script, the UDF is used as follows:
DEFINE LookupInteger com.ourcompany.LookupInteger(<some parameters>);
The UDF stores some data in Map<Integer, Integer>
On some input data the aggregation fails with an exception as follows
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.split(String.java:2377)
at java.lang.String.split(String.java:2422)
[...]
at com.ourcompany.LocalFileUtil.toMap(LocalFileUtil.java:71)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:46)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:19)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.genericGetNext(POBinCond.java:76)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNextInteger(POBinCond.java:118)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
This does not occur when the pig aggregation is run with mapreduce, so a workaround for us is to replace pig -t tez with pig -t mapreduce.
As i'm new to amazon emr, and pig with tez, i'd appreciate some hints on how to analyse or debug the issue.
EDIT:
It looks like a strange runtime behaviour while running the pig script on tez stack.
Please note that the pig script is using
replicated joins (the smaller relations to be joined need to fit into memory) and
the already mentioned UDF, which is initialising a Map<Integer, Interger> producing the aforementioned OutOfMemoryError.

We found another workaround using tez backend. Using increased values for mapreduce.map.memory.mb and mapreduce.map.java.opts (0.8 times of mapreduce.map.memory.mb). Those values are bound to the ec2 instance types and are usually fixed values (see aws emr task config).
By (temporarily) doubling the values, we were able to make the pig script succeed.
The following values were set for a m3.xlarge core instance, which has default values:
mapreduce.map.java.opts := -Xmx1152m
mapreduce.map.memory.mb := 1440
Pig startup command
pig -Dmapreduce.map.java.opts=-Xmx2304m \
-Dmapreduce.map.memory.mb=2880 -stop_on_failure -x tez ... script.pig
EDIT
One colleague came up with the following idea:
Another workaround for the OutOfMemory: GC overhead limit exceeded could be to add explicit STORE and LOAD statements for the problematic relations, that would make tez flush the data to storage. This could also help in debugging the issue, as the (temporary, intermediate) data can be observed with other pig scripts.

Related

"not a Parquet file (too small)" from Presto during Spark structured streaming run

I have a pipeline set up that reads data from Kafka, processes it using Spark structured streaming and then writes parquet files to HDFS. Downstream clients of the data query is using Presto configured to read the data as Hive tables.
Kafka --> Spark --> Parquet on HDFS --> Presto
In general this works. The problem arises when a query happens while the Spark job is running a batch. The Spark job creates a zero-length Parquet file on HDFS. If Presto attempts to open this file in the course of processing a query, then it throws an error:
Query 20171116_170937_07282_489cc failed: Error opening Hive split hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet (offset=0, length=0): hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet is not a Parquet file (too small)
The file is indeed zero bytes at this time, so the error is strictly correct, but this is not the behavior I want for the pipeline. I would like to be able to continuously write in to the appropriate HDFS folders, without disturbing the Presto queries.
The Spark scala code for the job looks like this:
val FilesOnDisk = 1
Spark
.initKafkaStream("fleet_profile_test")
.filter(_.name.contains(job.kafkaTag))
.flatMap(job.parser)
.coalesce(FilesOnDisk)
.writeStream
.trigger(ProcessingTime("1 hours"))
.outputMode("append")
.queryName(job.queryName)
.format("parquet")
.option("path", job.outputFilesPath)
.start()
The job starts at the top of the hour, :00. The file is first visible on HDFS as a zero-length file at :05. It is not updated until it is written completely at :21, just before the job finishes. This makes the table effectively unusable from Presto 25% of the time.
Each file is only a little over 500kB, so I wouldn't expect the physical writing of the file to take very long. From my understanding, Parquet files have their metadata at the end of the file so someone writing bigger files would have even more trouble.
What strategies have people used to integrate Spark structured streaming and Presto while working around this Presto error?
You could try to persuade Presto (or Presto team) to ignore empty files, but that wouldn't help, as the program writing the file (here: Spark) will eventually flush partial data and the file would appear partial, non-empty and not well formed, thus leading to an error as well.
The approach preventing Presto (or other programs reading the table data for that matter) from seeing partial file would be to assembler the file in different location and then atomically move the file into the correct location.

Mapreduce job not launching when running hive query with where clause

I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.
I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.
This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.

how to decrease the number of mapper in hive while the file is bigger than block size?

guys
I have a table in hive which have more than 720 partitions,and in each partition there is more than 400 files and the file's average size is 1G.
Now I execute following SQL:
insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;
this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:409
INFO : Submitting tokens for job: job_1482996444961_9384015
I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small.
I have tried the following set in beeline, but not work
---------------first time
set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;
-----------------second time
set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;
my hadoop version is
Hadoop 2.7.2
Compiled by root on 11 Jul 2016 10:58:45
hive version is
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
In addition to the setup in your post
set hive.hadoop.supports.splittable.combineinputformat=true;
hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.
MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:
mapreduce.input.fileinputformat.split.maxsize=xxxxx
If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop
This link can be helpful to control Mapper in Hive if your execution engine is mr
If your execution engine is tez and you would lile to control Mappers then use:
set tez.grouping.max-size = XXXXXX;
Here is a good read reference for the parallelism in Hive for tez execution engine,

Hive query in oozie coordinator

I running 10 hive scripts using oozie coordinator, it is getting stuck in one of the script in reduce stage at same percentage without any error, the scripts are simple insert statements and I tested them on command line they just work fine, how do I debug this?
It was data skew issue, 80% of the data was mapped to single key. Once we updated to Hive 10, skew optimization join resolved the issue.

Running Elastic Mapreduce Hive Queries from an Application

I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!