Hive count(1) leads to oom - hive
I have a new cluster built by cdh 6.3, hive is ready now and 3 nodes have 30GB memory.
I create a target hive table stored as parquet. I put some parquet files downloaded from another cluster to the HDFS directory of this hive table, and when I run
select count(1) from tableA
I finally shows:
INFO : 2021-09-05 14:09:06,505 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 436.69 sec
INFO : 2021-09-05 14:09:07,520 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 426.94 sec
INFO : 2021-09-05 14:09:10,562 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 464.3 sec
INFO : 2021-09-05 14:09:26,785 Stage-1 map = 94%, reduce = 31%, Cumulative CPU 464.73 sec
INFO : 2021-09-05 14:09:50,112 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 464.3 sec
INFO : MapReduce Total cumulative CPU time: 7 minutes 44 seconds 300 msec
ERROR : Ended Job = job_1630821050931_0003 with errors
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 18 Reduce: 1 Cumulative CPU: 464.3 sec HDFS Read: 4352500295 HDFS Write: 0 HDFS EC Read: 0 FAIL
INFO : Total MapReduce CPU Time Spent: 7 minutes 44 seconds 300 msec
INFO : Completed executing command(queryId=hive_20210905140833_6a46fec2-91fb-4214-a734-5b76e59a4266); Time taken: 77.981 seconds
Looking into MR logs, it repeatedly shows:
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1080)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:712)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:213)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:101)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:63)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
The parquet files are only 4.5 GB in total, why could count() runs oom? What parameter should I change in MapReduce?
There are two ways how you can fix OOM in mapper: 1 - increase mapper parallelism, 2 - increase the mapper size.
Try to increase parallelism first.
Check current values of these parameters and reduce mapreduce.input.fileinputformat.split.maxsize to get more smaller mappers:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapreduce.input.fileinputformat.split.minsize=16000; -- 16 KB files. smaller than min size will be processed on the same mapper combined
set mapreduce.input.fileinputformat.split.maxsize=128000000; -- 128Mb -files bigger than max size will be splitted. Decrease your setting to get 2x more smaller mappers
--These figures are example only. Compare with yours and decrease accordingly untill you get 2x more mappers
Alternatively try to increase the mapper size:
set mapreduce.map.memory.mb=4096; --compare with current setting and increase
set mapreduce.map.java.opts=-Xmx3000m; --set ~30% less than mapreduce.map.memory.mb
Also try to disable map-side aggregation (map-side aggregation often leads to OOM on mapper)
set hive.map.aggr=false;
Related
How vCores and Memory get allocated from Spark Pool
I have below spark pool config. Nodes : 3 to 10. Spark Job config: After seeing below allocation, looks like it is using all 10 nodes from the pool. 10 x 8 vCores = 80 vCores; 10 x 64 GB = 640 GB BUT, I have set number of executors - min & max to 4 to 6. So, shouldn’t it go max to 6 x 8 vCores and 6 x 64 GB ? Please correct if I am missing something here.
You are getting confused between Spark Pool Allocated vCores, memory and Spark Job executor size which are two different things. You have created a ContractsMed Spark Pool, which has max. 10 nodes with each node size equal to 8 vCores and 64 GB memory. That's the reason the last snippet you have shared containing Spark Pool Allocated vCores and Memory and not Spark Job details. So, 80 vCores and 640 GB is the Spark Pool size and not Spark Job Now coming to Spark Job Configuration, where you are using ContractsMed Spark Pool. As you have configured maximum 6 executors with 8 vCores and 56 GB memory each, the same resources, i.e, 6x8=56 vCores and 6x56=336 GB memory will be fetched from the Spark Pool and used in the Job. The remaining resources (80-56=24 vCores and 640-336=304 GB memory) from Spark Pool will remain unused and can be used in any other Spark Job.
MapReduce Job continues to run with map = 0%, reduce = 0% for hours
I am running one Hive query which looks like create table table1 as select split(comments,' ') as words from table2; comments column has review comments in the form of Strings separated by space. When I run this query, MapReduce job starts and continues to run with Map 0% for hours. It does not give any error during this process. hive> create table jw_1 as select split(comments,' ') from removed_null_values; Query ID = xxx-190418201314_7781cf59-6afb-4e82-ab75-c7e343c4985e Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1555607912038_0013, Tracking URL = http://xxx-VirtualBox:8088/proxy/application_1555607912038_0013/ Kill Command = /usr/local/bin/hadoop-3.2.0/bin/mapred job -kill job_1555607912038_0013 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2019-04-18 20:13:30,568 Stage-1 map = 0%, reduce = 0% 2019-04-18 20:14:31,140 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 39.6 sec 2019-04-18 20:15:31,311 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 101.64 sec 2019-04-18 20:16:31,451 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 146.5 sec 2019-04-18 20:17:31,684 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 212.08 sec However when I try select split(comments,' ') from table2; I can see comments in the form of an array in the shell. ["\"Lauren","was","promptly","responsive","in","advance","of","our","booking.","providing","a","lot","of","helpful","info.","And","she","stayed","in","contact","and","was","readily","available","prior","to","and","during","our","stay.","which","was","awesome.","The","location.","price","and","privacy","were","the","real","benefits."] I have also run a few other queries where the MapReduce jobs complete and produce the desired result I am currently using Hive 3.1.1 Basically, I want to create a new table with an array containing words and later on tokenize that column I am new to Hive and I am working on sentimental analysis on data file of size 35MB.
In your first case, you most likely don't have the resources necessary to complete the Hive query when converted to MapReduce. You would have to look at either YARN or MR1 to determine if you have enough compute resources to run your MapReduce job. I the second query, some Hive queries trigger don't trigger MapReduce jobs and that is why it comes back. See How does Hive decide when to use map reduce and when not to? for more information.
What is the performance overhead of XADisk for read and write operations?
What does XADisk do in addition to reading/writing from the underlying file? How does that translate into a percentage of the read/write throughput (approximately)?
depends on the size, if large set the flag "heavyWrite" as true while opening the xaFileOutputStream. test with 500 files of size 1MB each. Below is the amount of time taken, averaged over 10 executions... Java IO - 37.5 seconds Java NIO - 24.8 seconds XADisk - 30.3 seconds
Suppress map and reduce progress for Hive queries
Is there an option to suppress the map and reduce progress for a hive query, i.e. 2013-04-07 19:21:05,538 Stage-1 map = 13%, reduce = 4%, Cumulative CPU 28830.05 sec 2013-04-07 19:21:06,558 Stage-1 map = 13%, reduce = 4%, Cumulative CPU 28830.05 sec while keeping all other output, particularly the query itself with the -v option.
hive -S -v -e "select * from test;" S is for silent mode v is for query display e is for executing the query
Using `overlap`, `kernel time` and `utilization` to optimize one's kernels
My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels. Especially the high utilization and the low kernel time don't make sense to me. So how should I proceed in optimizing my kernel? I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended. `Quadro FX 580 utilization = 100.00% (62117.00/62117.00)` Kernel time = 3.05 % of total GPU time Memory copy time = 0.9 % of total GPU time Kernel taking maximum time = Pinned (0.7% of total GPU time) Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time) There is no time overlap between memory copies and kernels on GPU Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor. Kernel details: Grid size: [4 1 1], Block size: [256 1 1] Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread] Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block] Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8) Active threads per SM: 768 (Maximum Active threads per SM: 768) Potential Occupancy: 1 ( 24 / 24 ) Achieved occupancy: 0.333333 (on 4 SMs) Occupancy limiting factor: None p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.
it seems the grid size of your kernel is too small to make full use of SM. why not decrease block size and increase the grid size. i think it will do some help.