Is there an option to suppress the map and reduce progress for a hive query, i.e.
2013-04-07 19:21:05,538 Stage-1 map = 13%, reduce = 4%, Cumulative CPU 28830.05 sec
2013-04-07 19:21:06,558 Stage-1 map = 13%, reduce = 4%, Cumulative CPU 28830.05 sec
while keeping all other output, particularly the query itself with the -v option.
hive -S -v -e "select * from test;"
S is for silent mode
v is for query display
e is for executing the query
Related
I have a new cluster built by cdh 6.3, hive is ready now and 3 nodes have 30GB memory.
I create a target hive table stored as parquet. I put some parquet files downloaded from another cluster to the HDFS directory of this hive table, and when I run
select count(1) from tableA
I finally shows:
INFO : 2021-09-05 14:09:06,505 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 436.69 sec
INFO : 2021-09-05 14:09:07,520 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 426.94 sec
INFO : 2021-09-05 14:09:10,562 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 464.3 sec
INFO : 2021-09-05 14:09:26,785 Stage-1 map = 94%, reduce = 31%, Cumulative CPU 464.73 sec
INFO : 2021-09-05 14:09:50,112 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 464.3 sec
INFO : MapReduce Total cumulative CPU time: 7 minutes 44 seconds 300 msec
ERROR : Ended Job = job_1630821050931_0003 with errors
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 18 Reduce: 1 Cumulative CPU: 464.3 sec HDFS Read: 4352500295 HDFS Write: 0 HDFS EC Read: 0 FAIL
INFO : Total MapReduce CPU Time Spent: 7 minutes 44 seconds 300 msec
INFO : Completed executing command(queryId=hive_20210905140833_6a46fec2-91fb-4214-a734-5b76e59a4266); Time taken: 77.981 seconds
Looking into MR logs, it repeatedly shows:
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1080)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:712)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:213)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:101)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:63)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
The parquet files are only 4.5 GB in total, why could count() runs oom? What parameter should I change in MapReduce?
There are two ways how you can fix OOM in mapper: 1 - increase mapper parallelism, 2 - increase the mapper size.
Try to increase parallelism first.
Check current values of these parameters and reduce mapreduce.input.fileinputformat.split.maxsize to get more smaller mappers:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapreduce.input.fileinputformat.split.minsize=16000; -- 16 KB files. smaller than min size will be processed on the same mapper combined
set mapreduce.input.fileinputformat.split.maxsize=128000000; -- 128Mb -files bigger than max size will be splitted. Decrease your setting to get 2x more smaller mappers
--These figures are example only. Compare with yours and decrease accordingly untill you get 2x more mappers
Alternatively try to increase the mapper size:
set mapreduce.map.memory.mb=4096; --compare with current setting and increase
set mapreduce.map.java.opts=-Xmx3000m; --set ~30% less than mapreduce.map.memory.mb
Also try to disable map-side aggregation (map-side aggregation often leads to OOM on mapper)
set hive.map.aggr=false;
Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.
I am running one Hive query which looks like
create table table1 as select split(comments,' ') as words from table2;
comments column has review comments in the form of Strings separated by space.
When I run this query, MapReduce job starts and continues to run with Map 0% for hours. It does not give any error during this process.
hive> create table jw_1 as select split(comments,' ') from removed_null_values;
Query ID = xxx-190418201314_7781cf59-6afb-4e82-ab75-c7e343c4985e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555607912038_0013, Tracking URL = http://xxx-VirtualBox:8088/proxy/application_1555607912038_0013/
Kill Command = /usr/local/bin/hadoop-3.2.0/bin/mapred job -kill job_1555607912038_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-04-18 20:13:30,568 Stage-1 map = 0%, reduce = 0%
2019-04-18 20:14:31,140 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 39.6 sec
2019-04-18 20:15:31,311 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 101.64 sec
2019-04-18 20:16:31,451 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 146.5 sec
2019-04-18 20:17:31,684 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 212.08 sec
However when I try
select split(comments,' ') from table2;
I can see comments in the form of an array in the shell.
["\"Lauren","was","promptly","responsive","in","advance","of","our","booking.","providing","a","lot","of","helpful","info.","And","she","stayed","in","contact","and","was","readily","available","prior","to","and","during","our","stay.","which","was","awesome.","The","location.","price","and","privacy","were","the","real","benefits."]
I have also run a few other queries where the MapReduce jobs complete and produce the desired result
I am currently using Hive 3.1.1
Basically, I want to create a new table with an array containing words and later on tokenize that column
I am new to Hive and I am working on sentimental analysis on data file of size 35MB.
In your first case, you most likely don't have the resources necessary to complete the Hive query when converted to MapReduce. You would have to look at either YARN or MR1 to determine if you have enough compute resources to run your MapReduce job.
I the second query, some Hive queries trigger don't trigger MapReduce jobs and that is why it comes back. See How does Hive decide when to use map reduce and when not to? for more information.
I was inspired by this post called "Only fast languages are interesting" to look at the problem he suggests (sum'ing a couple of million numbers from a vector) in Haskell and compare to his results.
I'm a Haskell newbie so I don't really know how to time correctly or how to do this efficiently, my first attempt at this problem was the following. Note that I'm not using random numbers in the vector as I'm not sure how to do in a good way. I'm also printing stuff in order to ensure full evaluation.
import System.TimeIt
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
print $ V.length vec
return vec
sumit :: IO ()
sumit = do
vec <- vector
print $ V.sum vec
time = timeIt sumit
Loading this up in GHCI and running time tells me that it took about 0.22s to run for 3 million numbers and 2.69s for 30 million numbers.
Compared to the blog authors results of 0.02s and 0.18s in Lush it's quite a lot worse, which leads me to believe this can be done in a better way.
Note: The above code needs the package TimeIt to run. cabal install timeit will get it for you.
First of all, realize that GHCi is an interpreter, and it's not designed to be very fast. To get more useful results you should compile the code with optimizations enabled. This can make a huge difference.
Also, for any serious benchmarking of Haskell code, I recommend using criterion. It uses various statistical techniques to ensure that you're getting reliable measurements.
I modified your code to use criterion and removed the print statements so that we're not timing the I/O.
import Criterion.Main
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
return vec
sumit :: IO Int
sumit = do
vec <- vector
return $ V.sum vec
main = defaultMain [bench "sumit" $ whnfIO sumit]
Compiling this with -O2, I get this result on a pretty slow netbook:
$ ghc --make -O2 Sum.hs
$ ./Sum
warming up
estimating clock resolution...
mean is 56.55146 us (10001 iterations)
found 1136 outliers among 9999 samples (11.4%)
235 (2.4%) high mild
901 (9.0%) high severe
estimating cost of a clock call...
mean is 2.493841 us (38 iterations)
found 4 outliers among 38 samples (10.5%)
2 (5.3%) high mild
2 (5.3%) high severe
benchmarking sumit
collecting 100 samples, 8 iterations each, in estimated 6.180620 s
mean: 9.329556 ms, lb 9.222860 ms, ub 9.473564 ms, ci 0.950
std dev: 628.0294 us, lb 439.1394 us, ub 1.045119 ms, ci 0.950
So I'm getting an average of just over 9 ms with a standard deviation of less than a millisecond. For the larger test case, I'm getting about 100ms.
Enabling optimizations is especially important when using the vector package, as it makes heavy use of stream fusion, which in this case is able to eliminate the data structure entirely, turning your program into an efficient, tight loop.
It may also be worthwhile to experiment with the new LLVM-based code generator by using -fllvm option. It is apparently well-suited for numeric code.
Your original file, uncompiled, then compiled without optimization, then compiled with a simple optimization flag:
$ runhaskell boxed.hs
3000000
30000000
CPU time: 0.35s
$ ghc --make boxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.34s
$ ghc --make -O2 boxed.hs
$ ./boxed
3000000
30000000
CPU time: 0.09s
Your file with import qualified Data.Vector.Unboxed as V instead of import qualified Data.Vector as V (Int is an unboxable type) --
first without optimization then with:
$ ghc --make unboxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.27s
$ ghc --make -O2 unboxed.hs
$ ./unboxed
3000000
30000000
CPU time: 0.04s
So, compile, optimize ... and where possible use Data.Vector.Unboxed
Try to use an unboxed vector, although I'm not sure whether it makes a noticable difference in this case. Note also that the comparison is slightly unfair, because the vector package should optimize the vector away entirely (this optimization is called stream fusion).
If you use big enough vectors, Vector Unboxed might become impractical. For me pure (lazy) lists are quicker, if vector size > 50000000:
import System.TimeIt
sumit :: IO ()
sumit = print . sum $ replicate 50000000 10
main :: IO ()
main = timeIt sumit
I get these times:
Unboxed Vectors
CPU time: 1.00s
List:
CPU time: 0.70s
Edit: I've repeated the benchmark using Criterion and making sumit pure. Code and results as follow:
Code:
import Criterion.Main
sumit :: Int -> Int
sumit m = sum $ replicate m 10
main :: IO ()
main = defaultMain [bench "sumit" $ nf sumit 50000000]
Results:
warming up
estimating clock resolution...
mean is 7.248078 us (80001 iterations)
found 24509 outliers among 79999 samples (30.6%)
6044 (7.6%) low severe
18465 (23.1%) high severe
estimating cost of a clock call...
mean is 68.15917 ns (65 iterations)
found 7 outliers among 65 samples (10.8%)
3 (4.6%) high mild
4 (6.2%) high severe
benchmarking sumit
collecting 100 samples, 1 iterations each, in estimated 46.07401 s
mean: 451.0233 ms, lb 450.6641 ms, ub 451.5295 ms, ci 0.950
std dev: 2.172022 ms, lb 1.674497 ms, ub 2.841110 ms, ci 0.950
It looks like print makes a big difference, as it should be expected!
A CPU has a five-stage pipeline and runs at 1 GHz frequency. Instruction fetch
happens in the first stage of the pipeline. A conditional branch instruction
computes the target address and evaluates the condition in the third stage of the
pipeline. The processor stops fetching new instructions following a conditional
branch until the branch outcome is known. A program executes 10^9 instructions
out of which 20% are conditional branches. If each instruction takes one cycle to
complete on average, the total execution time of the program is:
(A) 1.0 second
(B) 1.2 seconds
(C) 1.4 seconds
(D) 1.6 seconds
Total_execution_time = (1+stall_cycle*stall_frequency)*exec_time_each_inst
exec_time_each_inst = 1s [i.e #1ghz need to execute 10^9 inst => 1 inst = 1 sec]
stall_frequency = 20% = .20
stall_cycle = 2
[i.e in 3rd stage of pipeline we know branch result, so there will be 2 stall cycles]
therefore Total_execution_time = (1+2*.20)*1 = 1.4 seconds
I don't know how to explain it better but hope it helps a bit :)