Hive: GC Overhead or Heap space error - dynamic partitioned table - hive

Could you please guide me to resolve this GC overhead and heap space error.
I am trying to insert partitioned table from another table (dynamic partition) using the below query:
INSERT OVERWRITE table tbl_part PARTITION(county)
SELECT col1, col2.... col47, county FROM tbl;
I have ran the following parameters:
export HADOOP_CLIENT_OPTS=" -Xmx2048m"
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2048;
SET hive.exec.max.dynamic.partitions.pernode=256;
set mapreduce.map.memory.mb=2048;
set yarn.scheduler.minimum-allocation-mb=2048;
set hive.exec.max.created.files=250000;
set hive.vectorized.execution.enabled=true;
set hive.merge.smallfiles.avgsize=283115520;
set hive.merge.size.per.task=209715200;
Also added in yarn-site.xml :
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
Running free -m:
total used free shared buffers cached
Mem: 15347 11090 4256 0 174 6051
-/+ buffers/cache: 4864 10483
Swap: 15670 18 15652
Its a standalone cluster with 1 core. Preparing test data to run my unit test cases in spark.
Could you guide what else I could do.
The source table has the below properties:
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 13
numRows 10509065
rawDataSize 3718599422
totalSize 3729108487
transient_lastDdlTime 1470909228
Thank you.

Add DISTRIBUTE BY county
to your query:
INSERT OVERWRITE table tbl_part PARTITION(county) SELECT col1, col2.... col47, county FROM tbl DISTRIBUTE BY county;

Related

Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

How can I optimize a query of this form since I keep running into this OOM error? Or come up with a better execution plan? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.
When the job fails, the beeline output shows the OOM Java heap space. Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error. Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error. In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.
I am using Azure HDInsight Interactive Query 4.0. 20 worker node, D13v2 8 core, and 56GB RAM.
Source table
create external table database.sourcetable(
a,
b,
c,
...
(183 total columns)
...
)
PARTITIONED BY (
W string,
X int,
Y string,
Z int
)
Target Table
create external table database.NEWTABLE(
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z
)
PARTITIONED BY (
aAAA,
bBBB
)
Query
insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA,
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable
If everything else is okay, try to add distribute by partiton key at the end of your query:
from database.sourcetable
distribute by aAAA, bBBB, cCCC
As a result each reducer will create only one partition file, consuming less memory
Try sorting the partitioned columns:
SET hive.optimize.sort.dynamic.partition=true;
When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Ignite: a simple group by sql query costs about 40s with 20 million pieces of data

prepare
16 core 32 G
Ignite server version 2.7.5
sqlline version 1.3.0
jdk 1.8
data file: csv formate;size 3G,10millon peice
data region 15G
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<!-- Redefining the default region's settings -->
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="Default_Region"/>
<property name="maxSize" value="#{15L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
load data
create table 'test'
load data from data.csv into table test
execute sql
SELECT division_code AS division_code,
sum(num) AS num,
sum(amount) AS amount
FROM test
GROUP BY division_code
it costs 47.089 seconds
then add index on division_code field
CREATE INDEX division_code ON test (division_code);
it costs 47.844 seconds
use explan
SELECT
A__Z0.DIVISION_CODE AS __C0_0,
SUM(A__Z0.NUM) AS __C0_1,
SUM(A__Z0.AMOUNT) AS __C0_2
FROM PUBLIC.TEST A__Z0
/* PUBLIC.IDX_DIVISION_CODE */
GROUP BY A__Z0.DIVISION_CODE
/* group sorted */
SELECT
__C0_0 AS DIVISION_CODE,
CAST(CAST(SUM(__C0_1) AS DOUBLE) AS DOUBLE) AS NUM,
CAST(CAST(SUM(__C0_2) AS DOUBLE) AS DOUBLE) AS AMOUNT
FROM PUBLIC.__T0
/* PUBLIC."merge_scan" */
GROUP BY __C0_0
using gcviewer parse the gc.log :
when execute preceding group by sql ,total pause is 0.2s, never have full gc
Has anyone done ignite performance testing, should i turn on some switch?
Thanks in advance
It's recommended to use scan query/partition affinity runs for such aggregation instead of SQL. SQL is going to hold a lot of data in heap for such operations. Ignite only has rudimentary optimizations for such cases (huge GROUP BY or ORDER BY).
What query do you mean as "pre group by sql"?

Hive: java.lang.OutOfMemoryError: Java heap space and Job running in-process (local Hadoop)

My setup: 4 node cluster in Google Cloud Platform (1 master, 3 workers) running NixOS Linux.
I have been using the TPC-DS toolkit to generate both data and queries are standard. On smaller dataset / more simpler queries they work just fine.
The queries I take from here: https://github.com/hortonworks/hive-testbench/tree/hdp3/sample-queries-tpcds
This is the first one, query1.sql:
WITH customer_total_return AS
(
SELECT sr_customer_sk AS ctr_customer_sk ,
sr_store_sk AS ctr_store_sk ,
Sum(sr_fee) AS ctr_total_return
FROM store_returns ,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year =2000
GROUP BY sr_customer_sk ,
sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1 ,
store ,
customer
WHERE ctr1.ctr_total_return >
(
SELECT Avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'NM'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id limit 100;
At first I had the problem of not being able to run this at all to success, running into java.lang.OutOfMemoryError: Java heap space.
What I did was:
Increased GCP nodes power (up to 7.5 gb of RAM and dual core CPUs)
Set these variables inside of the Hive CLI:
set mapreduce.map.memory.mb=2048;
set mapreduce.map.java.opts=-Xmx1024m;
set mapreduce.reduce.memory.mb=4096;
set mapreduce.reduce.java.opts=-Xmxe3072m;
set mapred.child.java.opts=-Xmx1024m;
Restarted Hive
Then this query worked (along other similar ones) when it came to a 1 GB dataset. I've monitored the situation with htop and the memory usage does not exceed 2gb while both CPU cores are used to 100% almost constantly.
Now the problem is, when it comes to more complex queries with larger dataset, the error starts again:
The query runs just fine for an entire minute or so, but ends in a FAIL. Full stacktrace:
hive> with customer_total_return as
> (select sr_customer_sk as ctr_customer_sk
> ,sr_store_sk as ctr_store_sk
> ,sum(SR_FEE) as ctr_total_return
> from store_returns
> ,date_dim
> where sr_returned_date_sk = d_date_sk
> and d_year =2000
> group by sr_customer_sk
> ,sr_store_sk)
> select c_customer_id
> from customer_total_return ctr1
> ,store
> ,customer
> where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
> from customer_total_return ctr2
> where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
> and s_store_sk = ctr1.ctr_store_sk
> and s_state = 'TN'
> and ctr1.ctr_customer_sk = c_customer_sk
> order by c_customer_id
> limit 100;
No Stats for default#store_returns, Columns: sr_returned_date_sk, sr_fee, sr_store_sk, sr_customer_sk
No Stats for default#date_dim, Columns: d_date_sk, d_year
No Stats for default#store, Columns: s_state, s_store_sk
No Stats for default#customer, Columns: c_customer_sk, c_customer_id
Query ID = root_20190811164854_c253c67c-ef94-4351-b4d3-74ede4c5d990
Total jobs = 14
Stage-29 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
Stage-30 is selected by condition resolver.
Stage-10 is filtered out by condition resolver.
SLF4J: Found binding in [jar:file:/nix/store/jjm6636r99r0irqa03dc1za9gs2b4fx6-source/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/nix/store/q9jpwzbqbg8k8322q785xfavg0p0v18i-hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
Execution completed successfully
MapredLocal task succeeded
SLF4J: Found binding in [jar:file:/nix/store/jjm6636r99r0irqa03dc1za9gs2b4fx6-source/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/nix/store/q9jpwzbqbg8k8322q785xfavg0p0v18i-hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Execution completed successfully
MapredLocal task succeeded
Launching Job 3 out of 14
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-08-11 16:49:19,415 Stage-20 map = 0%, reduce = 0%
2019-08-11 16:49:22,418 Stage-20 map = 100%, reduce = 0%
Ended Job = job_local404291246_0005
Launching Job 4 out of 14
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-08-11 16:49:24,718 Stage-22 map = 0%, reduce = 0%
2019-08-11 16:49:27,721 Stage-22 map = 100%, reduce = 0%
Ended Job = job_local566999875_0006
Launching Job 5 out of 14
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-08-11 16:49:29,958 Stage-2 map = 0%, reduce = 0%
2019-08-11 16:49:33,970 Stage-2 map = 100%, reduce = 0%
2019-08-11 16:49:35,974 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1440279093_0007
Launching Job 6 out of 14
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-08-11 16:49:37,235 Stage-11 map = 0%, reduce = 0%
2019-08-11 16:49:40,421 Stage-11 map = 100%, reduce = 0%
2019-08-11 16:49:42,424 Stage-11 map = 100%, reduce = 100%
Ended Job = job_local1508103541_0008
SLF4J: Found binding in [jar:file:/nix/store/jjm6636r99r0irqa03dc1za9gs2b4fx6-source/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/nix/store/q9jpwzbqbg8k8322q785xfavg0p0v18i-hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2019-08-11 16:49:51 Dump the side-table for tag: 1 with group count: 21 into file: file:/tmp/root/3ab30b3b-380d-40f5-9f72-68788d998013/hive_2019-08-11_16-48-54_393_105456265244058313-1/-local-10019/HashTable-Stage-19/MapJoin-mapfile71--.hashtable
Execution completed successfully
MapredLocal task succeeded
Launching Job 7 out of 14
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-08-11 16:49:53,956 Stage-19 map = 100%, reduce = 0%
Ended Job = job_local2121921517_0009
Stage-26 is filtered out by condition resolver.
Stage-27 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
2019-08-11 16:50:01 Dump the side-table for tag: 0 with group count: 99162 into file: file:/tmp/root/3ab30b3b-380d-40f5-9f72-68788d998013/hive_2019-08-11_16-48-54_393_105456265244058313-1/-local-10017/HashTable-Stage-17/MapJoin-mapfile60--.hashtable
2019-08-11 16:50:02 Uploaded 1 File to: file:/tmp/root/3ab30b3b-380d-40f5-9f72-68788d998013/hive_2019-08-11_16-48-54_393_105456265244058313-1/-local-10017/HashTable-Stage-17/MapJoin-mapfile60--.hashtable (2832042 bytes)
Execution completed successfully
MapredLocal task succeeded
Launching Job 9 out of 14
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-08-11 16:50:04,004 Stage-17 map = 0%, reduce = 0%
2019-08-11 16:50:05,005 Stage-17 map = 100%, reduce = 0%
Ended Job = job_local694362009_0010
Stage-24 is selected by condition resolver.
Stage-25 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
SLF4J: Found binding in [jar:file:/nix/store/q9jpwzbqbg8k8322q785xfavg0p0v18i-hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2019-08-11 16:50:12 Starting to launch local task to process map join; maximum memory = 239075328
Execution completed successfully
MapredLocal task succeeded
Launching Job 11 out of 14
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-08-11 16:50:14,254 Stage-13 map = 100%, reduce = 0%
Ended Job = job_local1812693452_0011
Launching Job 12 out of 14
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-08-11 16:50:15,481 Stage-6 map = 0%, reduce = 0%
Ended Job = job_local920309638_0012 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-20: HDFS Read: 8662606197 HDFS Write: 0 SUCCESS
Stage-Stage-22: HDFS Read: 9339349675 HDFS Write: 0 SUCCESS
Stage-Stage-2: HDFS Read: 9409277766 HDFS Write: 0 SUCCESS
Stage-Stage-11: HDFS Read: 9409277766 HDFS Write: 0 SUCCESS
Stage-Stage-19: HDFS Read: 4704638883 HDFS Write: 0 SUCCESS
Stage-Stage-17: HDFS Read: 4771516428 HDFS Write: 0 SUCCESS
Stage-Stage-13: HDFS Read: 4771516428 HDFS Write: 0 SUCCESS
Stage-Stage-6: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
The problem in the hive.log file is still the same:
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
And I realized my worker nodes don't actually do anything (htop showed that they were idle while only the master node was working)
Even in the stack trace:
Job running in-process (local Hadoop)
How can I make Hive use HDFS not just Local Hadoop?
Running hdfs dfs -df -h hdfs:<redacted>:9000/ returns
Filesystem Size Used Available Use%
hdfs://<redacted>:9000 88.5 G 34.3 G 35.2 G 39%
Which is correct, I have 3 worker nodes, each with 30 GB disks.
java.lang.OutOfMemoryError: Java heap space It will happen if you are trying to push too much data on the single machine.
Based on the query provided, there are few things that you can try:
Change your join conditions to explicit (remove WHERE CLAUSE and use INNER/LEFT JOIN). e.g.
FROM customer_total_return ctr1
INNER JOIN store s
ON ctr1.ctr_store_sk = s.s_store_sk
AND s_state = 'NM'
INNER JOIN customer c
ON ctr1.ctr_customer_sk = c.c_customer_sk
Check if you have skewed data for one of the following fields:
store_returns -> sr_returned_date_sk
store_returns -> sr_store_sk
store_returns -> sr_customer_sk
customer -> c_customer_sk
store -> s_store_sk
It might be possible the one of the KEY has high percent of values and that might cause 1 of the node to be overloaded (when data size is huge).
Basically you are trying eliminate possible reasons of node overloading.
Let me know if it helps.
It could be resource issue. Hive queries are internally executed as Map-Reduce jobs. You could check the Job History logs for the Hive Map-Reduce jobs failed. Sometimes executing queries from shell are faster compared to the Hive-Query editor.
OOM issues are related to query performance most of the time.
There are two queries here:
Part 1:
WITH customer_total_return AS
(
SELECT sr_customer_sk AS ctr_customer_sk ,
sr_store_sk AS ctr_store_sk ,
Sum(sr_fee) AS ctr_total_return
FROM store_returns ,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year =2000
GROUP BY sr_customer_sk ,
sr_store_sk)
Part 2:
SELECT c_customer_id
FROM customer_total_return ctr1 ,
store ,
customer
WHERE ctr1.ctr_total_return >
(
SELECT Avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'NM'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id limit 100;
Try enabling JMX for the hive cluster
link
And see the memory usage of both the parts of query. And the part2 inner query also.
Few hive optimizations for above queries can be tried out:
Use SORT BY instead of ORDER BY Clause -> SORT BY clause, that orders the data only within each reducer.
Partition the tables on the join keys to read only specific data instead of whole table scan.
cache the small hive table in distributed cache and use map side join to reduce the shuffling
For example:
select /*+MAPJOIN(b)*/ col1,col2,col3,col4
from table_A a
join
table_B b
on
a.account_number=b.account_number
If there is a possibility of skew data in any of the tables then use following parameters:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000; (i.e. the threshold of the data should go to one node)

Issue in Hive Query due to memory

We have insert query in which we are trying to insert data to partitioned table by reading data from non partitioned table.
Query -
insert into db1.fact_table PARTITION(part_col1, part_col2)
( col1,
col2,
col3,
col4,
col5,
col6,
.
.
.
.
.
.
.
col32
LOAD_DT,
part_col1,
Part_col2 )
select
col1,
col2,
col3,
col4,
col5,
col6,
.
.
.
.
.
.
.
col32,
part_col1,
Part_col2
from db1.main_table WHERE col1=0;
Table has 34 columns, number of records in main table depends on size of input file which we receive on daily basis.
and the number of partitions (part_col1, part_col2) which we insert in each run might vary from 4000 to 5000
Some time this query fails with below issue.
2019-04-28 13:23:31,715 Stage-1 map = 95%, reduce = 0%, Cumulative
CPU 177220.23 sec 2019-04-28 13:24:25,989 Stage-1 map = 100%, reduce
= 0%, Cumulative CPU 163577.82 sec MapReduce Total cumulative CPU time: 1 days 21 hours 26 minutes 17 seconds 820 msec Ended Job =
job_1556004136988_155295 with errors Error during job, obtaining
debugging information... Examining task ID:
task_1556004136988_155295_m_000003 (and more) from job
job_1556004136988_155295 Examining task ID:
task_1556004136988_155295_m_000004 (and more) from job
job_1556004136988_155295 Task with the most failures(4):
----- Task ID: task_1556004136988_155295_m_000000
----- Diagnostic Messages for this Task: Exception from container-launch. Container id:
container_e81_1556004136988_155295_01_000015 Exit code: 255 Stack
trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:563)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:748)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:356)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Shell output: main : command provided 1 main : user is bldadmin main :
requested yarn user is bldadmin Container exited with a non-zero
exit code 255 FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched:
Stage-Stage-1: Map: 10 Cumulative CPU: 163577.82 sec MAPRFS Read:
0 MAPRFS Write: 0 FAIL Total MapReduce CPU Time Spent: 1 days 21 hours
26 minutes 17 seconds 820 msec
Current hive properties.
Using Tez Engine -
set hive.execution.engine=tez;
set hive.tez.container.size=3072;
set hive.tez.java.opts=-Xmx1640m;
set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.enforce.bucketing=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=false;
set hive.enforce.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.optimize.bucketmapjoin=true;
set hive.exec.tmp.maprfsvolume=false;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.stats.fetch.partition.stats=true;
set hive.support.concurrency=true;
set hive.exec.max.dynamic.partitions=999999999;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
Based on input from other teams we changed the engine to mr and propertied are -
set hive.execution.engine=mr;
set hive.auto.convert.join=false;
set mapreduce.map.memory.mb=16384;
set mapreduce.map.java.opts=-Xmx14745m;
set mapreduce.reduce.memory.mb=16384;
set mapreduce.reduce.java.opts=-Xmx14745m;
With these properties query completed with out any errors few times.
How can i debug these issue and are there any hive properties which we can set so that we don't get these issues in future.
Add distribute by partition key. Each reducer will process only one partition, not every partition, this will result in less memory consumption, because reducer will create less files, keeping less buffers.
insert into db1.fact_table PARTITION(part_col1, part_col2)
select
col1,
...
col32,
part_col1,
Part_col2
from db1.main_table WHERE col1=0
distribute by part_col1, Part_col2; --add this
Use Predicate Push Down, it may help with filtering if source files are ORC:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.optimize.index.filter=true;
Tune proper mapper and reducer parallelism: https://stackoverflow.com/a/48487306/2700344
Add distribute by random in addition to partition keys if your data is too big and the distribution by partition key is not even. This will help with skewed data:
distribute by part_col1, Part_col2, FLOOR(RAND()*100.0)%20;
Read also https://stackoverflow.com/a/55375261/2700344

Using presto to query from Hive external table: Invalid UTF-8 start byte

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;
Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]
Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
at [Source: java.io.ByteArrayInputStream#6eb5bdfd; line: 1, column: 376]
The error only happens if I use aggregate function such as count, sum, etc.
But when I use the same query on Hive CLI, it works (but take a lot of time since it converts the query into a map-reduce job).
$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87 Reduce: 1 Cumulative CPU: 1076.6 sec HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)
The point is the same query works on Hive but not on Presto and I could not figure out why. I suspect it is because the 2 json library using on Hive and on Presto are different, but I'm not really sure.
I created the external table on Hive with the query:
hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';
Can anyone help me with this?
posting this here for easy reference:
from where OP documented a solution:
I successfully fixed the problem by using this serde: https://github.com/electrum/hive-serde (add to presto at /usr/lib/presto/plugin/hive-hadoop2/ and to hive cluster at /usr/lib/hive-hcatalog/share/hcatalog/)