Partitioning Not Working in Hive 2.3.0

Partitioning Not Working in Hive 2.3.0 - hive

I have created table as follows:
create table emp (
> eid int,
> fname string,
> lname string,
> salary double,
> city string,
> dept string )
> row format delimited fields terminated by ',';
then to enable partitioning i have set following properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
i created partition table as follows:
create table part_emp (
> eid int,
> fname string,
> lname string,
> salary double,
> dept string )
> partitioned by ( city string )
> row format delimited fields terminated by ',';
After creating table i issued insert query as
insert into table part_emp partition(city)
select eid,fname,lname,salary,dept,city from emp;
But it not works..
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = max_20180311015337_5a67813d-dcc5-46c0-ac4b-a54c11ffb912
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1520757649534_0004, Tracking URL = http://ubuntu:8088/proxy/application_1520757649534_0004/
Kill Command = /home/max/bigdata/hadoop-3.0.0/bin/hadoop job -kill job_1520757649534_0004
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2018-03-11 01:53:44,996 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1520757649534_0004 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Same Successfully Works on Hive 1.x

I have the same problem, and set hive.exec.max.dynamic.partitions.pernode=1000;(default 100) solves my problem. You may try.
PS：This setting means：Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.

Related

Hive: Query executing from hours

I'm try to execute the below hive query on Azure HDInsight cluster but it's taking unprecedented amount of time to finish. Did implemented hive settings but of no use. Below are the details:
Table
CREATE TABLE DB_MYDB.TABLE1(
MSTR_KEY STRING,
SDNT_ID STRING,
CLSS_CD STRING,
BRNCH_CD STRING,
SECT_CD STRING,
GRP_CD STRING,
GRP_NM STRING,
SUBJ_DES STRING,
GRP_DESC STRING,
DTL_DESC STRING,
ACTV_FLAG STRING,
CMP_NM STRING)
STORED AS ORC
TBLPROPERTIES ('ORC.COMPRESS'='SNAPPY');
Hive Query
INSERT OVERWRITE TABLE DB_MYDB.TABLE1
SELECT
CURR.MSTR_KEY,
CURR.SDNT_ID,
CURR.CLSS_CD,
CURR.BRNCH_CD,
CURR.SECT_CD,
CURR.GRP_CD,
CURR.GRP_NM,
CURR.SUBJ_DES,
CURR.GRP_DESC,
CURR.DTL_DESC,
'Y',
CURR.CMP_NM
FROM DB_MYDB.TABLE2 CURR
LEFT OUTER JOIN DB_MYDB.TABLE3 PREV
ON (CURR.SDNT_ID=PREV.SDNT_ID
AND CURR.CLSS_CD=PREV.CLSS_CD
AND CURR.BRNCH_CD=PREV.BRNCH_CD
AND CURR.SECT_CD=PREV.SECT_CD
AND CURR.GRP_CD=PREV.GRP_CD
AND CURR.GRP_NM=PREV.GRP_NM)
WHERE PREV.SDNT_ID IS NULL;
But the query is running for hours. Below is the detail:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 46 46 0 0 0 0
Map 3 .......... SUCCEEDED 169 169 0 0 0 0
Reducer 2 .... RUNNING 1009 825 184 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/03 [======================>>----] 84% ELAPSED TIME: 13622.73 s
--------------------------------------------------------------------------------
I did set some hive properties
SET hive.execution.engine=tez;
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.auto.convert.join.noconditionaltask.size=3470;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.execution.reduce.groupby.enabled=true;
SET hive.cbo.enable=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.compute.query.using.stats=true;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.tezfiles = true;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=16777216;
SET hive.merge.orcfile.stripe.level=true;
Records in Tables:
DB_MYDB.TABLE2= 337319653
DB_MYDB.TABLE3= 1946526625
There doesn't seem to be any impact on the query. Can anyone help me to:
Understand that why this query is not completing and taking indefinite time?
How can I optimize it to work faster and complete?
Using the versions:
Hadoop 2.7.3.2.6.5.3033-1
Hive 1.2.1000.2.6.5.3033-1
Azure HDInsight 3.6
Attempt_1:
As suggested by #leftjoin tried to set the set hive.exec.reducers.bytes.per.reducer=32000000;. This worked until the second last step of the hive script but at the last it failed with Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!
Last Query:
INSERT OVERWRITE TABLE DB_MYDB.TABLE3
SELECT
CURR_FULL.MSTR_KEY,
CURR_FULL.SDNT_ID,
CURR_FULL.CLSS_CD,
CURR_FULL.BRNCH_CD,
CURR_FULL.GRP_CD,
CURR_FULL.CHNL_CD,
CURR_FULL.GRP_NM,
CURR_FULL.GRP_DESC,
CURR_FULL.SUBJ_DES,
CURR_FULL.DTL_DESC,
(CASE WHEN CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID THEN 'Y' ELSE
CURR_FULL.SDNT_ID_FLAG END) AS SDNT_ID_FLAG,
CURR_FULL.CMP_NM
FROM
DB_MYDB.TABLE2 CURR_FULL
LEFT OUTER JOIN DB_MYDB.TABLE1 SND_DELTA
ON (CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID);
-----------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
-----------------------------------------------------------------
Map 1 ......... RUNNING 1066 1060 6 0 0 0
Map 4 .......... SUCCEEDED 3 3 0 0 0 0
Reducer 2 RUNNING 1009 0 22 987 0 0
Reducer 3 INITED 1 0 0 1 0 0
-----------------------------------------------------------------
VERTICES: 01/04 [================>>--] 99% ELAPSED TIME: 18187.78 s
Error:
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=8, pendingInputs=1058, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false

If it is reducer vertex which is runnong slow, you can increase reducer parallelism by reducing bytes per reducer configuration. Check your current setting and reduce figure accordingly untill you get 2x or more reducers running:
set hive.exec.reducers.bytes.per.reducer=67108864; --example only, check your current settings
--and reduce accordingly to get twice more reducers on Reducer 2 vertex
Change setting, start query, check the number of containers on Reducer 2 vertex, terminate and change again if the number of containers has not increased.
If you want to increase parallelism on mappers also, read this answer: https://stackoverflow.com/a/48487306/2700344

if you don't have index on your fk columns , you should add them for sure , here is my suggestion:
create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
be noticed from hive version 3.0 , indexing has been removed from hive and alternatively you can use materialized views (supported from Hive 2.3.0 and above) that gives you the same performance.

Loading null values to Hive table

I have a .txt file that has the following rows:
Steve,1 1 1 1 1 5 10 20 10 10 10 10
when i created an external table, loaded the data and select *, i got null values. Please help how to show the number values instead of null. I very much appreciate the help!
create external table Teller(Name string, Bill array<int>)
row format delimited
fields terminated by ','
collection items terminated by '\t'
stored as textfile
location '/user/training/hive/Teller';
load data local inpath'/home/training/hive/input/*.txt' overwrite into table Teller;
output:
Steve [null]

It seems the integers are separated by spaces and not tabs
bash
hdfs dfs -mkdir -p /user/training/hive/Teller
echo Steve,1 1 1 1 1 5 10 20 10 10 10 10 | hdfs dfs -put - /user/training/hive/Teller/data.txt
hive
hive> create external table Teller(Name string, Bill array<int>)
> row format delimited
> fields terminated by ','
> collection items terminated by ' '
> stored as textfile
> location '/user/training/hive/Teller';
OK
Time taken: 0.417 seconds
hive> select * from teller;
OK
Steve [1,1,1,1,1,5,10,20,10,10,10,10]

Dynamically Generate file connection for several packages in SSIS

In a project we have several SSIS packages (around 200), all the package names are stored in a control table. We need to create a master package which can run all the 200 packages.
Since the max concurrent executable setting was set to 8. So planning to create 8 execute package tasks in a container and was thinking of generating the connection string(Execute package task- File connection String) dynamically using the package names stored in the table.
The control table is in the below format
Id PackageName
---------------
1 Package1
2 Package2
Ideas on how should be implemented helps.

I covered this pattern on https://stackoverflow.com/a/34868545/181965 but you're looking for a package that looks something like this
A sequence container that contains everything that one of those 8 discrete buckets of work would require. In your case, a Variable for
CurrentPackage String
rsObject Object
ContainerId Int32
The containerId will be the values 0 through 7 (since you have 8 buckets of work). As outlined in the other answer, we must scope the variables to the Sequence Container. The default in 2012+ is to create them at the Control Flow level, whereas 2005/2008 would create them at the level of the selected object.
Set up
I created a table and loaded it with 200 rows
CREATE TABLE dbo.so_35415549
(
id int IDENTITY(1,1) NOT NULL
, PackageName sysname
);
INSERT INTO
dbo.so_35415549
(
PackageName
)
SELECT TOP 200
'Package' + CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS varchar(3))
FROM
sys.all_columns AS AC;
Get My Bucket's data
The modulus, modulo, mod whatever you call it operator is our friend here. The mod operator will return the remainder after division. e.g. 10 mod 3 is 1 because 3*3 + 1 = 10
In your case, you'll be modding via 8 so you know the remainder will be bounded between 0 and 7.
SQL Server implements the mod operator as % and you can test the correctness via the following query
SELECT
S.id
, S.PackageName
, S.id % 8 AS ModValue
FROM
dbo.so_35415549 AS S
ORDER BY
1;
Sample output
id PackageName ModValue
1 Package1 1
2 Package2 2
3 Package3 3
4 Package4 4
5 Package5 5
6 Package6 6
7 Package7 7
8 Package8 0
9 Package9 1
10 Package10 2
...
199 Package199 7
200 Package200 0
SQL Get Work List
Using the above query as a template, we will use the following query. Notice the ? in there. That is the placeholder for an Execute SQL Tasks parameterization for an OLE DB Connection Manager.
SELECT
S.PackageName
FROM
dbo.so_35415549 AS S
WHERE
S.id % 8 = ?
ORDER BY
1;
The Parameter we pass in will be #[User::ContainerId]
The Result Set option will be updated from None to Full ResultSet and we push the value into rsObject
FELC Shred Work List
This is a standard shredding of a recordset. We got our variable populated in the previous step so let's enumerate through the results. There will be one column in our result set and you will map that to User::CurrentPackageName
EPT Run Package
This is your Execute Package Task. Use the value of CurrentPackageName and you're set.

External Tables (HIVE) Choose only a few columns from a file

How can I create a external table setting only a few columns from a file?
Ex: In archive I have six columns, A,B,C,D,E,F. But in my table i want only A, C, F.
Is It possible?

I do not know of a way to selectively include columns from HDFS files for an external table. Depending on your use case, it may be sufficient to define a view based on the external table to only include the columns you want. For example, given the following silly example of an external table:
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING,
> D STRING,
> E STRING,
> F STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.401 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C row_1_col_D row_1_col_E row_1_col_F
row_2_col_A row_2_col_B row_2_col_C row_2_col_D row_2_col_E row_2_col_F
row_3_col_A row_3_col_B row_3_col_C row_3_col_D row_3_col_E row_3_col_F
Time taken: 0.222 seconds, Fetched: 3 row(s)
Then create a view to only include the columns you want:
hive> CREATE VIEW filtered_ext_table AS SELECT A, C, F FROM ext_table;
OK
Time taken: 0.749 seconds
hive> DESCRIBE filtered_ext_table;
OK
a string
c string
f string
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive> SELECT * FROM filtered_ext_table;
OK
row_1_col_A row_1_col_C row_1_col_F
row_2_col_A row_2_col_C row_2_col_F
row_3_col_A row_3_col_C row_3_col_F
Time taken: 0.301 seconds, Fetched: 3 row(s)
Another way to achieve what you want would require that you have the ability to modify the HDFS files backing your external table - if the columns you are interested in are all near the beginning of each line, then you can define your external table to capture only the first 3 columns (without regard for how many more columns are actually in the file). For example, with the same data file as above:
hive> DROP TABLE IF EXISTS ext_table;
OK
Time taken: 1.438 seconds
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.734 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C
row_2_col_A row_2_col_B row_2_col_C
row_3_col_A row_3_col_B row_3_col_C
Time taken: 0.727 seconds, Fetched: 3 row(s)

I found answer here
create table tmpdc_ticket(
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
organization external(
type oracle_loader
default directory sample_directory
access parameters(
records delimited by newline
nologfile
skip 1
fields terminated by '|'
missing field values are null
(DUMMY_1,
DUMMY_2,
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
)
location(sample_directory:'sample_file.txt')
)
reject limit 1
noparallel
nomonitoring;

Why does a missing primary key/unique key cause deadlock issues on upsert?

I came across a schema and an upsert stored procedure that was causing deadlock issues. I have a general idea about why this is causing deadlock and how to fix it. I can reproduce it but I don't have a clear understanding of the sequence of steps that is causing it. It would be great if someone can explain clearly why this is causing deadlock.
Here is the schema and the stored procedures. This code is being executed on PostgreSQL 9.2.2.
CREATE TABLE counters (
count_type INTEGER NOT NULL,
count_id INTEGER NOT NULL,
count INTEGER NOT NULL
);
CREATE TABLE primary_relation (
id INTEGER PRIMARY KEY,
a_counter INTEGER NOT NULL DEFAULT 0
);
INSERT INTO primary_relation
SELECT i FROM generate_series(1,5) AS i;
CREATE OR REPLACE FUNCTION increment_count(ctype integer, cid integer, i integer) RETURNS VOID
AS $$
BEGIN
LOOP
UPDATE counters
SET count = count + i
WHERE count_type = ctype AND count_id = cid;
IF FOUND THEN
RETURN;
END IF;
BEGIN
INSERT INTO counters (count_type, count_id, count)
VALUES (ctype, cid, i);
RETURN;
EXCEPTION WHEN OTHERS THEN
END;
END LOOP;
END;
$$
LANGUAGE PLPGSQL;
CREATE OR REPLACE FUNCTION update_primary_a_count(ctype integer) RETURNS VOID
AS $$
WITH deleted_counts_cte AS (
DELETE
FROM counters
WHERE count_type = ctype
RETURNING *
), rollup_cte AS (
SELECT count_id, SUM(count) AS count
FROM deleted_counts_cte
GROUP BY count_id
HAVING SUM(count) <> 0
)
UPDATE primary_relation
SET a_counter = a_counter + rollup_cte.count
FROM rollup_cte
WHERE primary_relation.id = rollup_cte.count_id
$$ LANGUAGE SQL;
And here is a python script to reproduce the deadlock.
import os
import random
import time
import psycopg2
COUNTERS = 5
THREADS = 10
ITERATIONS = 500
def increment():
outf = open('synctest.out.%d' % os.getpid(), 'w')
conn = psycopg2.connect(database="test")
cur = conn.cursor()
for i in range(0,ITERATIONS):
time.sleep(random.random())
start = time.time()
cur.execute("SELECT increment_count(0, %s, 1)", [random.randint(1,COUNTERS)])
conn.commit()
outf.write("%f\n" % (time.time() - start))
conn.close()
outf.close()
def update(n):
outf = open('synctest.update', 'w')
conn = psycopg2.connect(database="test")
cur = conn.cursor()
for i in range(0,n):
time.sleep(random.random())
start = time.time()
cur.execute("SELECT update_primary_a_count(0)")
conn.commit()
outf.write("%f\n" % (time.time() - start))
conn.close()
pids = []
for i in range(THREADS):
pid = os.fork()
if pid != 0:
print 'Process %d spawned' % pid
pids.append(pid)
else:
print 'Starting child %d' % os.getpid()
increment()
print 'Exiting child %d' % os.getpid()
os._exit(0)
update(ITERATIONS)
for pid in pids:
print "waiting on %d" % pid
os.waitpid(pid, 0)
# cleanup
update(1)
I recognize that one issue with this is that the upsert will can produce duplicate rows (with multiple writers) which will likely result in some double counting. But why will this result in deadlock?
The error I get from PostgreSQL is something like the following:
process 91924 detected deadlock while waiting for ShareLock on transaction 4683083 after 100.559 ms",,,,,"SQL statement ""UPDATE counters
And the client spews something like this:
psycopg2.extensions.TransactionRollbackError: deadlock detected
DETAIL: Process 91924 waits for ShareLock on transaction 4683083; blocked by process 91933.
Process 91933 waits for ShareLock on transaction 4683079; blocked by process 91924.
HINT: See server log for query details.CONTEXT: SQL statement "UPDATE counters
SET count = count + i
WHERE count_type = ctype AND count_id = cid"
PL/pgSQL function increment_count(integer,integer,integer) line 4 at SQL statement
To fix the issue, you need to add a primary key like so:
ALTER TABLE counters ADD PRIMARY KEY (count_type, count_id);
Any insight would be greatly appreciated. Thanks!

because of the primary key, the number of rows in this table is always <= # threads, and the primary key ensures that no row is repeated.
When you remove the primary key, some of the threads get lagged behind and the number of rows increases, and at the same time rows get repeated. When the rows get repeated, then the update time is longer and 2 or more threads will try to update the same row(s).
Open a new terminal and type:
watch --interval 1 "psql -tc \"select count(*) from counters\" test"
Try this with and without the primary key. When you get the first deadlock, look at the results of the query above. In my case this is what I am left with in the table counters:
test=# select * from counters order by 2;
count_type | count_id | count
------------+----------+-------
0 | 1 | 735
0 | 1 | 733
0 | 1 | 735
0 | 1 | 735
0 | 2 | 916
0 | 2 | 914
0 | 2 | 914
0 | 3 | 882
0 | 4 | 999
0 | 5 | 691
0 | 5 | 692
(11 rows)

Your code is the perfect recipe for a race condition (multiple threads, random sleeps).
The problem is most probably due to locking issues, since you don't mention the locking mode i'm going to assume that is a page based lock so, you get the following scenario:
Thread 1 starts, it begins to insert records, lets say that it locks page n° 1, and should lock page 2.
Thread 2 starts, at the same time as 1, but it locks first page 2, and should lock page 1 next.
Both threads are now waiting on each other to complete, so you have a deadlock.
Now, why a PK fixes it?
Because locking is done via index at first, you the race condition is mitigated because the PK is unique on the inserts, so all threads wait for the index, and in updates access is done via index so the record is locked based on its PK.

At some point one user is waiting on a lock another user has, while the first user owns a lock that the second user wants. This is what causes a deadlock.
At a guess, it's because without a primary key (or in fact any key) when you UPDATE counters in your increment sp it is having to read the whole table. The same with the primary_relation table. This is going to leave locks strewn about, and open the way for a deadlock. I'm not a Postgres user so I don't know the details of exactly when it will place locks, but I'm pretty sure that this is what is happening.
Putting a PK on counters is making it possible for the DB to target the rows it reads accurately and put on the minimum number of locks. You should really have a PK on primary_relation too!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Partitioning Not Working in Hive 2.3.0 - hive

I have the same problem, and set hive.exec.max.dynamic.partitions.pernode=1000;(default 100) solves my problem. You may try. PS：This setting means：Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.

Related

Hive: Query executing from hours

Loading null values to Hive table

Dynamically Generate file connection for several packages in SSIS

External Tables (HIVE) Choose only a few columns from a file

Why does a missing primary key/unique key cause deadlock issues on upsert?

Categories

Resources