Why Reducer required in a Hive Insert - hive

Question is related to working of MapReduce job when we fire a insert into statement from hive command line.
While inserting records into a hive table: As there is no aggregations involved while insert into the internal hive table, why reducer is also invoked. It should only a mapper job only.
What is the role of reducer here.
insert into table values (1,1);
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2021-04-28 10:30:26,487 Stage-1 map = 0%, reduce = 0%
INFO : 2021-04-28 10:30:30,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec
INFO : 2021-04-28 10:30:36,774 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.35 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 350 msec
hive> set hive.merge.mapfiles;
hive.merge.mapfiles=true
hive> set hive.merge.mapredfiles;
hive.merge.mapredfiles=false
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
explain insert into test values (10,14);
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
Stage-4
Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
Stage-2 depends on stages: Stage-0
Stage-3
Stage-5
Stage-6 depends on stages: Stage-5
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: _dummy_table
Row Limit Per Split: 1
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: array(const struct(10,14)) (type: array<struct<col1:int,col2:int>>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
UDTF Operator
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
function name: inline
Select Operator
expressions: col1 (type: int), col2 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Select Operator
expressions: _col0 (type: int), _col1 (type: int)
outputColumnNames: i, j
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: compute_stats(i, 'hll'), compute_stats(j, 'hll')
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
Reduce Operator Tree:
Group By Operator
aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-7
Conditional Operator
Stage: Stage-4
Move Operator
files:
hdfs directory: true
destination:<path>
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Stats Work
Basic Stats Work:
Column Stats Desc:
Columns: i, j
Column Types: int, int
Table: db.test.test
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-6
Move Operator
files:
hdfs directory: true
destination: <PATH>
Time taken: 5.123 seconds, Fetched: 121 row(s)

It seems you have statistics auto gathering enabled:
SET hive.stats.autogather=true;
And reducer is calculating statistics
Reduce Operator Tree:
Group By Operator
aggregations: **compute_stats**(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial

Related

Updating a column from values

I want to update a column col_123 in table TT from values such that some criteria are met.
The following is a piece of my code, where I have two values. But in my actual code, there are thousands of value rows.
UPDATE TT
SET col_123 = T2.score
FROM
(values ('1007163',2016,3,80.09), ('1034758',2013,4,68.85)) T2(person_id_t2, id_yr_t2, id_qtr_t2, score)
WHERE person_id = T2.person_id_t2 AND id_yr = T2.id_yr_t2 AND id_qtr = T2.id_qtr_t2;
But even with these two rows, it takes forever to update the table. What am I doing wrong?
Here is the output with EXPLAIN ANALYZE:
Update (slice0; segments: 56) (rows=1 width=3903)
-> Hash Join (cost=0.06..750889.50 rows=1 width=3903)
Hash Cond: TT.person_id::text = "*VALUES*".column1 AND TT.id_yr = "*VALUES*".column2::numeric AND TT.id_qtr = "*VALUES*".column3
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg29) with 236406 ms to first row, 236407 ms to end, start offset by 370 ms.
Executor memory: 1K bytes avg, 1K bytes max (seg0).
Work_mem used: 1K bytes avg, 1K bytes max (seg0). Workfile: (0 spilling, 0 reused)
(seg29) Hash chain length 1.0 avg, 1 max, using 2 of 262151 buckets.
-> Seq Scan on seamless_health_index (cost=0.00..466843.92 rows=676299 width=3871)
Rows out: Avg 676405.3 rows x 56 workers. Max 678281 rows (seg27) with 0.524 ms to first row, 243299 ms to end, start offset by 369 ms.
-> Hash (cost=0.03..0.03 rows=1 width=72)
Rows in: Avg 2.0 rows x 56 workers. Max 2 rows (seg0) with 0.080 ms to end, start offset by 375 ms.
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=1 width=72)
Rows out: Avg 2.0 rows x 56 workers. Max 2 rows (seg0) with 0.017 ms to first row, 0.020 ms to end, start offset by 375 ms.
Slice statistics:
(slice0) Executor memory: 5769K bytes avg x 56 workers, 5769K bytes max (seg0). Work_mem: 1K bytes max.
Statement statistics:
Memory used: 128000K bytes
Settings: from_collapse_limit=16; join_collapse_limit=16
Total runtime: 308388.391 ms
Thanks!
Note: The table TT has about 40,000,000 rows, and 1000 columns, but only two of the rows and col_123 should be updated.
Create an index on TT (person_id::text, id_yr, id_qtr).
Then a nested loop join can be used, which should find the one matching row faster.
You don't have to include all three columns in the index, only those where the join condition is selective.

Column names when exporting ORC files from hive server 2 using beeline

I am facing a problem where exporting results from hive server 2 to ORC files show some kind of default column names (e.g. _col0, _col1, _col2) instead of the original ones created in hive. We are using pretty much default components from HDP-2.6.3.0.
I am also wondering if the below issue is related:
https://issues.apache.org/jira/browse/HIVE-4243
Below are the steps we are taking:
Connecting:
export SPARK_HOME=/usr/hdp/current/spark2-client
beeline
!connect jdbc:hive2://HOST1:2181,HOST2:2181,HOST2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Creating test table and inserting sample values:
create table test(str string);
insert into test values ('1');
insert into test values ('2');
insert into test values ('3');
Running test query:
select * from test;
+-----------+--+
| test.str |
+-----------+--+
| 1 |
| 2 |
| 3 |
+-----------+--+
Exporting as ORC:
insert overwrite directory 'hdfs://HOST1:8020/tmp/test' stored as orc select * from test;
Getting the results:
hdfs dfs -get /tmp/test/000000_0 test.orc
Checking the results:
java -jar orc-tools-1.4.1-uber.jar data test.orc
Processing data file test.orc [length: 228]
{"_col0":"1"}
{"_col0":"2"}
{"_col0":"3"}
java -jar orc-tools-1.4.1-uber.jar meta test.orc
Processing data file test.orc [length: 228]
Structure for test.orc
File Version: 0.12 with HIVE_13083
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
File Statistics:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
Stripes:
Stripe: offset: 3 data: 11 rows: 2 tail: 60 index: 39
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 28
Stream: column 1 section DATA start: 42 length 5
Stream: column 1 section LENGTH start: 47 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 228 bytes
Padding length: 0 bytes
Padding ratio: 0%
Looking at the results I can see _col0 as the column name while expecting the original str.
Any ideas on what I am missing?
Update
I noticed that the connection from beeline was going to hive 1.x, and not 2.x as wanted. I changed the connection to the Hive Server 2 Interactive URL:
Connected to: Apache Hive (version 2.1.0.2.6.3.0-235)
Driver: Hive JDBC (version 1.21.2.2.6.3.0-235)
Transaction isolation: TRANSACTION_REPEATABLE_READ
And tried again with the same sample. It even prints out the schema correctly:
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.str, type:string, comment:null)], properties:null)
But still no luck in getting it to the ORC file.
Solution
You need to enable Hive LLAP (Interactive SQL) in Ambari, then change the connection string you are using. For example, my connection became jdbc:hive2://.../;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2
Note the additional "-hive2" at the end of the URL. Here is a tutorial vid from hortonworks.
"Proof"
After connecting to the updated Hive endpoint, I ran
create table t_orc(customer string, age int) stored as orc;
insert into t_orc values('bob', 12),('kate', 15);
Then
~$ hdfs dfs -copyToLocal /apps/hive/warehouse/t_orc/000000_0 ~/tmp/orc/hive2.orc
~$ orc-metadata tmp/orc/hive2.orc
{ "name": "tmp/orc/hive2.orc",
"type": "struct<customer:string,age:int>",
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "HIVE-13083",
"compression": "zlib", "compression block": 262144,
"file length": 305,
"content": 139, "stripe stats": 46, "footer": 96, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 136,
"index": 67, "data": 23, "footer": 46
}
]
}
Where orc-metadata is a tool distributed by the ORC repo on github.
You have to set this in hive script or hive-shell otherwise put it in a .hiverc file in your main directory or any of the other hive user properties files.
set hive.cli.print.header=true;

Make hive returns the value only

I want to make hive only returns the value only! not other words like information about the processing!
hive> select max(temp) from temp where dtime like '2014-07%' ;
Query ID = hduser_20170608003255_d35b8a43-8cc5-4662-89ce-9ee5f87d3ba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1496864651740_0008, Tracking URL = http://localhost:8088/proxy/application_1496864651740_0008/
Kill Command = /home/hduser/hadoop/bin/hadoop job -kill job_1496864651740_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-06-08 00:33:01,955 Stage-1 map = 0%, reduce = 0%
2017-06-08 00:33:08,187 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.13 sec
2017-06-08 00:33:14,414 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.91 sec
MapReduce Total cumulative CPU time: 5 seconds 910 msec
Ended Job = job_1496864651740_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.91 sec HDFS Read: 853158 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 910 msec
OK
44.4
Time taken: 20.01 seconds, Fetched: 1 row(s)
I want it to return the value only which is 44.4
Thanks in advance...
You can put result into variable in a shell script. max_temp variable will contain the result only:
max_temp=$(hive -e " set hive.cli.print.header=false; select max(temp) from temp where dtime like '2014-07%';")
echo "$max_temp"
You can also use -S
hive -S -e "select max(temp) from temp where dtime like '2014-07%';"

Is there a way to get database name in a hive UDF

I am writing a Hive UDF.
I have to get the name of the database (the function is deployed in). Then, I need to access a few files from hdfs depending on the database environment. Can you please help me which function can help with running a HQL query from Hive UDF.
write UDF class and prepare jar file
public class MyHiveUdf extends UDF {
public Text evaluate(String text,String dbName) {
if(text == null) {
return null;
} else {
return new Text(dbName+"."+text);
}
}
}
Use this UDF inside hive query like mentioned below
hive> use mydb;
OK
Time taken: 0.454 seconds
hive> ADD jar /root/MyUdf.jar;
Added [/root/MyUdf.jar] to class path
Added resources: [/root/MyUdf.jar]
hive> create temporary function myUdfFunction as 'com.hiveudf.strmnp.MyHiveUdf';
OK
Time taken: 0.018 seconds
hive> select myUdfFunction(username,current_database()) from users;
Query ID = root_20170407151010_2ae29523-cd9f-4585-b334-e0b61db2c57b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0004, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-07 15:11:11,376 Stage-1 map = 0%, reduce = 0%
2017-04-07 15:11:19,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.12 sec
MapReduce Total cumulative CPU time: 3 seconds 120 msec
Ended Job = job_1491484583384_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.12 sec HDFS Read: 21659 HDFS Write: 381120 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 120 msec
OK
mydb.user1
mydb.user2
mydb.user3
Time taken: 2.137 seconds, Fetched: 3 row(s)
hive>

Write the output of a script to a file in hive

I have a script with a set of 5 queries.I would like to execute the script and write the output to a file.What command should I give from the hive cli.
Thanks
sample Queries file (3 queries) :
ramisetty#aspire:~/my_tmp$ cat queries.q
show databases; --query1
use my_db; --query2
INSERT OVERWRITE LOCAL DIRECTORY './outputLocalDir' --query3
select * from students where branch = "ECE"; --query3
Run HIVE:
ramisetty#aspire:~/my_tmp$ hive
hive (default)> source ./queries.q;
--output of Q1 on console-----
Time taken: 7.689 seconds
--output of Q2 on console -----
Time taken: 1.689 seconds
____________________________________________________________
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201401251835_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201401251835_0004
Kill Command = /home/ramisetty/VJDATA/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201401251835_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-25 19:06:56,689 Stage-1 map = 0%, reduce = 0%
2014-01-25 19:07:05,868 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:14,047 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.07 sec
2014-01-25 19:07:15,059 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
MapReduce Total cumulative CPU time: 2 seconds 70 msec
Ended Job = job_201401251835_0004
**Copying data to local directory outputLocalDir
Copying data to local directory outputLocalDir**
2 Rows loaded to outputLocalDir
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.07 sec HDFS Read: 525 HDFS Write: 66 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 70 msec
OK
firstname secondname dob score branch
Time taken: 32.44 seconds
output file :
cat ./outputLocalDir/000000_0