Hive explain plan where to see full table scan? - hive

How can I see from hive EXPLAIN is there a full table scan?
For example, is there a full scan?
The table size is 993 rows.
The query is
explain select latitude,longitude FROM CRIMES WHERE geohash='dp3twhjuyutr'
I have secondary index on geohash column.
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: crimes
filterExpr: (geohash = 'dp3twhjuyutr') (type: boolean)
Statistics: Num rows: 993 Data size: 265582 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (geohash = 'dp3twhjuyutr') (type: boolean)
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: latitude (type: double), longitude (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

Absence of partition predicate in the plan means full scan. Of course this is not about predicate push-down in ORC.
Check Data size and Num rows in each operator.
EXPLAIN DEPENDENCY command will show all input_partitions collection and you can check what exactly will be scanned.

Related

strange query perf result - different expression number of 'in clause' in greenplum 5.0

i notice strange result when using 'in clause' in greenplum 5.0.
when expression number of 'in clause' <= 25, query linearly slows down(as expected), but when expression number > 25, query is obviously faster (than number = 25). why does this happen?
i explain the query, run using new/legacy optimizer, output is the same. here's query sql and explain result.
query 1 - 26 expression number
sql:
select * from table1
where column1 in ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26')
query time: 0.8s ~ 0.9s
explain:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.59 rows=2021 width=1069)
-> Table Scan on table1 (cost=0.00..475.60 rows=253 width=1069)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
explain analyze:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.53 rows=2003 width=1064)
Rows out: 0 rows at destination with 52 ms to end, start offset by 0.477 ms.
-> Table Scan on table1 (cost=0.00..475.63 rows=251 width=1064)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26}'::text[])
Rows out: 0 rows (seg0) with 51 ms to end, start offset by -358627 ms.
Slice statistics:
(slice0) Executor memory: 437K bytes.
(slice1) Executor memory: 259K bytes avg x 8 workers, 281K bytes max (seg7).
Statement statistics:
Memory used: 262144K bytes
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
Total runtime: 53.107 ms
query 2 - 25 expression number
sql:
select * from table1
where column1 in ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25')
query time: 1.2s ~ 1.5s
explain:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.59 rows=2021 width=1069)
-> Table Scan on table1 (cost=0.00..475.60 rows=253 width=1069)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
explain anaylze:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.53 rows=2003 width=1064)
Rows out: 0 rows at destination with 60 ms to end, start offset by 0.517 ms.
-> Table Scan on table1 (cost=0.00..475.63 rows=251 width=1064)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Rows out: 0 rows (seg0) with 59 ms to end, start offset by -155783 ms.
Slice statistics:
(slice0) Executor memory: 437K bytes.
(slice1) Executor memory: 191K bytes avg x 8 workers, 191K bytes max (seg0).
Statement statistics:
Memory used: 262144K bytes
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
Total runtime: 60.584 ms
gp runs in 3 vm, 1 master and 2 segment, each segment has 4 data directory.
table1 has 500,000 rows with 50 columns, primary key and distribute key is one another column, in uuid. column1 is not a distribute key or primary key, just one of a nature key.
you can run explain analyze to see what exactly the plan spent time on. Share it here.

Using presto to query from Hive external table: Invalid UTF-8 start byte

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;
Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]
Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
at [Source: java.io.ByteArrayInputStream#6eb5bdfd; line: 1, column: 376]
The error only happens if I use aggregate function such as count, sum, etc.
But when I use the same query on Hive CLI, it works (but take a lot of time since it converts the query into a map-reduce job).
$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87 Reduce: 1 Cumulative CPU: 1076.6 sec HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)
The point is the same query works on Hive but not on Presto and I could not figure out why. I suspect it is because the 2 json library using on Hive and on Presto are different, but I'm not really sure.
I created the external table on Hive with the query:
hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';
Can anyone help me with this?
posting this here for easy reference:
from where OP documented a solution:
I successfully fixed the problem by using this serde: https://github.com/electrum/hive-serde (add to presto at /usr/lib/presto/plugin/hive-hadoop2/ and to hive cluster at /usr/lib/hive-hcatalog/share/hcatalog/)

Does filtering on top of Hive view push the filter inside the view?

Let there be a view MyView on Table MyTable as:-
CREATE VIEW MyView AS SELECT col1,col2,...,colN from MyTable;
Now lets say we fire the following query:-
SELECT * FROM MyView WHERE col="abc";
So does Hive push the filter (col="abc") inside the view for the execution of the select? Basically trying to understand if Hive will do 'push down optimization' here, if I can use that term. Because otherwise it will be very inefficient, as the View is on the entire table and after querying the entire table , outside the view the filter will be applied.
YES.
create view tmp.v_tmp0823 as select city_id, city_name from dw.dim_city ;
explain select city_id, city_name from tmp.v_tmp0823 where city_id = 123 ;
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: dim_city
Statistics: Num rows: 530 Data size: 57323 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (city_id = 123) (type: boolean)
Statistics: Num rows: 265 Data size: 28661 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 123 (type: bigint), city_name (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 265 Data size: 28661 Basic stats: COMPLETE Column stats: NONE
ListSink

Hive: where + in does not use partition?

I am querying a large table that is partitioned on a field called day.
If I run a query:
select *
from my_table
where day in ('2016-04-01', '2016-03-01')
I get many mappers and reducers and the query takes a long time to run.
If, however, I write a query:
select *
from my_table
where day = '2016-04-01'
or day = '2016-03-01'
I get far less mappers and reducers and the query runs quickly. To me this suggests that in does not take advantage of partitions in a table. Can anyone confirm this and explain why?
Hive Version: 1.2.1
Hadoop Version: 2.3.4.7-4
Details:
I believe the relevant part of the execution plans are...
Using Where or
No filter operator at all
Using Where in
Filter Operator
predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean)
Statistics: Num rows: 100000000 Data size: 9999999999
The hive docs just say:
'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'
But don't elaborate. I couldn't find any SO posts directly relating to this.
Thanks!
tl;dr
I am using Hive 1.1.0 with Cloudera 5.13.3 and IN follows the same optimization as the equal operator (=) according to the explain plans I ran in Hue.
Examples
My table is partitioned on LOAD_YEAR (SMALLINT) and LOAD_MONTH (TINYINT) and has these two partitions:
load_year=2018/load_month=10 (19,828,71 rows)
load_year=2018/load_month=11 (702,856 rows)
Below are various queries and their explain plans.
1. Equal (=) operator
Query:
SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
2. IN operator
Query (note that there is no month 12 in the data):
SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
3. Equal (=) in conjunction with AND and OR
Query:
SELECT ID
FROM TBL
WHERE
(LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
4. Arithmetic operation
Query:
SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)
Side note:
100 doesn't have a suffix, so it's an INT, and (LOAD_YEAR * 100 + LOAD_MONTH) is also an INT. This ensures that the result is accurate. Since LOAD_YEAR is a SMALLINT and LOAD_MONTH a TINYINT, arithmetic calculations on the two use SMALLINT for the results and the max value stored is 32,767 (not enough for yyyymm, which needs 6 digits, i.e., at least up to 999,999). With 100 as an INT, calculations are made with the INT type and allow numbers up to 2,147,483,647.
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (201811) IN (201811, 201901) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
Summary
All these queries only scan the second partition, thereby avoiding the ~20 million rows in the other partition.

Hive Query Execution Error, return code 3 from MapredLocalTask

I am getting this error while performing a simple join between two tables. I run this query in Hive command line. I am naming table as a & b. Table a is Hive internal table and b is External table (in Cassandra). Table a has only 1610 rows and Table b has ~8million rows. In actual production scenario Table a could get upto 100K rows. Shown below is my join with table b as the last table in the join
SELECT a.col1, a.col2, b.col3, b.col4 FROM a JOIN b ON (a.col1=b.col1 AND a.col2=b.col2);
Shown below is the error
Total MapReduce jobs = 1
Execution log at: /tmp/pricadmn/.log
2014-04-09 07:15:36 Starting to launch local task to process map join; maximum memory = 932184064
2014-04-09 07:16:41 Processing rows: 200000 Hashtable size: 199999 Memory usage: 197529208 percentage: 0.212
2014-04-09 07:17:12 Processing rows: 300000 Hashtable size: 299999 Memory usage: 163894528 percentage: 0.176
2014-04-09 07:17:43 Processing rows: 400000 Hashtable size: 399999 Memory usage: 347109936 percentage: 0.372
...
...
...
2014-04-09 07:24:29 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 714454400 percentage: 0.766
2014-04-09 07:25:03 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 901427928 percentage: 0.967
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID:
Stage-5
Logs:
/u/applic/pricadmn/dse-4.0.1/logs/hive/hive.log
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
I am using DSE 4.0.1. Following are few of my settings which you might be interested in mapred.map.child.java.opts=-Xmx512M
mapred.reduce.child.java.opts=-Xmx512M
mapred.reduce.parallel.copies=20
hive.auto.convert.join=true
I increased mapred.map.child.java.opts to 1G and i got past few more records and then errored out. It doesn't look like a good solution. Also i changed the order in the join but no help. I saw this link Hive Map join : out of memory Exception but didn't solve my issue.
For me it looks Hive is trying to put the bigger table in memory during local task phase which i am confused. As per my understanding the second table (in my case table b) should be streamed in. Correct me if I am wrong. Any help in solving this issue is highly appreciated.
set hive.auto.convert.join = false;
It appears your task is running out of memory. Check line 324 of the MapredLocalTask class.
} catch (Throwable e) {
if (e instanceof OutOfMemoryError
|| (e instanceof HiveException && e.getMessage().equals("RunOutOfMeomoryUsage"))) {
// Don't create a new object if we are already out of memory
return 3;
} else {
Last join should be the largest table. You can change the order of join tables.