Hive: where + in does not use partition? - hive

I am querying a large table that is partitioned on a field called day.
If I run a query:
select *
from my_table
where day in ('2016-04-01', '2016-03-01')
I get many mappers and reducers and the query takes a long time to run.
If, however, I write a query:
select *
from my_table
where day = '2016-04-01'
or day = '2016-03-01'
I get far less mappers and reducers and the query runs quickly. To me this suggests that in does not take advantage of partitions in a table. Can anyone confirm this and explain why?
Hive Version: 1.2.1
Hadoop Version: 2.3.4.7-4
Details:
I believe the relevant part of the execution plans are...
Using Where or
No filter operator at all
Using Where in
Filter Operator
predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean)
Statistics: Num rows: 100000000 Data size: 9999999999
The hive docs just say:
'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'
But don't elaborate. I couldn't find any SO posts directly relating to this.
Thanks!

tl;dr
I am using Hive 1.1.0 with Cloudera 5.13.3 and IN follows the same optimization as the equal operator (=) according to the explain plans I ran in Hue.
Examples
My table is partitioned on LOAD_YEAR (SMALLINT) and LOAD_MONTH (TINYINT) and has these two partitions:
load_year=2018/load_month=10 (19,828,71 rows)
load_year=2018/load_month=11 (702,856 rows)
Below are various queries and their explain plans.
1. Equal (=) operator
Query:
SELECT ID
FROM TBL
WHERE LOAD_MONTH = 11Y
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
2. IN operator
Query (note that there is no month 12 in the data):
SELECT ID
FROM TBL
WHERE LOAD_MONTH IN (11Y, 12Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (load_month = 11) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
3. Equal (=) in conjunction with AND and OR
Query:
SELECT ID
FROM TBL
WHERE
(LOAD_YEAR = 2018S AND LOAD_MONTH = 11Y)
OR (LOAD_YEAR = 2019S AND LOAD_MONTH = 1Y)
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (((load_year = 2018) and (load_month = 11)) or ((load_year = 2019) and (load_month = 1))) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
4. Arithmetic operation
Query:
SELECT ID
FROM TBL
WHERE (LOAD_YEAR * 100 + LOAD_MONTH) IN (201811, 201901)
Side note:
100 doesn't have a suffix, so it's an INT, and (LOAD_YEAR * 100 + LOAD_MONTH) is also an INT. This ensures that the result is accurate. Since LOAD_YEAR is a SMALLINT and LOAD_MONTH a TINYINT, arithmetic calculations on the two use SMALLINT for the results and the max value stored is 32,767 (not enough for yyyymm, which needs 6 digits, i.e., at least up to 999,999). With 100 as an INT, calculations are made with the INT type and allow numbers up to 2,147,483,647.
Explain Plan:
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: tbl
filterExpr: (201811) IN (201811, 201901) (type: boolean)
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 702856 Data size: 84342720 Basic stats: COMPLETE Column stats: NONE
ListSink
Summary
All these queries only scan the second partition, thereby avoiding the ~20 million rows in the other partition.

Related

strange query perf result - different expression number of 'in clause' in greenplum 5.0

i notice strange result when using 'in clause' in greenplum 5.0.
when expression number of 'in clause' <= 25, query linearly slows down(as expected), but when expression number > 25, query is obviously faster (than number = 25). why does this happen?
i explain the query, run using new/legacy optimizer, output is the same. here's query sql and explain result.
query 1 - 26 expression number
sql:
select * from table1
where column1 in ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26')
query time: 0.8s ~ 0.9s
explain:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.59 rows=2021 width=1069)
-> Table Scan on table1 (cost=0.00..475.60 rows=253 width=1069)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
explain analyze:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.53 rows=2003 width=1064)
Rows out: 0 rows at destination with 52 ms to end, start offset by 0.477 ms.
-> Table Scan on table1 (cost=0.00..475.63 rows=251 width=1064)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26}'::text[])
Rows out: 0 rows (seg0) with 51 ms to end, start offset by -358627 ms.
Slice statistics:
(slice0) Executor memory: 437K bytes.
(slice1) Executor memory: 259K bytes avg x 8 workers, 281K bytes max (seg7).
Statement statistics:
Memory used: 262144K bytes
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
Total runtime: 53.107 ms
query 2 - 25 expression number
sql:
select * from table1
where column1 in ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25')
query time: 1.2s ~ 1.5s
explain:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.59 rows=2021 width=1069)
-> Table Scan on table1 (cost=0.00..475.60 rows=253 width=1069)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
explain anaylze:
Gather Motion 8:1 (slice1; segments: 8) (cost=0.00..481.53 rows=2003 width=1064)
Rows out: 0 rows at destination with 60 ms to end, start offset by 0.517 ms.
-> Table Scan on table1 (cost=0.00..475.63 rows=251 width=1064)
Filter: column1 = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}'::text[])
Rows out: 0 rows (seg0) with 59 ms to end, start offset by -155783 ms.
Slice statistics:
(slice0) Executor memory: 437K bytes.
(slice1) Executor memory: 191K bytes avg x 8 workers, 191K bytes max (seg0).
Statement statistics:
Memory used: 262144K bytes
Settings: optimizer=on
Optimizer status: PQO version 2.42.0
Total runtime: 60.584 ms
gp runs in 3 vm, 1 master and 2 segment, each segment has 4 data directory.
table1 has 500,000 rows with 50 columns, primary key and distribute key is one another column, in uuid. column1 is not a distribute key or primary key, just one of a nature key.
you can run explain analyze to see what exactly the plan spent time on. Share it here.

Hive explain plan where to see full table scan?

How can I see from hive EXPLAIN is there a full table scan?
For example, is there a full scan?
The table size is 993 rows.
The query is
explain select latitude,longitude FROM CRIMES WHERE geohash='dp3twhjuyutr'
I have secondary index on geohash column.
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: crimes
filterExpr: (geohash = 'dp3twhjuyutr') (type: boolean)
Statistics: Num rows: 993 Data size: 265582 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (geohash = 'dp3twhjuyutr') (type: boolean)
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: latitude (type: double), longitude (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 496 Data size: 132657 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Absence of partition predicate in the plan means full scan. Of course this is not about predicate push-down in ORC.
Check Data size and Num rows in each operator.
EXPLAIN DEPENDENCY command will show all input_partitions collection and you can check what exactly will be scanned.

Using presto to query from Hive external table: Invalid UTF-8 start byte

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;
Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]
Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
at [Source: java.io.ByteArrayInputStream#6eb5bdfd; line: 1, column: 376]
The error only happens if I use aggregate function such as count, sum, etc.
But when I use the same query on Hive CLI, it works (but take a lot of time since it converts the query into a map-reduce job).
$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87 Reduce: 1 Cumulative CPU: 1076.6 sec HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)
The point is the same query works on Hive but not on Presto and I could not figure out why. I suspect it is because the 2 json library using on Hive and on Presto are different, but I'm not really sure.
I created the external table on Hive with the query:
hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';
Can anyone help me with this?
posting this here for easy reference:
from where OP documented a solution:
I successfully fixed the problem by using this serde: https://github.com/electrum/hive-serde (add to presto at /usr/lib/presto/plugin/hive-hadoop2/ and to hive cluster at /usr/lib/hive-hcatalog/share/hcatalog/)

Does filtering on top of Hive view push the filter inside the view?

Let there be a view MyView on Table MyTable as:-
CREATE VIEW MyView AS SELECT col1,col2,...,colN from MyTable;
Now lets say we fire the following query:-
SELECT * FROM MyView WHERE col="abc";
So does Hive push the filter (col="abc") inside the view for the execution of the select? Basically trying to understand if Hive will do 'push down optimization' here, if I can use that term. Because otherwise it will be very inefficient, as the View is on the entire table and after querying the entire table , outside the view the filter will be applied.
YES.
create view tmp.v_tmp0823 as select city_id, city_name from dw.dim_city ;
explain select city_id, city_name from tmp.v_tmp0823 where city_id = 123 ;
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: dim_city
Statistics: Num rows: 530 Data size: 57323 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (city_id = 123) (type: boolean)
Statistics: Num rows: 265 Data size: 28661 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 123 (type: bigint), city_name (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 265 Data size: 28661 Basic stats: COMPLETE Column stats: NONE
ListSink

postgres large table select optimization

I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);