Hive multilevel partitions and select with where clause - hive

I have hive table with 2 partitions and the 1st partition is the city and the second partition is village, So every city partition will contains the list of all village partitions in it.Some thing like below
city1/village1
city1/village2
city1/village3
city2/village5
city2/village6
So if my select statement is select * from table where village = 'village5'
will it search all the partitions in city 1 and city 2 before outputting the result? Or will it see the hive metastore file and hit only the village5 partition.

It will depend on your Hive version how optimize it is. In my current version (1.1.0) Hive is able to point to the specific partition without scanning the top partition
Here is a quick demonstration.
create table mydb.partition_test
(id string)
partitioned by (city string, village string);
INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
select * from (
select '1', 'city1', 'village1'
union all
select '1', 'city1', 'village2'
union all
select '1', 'city1', 'village3'
union all
select '1', 'city2', 'village5'
union all
select '1', 'city2', 'village6'
) t;
explain select * from mydb.partition_test where village='village5';
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: partition_test
filterExpr: (village = 'village5') (type: boolean)
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: string), city (type: string), 'village5' (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
ListSink
As you can see from the execution plan, it is able to estimate the number of records for that specific partition without mapred operation and the table scan is pointing to the specific partition.

Related

Perform an SQL query excluding a specific partition name

Let's imagine that the table my_table is divided into 1000 partitions as the following example:
P1, P2, P3, ... , P997, P998, P999, P1000
Partitions are organized by dates, mostly a partition per day. E.g.:
P0 < 01/01/2000 => Contains around 472M records
P1 = 01/01/2000 => Contains around 15k records
P2 = 02/01/2000 => Contains around 15k records
P3 = 03/01/2000 => Contains around 15k records
... = ../../.... => Contains around ... records
P997 = 07/04/2000 => Contains around 15k records
P998 = 08/04/2000 => Contains around 15k records
P999 = 09/04/2000 => Contains around 15k records
P1000 = 10/04/2000 => Contains around 15k records
Please notice that P0 is < to 01/01/2000, NOT =
CURRENT SITUATION:
When looking for a specific record without knowing the date, I am doing a:
SELECT * FROM my_schema.my_table WHERE ... ;
But this take too much time because it does include P0 (30s).
IMPOSSIBLE SOLUTION:
So the best idea would be to execute an SQL query such as:
SELECT * FROM my_schema.my_table FROM PARTITION(P42) WHERE ... ;
But we never know in which partition is the record. We don't know either the date associated to the partition. And of course we won't loop over all partitions 1 by 1
BAD SOLUTION:
I could be clever by doing 5 by 5:
SELECT * FROM my_schema.my_table FROM PARTITION(P40,P41,P42,P43,P44) WHERE ... ;
However same issue as above, I won't loop over all partitions, even 5 by 5
LESS BAD SOLUTION:
I won't run either do (excluding P0 in the list):
SELECT * FROM my_schema.my_table FROM PARTITION(P1,P2,...,P99,P100) WHERE ... ;
The query would be too long and I would have to compute for each request the list of partitions names since it could not always start by P1 or end by P100 (each days some partitions are dropped, some are created)
CLEVER SOLUTION (but does it exist?):
How can I do something like this?
SELECT * FROM my_schema.my_table NOT IN PARTITION(P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table PARTITION(*,-P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table LESS PARTITION(P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table EXCLUDE PARTITION(P0) WHERE ... ;
Is there any way to do that?
I mean a way to select all partitions expect one or some of them?
Note: I don't know in advance the value of the dateofSale. Inside the table, we have something like
CREATE TABLE my_table
(
recordID NUMBER(16) NOT NULL, --not primary
dateOfSale DATE NOT NULL, --unknown
....
<other fields>
)
Before you answer, read the following:
Index usage: yes, it is already optimized, but remember, we do not know the partitioning date
No we won't drop records in P0, we need to keep them for at least few years (3, 5 and sometimes 10 according each country laws)
We can "split" P0 into several partitions, but that won't solve the issue with a global SELECT
We cannot move that data into a new table, we need them to be kept in this table since we have multiple services and softwares performing select in it. We would have to edit to much code to add a query for the second table for each services and back-end.
We cannot do an aka WHERE date > 2019 clause and index the date field for multiples reasons that would take too much time to explain here.
The query below, ie two queries in a UNION ALL but I only want 1 row, will stop immediately a row is found. We do not need to go into the second part of the UNION ALL if we get a row in the first.
SQL> select * from
2 ( select x
3 from t1
4 where x = :b1
5 union all
6 select x
7 from t2
8 where x = :b1
9 )
10 where rownum = 1
11 /
See https://connor-mcdonald.com/golden-oldies/first-match-written-15-10-2007/ for a simple proof of this.
I'm assuming that you're working under the assumption that most of the time, the record you are interested in is in your most recent smaller partitions. In the absence of any other information to hone on in the right partition, you could do
select * from
( select ...
from tab
where trans_dt >= DATE'2000-01-01'
and record_id = :my_record
union all
select x
from tab
where trans_dt < DATE'2000-01-01'
and record_id = :my_record
)
where rownum = 1
which will only scan the big partition if we fall through and don't find it anywhere else.
But your problem does seem to be screaming out for an index to avoid all this work
Let's simplify your partitioned table as follows
CREATE TABLE tab
( trans_dt DATE
)
PARTITION BY RANGE (trans_dt)
( PARTITION p0 VALUES LESS THAN (DATE'2000-01-01')
, PARTITION p1 VALUES LESS THAN (DATE'2000-01-02')
, PARTITION p2 VALUES LESS THAN (DATE'2000-01-03')
, PARTITION p3 VALUES LESS THAN (DATE'2000-01-04')
);
If you want to skip your large partition P0 in a query, you simple (as this is the first partition) constraints the partition key as trans_dt >= DATE'2000-01-01'
You will need two predicates and or to skip a partition in the middle
The query
select * from tab
where trans_dt >= DATE'2000-01-01';
Checking the execution plan you see the expected behaviour in Pstart = 2(i.e. the 1st partition is pruned).
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | 2 (0)| 00:00:01 | | |
| 1 | PARTITION RANGE ITERATOR | | 1 | 9 | 2 (0)| 00:00:01 | 2 | 4 |
| 2 | TABLE ACCESS STORAGE FULL| TAB | 1 | 9 | 2 (0)| 00:00:01 | 2 | 4 |
---------------------------------------------------------------------------------------------------
Remember, if you scan a partitioned table without constraining the partition key you will have to scall all partitions.
If you know, that most of the query results are in the recent and small partitions, simple scan tme in the first query
select * from tab
where trans_dt >= DATE'2000-01-01' and <your filter>
and only if you fail to get the row scan the large partition
select * from tab
where trans_dt < DATE'2000-01-01' and <your filter>
You will get much better response time on average if the assumption is true that the queries refer mostly the recent data.
Although there is no syntax to exclude a specific partition, you can build a pipelined table function that dynamically builds a query that uses every partition except for one.
The table function builds a query like the one below. The function uses the data dictionary view USER_TAB_PARTITIONS to get the partition names to build the SQL, uses dynamic SQL to execute the query, and then pipes the results back to the caller.
select * from my_table partition (P1) union all
select * from my_table partition (P2) union all
...
select * from my_table partition (P1000);
Sample schema
CREATE TABLE my_table
(
recordID NUMBER(16) NOT NULL, --not primary
dateOfSale DATE NOT NULL, --unknown
a NUMBER
)
partition by range (dateOfSale)
(
partition p0 values less than (date '2000-01-01'),
partition p1 values less than (date '2000-01-02'),
partition p2 values less than (date '2000-01-03')
);
insert into my_table
select 1,date '1999-12-31',1 from dual union all
select 2,date '2000-01-01',1 from dual union all
select 3,date '2000-01-02',1 from dual;
commit;
Package and function
create or replace package my_table_pkg is
type my_table_nt is table of my_table%rowtype;
function get_everything_but_p0 return my_table_nt pipelined;
end;
/
create or replace package body my_table_pkg is
function get_everything_but_p0 return my_table_nt pipelined is
v_sql clob;
v_results my_table_nt;
v_cursor sys_refcursor;
begin
--Build SQL that referneces all partitions.
for partitions in
(
select partition_name
from user_tab_partitions
where table_name = 'MY_TABLE'
and partition_name <> 'P0'
) loop
v_sql := v_sql || chr(10) || 'union all select * from my_table ' ||
'partition (' || partitions.partition_name || ')';
end loop;
v_sql := substr(v_sql, 12);
--Print the query for debugging:
dbms_output.put_line(v_sql);
--Gather the results in batches and pipe them out.
open v_cursor for v_sql;
loop
fetch v_cursor bulk collect into v_results limit 100;
exit when v_results.count = 0;
for i in 1 .. v_results.count loop
pipe row (v_results(i));
end loop;
end loop;
close v_cursor;
end;
end;
/
The package uses 12c's ability to define types in package specifications. If you build this in 11g or below, you'll need to create SQL types instead. This package only works for one table, but if necessary there are ways to create functions that work with any table (using Oracle data cartridge or 18c's polymorphic table functions).
Sample query
SQL> select * from table(my_table_pkg.get_everything_but_p0);
RECORDID DATEOFSAL A
---------- --------- ----------
2 01-JAN-00 1
3 02-JAN-00 1
Performance
This function should perform almost as well as the clever solution you were looking for. There will be overhead because the rows get passed through PL/SQL. But most importantly, the function builds a SQL statement that partition prunes away the large P0 partition.
One possible issue with this function is that the optimizer has no visibility inside it and can't create a good row cardinality estimate. If you use the function as part of another large SQL statement, be aware that the optimizer will blindly guess that the function returns 8168 rows. That bad cardinality guess may lead to a bad execution plan.

In Athena how do I query a member of a struct in an array in a struct?

I am trying to figure out how to query where I am checking the value of usage given the following table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS foo.test (
`id` string,
`foo` struct< usages:array< struct< usage:string,
method_id:int,
start_at:string,
end_at:string,
location:array<string> >>>
) PARTITIONED BY (
timestamp date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1' ) LOCATION 's3://foo.bar/' TBLPROPERTIES ('has_encrypted_data'='false');
I would like to have a query like:
SELECT * FROM "foo"."test" WHERE foo.usages.usage is null;
When I do that I get:
SYNTAX_ERROR: line 1:53: Expression "foo"."usages" is not of type ROW
If I make my query where I directly index the array as seen in the following it works.
SELECT * FROM "foo"."test" WHERE foo.usages[1].usage is null;
My overall goal though is to query across all items in the usages array and find any row where at least one item in the usages array has a member usage that is null.
Athena is based on Presto. In Presto 318 you can use any_match:
SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);
I think the function is not available in Athena yet, but you can emulate it using reduce.
SELECT * FROM "foo"."test"
WHERE reduce(
foo.usages, -- array to reducing
false, -- initial state
(state, element) -> state OR element.usage IS NULL, -- combining function
state -> state); -- output function (identity in this case)
You can achieve this by unnesting the array into rows and then check those for null values. This will result in one row per null-value entry.
select * from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null
So if you only nee the unique set, you must run this through a select distinct.
select distinct id from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null
Another way to emulate any_match(<array>, <function>) is with cardinality(filter(<array>, <function>)) > 0.
SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);
Becomes:
SELECT * FROM "foo"."test"
WHERE cardinality(filter(foo.usages, element -> element.usage IS NULL)) > 0

SQL Error in Google Big Query with UNION ALL on tables with same schema EDIT: change in schema from String to INT

I have the following query
SELECT *
FROM `January_2018`
UNION ALL
SELECT *
FROM `February_2018`
I get the following error on the second SELECT call
Column 14 in UNION ALL has incompatible types: STRING, STRING, INT64,
INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64 at [7:3]
The column name is travel_type with a type of integer with values 0, 1 and 2.
I am trying to make one large table from several smaller ones - monthly tables of the same data. It seems that one of the fields has changed from String to Int data type after the 4th month and stays Int ongoing after that.
Try the following so both table schemas match:
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table1
UNION ALL
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table2
TO select data from different tables you can use Wildcard instead of union. Wildcard will execute your query on all tables satisfying the condition. You can use wildcard ‘*’ with table prefix to select multiple tables at once. Your table names must have same Prefix with different suffix. Ex – Mytable_1, Mytabel_2, Mytable_3………

Does Hive Optimizer consider view definition while optimizing queries on views?

I have this schema (given through DDL for tables and views):
hive> create table t_realtime(cust_id int, name string, status string, active_flag int);
hive> create table t_hdfs(cust_id int, name string, status string, active_flag int);
hive> create view t_inactive as select * from t_hdfs where active_flag=0;
hive> create view t_view as select * from t_realtime union all select * from t_inactive;
If I fire a query as follows:
hive> select * from t_view where active_flag = 1;
This query ideally should not visit t_inactive view or t_hdfs at all, since the view definition for t_inactive itself has active_flag = 0 and query predicate has active_flag = 1. However, by default, it does not eliminate t_inactive part of this union view.
Is there anyway to achieve this for such hive query? Maybe some hive optimizer parameter or a hint?
hive> explain extended select * from t_view where active_flag = 1;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: t_realtime
properties:
insideView TRUE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (active_flag = 1) (type: boolean)
Select Operator
expressions: cust_id (type: int), name (type: string), status (type: string), 1 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
ListSink
This is tested on yesterday's mainline (at d68630b6ed25884a76030a9073cd864032ab85c2). As you can see, it only scans t_realtime and pushes down the predicate active_flag = 1. Whether your particular installation will do this or not, it depends on what version you're using. This topic is subject to active development, not only on Hive but also on Calcite (used by Hive).

How can I make this query faster?

I'm running this query on an Oracle DB:
SELECT COUNT(1) FROM db.table WHERE columnA = 'VALUE' AND ROWNUM < 2
There is no index on columnA, and the table has many many thousands of lines (possibly millions). There's about twenty values that should be returned, so it's not a huge set being returned. However, because it triggers a full table scan it takes eons. How can I make it go faster?
Note: I'm not a DBA so I have limited access to the database and can't implement restructuring, or adding indexes, or get rid of old data.
If you're looking for the existence of a row, not the number of times it appears, then this would be more appropriate:
SELECT 1
FROM DB.TABLE
WHERE ColumnA = 'VALUE'
AND ROWNUM = 1
That will stop the query as fast as possible once a row's been found; however, if you need it to go faster, that's what indexes are for.
Test Case:
create table q8806566
( id number not null,
column_a number not null,
padding char(256), -- so all the rows aren't really short
constraint pk_q8806566 primary key (id)
using index tablespace users
)
tablespace users;
insert into q8806566 -- 4 million rows
(id, column_a, padding)
with generator as
(select --+ materialize
rownum as rn from dba_objects
where rownum <= 2000)
select rownum as id, mod(rownum, 20) as column_a,
v1.rn as padding
from generator v1
cross join generator v2;
commit;
exec dbms_stats.gather_table_stats (ownname => user, tabname => 'q8806566');
The data for column_A is well distributed, and can be found in the first few blocks for all values, so this query runs well:
SELECT 1
FROM q8806566
WHERE Column_A = 1
AND ROWNUM = 1;
Sub .1 sec execution time and low I/O - on the order of 4 I/Os. However, when looking for a value that's NOT present, things change alarmingly:
SELECT 1
FROM q8806566
WHERE Column_A = 20
AND ROWNUM = 1;
20-40 seconds of execution time, and over 100,000 I/Os.
However, if we add the index:
create index q8806566_idx01 on q8806566 (column_a) tablespace users;
exec dbms_stats.gather_index_stats (ownname => user, indname => 'q8806566_idx01');
We get sub .1 second response time and single-digit I/Os from both queries.