Does Hive Optimizer consider view definition while optimizing queries on views? - sql

I have this schema (given through DDL for tables and views):
hive> create table t_realtime(cust_id int, name string, status string, active_flag int);
hive> create table t_hdfs(cust_id int, name string, status string, active_flag int);
hive> create view t_inactive as select * from t_hdfs where active_flag=0;
hive> create view t_view as select * from t_realtime union all select * from t_inactive;
If I fire a query as follows:
hive> select * from t_view where active_flag = 1;
This query ideally should not visit t_inactive view or t_hdfs at all, since the view definition for t_inactive itself has active_flag = 0 and query predicate has active_flag = 1. However, by default, it does not eliminate t_inactive part of this union view.
Is there anyway to achieve this for such hive query? Maybe some hive optimizer parameter or a hint?

hive> explain extended select * from t_view where active_flag = 1;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: t_realtime
properties:
insideView TRUE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (active_flag = 1) (type: boolean)
Select Operator
expressions: cust_id (type: int), name (type: string), status (type: string), 1 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
ListSink
This is tested on yesterday's mainline (at d68630b6ed25884a76030a9073cd864032ab85c2). As you can see, it only scans t_realtime and pushes down the predicate active_flag = 1. Whether your particular installation will do this or not, it depends on what version you're using. This topic is subject to active development, not only on Hive but also on Calcite (used by Hive).

Related

How to dynamically SELECT from manually partitioned table

Suppose I have table of tenants like so;
CREATE TABLE tenants (
name varchar(50)
)
And for each tenant, I have a corresponding table called {tenants.name}_entities, so for example for tenant_a I would have the following table.
CREATE TABLE tenant_a_entities {
id uuid,
last_updated timestamp
}
Is there a way I can create a query with the following structure? (using create table syntax to show what I'm looking for)
CREATE TABLE all_tenant_entities {
tenant_name varchar(50),
id uuid,
last_updated timestamp
}
--
I do understand this is a strange DB layout, I'm playing around with foreign data in Postgres to federate foreign databases.
Did you consider declarative partitioning for your relational design? List partitioning for your case, with PARTITION BY LIST ...
To answer the question at hand:
You don't need the table tenants for the query at all, just the detail tables. And one way or another you'll end up with UNION ALL to stitch them together.
SELECT 'a' AS tenant_name, id, last_updated FROM tenant_a_entities
UNION ALL SELECT 'b', id, last_updated FROM tenant_b_entities
...
You can add the name dynamically, like:
SELECT tableoid::regclass::text, id, last_updated FROM tenant_a_entities
UNION ALL SELECT tableoid::regclass::text, id, last_updated FROM tenant_a_entities
...
See:
Get the name of a row's source table when querying the parent it inherits from
But it's cheaper to add a constant name while building the query dynamically in your case (the first code example) - like this, for example:
SELECT string_agg(format('SELECT %L AS tenant_name, id, last_updated FROM %I'
, split_part(tablename, '_', 2)
, tablename)
, E'\nUNION ALL '
ORDER BY tablename) -- optional order
FROM pg_catalog.pg_tables
WHERE schemaname = 'public' -- actual schema name
AND tablename LIKE 'tenant\_%\_entities';
Tenant names cannot contain _, or you have to do more.
Related:
Table name as a PostgreSQL function parameter
How to check if a table exists in a given schema
You can wrap it in a custom function to make it completely dynamic:
CREATE OR REPLACE FUNCTION public.f_all_tenant_entities()
RETURNS TABLE(tenant_name text, id uuid, last_updated timestamp)
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE
(
SELECT string_agg(format('SELECT %L AS tn, id, last_updated FROM %I'
, split_part(tablename, '_', 2)
, tablename)
, E'\nUNION ALL '
ORDER BY tablename) -- optional order
FROM pg_tables
WHERE schemaname = 'public' -- your schema name here
AND tablename LIKE 'tenant\_%\_entities'
);
END
$func$;
Call:
SELECT * FROM public.f_all_tenant_entities();
You can use this set-returning function (a.k.a "table-function") just like a table in most contexts in SQL.
Related:
How to UNION a list of tables retrieved from another table with a single query?
Simulate CREATE DATABASE IF NOT EXISTS for PostgreSQL?
Function to loop through and select data from multiple tables
Note that RETIRN QUERY does not allow parallel queriies before Postgres 14. The release notes:
Allow plpgsql's RETURN QUERY to execute its query using parallelism (Tom Lane)

Hive: read table partitions defined in subselect

I have a Hive table which is partitioned by partitionDate field.
I can read partition of my choice via simple
select * from myTable where partitionDate = '2000-01-01'
My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.
I have written a query which looks like
select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate
The query works but looks like partitions are not used. The query works too long.
I tried another approach:
select * from myTable where partitionDate in (select reportDate from thatTable)
.. and again I see that the query works too slowly.
Is there a way to implement this in Hive?
update: create table for myTable
CREATE TABLE `myTable`(
`theDate` string,
')
PARTITIONED BY (
`partitionDate` string)
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='2',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...
'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',
'spark.sql.sources.schema.partCol.0'='partitionDate')
If you are running Hive on Tez execution engine, try
set hive.tez.dynamic.partition.pruning=true;
Read more details and related configuration in the Jira HIVE-7826
and at the same time try to rewrite as a LEFT SEMI JOIN:
select *
from myTable t
left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate
If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344
Or this one: https://stackoverflow.com/a/53279839/2700344
Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

SQL Error in Google Big Query with UNION ALL on tables with same schema EDIT: change in schema from String to INT

I have the following query
SELECT *
FROM `January_2018`
UNION ALL
SELECT *
FROM `February_2018`
I get the following error on the second SELECT call
Column 14 in UNION ALL has incompatible types: STRING, STRING, INT64,
INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64 at [7:3]
The column name is travel_type with a type of integer with values 0, 1 and 2.
I am trying to make one large table from several smaller ones - monthly tables of the same data. It seems that one of the fields has changed from String to Int data type after the 4th month and stays Int ongoing after that.
Try the following so both table schemas match:
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table1
UNION ALL
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table2
TO select data from different tables you can use Wildcard instead of union. Wildcard will execute your query on all tables satisfying the condition. You can use wildcard ‘*’ with table prefix to select multiple tables at once. Your table names must have same Prefix with different suffix. Ex – Mytable_1, Mytabel_2, Mytable_3………

Hive multilevel partitions and select with where clause

I have hive table with 2 partitions and the 1st partition is the city and the second partition is village, So every city partition will contains the list of all village partitions in it.Some thing like below
city1/village1
city1/village2
city1/village3
city2/village5
city2/village6
So if my select statement is select * from table where village = 'village5'
will it search all the partitions in city 1 and city 2 before outputting the result? Or will it see the hive metastore file and hit only the village5 partition.
It will depend on your Hive version how optimize it is. In my current version (1.1.0) Hive is able to point to the specific partition without scanning the top partition
Here is a quick demonstration.
create table mydb.partition_test
(id string)
partitioned by (city string, village string);
INSERT OVERWRITE TABLE mydb.partition_test PARTITION (city,village)
select * from (
select '1', 'city1', 'village1'
union all
select '1', 'city1', 'village2'
union all
select '1', 'city1', 'village3'
union all
select '1', 'city2', 'village5'
union all
select '1', 'city2', 'village6'
) t;
explain select * from mydb.partition_test where village='village5';
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: partition_test
filterExpr: (village = 'village5') (type: boolean)
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: id (type: string), city (type: string), 'village5' (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: PARTIAL
ListSink
As you can see from the execution plan, it is able to estimate the number of records for that specific partition without mapred operation and the table scan is pointing to the specific partition.

Accessing BigQuery RECORD - Repeated in Tableau

I have a BigQuery Table with a column of RECORD type & mode REPEATED. I have to query and use this table in Tableau. Using UNNEST or FLATTEN in BigQuery is performing CROSS JOIN of the Table which is impacting performance. Is there any other way to use this table in Tableau without flattening it. Have posted the table schema image link below.
[Schema of Table]
https://i.stack.imgur.com/T4jHg.png
Is there any other way to use ... ?
You should not afraid UNNEST just because it “does” CROSS JOIN
The trick is that even though it is cross join but it is cross join within the row only and global to all rows in table. At the same time, there are always way to do stuff different
So, below example 1 – presents dummy example using UNNEST
And then Example 2 – shows how to do the same without using UNNEST, but rather using SQL UDF
You have not presented specifics about your case, so below is generic enough to show ‘other’ way
With Flattening via UNNEST
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, SUM(t.details) AS details
FROM yourTable, UNNEST(type) AS t
WHERE t.flag = 'y'
GROUP BY id
With SQL UDF
#standardSQL
CREATE TEMP FUNCTION do_something (
type ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
)
RETURNS INT64 AS ((
SELECT SUM(t.details) AS details
FROM UNNEST(type) AS t
WHERE t.flag = 'y'
));
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, do_something(type) AS details
FROM yourTable