Hive: read table partitions defined in subselect - sql

I have a Hive table which is partitioned by partitionDate field.
I can read partition of my choice via simple
select * from myTable where partitionDate = '2000-01-01'
My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.
I have written a query which looks like
select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate
The query works but looks like partitions are not used. The query works too long.
I tried another approach:
select * from myTable where partitionDate in (select reportDate from thatTable)
.. and again I see that the query works too slowly.
Is there a way to implement this in Hive?
update: create table for myTable
CREATE TABLE `myTable`(
`theDate` string,
')
PARTITIONED BY (
`partitionDate` string)
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='2',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...
'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',
'spark.sql.sources.schema.partCol.0'='partitionDate')

If you are running Hive on Tez execution engine, try
set hive.tez.dynamic.partition.pruning=true;
Read more details and related configuration in the Jira HIVE-7826
and at the same time try to rewrite as a LEFT SEMI JOIN:
select *
from myTable t
left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate
If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344
Or this one: https://stackoverflow.com/a/53279839/2700344
Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

Related

IF Conditional to Run Schedule Query

I'm using BigQuery. I have a query-scheduler to generate a table (RESULT TABLE) that depends on another table (SOURCE TABLE). The case is, this source table doesn't always have data, there's a possibility that this source table is empty.
I want to Schedule the Query to make the RESULT TABLE only if there's data in SOURCE TABLE.
The example would be:
IF COUNT(1) FROM data.source_table > 0 THEN RUN:
SELECT *
FROM data.source_table
LEFT JOIN data.other_source_table
ELSE [Don't Run]
Thanks in Advance
The syntax is
IF condition THEN [sql_statement_list]
[ELSEIF condition THEN sql_statement_list]
[ELSEIF condition THEN sql_statement_list]...
[ELSE sql_statement_list]
END IF;
So for your case it's
IF COUNT(1) FROM data.source_table > 0
THEN
SELECT *
FROM data.source_table
LEFT JOIN data.other_source_table;
END IF;
For more details, you can read https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#if
At the moment you can't set a destination table when using BigQuery Scripting. It means that solutions based on IF statement will not work for your case.
Besides that, it seems that when you set a destination table, BigQuery creates the table before your query's execution, which means that independently of the results, the table will be created.
The query below is only SQL. In other words, it doesn't contains scripting. If you use it to create a scheduled query and set a destination table, you will see that even when the sub query is not run an empty table will be created.
SELECT
*
FROM
UNNEST(
(SELECT
(
CASE (SELECT COUNT(1) FROM data.source_table) > 0
WHEN TRUE
THEN (
SELECT ARRAY(
SELECT AS STRUCT *
FROM data.source_table
LEFT JOIN data.other_source_table)
)
END
)
)
)
As a workaround, you could keep your existing scheduled query and create another scheduled query just like below to run some minutes after the first one:
IF (SELECT count(1) FROM `dataset.destination_table`) = 0
THEN DROP TABLE `dataset.destination_table`;
END IF
To summarize, your solution would be:
Run a scheduled query that will create a destination table,
A few minutes later, run a scheduled query that will check if the created table is empty. If so, the table will be deleted.
I hope it helps

Trying to create multiple temporary tables in a single query

I'd like to create 3-4 separate temporary tables within a single BigQuery query (all of the tables are based on different data sources) and then join them in various ways later on in the query.
I'm trying to do this by using multiple WITH statements, but it appears that you're only able to use one WITH statement in a query if you're not nesting them. Each time I've tried, I get an error saying that a 'SELECT' statement is expected.
Am I missing something? I'd prefer to do this all in one query if at all possible.
I don't know what you mean by "temporary tables", but I suspect you mean common table expressions (CTEs).
Of course you can have a query with multiple CTEs. You just need the right syntax:
with t1 as (
select . . .
),
t2 as (
select . . .
),
t3 as (
select . . .
)
select *
from t1 cross join t2 cross join t3;
bq mk --table --expiration [INTEGER] --description "[DESCRIPTION]"
--label [KEY:VALUE, KEY:VALUE] [PROJECT_ID]:[DATASET].[TABLE]
Expiration date will make your table temporary. You can create one table at the time with "bq mk" but you can use this in a script.
You can use the DDL, but here also you can only create one table at the time.
{CREATE TABLE | CREATE TABLE IF NOT EXISTS | CREATE OR REPLACE TABLE}
table_name [( column_name column_schema[, ...] )]
[PARTITION BY partition_expression] [OPTIONS(options_clause[, ...])]
[AS query_statement]
If by "temporary table" you meant "subqueries" this is the syntax you have to use:
WITH
subQ1 AS (
SELECT * FROM Roster
WHERE SchoolID = 52),
subQ2 AS (
SELECT SchoolID FROM subQ1)
SELECT DISTINCT * FROM subQ2;
You should be able to use common table expressions without issue. However, if you have large queries with a significant amount of common table expressions/subqueries you may run into resource issues in BQ specifically regarding the ability for BQ to create an execution plan. Temporary tables have helped me in these scenarios, but there are likely better practices.
CREATE TEMP TABLE T1 AS (
WITH
X as (query),
Y as (query),
Z as (query)
SELECT * FROM
(combination_of_above_queries)
);
CREATE TEMP TABLE T2 AS (
WITH
A as (query),
B as (query),
C as (query)
SELECT * FROM
(combination_of_above_queries)
);
CREATE TABLE new_table AS (
SELECT *
FROM (combination of T1 and T2)
)
Apologies, I just saw happened to come across this question and wanted to share what helped me... hope it is helpful for you.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#temporary_tables

HiveQL: Using query results as variables

in Hive I'd like to dynamically extract information from a table, save it in a variable and further use it. Consider the following example, where I retrieve the maximum of column var and want to use it as a condition in the subsequent query.
set maximo=select max(var) from table;
select
*
from
table
where
var=${hiveconf:maximo}
It does not work, although
set maximo=select max(var) from table;
${hiveconf:maximo}
shows me the intended result.
Doing:
select '${hiveconf:maximo}'
gives
"select max(var) from table"
though.
Best
Hive substitutes variables as is and does not execute them. Use shell wrapper script to get result into variable and pass it to your Hive script.
maximo=$(hive -e "set hive.cli.print.header=false; select max(var) from table;")
hive -hiveconf "maximo"="$maximo" -f your_hive_script.hql
And after this inside your script you can use select '${hiveconf:maximo}'
#Hein du Plessis
Whilst it's not possible to do exactly what you want from Hue -- a constant source of frustration for me -- if you are restricted to Hue, and can't use a shell wrapper as suggested above, there are workarounds depending on the scenario.
When I once wanted to set a variable by selecting the max of a column in a table to use in a query, I got round it like this:
I first put the result into a table comprising two columns, with the (arbitrary word) 'MAX_KEY' in one column and the result of the max calculation in the other, like this:
drop table if exists tam_seg.tbl_stg_temp_max_id;
create table tam_seg.tbl_stg_temp_max_id as
select
'MAX_KEY' as max_key
, max(pvw_id) as max_id
from
tam_seg.tbl_dim_cc_phone_vs_web;
I then added the word 'MAX_KEY' to a sub-query then joined in the above table so I could use the result in the main query:
select
-- *** here is the joined in value from the table being used ***
cast(mxi.max_id + qry.temp_id as string) as pvw_id
, qry.cc_phone_vs_web
from
(
select
snp.cc_phone_vs_web
, row_number() over(order by snp.cc_phone_vs_web) as temp_id
-- *** here is the key being added to the sub-query ***
, 'MAX_KEY' as max_key
from
(
select distinct cc_phone_vs_web from tam_seg.tbl_stg_base_snapshots
) as snp
left outer join
tam_seg.tbl_dim_cc_phone_vs_web as pvw
on snp.cc_phone_vs_web = pvw.cc_phone_vs_web
where
pvw.cc_phone_vs_web is null
) as qry
-- *** here is the table with the select result in being joined in ***
left outer join
tam_seg.tbl_stg_temp_max_id as mxi
on qry.max_key = mxi.max_key
;
Not sure if this is your scenario but maybe it can be adapted. I'm 99% sure you can't just put a select statement directly into a variable in Hue though.
If I am doing something in just Hue I would probably do the temporary table and join method. But if I were using a shall wrapper anyway I would definitely do it there.
I hope this helps.

how to group by data from hive with specific partition?

I have the following:
hive>show partitions TABLENAME
pt=2012.07.28.08
pt=2012.07.28.09
pt=2012.07.28.10
pt=2012.07.28.11
hive> select pt,count(*) from TABLENAME group by pt;
OK
Why can't the group by get the data?
Check if the hive.mapred.mode is set to "strict", if so it'll not allow all partitions to scan for the submitted query. You can set it to nonstrict as below:
hive>set hive.mapred.mode=nonstrict;
I'm not sure whether this caused NO results out of your query, but trying to address it. Do share the results.
Note: You can check the default value for this parameter in hive-default.xml
You can always achive the same using 2 select statements . For ex
Create table table1(
session_id string,
page_id string
)
partitioned by (metrics_date string);
Consider we are have loaded table for 2 partitions
hive>show partitions table1
metrics_date=2012.07.28.08
metrics_date=2012.07.28.09
select * from table1 ;
1212121212 google.com 2012.07.28.08
1212121212 google.com 2012.07.28.09`
Getting number of rows per partition
select metrics_date,count(*) from (
select * from table1 ) temp
group by metrics_date;
To get whole results along with group by ,You can use the below query.
SELECT pt,count(*) OVER (PARTITION BY pt) FROM TABLENAME;
This can be achiened through partition by.

sql select into subquery

I'm doing a data conversion between systems and have prepared a select statement that identifies the necessary rows to pull from table1 and joins to table2 to display a pair of supporting columns. This select statement also places blank columns into the result in order to format the result for the upload to the destination system.
Beyond this query, I will also need to update some column values which I'd like to do in a separate statement operation in a new table. Therefore, I'm interested in running the above select statement as a subquery inside a SELECT INTO that will essentially plop the results into a staging table.
SELECT
dbo_tblPatCountryApplication.AppId, '',
dbo_tblPatCountryApplication.InvId,
'Add', dbo_tblpatinvention.disclosurestatus, ...
FROM
dbo_tblPatInvention
INNER JOIN
dbo_tblPatCountryApplication ON dbo_tblPatInvention.InvId = dbo_tblPatCountryApplication.InvId
ORDER BY
dbo_tblpatcountryapplication.invid;
I'd like to execute the above statement so that the results are dumped into a new table. Can anyone please advise how to embed the statement into a subquery that will play nicely with a SELECT INTO?
You can simply add an INTO clause to your existing query to create a new table filled with the results of the query:
SELECT ...
INTO MyNewStagingTable -- Creates a new table with the results of this query
FROM MyOtherTable
JOIN ...
However, you will have to make sure each column has a name, as in:
SELECT dbo_tblPatCountryApplication.AppId, -- Cool, already has a name
'' AS Column2, -- Given a name to create that new table with select...into
...
INTO MyNewStagingTable
FROM dbo_tblPatInvention INNER JOIN ...
Also, you might like to use aliases for your tables, too, to make code a little more readable;
SELECT a.AppId,
'' AS Column2,
...
INTO MyNewStagingTable
FROM dbo_tblPatInvention AS i
INNER JOIN dbo_tblPatCountryApplication AS a ON i.InvId = a.InvId
ORDER BY a.InvId
One last note is that it looks odd to have named your tables dbo_tblXXX as dbo is normally the schema name and is separated from the table name with dot notation, e.g. dbo.tblXXX. I'm assuming that you already have a fully working select query before adding the into clause. Some also consider using Hungarian notation in your database (tblName) to be a type of anti-pattern to avoid.
If the staging table doesn't exist and you want to create it on insert then try the following:
SELECT dbo_tblPatCountryApplication.AppId,'', dbo_tblPatCountryApplication.InvId,
'Add', dbo_tblpatinvention.disclosurestatus .......
INTO StagingTable
FROM dbo_tblPatInvention
INNER JOIN dbo_tblPatCountryApplication
ON dbo_tblPatInvention.InvId = dbo_tblPatCountryApplication.InvId;
If you want to insert them in a specific order then use try using a sub-query in the from clause:
SELECT *
INTO StagingTable
FROM
(
SELECT dbo_tblPatCountryApplication.AppId, '', dbo_tblPatCountryApplication.InvId,
'Add', dbo_tblpatinvention.disclosurestatus .......
FROM dbo_tblPatInvention
INNER JOIN dbo_tblPatCountryApplication ON
dbo_tblPatInvention.InvId = dbo_tblPatCountryApplication.InvId
order by dbo_tblpatcountryapplication.invid
) a;
Try
INSERT INTO stagingtable (AppId, ...)
SELECT ... --your select goes here