Google BigQuery - Using wildcard table query with date partitioned table? - google-bigquery

I am trying to use wildcard table functions to query bunch of date-partitioned tables.
This query works:
select * from `Mydataset.fact_table_1` where _partitiontime='2016-09-30' limit 10
This query does not work:
select * from `Mydataset.fact_table_*` where _partitiontime='2016-09-30' limit 10
Is this operation not supported?
If it is not supported what's the best way to read same day's data from multiple date-partitioned tables?

Following statement
select * from TABLE_QUERY(YOUR_DATASET,'table_id contains "fact_table_"') where _PARTITIONTIME = TIMESTAMP('2016-09-30')
Should do the trick

Related

Paritions counts of tables across different datasets in bigquery

I am looking a way to find the total no of partitions(count of partitions to find ahead if any table hitting 4000 limit threshold.) across bigquery tables in all datasets of a project. COuld someone please help me with the query.
Thanks
You can use INFORMATION_SCHEMA.PARTITIONS metadata table in order to extract partitions information from a whole schema/dataset.
It works as follows:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
In case you want to look at a specific table, you just need to include it in the WHERE clause:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'partitioned_table'

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

Snowflake sql table name wildcard

What is a good way to "select" from multiple tables at once when the list of tables is not known in advance in snowflake sql?
Something that simulates
Select * from mytable*
which would fetch same results as
Select * from mytable_1
union
Select * from mytable_2
...
I tried doing this in a multistep.
show tables like 'mytable%';
set mytablevar =
(select listagg("name", ' union ') table_
from table(result_scan(last_query_id())))
The idea was to use the variable mytablevar to store the union of all tables in a subsequent query, but the variable size exceeded the size limit of 256 as the list of tables is quite large.
Even if you do not hit 256 character limits, it will not help you to query all these tables. How will you use that session variable?
If you have multiple tables which have the same structure, and hold similar data that you need to query together, why are the data not in one big table? You can use Snowflake's clustering feature to distribute data based on a specific column to your micro-partitions.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
Anyway, you may create a stored procedure which will create/replace a view.
https://docs.snowflake.com/en/sql-reference/stored-procedures-usage.html#dynamically-creating-a-sql-statement
And then you can query that view:
CALL UPDATE_MY_VIEW( 'myview_name', 'table%' );
SELECT * FROM myview_name;

Full table scan on partitioned table

I have two tables (pageviews and base_events) that are both partitioned on a date field derived_tstamp. Every night I'm doing an incremental update to the base_events table, querying the new data from pageviews like so:
select
*
from
`project.sp.pageviews`
where derived_tstamp > (select max(derived_tstamp) from `project.sp_modeled.base_events`)
Looking at the query costs, this query scans the full table instead of only the new data. Usually this should only get yesterdays data.
Do you have any idea, what's wrong with the query?
Subqueries will trigger a full table scan. The solution is to use scripting. I have solved my problem with the following query:
declare event_date_checkpoint DATE default (
select max(date(page_view_start)) from `project.sp_modeled.base_events
);
select
*
from
`project.sp.pageviews`
where derived_tstamp > event_date_checkpoint
More on scripting:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#declare

How to select data from partitioned tables into partitioned destination tables?

In Google Big Query, it's straightforward to select across all partitioned tables using wildcard operators. For example, I could select all rows from date partitioned tables with something like this:
SELECT * FROM `project.dataset.table_name__*`;
That would give me all results from project.dataset.table_name__20161127,
project.dataset.table_name__20161128, project.dataset.table_name__20161129, etc.
What I don't understand is how to specify partitioned destination tables. How do I ensure that result set is written to, as an example, project.dataset.dest-table__20161127,
project.dataset.table-name__20161128, project.dataset.table-name__20161129?
Thanks in advance!