How to select data from partitioned tables into partitioned destination tables? - sql

In Google Big Query, it's straightforward to select across all partitioned tables using wildcard operators. For example, I could select all rows from date partitioned tables with something like this:
SELECT * FROM `project.dataset.table_name__*`;
That would give me all results from project.dataset.table_name__20161127,
project.dataset.table_name__20161128, project.dataset.table_name__20161129, etc.
What I don't understand is how to specify partitioned destination tables. How do I ensure that result set is written to, as an example, project.dataset.dest-table__20161127,
project.dataset.table-name__20161128, project.dataset.table-name__20161129?
Thanks in advance!

Related

Use wildcard query on dataset with models in BigQuery

I have a series of tables that are named {YYYYMM}_{id} and I have ML models that are named {groupid}_cost_model. I'm attempting to collate some data across all the tables using the following query:
SELECT * FROM `mydataset.20*`
The problem I'm having is that I have a model named 200_cost_model and it causes the following error:
Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table myproject:mydataset.200_cost_model, first column table myproject:mydataset.202001_4544248676.
Is there a way to filter out the models from wildcard queries or am I stuck joining all the tables together?
When using Wildcard tables you can use psuedo column to filter results:
Queries with wildcard tables support the _TABLE_SUFFIX pseudo column
in the WHERE clause. This column contains the values matched by the
wildcard character, so that queries can filter which tables are
accessed. For example, the following WHERE clauses use comparison
operators to filter the matched tables
I have tested on my side, although only on standard freshly created tables, that it should work for example like that:
SELECT *
FROM
`mydataset.20*`
WHERE
_TABLE_SUFFIX like '%cost_model' ;
As well to check all possible _TABLE_SUFFIX choices it work to me like this:
select DISTINCT _TABLE_SUFFIX as suffix from `mydataset.20*`
but I am not sure, if this will work in your situation.

Paritions counts of tables across different datasets in bigquery

I am looking a way to find the total no of partitions(count of partitions to find ahead if any table hitting 4000 limit threshold.) across bigquery tables in all datasets of a project. COuld someone please help me with the query.
Thanks
You can use INFORMATION_SCHEMA.PARTITIONS metadata table in order to extract partitions information from a whole schema/dataset.
It works as follows:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
In case you want to look at a specific table, you just need to include it in the WHERE clause:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'partitioned_table'

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

How to disallow loading duplicate rows to BigQuery?

I was wondering if there is a way to disallow duplicates from BigQuery?
Based on this article I can deduplicate a whole or a partition of a table.
To deduplicate a whole table:
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
To deduplicate a table based on partitions defined in a WHERE clause:
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?
BigQuery doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.
If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:
First insert your data into an temporary table
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
Using the approach you mentioned as a scheduled query
Using a bq command such as bq query --use_legacy_sql=false --destination_table=<dataset.actual_table> 'select distinct * from <dataset.temporary_table>' that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table attribute. Its important to mention that this approach will also work for partitioned tables.
Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.
I hope it helps

BigQuery, date partitioned tables and decorator

I am familiar with using table decorators to query a table, for example, as it was a week ago or for data inserted over a certain date range.
Introducing date-partitioned tables revealed a pseudo column called _PARTITIONTIME. Using a date decorator syntax, you can add records to a certain partition in the table.
I was wondering if the pseudo column _PARTITIONTIME is also used, behind the scene, to support table decorators or something that straightforward.
If yes, can it be accessed/changed, as we do with the pseudo column of partitioned tables?
Is it called _PARTITIONTIME or _INSERTIONTIME? Of course, both didn't work. :)
First check if indeed the table is partitioned by reading out partitions
SELECT TIMESTAMP(partition_id)
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
In case not you will get error: Cannot read partition information from a table that is not partitioned
then another important step: To select the value of _PARTITIONTIME, you must use an alias.
SELECT
_PARTITIONTIME AS pt,
field1
FROM
mydataset.table1
but when you use in WHERE it's not mandatory, only when it's in select.
#legacySQL
SELECT
field1
FROM
mydataset.table1
WHERE
_PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
you can always reference one partitioned table with the decorator: mydataset.table$20160519