Bigquery Schedule query to load data to a particular partition - google-bigquery

I am using the bigquery schedule query functionality to run a query every 30 mins.
My destination table will be a partitioned table and the partionining column is 'event_date'
The schedule query that i am using will be to copy today's data from source_table -> Dest_table
(like select * from source_table where event_date = CURRENT_DATE())
every 30 mins ,
but i would like it to write_truncate existing partition without write truncating the whole table.(since i don't want to duplicate today's data every 30 mins)
Currently when i schedule this query with partition_field set to event_date and write_truncate , it is truncating the whole table and this causes the previous data to be lost . Is there something else that i am missing

Instead of specifying destination table, you may use MERGE to truncate only one partition.
It is unfortunately more expensive, for you also pay for deleting the data from dest_table. (Insert is still free)
MERGE dest_table t
USING source_table
ON FALSE
WHEN NOT MATCHED BY SOURCE AND event_date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Related

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

How can I avoid and/or clean duplicated row in BigQuery?

How should I import data in BigQuery on a daily basis when I have potential duplicated row ?
Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.
Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?
Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?
thanks !
Step 1: Have a sheet with data to be imported
Step 2: Set up your spreadsheet as a federated data source in BigQuery.
Step 3: Use DML to load data into an existing table
(requires #standardSql)
#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN (
SELECT id
FROM `fh-bigquery.tt.test_import_native`
)
WHERE id NOT IN (...) ensures that only rows with new ids are loaded into the table.
As far as I know, the answer provided by Felipe Hoffa is the most effective way to avoid duplicate rows since Bigquery do not normalize data when loading data. The reason is that Bigquery performs best with denormalized data [1]. To better understand it, I’d recommend you to have a look in this SO thread.
I also would like to suggest using SQL aggregate or analytic function to clean the duplicate rows in a Bigquery table, as Felipe Hoffa's or Jordan Tigani's answer in this SO question.
If you have a large-size partitioned table, and only want to remove duplicates in a given range without scanning through (cost-saving) and replacing the whole table.
use the MERGE SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `your_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `your_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

BigQuery - Delete rows from Partitioned Table

I have a Day-Partitioned Table on BigQuery. When I try to delete some rows from the table using a query like:
DELETE FROM `MY_DATASET.partitioned_table` WHERE id = 2374180
I get the following error:
Error: DML statements are not yet supported over partitioned tables.
A quick Google search leads me to: https://cloud.google.com/bigquery/docs/loading-data-sql-dml where it also says: "DML statements that modify partitioned tables are not yet supported."
So for now, is there a workaround that we can use in deleting rows from a partitioned table?
DML has some known issues/limitation in this phase.
Such as:
DML statements cannot be used to modify tables with REQUIRED fields in their schema.
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted. For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
DML statements that modify partitioned tables are not yet supported.
Also be aware of the quota limits
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
What you can do is copy the entire partition to a non-partitioned table and execute the DML statement there. Than write back the temp table to the partition. Also if you ran into DML update limit statements per day per table, you need to create a copy of the table and run the DML on the new table to avoid the limit.
You could delete partitions in partitioned tables using the command-line bq rm, like this:
bq rm 'mydataset.mytable$20160301'
I've already done it without temporary table, steps:
1) prepare query which selects all the rows from particular partition which should be kept:
SELECT * FROM `your_data_set.tablename` WHERE
_PARTITIONTIME = timestamp('2017-12-07')
AND condition_to_keep_rows_which_shouldn't_be_deleted = 'condition'
if necessary run this for other partitions
2) choose Destination table for result of your query where you point TO THE PARTICULAR PARTITION, you need to provide table name like this:
tablename$20171207
3) Check option "Overwrite table" -> it will overwrite only particular partition
4) Run Query, as a result from pointed partition redundant rows will be deleted!
//remember that you could need run this for other partitions, where you rows to deleted are spread across more than one partition
Looks like as of my writing, this is no longer a BigQuery limitation!
In standard SQL, a statement like the above, over a partitioned table, will succeed, assuming rows being deleted weren't recently (within last 30 minutes) inserted via a streaming insert.
Current docs on DML: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language
Example Query that worked for me in the BQ UI:
DELETE
FROM dataset_name.partitioned_table_on_timestamp_column
WHERE
timestamp >= '2020-02-01' AND timestamp < '2020-06-01'
After the hamsters are done spinning, we get the BQ response:
This statement removed 101 rows from partitioned_table_on_timestamp_column

Bigquery - How to keep partition in target table

I need to select rows from a partitioned table and save the result into another table, how can I keep records' __PARTITIONTIME the same as they are in the source table? I mean, not only to keep the value of __PARTITIONTIME, but the whole partition feature so that I can do further queries on the target table using time decor and like stuff.
(I'm using Datalab notebooks)
%%sql -d standard --module TripData
SELECT
HardwareId,
TripId,
StartTime,
StopTime
FROM
`myproject.mydataset.TripData`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY)
You cannot do this for multiple partitions at once!
You should do it one partition at a time specifying target partition - targetTable$yyyymmdd
Note: first you need to create target table as a partitioned table with respective schema

In time-partitioned bigquery tables, when is data written to __UNPARTITIONED__? what are the effects?

I ran into some freak undocumented behavior of time-partitioned bigquery tables:
I created a time-partitioned table in BigQuery and inserted data.
I was able to insert normally - data was written to today's partition (I was also able to explicitly specify a partition and write into it)
After some tests with new data, I deleted today's partition, in order to have clean data:(CLI)
bq --project_id=my-project rm v1.mytable$20160613
I then checked whether it's empty:
select count(*) from [v1.mytable]
Result 270 instead of 0
I tried deleting again and rerunning the query - same result.
So I queried
select count(*) from [v1.mytable$20160613]
Result 0
so a couple of previous dates in which I may have inserted data, but all were 0.
Finally I ran
SELECT partition_id from [v1.mytable$__PARTITIONS_SUMMARY__];
and the result was
{ UNPARTITIONED 20160609 20160613 }
and all the data was in fact in UNPARTITIONED
My questions:
When is the data written to this special partition instead of the daily partition, and how can I avoid this?
Are there other effects, except from losing the ability to address specific dates (in query, or when deleting data, etc.)? should I take care for this case?
While data is in the streaming buffer, it remains in the UNPARTITIONED partition. To address this partition in a query, you can use the value NULL for the _PARTITIONTIME pseudo column.
SELECT ... FROM mydataset.mypartitioned_table WHERE _PARTITIONTIME IS NULL
To delete data for a given partition, we suggest doing a write truncate to it with a query that returns an empty result. For example:
bq query --destination_table=mydataset.mypartitionedtable\$20160121 --replace 'SELECT 1 as field1, "one" as field2 FROM (SELECT 1 as field1, "one" as field2) WHERE FALSE'
Note that the partition will still be around (if you do a SELECT * from table$__PARTITIONS__SUMMARY), but it will have 0 rows.
$ bq query 'SELECT COUNT(*) from [mydataset.mypartitionedtable$20160121]'
+-----+
| f0_ |
+-----+
| 0 |
+-----+
This is a temporary state -- querying an hour later the records all belonged to today's partition.
The effect is thus similar to a delay in data write: querying immediately after the insert may not have the most recent data in the correct partition, but eventually this will be ok