Aggregating last 30 days data in BigQuery - google-bigquery

I am checking the feasibility of moving from Redshift to BigQuery. I need help in implementing the below use case on BigQuery.
We have a by day product performance table which is a date partitioned table. It is called product_performance_by_day. There is a row for every product that was sold each day. Every day we process the data at the end of the day and put it in the partition for that day. Then we aggregate this by day performance data over the last 30 days and put it in the table called product_performance_last30days. This aggregation saves querying time and in the case of BigQuery will save the cost as well since it will scan less data.
Here is how we do it in Redshift currently -
We put the aggregated data in a new table e.g. product_performance_last30days_temp. Then drop the product_performance_last30days table and rename product_performance_last30days_temp to product_performance_last30days. So there is very minimal downtime for product_performance_last30days table.
How can we do the same thing in the BigQuery?
Currently, BigQuery does not support renaming tables or materialized views or table aliases. And since we want to save the aggregated data in the same table every day we cannot use destination table if the table is not empty.

You can overwrite the same table by using writeDisposition Specifies the action that occurs if the destination table already exists.
The following values are supported:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
The default value is WRITE_EMPTY.
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
For RENAMING tables look on this answer.

Related

Get column timestamp when they got added in Bigquery

I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:

DATA_CONSISTENCY_CHECK is On in my table.But still temporal table inserts another row for same data update .How can I restrict in T-SQL?

DATA_CONSISTENCY_CHECK is On in my table. I'm trying to check data consistency for audit purpose.When I update same value in the main table, the temporal table keeps history of the same row, which causes difficult to track the version changes.I'm using MSSQL server.
You misunderstood the function of DATA_CONSISTENCY_CHECK option. It's used to check if time ranges definded by system_start_time_column_name and system_end_time_column_name columns in PERIOD FOR SYSTEM_TIME do not overlap in base and historical table when you enable the link between the base and historical table ( this done when you execute CREATE/ALTER TABLE command).
If you need data deduplication in historical table you have to implement it yourself. It can be a maintenance task which disable the link, remove duplicates, update the time range columns correctely and enable link between base and historical table back.

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Old rows left unpartitioned in partitioned table

I'm working with a BigQuery partitioned table. The partition is based on a Timestamp column in the data (rather than ingestion-based). We're streaming data into this table at a rate of several million rows per day.
We noticed that our queries based on specific days were scanning much more data than they should in a partitioned table.
Here is the current state of the UNPARTITIONED partition:
I'm assuming that little blip at the bottom-right is normal (streaming buffer for the rows inserted this morning), but there is this massive block of data between mid-November and early-December that lives in the UNPARTITIONED partition, instead of being sent to the proper daily partitions (the partitions for that period don't appear to exist at all in __PARTITIONS_SUMMARY__).
My two questions are:
Is there a particular reason why these rows would not have been partitioned correctly, while data before and after that period is fine?
Is there a way to 'flush' the UNPARTITIONED partition, i.e. force BigQuery to dispatch the rows to their correct daily partition?
I faced a similar type of issue where a lot of rows stayed unpartitioned in a column-based partitioned table. So, what I observed that some records are not partitioned due to the source of the streaming insert. For the soulition, I update the table using the update and set a partitioned date where the partitioned column date is null. For safer side make sure that partitioned date column should not be nullable.

Comparing yesterday's data with today's data

I have 2 parquet tables, one for today and one for yesterday. What I want to do is compare what has changed in today's table, e.g.:
which new rows have been added
which rows have been deleted and when they have been deleted
which rows have been changed
The tables itself have columns "createdAt" and "updatedAt" which I can use for this purpose.
I'm working with Databricks/Apache Spark so I can either use their built-in functions or an SQL query. I'm not sure how to go about this, any general ideas are appreciated!
Maintain one audit table behind your main table. data must be inserted in Audit table when you perform Insert, update or delete on your main table. Audit table should include createdAt of main table and current date-stamp.
If you manage transaction-type Insert, update or delete with 1,2,3 then it will be good for Query performance.
As I don't know the LoadType (full or delta) for your table, I will try to cover both the scenarios:-
Full Load -
For this, you only need today's table as it will contain all the previous days record as well.
Hence you only need to put condition to check all the records that are modified after yesterday's load using updatedAt column i.e
updatedAt > yesterday's load date
Delta Load -
For delta, each day you will get modified records(new, updated or deleted) only, hence just query today's table without any condition will serve the purpose.
Now, on spark side, as you have large number of records, you can manipulate number of dataframe partitions at runtime using something like below:-
spark.sql("set spark.sql.shuffle.partitions = 1500");
please find other optimization techniques here
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/