BigQuery Put Query Result into Day Partition Table - google-bigquery

BigQuery can save query result into specify table, but If target table has day partition, currently I use Python loop query one day data and save to table, is other faster way?
thanks!

There are two related feature requests that you can vote for and monitor progress - Update date-partitioned tables from results of a query and Partition on non-date field
Meantime, conceptually - the way you approach this - using loop - is correct and the only way as of now (August 2017)

Related

Get column timestamp when they got added in Bigquery

I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:

Partition bigquery table with more than 4000 days of data?

I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?
Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.
two solutions i see:
1)
Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ.
With that you are able to easily address the shard plus the partition with one wildcard/variable.
2)
A good workaround is to create an extra column with the date of you field which should be partitioned.
For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.
With that you are able to create another 29 years of data.
Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.
Best Thomas
Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.
You can put old data per year in single partition.
You need to add extra column to you table for partioning it.
Say, all data for year 2011 will go to partition 20110101.
For newer data (2019) you can have seperate partition for each date.
This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.
4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:
Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
Union the tables together in a BigQuery view
This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.
It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.
In that case, see:
Partition by week/year/month to get over the partition limit?

Comparing yesterday's data with today's data

I have 2 parquet tables, one for today and one for yesterday. What I want to do is compare what has changed in today's table, e.g.:
which new rows have been added
which rows have been deleted and when they have been deleted
which rows have been changed
The tables itself have columns "createdAt" and "updatedAt" which I can use for this purpose.
I'm working with Databricks/Apache Spark so I can either use their built-in functions or an SQL query. I'm not sure how to go about this, any general ideas are appreciated!
Maintain one audit table behind your main table. data must be inserted in Audit table when you perform Insert, update or delete on your main table. Audit table should include createdAt of main table and current date-stamp.
If you manage transaction-type Insert, update or delete with 1,2,3 then it will be good for Query performance.
As I don't know the LoadType (full or delta) for your table, I will try to cover both the scenarios:-
Full Load -
For this, you only need today's table as it will contain all the previous days record as well.
Hence you only need to put condition to check all the records that are modified after yesterday's load using updatedAt column i.e
updatedAt > yesterday's load date
Delta Load -
For delta, each day you will get modified records(new, updated or deleted) only, hence just query today's table without any condition will serve the purpose.
Now, on spark side, as you have large number of records, you can manipulate number of dataframe partitions at runtime using something like below:-
spark.sql("set spark.sql.shuffle.partitions = 1500");
please find other optimization techniques here
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

Aggregating last 30 days data in BigQuery

I am checking the feasibility of moving from Redshift to BigQuery. I need help in implementing the below use case on BigQuery.
We have a by day product performance table which is a date partitioned table. It is called product_performance_by_day. There is a row for every product that was sold each day. Every day we process the data at the end of the day and put it in the partition for that day. Then we aggregate this by day performance data over the last 30 days and put it in the table called product_performance_last30days. This aggregation saves querying time and in the case of BigQuery will save the cost as well since it will scan less data.
Here is how we do it in Redshift currently -
We put the aggregated data in a new table e.g. product_performance_last30days_temp. Then drop the product_performance_last30days table and rename product_performance_last30days_temp to product_performance_last30days. So there is very minimal downtime for product_performance_last30days table.
How can we do the same thing in the BigQuery?
Currently, BigQuery does not support renaming tables or materialized views or table aliases. And since we want to save the aggregated data in the same table every day we cannot use destination table if the table is not empty.
You can overwrite the same table by using writeDisposition Specifies the action that occurs if the destination table already exists.
The following values are supported:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
The default value is WRITE_EMPTY.
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
For RENAMING tables look on this answer.

Incremental extraction from DB2

What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.