Row level timestamp information in Google Big Query - sql

I am working on a table in BigQuery. The table is already populated with data. I want to know if the BigQuery holds any kind of row level information from where I can know the row inserted or modified datetime?

BigQuery provides no such metadata. You would have to create such fields and populate them yourself.

Related

Get column timestamp when they got added in Bigquery

I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:

Create New Bigquery Table Partitioned on Different Column

I have some data streaming to a Bigquery table partitioned by timestamp column A (defined in streaming service). Now, for analysis, we want to query data with filters on timestamp column B. So, it would be great be there is someway we can create a view or table(that is in sync with source table) partitioned on Column B. I looked into materialized views but they only support same partitioning column as in the source table.
Any workaround or suggestion is appreciated.
Thanks in advance.

Stream table data from one BigQuery table to another with existing schema

I have two BigQuery datasets: dataset_a and dataset_b
Each of these datasets contain a table, e.g dataset_a_table and dataset_b_table
dataset_a_table contains streaming data and I want to stream data from dataset_a_table to dataset_b_table.
I have schema of dataset_a_table of type TableSchema. How can I copy stream rows from one table to another and keep the existing schema?
I have so far looked at insertAll method of BigQuery but I am a bit unsure about which data structure to fetch rows in and how to specify TableSchema when inserting into a new table.
I would appreciate some guidance regarding how to do that. Thanks.
Approach 1: If dataset_b_table needs to simply mirror dataset_a_table, for instance because you have different user permissions on the two datasets, you could consider setting up dataset_b_table as a view instead of a table. Views in BigQuery work across datasets:
CREATE VIEW dataset_b.dataset_b_view AS SELECT * FROM dataset_a.dataset_a_table
Approach 2: If you do want dataset_b_table with the same schema as dataset_a_table, you can use the BigQuery native "transfers" functionality. ("Transfers" > "Create Transfer" > select "Dataset Copy")
Approach 3: If dataset_b_table has a different schema from dataset_a_table, or if dataset_b_table already contains data and you want to merge in data from dataset_a_table, you will need some sort of incremental logic. Assuming your dataset_a_table has some sort of "created_at" field (also assuming no updates to records) then you could go with an incremental load like this:
INSERT INTO dataset_b.dataset_b_table
SELECT
column_a, column_b, column_c, updated_at
FROM dataset_a.dataset_a_table
WHERE updated_at>(SELECT max(updated_at) FROM dataset_b.dataset_b_table)
You can then schedule this to run depending on your timing requirements, once a day, hour, or every couple of minutes. You can use the BigQuery native scheduling functionality, or your own logic.
If you need actual streaming in (milli)seconds, and the View approach doesn't work for you, you will need to work with the source that fills dataset_a_table in the first place as BigQuery doesn't support triggers.

Using HBase in place of Hive

Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems?
1. insert new columns at any position in the schema, delete column, rename column
2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?
You are right HBase is a semi schema-less database (column families still fixed)
You will be able to create new columns
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
Rename column is not trivial operation at all
Frequently updating existing records between major compaction intervals will increase query response time
I hope it is helpful.

Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted?

Google BigQuery has no primary key or unique constraints.
We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?
If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.
What is your advice and suggestions based on your experience?
It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?
So let's clear some facts up in the first place.
Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.
Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".
So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.
We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.
"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.
"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".
When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.
Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.
BigQuery now doesn't have DML limits.
https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery