Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted? - google-bigquery

Google BigQuery has no primary key or unique constraints.
We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?
If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.
What is your advice and suggestions based on your experience?
It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?

So let's clear some facts up in the first place.
Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.
Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".
So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.
We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.

"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.
"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".
When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.
Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.

BigQuery now doesn't have DML limits.
https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery

Related

Advice on changing the partition field for dynamic BigQuery tables

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.
I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

How to insert nested data to an existing record using bigquery streaming

I am trying to understand bigQuery and see if it fits our needs.
One of the basic requirements we have is to store a nested structure such that the nested part needs to be stored separately than the main record.
e.g.
Let's say we have a record of an employee, after storing the main data for the employee, let's say a minute after, another record would arrive with employee previous work place (and then another such record may arrive)
So we need to store te first employee record, and then update the structure to add a detail about the employee, this detail is also inserted as new record and does not overwrite an existing record.
How can this be done in bigQuerY?
Assuming we may have different sources of the data?
The preferred and recommended way to store that in BigQuery is append-only. That means that you are limited to do update/delete, and you constantly instant new rows.
By having a stream of rows from the same user, you need to write your queries in a such way to pick the last row, to obtain the most recent profile. But you have all the 'versioning' of all the stream that came in.
In other words you use Streaming Insert functionality to constantly add new rows. Then you have your SQL queries usually with Window Functions to pick last row.
You cannot update a row, or append to a record as BigQuery limits DML statements to 96 per table.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.