How do you deduplicate records in a BigQuery table? - sql

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated

I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

Related

Finding the query that created a table in BigQuery

I am a new employee at the company. The person before me had built some tables in BigQuery. I want to investigate the create table query for that particular table.
Things I would want to check using the query is:
What joins were used?
What are the other tables used to make the table in question?
I have not worked with BigQuery before but I did my due diligence by reading tutorials and the documentation. I could not find anything related there.
Brief outline of your actions below:
Step 1 - gather all query jobs of that user using Jobs.list API - you must have Is Owner permission for respective projects to get someone else's jobs
Step 2 - extract only those jobs run by the user you mentioned and referencing your table of interest - using destination table attribute
Step 3 - for those extracted jobs - just simply check respective queries which allow you to learn how that table was populated
Hth!
I have been looking for an answer since a long time.
Finally found it :
Go to the three bars tab on the left hand side top
From there go to the Analytics tab.
Select BigQuery under which you will find Scheduled queries option,click on that.
In the filter tab you can enter the keywords and get the required query of the table.
For me, I was able to go through my query history and find the query I used.
Step 1.
Go to the Bigquery UI, on the bottom there are personal history and project history tabs. If you can use the same account used to execute the query I recommend personal history.
Step 2.
Click on the tab and there will be a list of queries ordered from most recently run. Check the time the table was created and find a query that ran before the table creation time.
Since the query will run first and create the table there will be slight differences. For me it stayed between a few seconds.
Step 3.
After you find the query used to create the table, simply copy it. And you're done.

How should I control the duplication of calculation version using Pentaho?

I have a table result_slalom where data is populated via ETL Jobs of Pentaho.
When ETL runs for the first time it creates version-1.
Now, if data is changed after new calculations it becomes version-2.
I need to make changes only in the Calculation Version -2 and no more than 2 versions should be there in the table result_slalom. ( Version-1 and Version-2 )
So the logic is :
Check if data exists in table
o
When data exists and existing version is 1, then set the version of new data=2
--> Insert new dataset
o When data exists and existing version is 2, then set the version of new data=2
--> Update existing dataset
o When no data exists, then set version = 1
--> Insert new dataset
How do I make my Pentaho formula for this logic?
currently it is:
if([VersionInDB]=1;[Calculationversion];[VersionInDB]+1)
The dimension lookup/update is a step which does exactly that.
In addition it has validity dates : at the time version2 is created, version1 receives a end-date of now and version2 receives a start-date of now. It make it easy to retrieve historical information with a where date between start-date and end-date. Plus, you have a single button that writes the create/alter table and create index for you.
The other neat solution is to put a trigger on the tables.
Forget to reinvent the wheel in that direction. Although I usually like to inventing the wheel, it is one case on which redeveloping the logic will lead you in an infinite number of tests and bugs.

Bigquery return nested results without flattening it without using a table

It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.

Get the oldest row from table

I coding a application that dealing with files. So, I have a table that contains information about all the files that registered in the application.
My "files" table looks like this: ID, Path and LastScanTime.
The algorithm that I use in my application is simple:
Take the oldest row (LastScanTime is the oldest)
Extract the file path
Do some magics on this file (takes exactly 5 minutes)
Update the LastScanTime to the current time (now)
Go to step "1"
Until now, the task is pretty simple. For this, I going to use this SQL statement for getting the oldest item:
SELECT TOP 1 * FROM files ORDER BY [LastScanTime] ASC
and at the end of the item processing (preventing the item to be selected immediately again):
UPDATE Files SET [LastScanTime]=GETDATE() WHERE Id=#ItemID
Now, I going to add some complexity to the algorithm:
Take the 3 oldest row (LastScanTime is the oldest)
For each row, do:
A. Extract the file path
B. Do some magics on this file (takes exactly 5 minutes)
C. Update the LastScanTime to the current time (now)
D. Go to step "1"
The problem that now I facing with is that the whole process is going to be processed in parallel (no more serial processing). So, changing my SQL statement to the next statement is not enough!
SELECT TOP 3 * FROM files ORDER BY [LastScanTime] ASC
Why this SQL statement isn't enough?
Let's say that I run my code and started to execute the first 3 items. Now, after a minute I want to execute another 3 items. This SQL statement will retrieve exactly the same "oldest" items that we already started to process.
Possible solution
Implementing a SELECT & UPDATE (combined) that getting the 3 oldest item and immediately update their last scan time. Since there no SELECT & UPDATE in same statement, what will happens if between the executing of the first SELECT, will come in another SELECT? The both statements will get the same results. This is a problem... Another problem is that we mark the item as "scanned recently", before the scan is really finished. What happend if the scanned will terminated by an error?
I'm looking for tips and tricks to solve this problem. The solutions can add columns as needed.
I'll appreciate you help.
Well I usually have habit of having two different field name in the database. one is AddedDate and another is ModifiedDate.
So the algorithm in your terms will be:-
Take the oldest row (AddedDate is the oldest)
Extract the file path
Do some process on this file
Update the ModifiedDate to the current time (now)
It seems that you are going to invent event queue with your SQL. Possibly standard approaches like RabbitMQ or ActiveMQ may solve your problem.

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?
You can use that column to make sure your users don't overwrite data from another user.
Lets say user A pulls up record 1 and at the same time user B pulls up record 1. User A edits the record and saves it. 5 minutes later, User B edits the record - but doesn't know about user A's changes. When he saves his changes, you use the recordversion column in your update where clause which will prevent User B from overwriting what User A did. You could detect this invalid condition and throw some kind of data out of date error.
Nothing that I'm aware of, or that Google seems to find quickly.
You con't get anything inherent by using that name for a column. Sure, you can create a column and do the record versioning as described in the next response, but there's nothing special about the column name. You could call the column anything you want and do versioning, and you could call any column RecordVersion and nothing special would happen.
Timestamp is mainly used for replication. I have also used it successfully to determine if the data has been updated since the last feed to the client (when I needed to send a delta feed) and thus pick out only the records which have changed since then. This does require having another table that stores the values of the timestamp (in a varbinary field) at the time you run the report so you can use it compare on the next run.
If you think that timestamp is recording the date or time of the last update, it does not do that, you would need dateTime fields and constraints (To get the orginal datetime)and triggers (to update) to store that information.
Also, keep in mind if you want to keep track of your data, it's a good idea to add these four columns to every table:
CreatedBy(varchar) | CreatedOn(date) | ModifiedBy(varchar) | ModifiedOn(date)
While it doesn't give you full history, it lets you know who and when created an entry, and who and when last modified it. Those 4 columns create pretty powerful tracking abilities without any serious overhead to your DB.
Obviously, you could create a full-blown logging system that tracks every change and gives you full-blown history, but that's not the solution for the issue I think you are proposing.