Advice on changing the partition field for dynamic BigQuery tables - google-bigquery

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.

I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

Related

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

How to insert nested data to an existing record using bigquery streaming

I am trying to understand bigQuery and see if it fits our needs.
One of the basic requirements we have is to store a nested structure such that the nested part needs to be stored separately than the main record.
e.g.
Let's say we have a record of an employee, after storing the main data for the employee, let's say a minute after, another record would arrive with employee previous work place (and then another such record may arrive)
So we need to store te first employee record, and then update the structure to add a detail about the employee, this detail is also inserted as new record and does not overwrite an existing record.
How can this be done in bigQuerY?
Assuming we may have different sources of the data?
The preferred and recommended way to store that in BigQuery is append-only. That means that you are limited to do update/delete, and you constantly instant new rows.
By having a stream of rows from the same user, you need to write your queries in a such way to pick the last row, to obtain the most recent profile. But you have all the 'versioning' of all the stream that came in.
In other words you use Streaming Insert functionality to constantly add new rows. Then you have your SQL queries usually with Window Functions to pick last row.
You cannot update a row, or append to a record as BigQuery limits DML statements to 96 per table.

How can I present the changes for updated data in Tableau

I am working on some data-sets which gets updated daily. By updation, I mean that three things happen:
1. New rows get added.
2. Some rows get deleted.
3. Some existing rows get replaced with new values.
Now I have prepared dash-boards on Tableau to analyze daily data, but I would also like to compare how the things are changing daily (i.e are we progressing or making loss from previous day.)
I am aware that we can take extracts from the data set. But if I go this way, I am not sure how to use all the extracts in one worksheet and compare the info given by all of them.
Tableau is simply a mechanism that builds an SQL query in the background and then builds tables and charts and such via that fetched query. This means that if you delete a row from the table it no longer exists so how can Tableau read it?? If anything your DB architecture should be creating new records and giving it a createtimestamp. You would NOT delete a record and put a new one. Then you'll only have one record in that table.... Sounds like a design issue

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.