BigQuery update multi tables - google-bigquery

i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)

You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.

Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.

Related

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

What are the benefits of a Make Table vs a Select query in Access?

I know you can run SELECT queries on top of SELECT queries in Access, but the application also provides the Make Table query type.
I'm wondering what the benefits/reasons for using Make Table might be?
You would usually use Make Table for performance reasons. If you have a fairly complex query that returns a subset of your table's data, and that you may need to retrieve multiple times, it can be expensive to re-run the query multiple times.
Using Make Table allows you to incur the cost of running the expensive query once, and make a copy of the query results into a table. Querying this copy would then be a lot less expensive than running your original expensive query.
This is usually a good option when you don't expect your original data to change frequently, or if you don't care that you are working of a copy of the data that may not be 100% up-to-date with the original data.
Notice what the following article on Create a make table query has to say:
Typically, you create make table queries when you need to copy or archive data. For example, suppose you have a table (or tables) of past sales data, and you use that data in reports. The sales figures cannot change because the transactions are at least one day old, and constantly running a query to retrieve the data can take time — especially if you run a complex query against a large data store. Loading the data into a separate table and using that table as a data source can reduce workload and provide a convenient data archive. As you proceed, remember that the data in your new table is strictly a snapshot; it has no relationship or connection to its source table or tables.
The main defense here is that a make table query creates a table. And when you done with the table then effort and time to delete that table and recover the VERY LARGE increase in the database file will have to occur. For general reports and a query of data make much more send. A comparison would be to build a NEW garage every time you want to park your car.
The database engine and query system can fetch and pull rows at a very high rate and those results are then able to be rendered into a report or form, and this occurs without having to create a temp table. It makes little sense to go through all of the trouble of having the system create a WHOLE NEW table for such results of data when they can with ease be sent to a report.
In other words creating a whole table just to display or use some data that the database engine already fetched and returned makes little sense. A table is a set of rows that holds data that can be updated and the results are permanent. A query is a “on the fly” results or sub set of data that only exists in memory and is discarded after you use the results.
So for general reporting and display of data, it makes no sense to create a temp table. MUCH WORSE of an issue is that if you have two users wanting to run a report, if they both need different results and you send the results to the SAME temp table, then you have a big mess and collision between the two users. So use of a temp table in Access for the most part makes little sense, and this is EVEN MORE so when working in a multi-user environment. And as noted, once the table is created, then after you are done you need to delete and remove the table. And with many users in a multi-user database this becomes even more of a problem and issue.
However in a multi-user environment as pointed out that if the resulting data needs additional processing, then sending the results to a temp table can be of use. This approach however suggests that EACH USER has their own front end and own copy of the application side. And better is that the temp table is created outside of the front end application that resides on each computer. Since the application part (front end) is placed on each computer, then creating of a temp table does not occur in the production database (back end) and as a result you can have multiple users function correctly without each individual user creating a temp table in the production back end database. So if one is to adopt a make table query, it likely should occur on each local workstation and not in the back end database when you have a multiple user database application.
Thus for the most part a make table and that of reports and query of data are VERY different goals and tasks. You don't want nor as a general rule create a whole brand new table for a simple query. In a multi user database system the users might run 100's of reports in a given day and FEW if any systems will send such data to a temp table in place of sending the query results directly to the report.
It creates a table - which is useful if you have a need for that table which you may have for temporary use where you have to modify the data for calculations or further processing while not disturbing the original data.

No waiting while Truncate Table

I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.

Query Performance help

I have a long running job. The records to be processed are in a table with aroun 100K records.
Now during whole job whenever this table is queried it queries against those 100K records.
After processing status of every record is updated against same table.
I want to know, if it would be better if I add another table where I can update records status and in this table keep deleting whatever records are processed, so as the query go forward the no. of records in master table will decrease increasing the query performance.
EDIT: Master table is basically used for this load only. I receive a flat file, which I upload as it is before processing. After doing validations on this table I pick one record at a time and move data to appropriate system tables.
I had a similar performance problem where a table generally has a few million rows but I only need to process what has changed since the start of my last execution. In my target table I have an IDENTITY column so when my batch process begins, I get the highest IDENTITY value from the set I select where the IDs are greater than my previous batch execution. Then upon successful completion of the batch job, I add a record to a separate table indicating this highest IDENTITY value which was successfully processed and use this as the start input for the next batch invocation. (I'll also add that my bookmark table is general purpose so I have multiple different jobs using it each with unique job names.)
If you are experiencing locking issues because your processing time per record takes a long time you can use the approach I used above, but break your sets into 1,000 rows (or whatever row chunk size your system can process in a timely fashion) so you're only locking smaller sets at any given time.
Few pointers (my two cents):
Consider splitting that table similar to "slowly changing dimension" technique into few "intermediate" tables, depending on "system table" destination; then bulk load your system tables -- instead of record by record.
Drop the "input" table before bulk load, and re-create to get rid of indexes, etc.
Do not assign unnecessary (keys) indexes on that table before load.
Consider switching the DB "recovery model" to bulk-load mode, not to log bulk transactions.
Can you use a SSIS (ETL) task for loading, cleaning and validating?
UPDATE:
Here is a typical ETL scenario -- well, depends on who you talk to.
1. Extract to flat_file_1 (you have that)
2. Clean flat_file_1 --> SSIS --> flat_file_2 (you can validate here)
3. Conform flat_file_2 --> SSIS --> flat_file_3 (apply all company standards)
4. Deliver flat_file_3 --> SSIS (bulk) --> db.ETL.StagingTables (several, one per your destination)
4B. insert into destination_table select * from db.ETL.StagingTable (bulk load your final destination)
This way if a process (1-4) times-out you can always start from the intermediate file. You can also inspect each stage and create report files from SSIS for each stage to control your data quality. Operations 1-3 are essentially slow; here they are happening outside of the database and can be done on a separate server. If you archive flat_file(1-3) you also have an audit trail of what's going on -- good for debug too. :)