I'm investigating a data correctness issue in a regularly-running job that I wrote, and the problem seems to be caused by BigQuery overwriting the same table twice in a non-atomic way. More specifically, I had two copies of the same query running at the same time (due to retry logic), both set to overwrite the same table (using the WRITE_TRUNCATE option), and the resulting table had two copies of every row. I was expecting one query to write a table with the query results and the other query to overwrite it with the same results, rather than ending up with a double-sized table.
My understanding when designing the system was that all BigQuery actions are atomic (based on atomic inserts in big query, Can I safely query a BigQuery table being replaced with WRITE_TRUNCATE, and Views are failing when their underlying table is repopulated). Is the issue I'm running into a bug, or am I misunderstanding the exact guarantees I can expect?
Looking through history, it looks like this has happened in at least 4 separate cases in the past week.
Here's the timeline of what causes this to happen (with the specific details applying to the most noticeable case):
At about 18:07 April 30th UTC, my code submitted 82 queries at the same time. Each one queried a table ending in conversions_2014_04_30_14 and another table and wrote to a table ending in conversions_2014_04_30_16 (specifying WRITE_TRUNCATE).
About 25 minutes later, 25 of the queries were still not finished (which is more than usual), so it triggered "retry" logic that gives up on all queries still running and just submits them again (this is to work around an issue I've seen where queries would stay in pending for hours without being run, which I mentioned here: https://code.google.com/p/google-bigquery/issues/detail?id=83&can=1 ). This means that 50 queries were outstanding at once, two of each of the 25 queries that hadn't finished yet.
After all queries finished, 6 of the 82 resulting tables were twice as big as they should be.
Here's one example:
First query job: 124072386181:job_tzqbfxfLmZv_QMYL6ozlQpWlG5U
Second query job: 124072386181:job_j9_7uJEjtvYbyeVmEVP0u2er9Lk
The resulting table: 124072386181:bigbingo_history.video_task_companions_conversions_2014_04_30_16
And another example:
First query job: 124072386181:job_TQJzGabFT9FtHI05ftTkD5O8KKU
Second query job: 124072386181:job_5hogbjnLX_5a2opEJl9Jacnn53s
Table: 124072386181:bigbingo_history.Item_repetition__Elimination_conversions_2014_04_27_16
The tables haven't been touched since these queries ran (aside from a schema addition for the first table), so they still contain the duplicate rows. One way to confirm this is to see that the queries all had "GROUP BY alternative, bingo_id", but the tables have two of each (alternative, bingo_id) pair.
We had a bug in which write-truncate could end up appending in certain cases. We released the fix yesterday (May 22), and haven't seen any further instances of the problem since then.
Related
We process CSV files from our upstream systems and load them to our master tables in our SQL Server database. We are currently on boarding a new upstream system and suddenly our UPDATE statement took very long time. It could be due to incoming data having previous related data in our system and it caused huge update. We are able to find out the table which was getting updated through sp_whoisactive.
My query is:
Post the update, is there a way to figure out the number of rows updated for the table from some place like error log or default trace or through DMV?
During update, if we find these kind of huge update happening in future, can we set up some trace to identify the number of rows will get updated or figure out the update statement with current parameters (current values of parameters) ? In sp_whoisactive we get update statement with variables. But we don't know the current parameters.
Proactively, should we setup extended events or something else to capture these kinds of huge updates in future?
Let's start with your third question first. Yes. If you really want to track specific values for changes, the best way to do this is through Extended Events and you must set it up and have it running ahead of time. As you'll see in the rest of this post, there may be no easy way to retrieve the specific information you're looking for, depending. Something like sql_statement_completed will give you precise row counts for a given event. You can filter it to a specific table.
Second question, during updates, you can't really see how many rows are being updated accurately within a transaction. However, you can get a guess at how many rows are likely to be updated. The execution plan will have the row estimates that it anticipates will occur. So, you can query this from sys.dm_exec_query_plan. Combine it with sys.dm_exec_sql_batch to find the query. I'm sure sp_whoisactive can also supply this information (it's just querying the DMVs). You can also watch Live Query Statistics if you've set your server up correctly ahead of time. That will give you the estimated row counts, but then it will show you the actuals as they occur.
Now for the tough question. Can you get row counts after the fact? Kind of. If the query just executed and hasn't executed again, sys.dm_exec_sql_batch does have a last_rows column that will provide that info. If more than one query has run though, that information is lost because it's only the most recent execution of the query. If you're on Azure SQL Database, or SQL Server 2019, you can also look to sys.dm_exec_query_plan_stats to see the last Execution Plan Plus Runtime Metrics. That will also have row counts Although, if that's all you're looking for, and this is the most recent execution, the batch DMV is easier. I don't know if that column is included in sp_whoisactive, but you can just query the DMV yourself.
However, if the query has run more than once, you're out of luck. You can look to the execution plan, as was mentioned before, to see what the row estimates are. If the query suffered from waits more than 30 seconds, it will show up in the system_health extended event session, but that won't include row counts. Really, unless it's the very last time the exact query was run, there's no way after the fact to get the row count value.
In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.
If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.
A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.
I have a table with 281,433 records in it, ranging from March 2010 to the current date (Sept 2014). It's a transaction table which consists of records that determine stock which is currently in and out of the warehouse.
When making picks from the warehouse, the system needs to look over every transaction from a particular customer that was ever made (based on the AccountListID field, which determines the customer, a customer might on average have about 300 records in the table). This happens 2-3 times per request from the particular .NET application when a picking run is done.
There are times when the database seemingly locks out. Some requests complete no bother, within about 3 seconds. Others hang for 'up to 4 minutes' according to the end users.
My guess is with 4-5 requests at the same time all looking at this one transaction table things are getting locked up.
I'm thinking about partitioning this table so that the primary transaction table only contains record from the last 2 years. The end user has agreed that any records past this date are unnecessary.
But I can't just delete them, they're used elsewhere in the system. I have indexes already in place and they make a massive difference (going from >30 seconds to <2, on the accountlistid field). It seems partitioning is the next step.
1) Am I going down the right route as a solution to my 'locking' problem?
2) When moving a set of records (e.g. records where the field DateTimeCheckedIn is more than 2 years old) is this a manual process or does partitioning automatically do this?
Partitioning shouldn't be necessary on a table with fewer than 300,000 rows, unless each record is really big. If a record is occupying more than 4k bytes, then you have 300,000 pages (2,400,000,000 bytes) and that is getting larger.
Indexes are usually the solution for something like this. Taking more than a second to return 300 records in an indexed database seems like a long time (unless the records are really big and the network overhead adds to the time). Your table and index should both fit into memory. Check your memory configuration.
The next question is about the application code. If it uses cursors, then these might be the culprit by locking rows under certain circumstances. For read-only cursors, "FAST_FORWARD" or "FORWARD READ_ONLY" should be fast. It is possible that if the application code is locking all the historical records, then you might get contention. After all, this would occur when two records (for different) customers are on the same data page. The solution is to not lock the historical records as you read them. Or, to avoid using cursors all together.
I don't think partitioning will be necessary here. You can probably fix this with a well-placed index: I'm thinking a single index covering (in order) company, part number, and quantity. Or, if it's an old server, possibly just add ram. Finally, since this is reading a lot of older data for transactions, where individual transactions themselves are likely never (or at most very rarely) updated once written, you might do better with a READ UNCOMMITTED isolation level for this query.
I have been trying to solve a problem where two concurrent updates on the same table are causing additional records to be created/inserted. Never experienced this in any other relational database, and nor would i. So i believe it's potential a quirk in redshifts architecture of distributing queries across multiple nodes, however cannot pinpoint or provide a real world example.
Before these two updates are run, i insert new data into the table. The insert contains a daily snapshot that fills out one day of data, most columns have empty values ready for the updates to populate them.
The updates are run concurrently, which are simple update sql's, updating their respective columns. If run individually i do not see additional records created and no duplication.
The updates operate across the entire table, over 200 million records, however the duplication occurs only in the records that where populated recently(the new data for that days period.
This is kind of a worry, as i would never assume an update would ever create new records, addition to the records created with the first insert.
What is even more bizzare is that the duplicate records hold different data.
I have checked to veryify that no other queries are running beyond the expected, by looking at redshifts query logs (stl_query).
I find really hard to believe that an update created new values, are you really sure about this?
I've been trough complicate situations when It comes to concurrent transactions on the same table, so what I suggest is that you explicitly lock your table with:
lock table <table> in exclusive mode;
before you manipulate it (exclusive mode will allow reads but any write attempt will have to wait)
If you don't and 2 transactions try to update (Inserts are fine, BTW) the same table, you are most likely yo get a "ERROR: 1023 - DETAIL: Serializable isolation violation on table" - or the behavior you are reporting
I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."