I'm working on Postgres 9.2.
There are 2 UPDATEs, each in their own transactions. One looks like:
UPDATE foo SET a=1 WHERE b IN (1,2,3,4);
The other is similar:
UPDATE foo SET a=2 WHERE b IN (1,2,3,4);
These could possibly run at the same time and in reality have 500+ in the 'IN' expression.
I'm sometimes seeing deadlocks. Is is true that that order of items in the 'IN' expression may not actually influence the true lock ordering?
There is no ORDER BY in the UPDATE command.
But there is for SELECT. Use row-level locking with the FOR UPDATE clause in a subquery:
UPDATE foo f
SET a = 1
FROM (
SELECT b FROM foo
WHERE b IN (1,2,3,4)
ORDER BY b
FOR UPDATE
) upd
WHERE f.b = upd.b;
Of course, b has to be UNIQUE or you need to add more expressions to the ORDER BY clause to make it unambiguous.
And you need to enforce the same order for all UPDATE, DELETE and SELECT .. FOR UPDATE statements on the table.
Related, with more details:
Postgres UPDATE … LIMIT 1
Avoiding PostgreSQL deadlocks when performing bulk update and delete operations
Optimizing concurrent updates in Postgres
Yes. I think the main issue here is that IN checks for membership in the set specified, but does not confer any sort of ordering on the UPDATE, which in turn, means that no concrete ordering is conferred upon the lock ordering.
The WHERE clause in an UPDATE statement essentially behaves the same way it does in a SELECT. For example, I will often simulate an UPDATE using a SELECT to check what will be updated to see that it's what I expected.
With that in mind, the following example using SELECT demonstrates that IN does not in itself confer ordering:
Given this schema/data:
create table foo
(
id serial,
val text
);
insert into foo (val)
values ('one'), ('two'), ('three'), ('four');
The following queries:
select *
from foo
where id in (1,2,3,4);
select *
from foo
where id in (4,3,2,1);
yield the exact same results -- the rows in order from id 1-4.
Even that isn't guaranteed, since I did not use an ORDER BY in the select. Rather, without it, Postgres uses whatever order the server decides is fastest (see point 8 about ORDER BY in the Postgres SELECT doc). Given a fairly static table, it's often the same order in which it was inserted (as was the case here). However, there's nothing guaranteeing that, and if there's a lot of churn on the table (lots of dead tuples, rows removed, etc.), it's less likely to be the case.
I suspect that's what's happening here with your UPDATE. Sometimes -- if not even most of the time -- it may end up in numerical order if that's the same way the rows were inserted, but there's nothing to guarantee that, and the cases where you see the deadlocks are likely scenarios where the data has changed such that one update is ordered different than the other.
sqlfiddle with the above code.
Possible fixes/workarounds:
In terms of what you could do about it, there are various options, depending on your requirements. You could explicitly take out a table lock on the table, although that would of course have the effect of serializing the updates there, which may prove to be too large a bottleneck.
Another option, which would still allow for concurrency -- is to explicitly iterate over the items using dynamic SQL in, say, Python. That way, you'd have a set of one-row updates that occurred always in the same order, and since you could ensure that consistent order, the normal Postgres locking should be able to handle the concurrency without deadlocking.
That won't perform as well as batch-updating in pure SQL, but it should solve the lock issue. One suggestion to bump up performance is to only COMMIT every so often, and not after every single row -- that saves a lot of overhead.
Another option would be to do the loop in a Postgres function written in PL/pgSQL. That function could then be called externally, in, say, Python, but the looping would be done (also explicitly) server-side, which may save on some overhead, since the looping and UPDATEs are done entirely server-side without having to go over the wire each loop iteration.
Related
It is my understanding that select is not guaranteed to always return the same result.
Following query is not guaranteed to return the same result every time:
select * from myTable offset 10000 limit 100
My question is if myTable is not changed between executions of select (no deletions or inserts) can i rely on it returning the same result set every time?
Or to put it in another way if my database is locked for changes can I rely on select returning the same result?
I am using postgresql.
Tables and result sets (without order by) are simply not ordered. It really is that simple.
In some databases, under some circumstances, the order will be consistent. However, you should never depend on this. Subsequent releases, for instance, might invalidate the query.
For me, I think the simplest way to understand this is by thinking of parallel processing. When you execute a query, different threads might go out and start to fetch data; which values are returned first depends on non-reproducible factors.
Another way to think of it is to consider a page cache that already has pages in memory -- probably from the end of the table. The SQL engine could read the pages in any order (although in practice this doesn't really happen).
Or, some other query might have a row or page lock, so that page gets skipped when reading the records.
So, just accept that unordered means what ordered means. Add an order by if you want data in a particular order. If you use a clustered index key, then there is basically no performance hit.
I have a string vector data containing items that I want to insert into a table named foos. It's possible that some of the elements in data already exist in the table, so I must watch out for those.
The solution I'm using starts by transforming the data vector into virtual table old_and_new; it then builds virtual table old which contains the elements which are already present in foos; then, it constructs virtual table new with the elements
which are really new. Finally, it inserts the new elements in table foos.
WITH old_and_new AS (SELECT unnest ($data :: text[]) AS foo),
old AS (SELECT foo FROM foos INNER JOIN old_and_new USING (foo)),
new AS (SELECT * FROM old_and_new EXCEPT SELECT * FROM old)
INSERT INTO foos (foo) SELECT foo FROM new
This works fine in a non-concurrent setting, but fails if concurrent threads
try to insert the same new element at the same time. I know I can solve this
by setting the isolation level to serializable, but that's very heavy-handed.
Is there some other way I can solve this problem? If only there was a way to
tell PostgreSQL that it was safe to ignore INSERT errors...
Is there some other way I can solve this problem?
There are plenty, but none are a panacea...
You can't lock for inserts like you can do a select for update, since the rows don't exist yet.
You can lock the entire table, but that's even heavier handed that serializing your transactions.
You can use advisory locks, but be super wary about deadlocks. Sort new keys so as to obtain the locks in a consistent, predictable order. (Someone more knowledgeable with PG's source code will hopefully chime in, but I'm guessing that the predicate locks used in the serializable isolation level amount to doing precisely that.)
In pure sql you could also use a do statement to loop through the rows one by one, and trap the errors as they occur:
http://www.postgresql.org/docs/9.2/static/sql-do.html
http://www.postgresql.org/docs/9.2/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
Similarly, you could create a convoluted upsert function and call it once per piece of data...
If you're building $data at the app level, you could run the inserts one by one and ignore errors.
And I'm sure I forgot some additional options...
Whatever your course of action is (#Denis gave you quite a few options), this rewritten INSERT command will be much faster:
INSERT INTO foos (foo)
SELECT n.foo
FROM unnest ($data::text[]) AS n(foo)
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
It also leaves a much smaller time frame for a possible race condition.
In fact, the time frame should be so small, that unique violations should only be popping up under heavy concurrent load or with huge arrays.
Dupes in the array?
Except, if you your problem is built-in. Do you have duplicates in the input array itself? In this case, transaction isolation is not going to help you. The enemy is within!
Consider this example / solution:
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest('{foo,bar,foo,baz}'::text[]) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL
I use DISTINCT in the subquery to eliminate the "sleeper agents", a.k.a. duplicates.
People tend to forget that the dupes may come within the import data.
Full automation
This function is one way to deal with concurrency for good. If a UNIQUE_VIOLATION occurs, the INSERT is just retried. The newly present rows are excluded from the new attempt automatically.
It does not take care of the opposite problem, that a row might have been deleted concurrently - this would not get re-inserted. One might argue, that this outcome is ok, since such a DELETE happened concurrently. If you want to prevent this, make use of SELECT ... FOR SHARE to protect rows from concurrent DELETE.
CREATE OR REPLACE FUNCTION f_insert_array(_data text[], OUT ins_ct int) AS
$func$
BEGIN
LOOP
BEGIN
INSERT INTO foos (foo)
SELECT n.foo
FROM (SELECT DISTINCT foo FROM unnest(_data) AS foo) n
LEFT JOIN foos o USING (foo)
WHERE o.foo IS NULL;
GET DIAGNOSTICS ins_ct = ROW_COUNT;
RETURN;
EXCEPTION WHEN UNIQUE_VIOLATION THEN -- tag.tag has UNIQUE constraint.
RAISE NOTICE 'It actually happened!'; -- hardly ever happens
END;
END LOOP;
END
$func$
LANGUAGE plpgsql;
I made the function return the count of inserted rows, which is completely optional.
-> SQLfiddle demo
I like both Erwin and Denis' answers, but another approach might be to have concurrent sessions performing the unnesting and loading into a separate temporary table, and optionally eliminating what duplicates they can against the target table, and having a single session selecting from this temporary table, resolving temp table internal duplicates in an appropriate manner, inserting to the target table checking again for existing values, and deleting the selected temporary table records (in the same query using a common table expression).
This would be more batch oriented, in the style of a data warehouse extraction-load-transform paradigm, but would guarantee that no unique constraint issues would need to be dealt with.
Other advantages/disadvantages apply, such as decoupling the final insert from the data gathering (possible advantage), and needing to vacuum the temp table frequently (possible disadvantage), which may not be relevant to Jon's case, but might be useful info to others in the same situation.
Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)
this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.
The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.
I have a statement that looks something like this:
MERGE INTO someTable st
USING
(
SELECT id,field1,field2,etc FROM otherTable
) ot on st.field1=ot.field1
WHEN NOT MATCHED THEN
INSERT (field1,field2,etc)
VALUES (ot.field1,ot.field2,ot.etc)
where otherTable has an autoincrementing id field.
I would like the insertion into someTable to be in the same order as the id field of otherTable, such that the order of ids is preserved when the non-matching fields are inserted.
A quick look at the docs would appear to suggest that there is no feature to support this.
Is this possible, or is there another way to do the insertion that would fulfil my requirements?
EDIT: One approach to this would be to add an additional field to someTable that captures the ordering. I'd rather not do this if possible.
... upon reflection the approach above seems like the way to go.
I cannot speak to what the Questioner is asking for here because it doesn't make any sense.
So let's assume a different problem:
Let's say, instead, that I have a Heap-Table with no Identity-Field, but it does have a "Visited" Date field.
The Heap-Table logs Person WebPage Visits and I'm loading it into my Data Warehouse.
In this Data Warehouse I'd like to use the Surrogate-Key "WebHitID" to reference these relationships.
Let's use Merge to do the initial load of the table, then continue calling it to keep the tables in sync.
I know that if I'm inserting records into an table, then I'd prefer the ID's (that are being generated by an Identify-Field) to be sequential based on whatever Order-By I choose (let's say the "Visited" Date).
It is not uncommon to expect an Integer-ID to correlate to when it was created relative to the rest of the records in the table.
I know this is not always 100% the case, but humor me for a moment.
This is possible with Merge.
Using (what feels like a hack) TOP will allow for Sorting in our Insert:
MERGE DW.dbo.WebHit AS Target --This table as an Identity Field called WebHitID.
USING
(
SELECT TOP 9223372036854775807 --Biggest BigInt (to be safe).
PWV.PersonID, PWV.WebPageID, PWV.Visited
FROM ProdDB.dbo.Person_WebPage_Visit AS PWV
ORDER BY PWV.Visited --Works only with TOP when inside a MERGE statement.
) AS Source
ON Source.PersonID = Target.PersonID
AND Source.WebPageID = Target.WebPageID
AND Source.Visited = Target.Visited
WHEN NOT MATCHED BY Target THEN --Not in Target-Table, but in Source-Table.
INSERT (PersonID, WebPageID, Visited) --This Insert populates our WebHitID.
VALUES (Source.PersonID, Source.WebPageID, Source.Visited)
WHEN NOT MATCHED BY Source THEN --In Target-Table, but not in Source-Table.
DELETE --In case our WebHit log in Prod is archived/trimmed to save space.
;
You can see I opted to use TOP 9223372036854775807 (the biggest Integer there is) to pull everything.
If you have the resources to merge more than that, then you should be chunking it out.
While this screams "hacky workaround" to me, it should get you where you need to go.
I have tested this on a small sample set and verified it works.
I have not studied the performance impact of it on larger complex sets of data though, so YMMV with and without the TOP.
Following up on MikeTeeVee's answer.
Using TOP will allow you to Order By within a sub-query, however instead of TOP 9223372036854775807, I would go with
SELECT TOP 100 PERCENT
Unlikely to reach that number, but this way just makes more sense and looks cleaner.
Why would you care about the order of the ids matching? What difference would that make to how you query the data? Related tables should be connected through primary and foreign keys, not order records were inserted. Tables are not inherently ordered a particular way in databases. Order should come from the order by clause.
More explanation as to why you want to do this might help us steer you to an appropriate solution.
Is it safe to use MS SQL's WITH (NOLOCK) option for select statements and insert statements if you never modify a row, but only insert or delete rows?
I..e you never do an UPDATE to any of the rows.
If you're asking whether or not you'll get data that may no longer be accurate, then it depends on your queries. For example, if you do something like:
SELECT
my_id,
my_date
FROM
My_Table
WHERE
my_date >= '2008-01-01'
at the same time that a row is being inserted with a date on or after 2008-01-01 then you may not get that new row. This can also affect queries which generate aggregates.
If you are just mimicking updates through a delete/insert then you also may get an "old" version of the data.
Not in general. (i.e. UPDATE is not the only locking issue)
If you are inserting (or deleting) records and a select could potentially specify records which would be in that set, then yes, NOLOCK will give you a dirty read which may or may not include those records.
If the inserts or deletes would never potentially be selected (for instance the data read is always yesterday's data, wheras today's data coming in or being manipulated is never read), then yes, it is "safe".
If you are never doing any UPDATEs, then why does locking give you a problem in the first place?
If there are referential integrity or trigger-firing issues at play, then NOLOCK is just going to turn those errors into mysterious inconsistencies.
Well 'safe' is a very generic term; it all depends on the context of your application and its use. But there is always a chance of skipping and double-counting previously committed rows when NOLOCK hint is used.
Anyways, have a read of this:
http://blogs.msdn.com/b/sqlcat/archive/2007/02/01/previously-committed-rows-might-be-missed-if-nolock-hint-is-used.aspx
Not sure how SELECT statements could conflict if you're limiting yourself to INSERTs and DELETEs. INSERT is problematic because there may have been conflicting primary keys inserted during your query, for instance. Both INSERTs and DELETEs both expose you to the conditions expressed in your WHERE clause, or JOINs, etc. may become invalid during your statement's execution.