Insert statement with no joins results in duplicates where no duplicates existed previously? - sql

I am having an issue with some SQL that is resulting in results that I wouldn't expect. I am storing information from a variety of tables in another table which is used as part of a search page on a website. All of the page data for each page, along with data from other elements on other pages (like calendars, etc) is referenced in a table called pageContentCache. This table has normally has an index against created with the following:
alter table pageContentCache add
constraint [IX_pageContentCache] PRIMARY KEY CLUSTERED (
[objectId]
)
For some reason that to me would appear to be a duplicate objectId, an issue has started occurring with one instance of this software, resulting in the following error:
Msg 1505, Level 16, State 1 Procedure sp_rebuildPageContentCache, Line 50
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.pageContentCache' and the index name 'IX_pageContentCache'. The duplicate key value is (21912).
So, to debug the issue, I had got the procedure to load all of the data it was going to input into the pageContentCache table into a temporary table, #contentcache, first, so I could have a look through it.
This is where I'm starting to get a little confused...
Once the data has been inserted into #contentcache (which has two columns, objectId and content), I can run the following SQL statement and it will return nothing:
select objectId, count(objectId) from #contentcache
group by objectId having count(objectId) > 1
This returns no records. If I then run the following SQL:
insert into pageContentCache (objectId, contentData)
select objectId, content
from #contentcache
This inserts all of the data from #contentcache into pageContentCache as you'd expect. However, if I then run the following SQL, it returns duplicates:
select objectId, count(objectId) from pageContentCache
group by objectId having count(objectId) > 1
This then returns duplicates:
objectId (no column name)
21912 2
There are no triggers or anything like that associated with this table and the insert statement is merely copying the data from one table to another, so... where is this duplicate coming from?

Try the following:
insert into pageContentCache (objectId, contentData)
select distinct objectId, content
from #contentcache
Can't see why you would have duplicates since, as you mentioned, there are no joins in your select statement. Anyways, my guess is that the distinct keyword will ensure that the duplicates are eliminated.

This is a SQL Server database error I have seen before. You may want to patch the latest service pack and retry.

I am not so sure that this statement does what you think it does:
select objectId, count(objectId) from #contentcache
group by objectId having count(objectId) > 1
Can you try this instead:
WITH SUBQUERY AS
( select
COUNT(objectId) OVER (PARTITION BY objectId) AS CNT_OBJECT_IDS,
objectId
FROM #contentcache)
SELECT * FROM SUBQUERY WHERE CNT_OBJECT_IDS > 1
See if this gets you any rows back.
Also, I've never worked with clusters before and I am wondering if they do some additional things that we are not aware of. Can you try just saying
PRIMARY KEY
instead of
PRIMARY KEY CLUSTERED
in your constraint definition and see if that affects your problem at all?

Related

How to insert a row if not exists otherwise select and return its ID in both cases in MariaDB?

I have a table with ID primary key (autoincrement) and a unique column Name. Is there an efficient way in MariaDB to insert a row into this table if the same Name doesn't exist, otherwise select the existing row and, in both cases, return the ID of the row with this Name?
Here's a solution for Postgres. However, it seems MariaDB doesn't have the RETURNING id clause.
What I have tried so far is brute-force:
INSERT IGNORE INTO services (Name) VALUES ('JohnDoe');
SELECT ID FROM services WHERE Name='JohnDoe';
UPDATE: MariaDB 10.5 has RETURNING clause, however, the queries I have tried so far throw a syntax error:
WITH i AS (INSERT IGNORE INTO services (`Name`) VALUES ('John') RETURNING ID)
SELECT ID FROM i
UNION
SELECT ID FROM services WHERE `Name`='John'
For a single row, assuming id is AUTO_INCREMENT.
INSERT INTO t (name)
VALUES ('JohnDoe')
ON DUPLICATE KEY id = LAST_INSERT_ID(id);
SELECT LAST_INSERT_ID();
That looks kludgy, but it is an example in the documentation.
Caution: Most forms of INSERT will "burn" auto_inc ids. That is, they grab the next id(s) before realizing that the id won't be used. This could lead to overflowing the max auto_inc size.
It is also wise not to put the normalization inside the transaction that does the "meat" of the code. It ties up the table unnecessarily long and runs extra risk of burning ids in the case of rollback.
For batch updating of a 'normalization' table like that, see my notes here: http://mysql.rjweb.org/doc.php/staging_table#normalization (It avoids burning ids.)

Postgres: SELECT or INSERT in high concurrent write load DB

We have a DB for which we need a "selsert" (not upsert) function.
The function should take a text value and return a id column of existing row (SELECT) or insert the value and return id of new row (INSERT).
There are multiple processes that will need to perform this functionality (selsert)
I have been experimenting with pg_advisory_lock and ON CONFLICT clause for INSERT but am still not sure what approach would work best (even when looking at some of the other answers).
So far I have come up with following
WITH
selected AS (
SELECT id FROM test.body_parts WHERE (lower(trim(part))) = lower(trim('finger')) LIMIT 1
),
inserted AS (
INSERT INTO test.body_parts (part)
SELECT trim('finger')
WHERE NOT EXISTS ( SELECT * FROM selected )
-- ON CONFLICT (lower(trim(part))) DO NOTHING -- not sure if this is needed
RETURNING id
)
SELECT id, 'inserted' FROM inserted
UNION
SELECT id, 'selected' FROM selected
Will above query (within function) insure consistency in high
concurrency write workloads?
Are there any other issues I must consider (locking?, etc, etc)
BTW, I can insure that there are no duplicate values of (part) by creating unique index. That is not an issue. What I am after is that SELECT returns existing value if another process does INSERT (I hope I am explaining this right)
Unique index would have following definition
CREATE UNIQUE INDEX body_parts_part_ux
ON test.body_parts
USING btree
(lower(trim(part)));

Get Id from a conditional INSERT

For a table like this one:
CREATE TABLE Users(
id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
What would be the correct one-query insert for the following operation:
Given a user name, insert a new record and return the new id. But if the name already exists, just return the id.
I am aware of the new syntax within PostgreSQL 9.5 for ON CONFLICT(column) DO UPDATE/NOTHING, but I can't figure out how, if at all, it can help, given that I need the id to be returned.
It seems that RETURNING id and ON CONFLICT do not belong together.
The UPSERT implementation is hugely complex to be safe against concurrent write access. Take a look at this Postgres Wiki that served as log during initial development. The Postgres hackers decided not to include "excluded" rows in the RETURNING clause for the first release in Postgres 9.5. They might build something in for the next release.
This is the crucial statement in the manual to explain your situation:
The syntax of the RETURNING list is identical to that of the output
list of SELECT. Only rows that were successfully inserted or updated
will be returned. For example, if a row was locked but not updated
because an ON CONFLICT DO UPDATE ... WHERE clause condition was not
satisfied, the row will not be returned.
Bold emphasis mine.
For a single row to insert:
Without concurrent write load on the same table
WITH ins AS (
INSERT INTO users(name)
VALUES ('new_usr_name') -- input value
ON CONFLICT(name) DO NOTHING
RETURNING users.id
)
SELECT id FROM ins
UNION ALL
SELECT id FROM users -- 2nd SELECT never executed if INSERT successful
WHERE name = 'new_usr_name' -- input value a 2nd time
LIMIT 1;
With possible concurrent write load on the table
Consider this instead (for single row INSERT):
Is SELECT or INSERT in a function prone to race conditions?
To insert a set of rows:
How to use RETURNING with ON CONFLICT in PostgreSQL?
How to include excluded rows in RETURNING from INSERT ... ON CONFLICT
All three with very detailed explanation.
For a single row insert and no update:
with i as (
insert into users (name)
select 'the name'
where not exists (
select 1
from users
where name = 'the name'
)
returning id
)
select id
from users
where name = 'the name'
union all
select id from i
The manual about the primary and the with subqueries parts:
The primary query and the WITH queries are all (notionally) executed at the same time
Although that sounds to me "same snapshot" I'm not sure since I don't know what notionally means in that context.
But there is also:
The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot
If I understand correctly that same snapshot bit prevents a race condition. But again I'm not sure if by all the statements it refers only to the statements in the with subqueries excluding the main query. To avoid any doubt move the select in the previous query to a with subquery:
with s as (
select id
from users
where name = 'the name'
), i as (
insert into users (name)
select 'the name'
where not exists (select 1 from s)
returning id
)
select id from s
union all
select id from i

Duplicate key error with PostgreSQL INSERT with subquery

There are some similar questions on StackOverflow, but they don't seem to exactly match my case. I am trying to bulk insert into a PostgreSQL table with composite unique constraints. I created a temporary table (temptable) without any constraints, and loaded the data (with possible some duplicate values) in it. So far, so good.
Now, I am trying to transfer the data to the actual table (realtable) with unique index. For this, I used an INSERT statement with a subquery:
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
);
However, I am getting duplicate key errors:
ERROR: duplicate key value violates unique constraint "realtable_added_date_product_name_key"
SQL state: 23505
Detail: Key (added_date, product_name)=(20000103, TEST) already exists.
My question is, shouldn't the WHERE NOT EXISTS clause prevent this from happening? How can I fix it?
The NOT EXISTS clause only prevents rows from temptable conflicting with existing rows from realtable; it will not prevent multiple rows from temptable from conflicting with each other. This is because the SELECT is calculated once based on the initial state of realtable, not re-calculated after each row is inserted.
One solution would be to use a GROUP BY or DISTINCT ON in the SELECT query, to omit duplicates, e.g.
INSERT INTO realtable
SELECT DISTINCT ON (added_date, product_name) *
FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.added_date = realtable.added_date
AND temptable.product_name = realtable.product_name
)
ORDER BY ???; -- this ORDER BY will determine which of a set of duplicates is kept by the DISTINCT ON

INSERT INTO using SELECT and increment value in a column

I am trying to insert missing rows into a table. One of the columns is OrderNumber (sort number), this column should be +1 of the max value of OrderNumber returned for sID in the table. Some sIDs do not appear in the SPOL table which is why there is the WHERE clause at the end of the statement. I would run this statement again but set OrderNumber to 1 for the records where sID does not currently exist in the table.
The statement below doesn't work due to the OrderNumber causing issues with the primary key which is sID + OrderNumber.
How can I get the OrderNumber to increase for each row that is inserted based on the sID column?
INSERT INTO SPOL(sID, OrderNumber, oID)
SELECT
sID, OrderNumber, oID
FROM
(SELECT
sID,
(SELECT Max(OrderNumber) + 1
FROM SPOL
WHERE sID = TMPO.sID) AS OrderNumber,
oID
FROM TMPO
WHERE NOT EXISTS (SELECT * FROM SPOL
WHERE SPOL.oID = TMPO.oID)
) AS MyData
WHERE
OrderNumber IS NOT NULL
It's much better to handle this in the database design with an identity column - you don't mention whether or not you can change the schema but hopefully you can as queries will end up a lot cleaner if you don't have to manage it yourself.
You can set the Identity property to on for your OrderNumber column in SQL Server management studio, but the script it would generate clones the table with the new specification, inserts the values you've already got with Identity_Insert on, drops the original table, and renames the temporary one to replace it - this has massive overheads depending on how many rows you've got.
The most efficient way to go about it is probably:
create an additional column with the identity property on
copy across the values
rename the original column
rename the new column to the same name as the original
remove the original OrderNumber column
Once it's done, it's done though - and looks after itself. Wouldn't you rather your insert statement simply said something like this:
INSERT INTO SPOL (sID, oID)
SELECT sID, oID,
FROM TMPO
WHERE OrderNumber IS NOT NULL
Use identity(1,1) to increment your column Order Number,this would makes your task easy..!