T-SQL to "Merge" two rows, or "Rekey" all FK relationships - sql

I have a production database where occasionally redundant rows in a single table need to be "Merged".
Let's assume that both rows in this table have identical values, except their IDs.
Table "PrimaryStuff"
ID | SomeValue
1 | "I have value"
2 | "I have value"
3 | "I am different"
Let's also assume that a number of related tables exist. Because duplicates were created in the "PrimaryStuff" table, often rows are created in these child tables that SHOULD all be related to a single record on the PrimaryStuff table. The number and names of these tables are not under my control and should be considered dynamically at runtime. IE: I don't know the names or even the number of related records, as other people may edit the database without my knowledge.
Table "ForeignStuff"
ID | PrimaryStuffId | LocalValue
1| 1| "I have the correct FK"
2| 1| "I have the correct FK"
3| 2| "I should get pointed to an FK of 1"
To resolve the duplication of PrimaryStuff's row 1 and 2, I wish to have ALL related tables change their FK's to 1s and then delete the PrimaryStuff's row 2. This SHOULD be trivial, as if PrimaryStuff's row 1 didn't exist, I could just update the Primary Key on Row 2 to 1, and the changes would cascade out. I cannot do this because that would be a duplicate key in the PrimaryStuff's unique index.
Feel free to ask questions and I'll try to clear up anything that's confusing.

First lets get a list of the rows that need to be updated (as I understand it you want the lowest ID to replace all the higher IDs)
SELECT MIN(ID) OVER (PARTITION BY SomeValue ORDER BY SomeValue, ID ASC) AS FirstID,
ID,
SOMEVALUE
FROM PrimaryStuff
We can remove the ones where FirstID and ID match, these don't matter
SELECT FirstID, ID FROM
(
SELECT MIN(ID) OVER (PARTITION BY SomeValue ORDER BY SomeValue, ID ASC) AS FirstID,
ID,
SOMEVALUE
FROM PrimaryStuff
) T
WHERE FirstID != ID
Now we have a change list. We can use this in an update statement, put it in a temp table (or a CTE as I did below):
WITH ChangeList AS
(
SELECT FirstID, ID FROM
(
SELECT MIN(ID) OVER (PARTITION BY SomeValue ORDER BY SomeValue, ID ASC) AS FirstID,
ID
FROM PrimaryStuff
) T
WHERE FirstID != ID
)
UPDATE ForeignStuff
SET PrimaryStuffId = ChangeList.FirstID
FROM ForeignStuff
JOIN ChangeList ON ForeignStuff.ID = ChangeList.ID
NB - Code not tested, might have typos.

Could you be more proactive and either use the existing ID when SomeValue already exists and enforce a unique constraint on PrimaryStuff.SomeValue, or why not make SomeValue the primary key of PrimaryStuff. With it as the PrimaryKey then you would only ever add a record to PrimaryStuff if SomeValue did not already exist in it.
Lastly, and most simply, if SomeValue is always arbitrarily defined by others and you take whatever they give you, why not just drop PrimaryStuff altogether and let users enter whatever they wish into ForeignStuff? If you need a unique listing for SomeValue, create a view based on your main table. If you need to speed up querying then add an index to ForeignStuff.SomeValue field.
Here's an (untested) view when there are multiple tables like ForeignStuff:
-- dynamically generate a distinct list of values of interest
select SomeValue from ForeignStuffA
union select SomeValue from ForeignStuffB
union select SomeValue from ForeignStuffC
-- and so on, the union applies distinct

Related

Identify duplicate fields in a table

I'm trying to identify specific fields that are duplicated in a table in a mariadb-10.4.20 Joomla database. I would like to identify all rows that have a specific field duplicated, then ultimately be able to remove those duplicates, leaving just the one with the highest ID.
This table contains the IDs, titles and aliases for the articles in a joomla website. The script I'm building (in perl) will use this information to print the primary title alias and create redirects for any others.
I was previously using "group by" but it appears there's been a change recently in how it's used, and now it doesn't work properly. I don't understand the new format, and I'm not even sure it was previously working fully.
Here's a basic query that shows there are two of the same articles with different IDs:
MariaDB [mydb]> select id,alias,title from db1_content where title = "article title";
+--------+---------------+--------------+
| id | alias | title |
+--------+---------------+--------------+
| 299959 | unique-title | Unique Title |
| 300026 | unique-title | Unique Title |
+--------+------------------------------+
Here's an attempt at trying to use "group by" but it returns no results.
MariaDB [mydb]> select id,title,count(title) from db1_content group by id,title having count(title) > 1;
Empty set (0.230 sec)
If I run the same query without the id field, then it does return a list of all titles that are duplicated, along with the number of occurrences of each title.
That's not exactly what I want, though. I need it to print the id, alias and title fields so I can reference them in my perl script to subsequently perform another query to ultimately delete the duplicates and create links to be used in RewriteRules.
What am I doing wrong?
Since MariaDB cannot currently delete from a CTE, you could use a derived table to generate row numbers for each title ordered by id descending, JOIN that to your main table and then delete any row which has a row number greater than 1. For example:
DELETE db1 FROM db1_content db1
JOIN (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
) dbr ON db1.id = dbr.id
WHERE dbr.rn > 1
If you don't want to actually delete the records using SQL, you can just select the ones that need to be deleted by using a CTE:
WITH rns AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
)
SELECT id, alias, title
FROM rns
WHERE rn > 1
Demo on dbfiddle

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);
I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);
You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

How can i sort table records in SQL Server 2014 Management Studio by Alphabetical?

I have many record in one table :
1 dog
2 cat
3 lion
I want to recreate table or sort data with this Alphabetical order :
1 cat
2 dog
3 lion
Table 1
Id int Unchecked
name nvarchar(50) Checked
To create another table from your table :
CREATE TABLE T1
( ID INT IDENTITY PRIMARY KEY NOT NULL,
NAME NVARCHAR(50) NOT NULL
)
GO
INSERT INTO T1 VALUES ('Dog'),('Cat'),('Lion');
SELECT ROW_NUMBER ()OVER (ORDER BY NAME ASC) ID, NAME INTO T2 FROM T1 ORDER BY NAME ASC;
If you just want to sort the table data, use Order by
Select * from table_1 order by Name
If you want to change the Id's as well according to alphabetical order, create a new table and move the records to the new table by order.
SELECT RANK() OVER (ORDER BY name ) AS Id, name
INTO newTable
FROM table_1
In your database, the order of the records as they were inserted into the table does not necessarily dictate the order in which they're returned when queried. Nor does the ordering of a clustered key. There may be situations in which you appear to always get the same ordering of your results, but that is not guaranteed and may change at any time.
If the results of a query must be a specific order, then you must specify that ordering with an ORDER BY clause in your query (ORDER BY [Name] ASC in this particular case).
I understand, based upon your comments above, that you don't want this to be the answer. But this is how SQL Server (and any other relational database) works. If order matters, you specify that upon querying data from the system, not when inserting data into it.

Update table row with certain id while deleting the recurrent row

I have 2 tables
Table name: Attributes
attribute_id | attribute_name
1 attr_name_1
2 attr_name_2
3 attr_name_1
4 attr_name_2
Table name: Products
product_id | product_name | attribute_id
1 prod_name_1 1
2 prod_name_2 2
3 prod_name_3 3
4 prod_name_4 4
If you can see, attribute_id in the table Products has the following id's (1,2,3,4), instead of (1,2,1,2).
The problem is in the table Attributes, namely, there are repeating values(attribute_names) with different ID, so I want:
To pick One ID of the repeating, from the table Attributes
Update the table Products with that "picked" ID(only in cases that attribute_id has same name in the table Attributes)
And after that, delete the repeating values from the table Attributes witch has no use in the table Products
Output:
Table name: Attributes
attribute_id | attribute_name
1 attr_name_1
2 attr_name_2
Table name: Products
product_id | product_name | attribute_id
1 prod_name_1 1
2 prod_name_2 2
3 prod_name_3 1
4 prod_name_4 2
Demo on SQLFiddle
Note:
it will help me a lot if i use sql instead fixing this issue manually.
update Products
set attribute_id = (
select min(attribute_id)
from Attributes a
where a.attribute_name=(select attribute_name from Attributes a2 where a2.attribute_id=Products.attribute_id)
);
DELETE
FROM Attributes
WHERE attribute_id NOT IN
(
SELECT MIN(attribute_id)
FROM Attributes
GROUP BY attribute_name
);
The following may be faster than #Alexander Sigachov's suggestion, but it does require at least SQL Server 2005 to run it, while Alexander's solution would work on any (reasonable) version of SQL Server. Still, even if only for the sake of providing an alternative, here you go:
WITH Min_IDs AS (
SELECT
attribute_id,
min_attribute_id = MIN(attribute_id) OVER (PARTITION BY attribute_name)
FROM Attributes
)
UPDATE p
SET p.attribute_id = a.min_attribute_id
FROM Products p
JOIN Min_IDs a ON a.attribute_id = p.attribute_id
WHERE a.attribute_id <> a.min_attribute_id
;
DELETE FROM Attributes
WHERE attribute_id NOT IN (
SELECT attribute_id
FROM Products
WHERE attribute_id IS NOT NULL
)
;
The first statement's CTE returns a row set where every attribute_id is mapped to the minimum attribute_id for the same attribute_name. By joining to this mapping set, the UPDATE statement uses it to replace attribute_ids in the Products table.
When subsequently deleting from Attributes, it is enough just to check if Attributes.attribute_id is not found in the Products.attribute_id column, which is what the the second statement does. That is to say, grouping and aggregation, as in the other answer, is not needed at this point.
The WHERE attribute_id IS NOT NULL condition is added to the second query's subquery in case the column is nullable and may indeed contain NULLs. NULLs need to be filtered out in this case, or their presence would result in the NOT IN predicate's evaluation to UNKNOWN, which SQL Server would treat same as FALSE (and so no row would effectively be deleted). If there cannot be NULLs in Products.attribute_id, the condition may be dropped.

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.