Safely normalizing data via SQL query - sql

Suppose I have a table of customers:
CREATE TABLE customers (
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
This table does not have a primary key. However, customer_name and customer_address should be unique for any given customer_number.
It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:
SELECT
DISTINCT customer_number, customer_name, customer_address
FROM customers
Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number in question.
For example, I tried this query with no success:
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number
Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?
EDIT: The rationale behind the bizarre query:
Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...
I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:
CREATE TABLE unprocessed_invoices (
invoice_number INTEGER,
invoice_date DATE,
...
// other invoice columns
...
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.
The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number
Hopefully this helps clarify the original intent of the question.
EDIT: Examples of good/bad data
To clarify: customer_name and customer_address only have to be unique for a particular customer_number.
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
3 | 'Fred' | '456 Avenue'
3 | 'Fred' | '789 Crescent'
The first two rows are fine because it is the same customer_name and customer_address for customer_number 1.
The middle two rows are fine because it is the same customer_name and customer_address for customer_number 2 (even though another customer_number has the same customer_name and customer_address).
The last two rows are not okay because there are two different customer_addresses for customer_number 3.
The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
I hope this clarifies what I meant by "conflicting customer_name and customer_address". They have to be unique per customer_number.
I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)

Your approach is flawed. You do not want data that was successfully able to be stored to then throw an error on a select - that is a land mine waiting to happen and means you never know when a select could fail.
What I recommend is that you add a unique key to the table, and slowly start modifying your application to use this key rather than relying on any combination of meaningful data.
You can then stop caring about duplicate data, which is not really duplicate in the first place. It is entirely possible for two people with the same name to share the same address.
You will also gain performance improvements from this approach.
As an aside, I highly recommend you normalize your data, that is break up the name into FirstName and LastName (optionally MiddleName too), and break up the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip, or whatever)
Update: If I understand your situation correctly (which I am not sure I do), you want to prevent duplicate combinations of name and address from ever occurring in the table (even though that is a possible occurrence in real life). This is best done by a unique constraint or index on these two fields to prevent the data from being inserted. That is, catch the error before you insert it. That will tell you the import file or your resulting app logic is bad and you can choose to take the appropriate measures then.
I still maintain that throwing the error when you query is too late in the game to do anything about it.

A scalar sub-query must only return one row (per result set row...) so you could do something like:
select distinct
customer_number,
(
select distinct
customer_address
from customers c2
where c2.customer_number = c.customer_number
) as customer_address
from customers c

Making the query fail may be tricky...
This will show you if there are any duplicate records in the table:
select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1
If you just add a unique index for all the three fields, noone can create a duplicate record in the table.

The defacto key is Name+Address, so that's what you need to group by.
SELECT
Customer_Name,
Customer_Address,
CASE WHEN Count(DISTINCT Customer_Number) > 1
THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address
If you want to do it from the point of view of a Customer_Number, then this is good too.
SELECT *,
CASE WHEN Exists((
SELECT top 1 1
FROM Customers c2
WHERE c1.Customer_Number != c2.Customer_Number
AND c1.Customer_Name = c2.Customer_Name
AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = #Number

If you have dirty data, I would clean it up first.
Use this to find the duplicate customer records...
Select * From customers
Where customer_number in
(Select Customer_number from customers
Group by customer_number Having count(*) > 1)

If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.
CREATE TABLE #temp_customers
(customer_number int,
customer_name varchar(50),
customer_address varchar(50),
PRIMARY KEY (customer_number),
UNIQUE(customr_name, customer_address))
)
INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers
SELECT customer_number, customer_name, customer_address
FROM #temp_customers
DROP TABLE #temp_customers
This will fail if there are issues but will keep your duplicate records from causing issues.

Let's put the data into a temp table or table variable with your distinct query
select distinct customer_number, customer_name, customer_address,
IDENTITY(int, 1,1) AS ID_Num
into #temp
from unprocessed_invoices
Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.
Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.
Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number
You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.

Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number

Related

Insert values from another table and update original table with returning values

I'm new to PostgreSQL (and even Stackoverflow).
Say, I have two tables Order and Delivery:
Order
id product address delivery_id
--------------------------------------------------
1 apple mac street (null)
3 coffee java island (null)
4 window micro street (null)
Delivery
id address
----------------
Delivery.id and Order.id are auto-incrementing serial columns.
The table Delivery is currently empty.
I would like to move Order.address to Delivery.address and its Delivery.id to Order.delivery_id to arrive at this state:
Order
id product address delivery_id
--------------------------------------------------
1 apple mac street 1
5 coffee java island 2
7 window micro street 3
Delivery
id address
---------------------
1 mac street
2 java island
3 micro street
I'll then remove Order.address.
I found a similar question for Oracle but failed to convert it to PostgreSQL:
How to insert values from one table into another and then update the original table?
I still think it should be possible to use a plain SQL statement with the RETURNING clause and a following INSERT in Postgres.
I tried this (as well as some variants):
WITH ids AS (
INSERT INTO Delivery (address)
SELECT address
FROM Order
RETURNING Delivery.id AS d_id, Order.id AS o_id
)
UPDATE Order
SET Delivery_id = d_id
FROM ids
WHERE Order.id = ids.o_id;
This latest attempt failed with:
ERROR: missing FROM-clause entry for table "Delivery" LINE 1: ...address Order RETURNING Delivery.id...
How to do this properly?
First of all, ORDER is a reserved word. Don't use it as identifier. Assuming orders as table nae instead.
WITH ids AS (
INSERT INTO delivery (address)
SELECT DISTINCT address
FROM orders
ORDER BY address -- optional
RETURNING *
)
UPDATE orders o
SET delivery_id = i.id
FROM ids i
WHERE o.address = i.address;
You have to account for possible duplicates in order.address. SELECT DISTINCT produces unique addresses.
In the outer UPDATE we can now join back on address because delivery.address is unique. You should probably keep it that way beyond this statement and add a UNIQUE constraint on the column.
Effectively results in a one-to-many relationship between delivery and orders. One row in delivery can have many corresponding rows in orders. Consider to enforce that by adding a FOREIGN KEY constraint accordingly.
This statement enjoys the benefit of starting out on an empty delivery table. If delivery wasn't empty, we'd have to work with an UPSERT instead of the INSERT. See:
How to use RETURNING with ON CONFLICT in PostgreSQL?
Related:
Insert data in 3 tables at a time using Postgres
About the cause for the error message you got:
RETURNING causes error: missing FROM-clause entry for table
Use legal, lower-case identifiers exclusively, if you can. See:
Are PostgreSQL column names case-sensitive?
You can't return columns from the FROM relation in the RETURNING clause of the CTE query. You'll have to either manage this in a cursor, or add an order_id column to the Delivery table, something like this:
ALTER TABLE Delivery ADD COLUMNN order_id INTEGER:
INSERT INTO Delivery (address, order_id)
SELECT address, id
FROM Order
;
WITH q_ids AS
(
SELECT id, order_id
FROM Delivery
)
UPDATE Order
SET delivery_id = q_ids.id
FROM q_ids
WHERE Order.id = q_ids.order_id;

Unique constraint on Distinct select in Oracle database

I have a data processor that would create a table from a select query.
<_config:table definition="CREATE TABLE TEMP_TABLE (PRODUCT_ID NUMBER NOT NULL, STORE NUMBER NOT NULL, USD NUMBER(20, 5),
CAD NUMBER(20, 5), Description varchar(5), ITEM_ID VARCHAR(256), PRIMARY KEY (ITEM_ID))" name="TEMP_TABLE"/>
and the select query is
<_config:query sql="SELECT DISTINCT ce.PRODUCT_ID, ce.STORE, op.USD ,op.CAD, o.Description, ce.ITEM_ID
FROM PRICE op, PRODUCT ce, STORE ex, OFFER o, SALE t
where op.ITEM_ID = ce.ITEM_ID and ce.STORE = ex.STORE
and ce.PRODUCT_ID = o.PRODUCT_ID and o.SALE_ID IN (2345,1234,3456) and t.MEMBER = ce.MEMBER"/>
When I run that processor, I get an unique constraint error, though I have a distinct in my select statement.
I tried with CREATE TABLE AS (SELECT .....) its creating fine.
Is it possible to get that error? I'm doing a batch execute so not able to find the individual record.
The select distinct applies to the entire row, not to each column individually. So, two rows could have the same value of item_id but be different in the other columns.
The ultimate fix might be to have a group by item_id in the query, instead of select distinct. That would require other changes to the logic. Another possibility would be to use row_number() in a subquery and select the first row.

Delete duplicates with no primary key

Here want to delete rows with a duplicated column's value (Product) which will be then used as a primary key.
The column is of type nvarchar and we don't want to have 2 rows for one product.
The database is a large one with about thousands rows we need to remove.
During the query for all the duplicates, we want to keep the first item and remove the second one as the duplicate.
There is no primary key yet, and we want to make it after this activity of removing duplicates.
Then the Product columm could be our primary key.
The database is SQL Server CE.
I tried several methods, and mostly getting error similar to :
There was an error parsing the query. [ Token line number = 2,Token line offset = 1,Token in error = FROM ]
A method which I tried :
DELETE FROM TblProducts
FROM TblProducts w
INNER JOIN (
SELECT Product
FROM TblProducts
GROUP BY Product
HAVING COUNT(*) > 1
)Dup ON w.Product = Dup.Product
The preferred way trying to learn and adjust my code with something similar
(It's not correct yet):
SELECT Product, COUNT(*) TotalCount
FROM TblProducts
GROUP BY Product
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
--
;WITH cte -- These 3 lines are the lines I have more doubt on them
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Product
ORDER BY ( SELECT 0)) RN
FROM Word)
DELETE FROM cte
WHERE RN > 1
If you have two DIFFERENT records with the same Product column, then you can SELECT the unwanted records with some criterion, e.g.
CREATE TABLE victims AS
SELECT MAX(entryDate) AS date, Product, COUNT(*) AS dups FROM ProductsTable WHERE ...
GROUP BY Product HAVING dups > 1;
Then you can do a DELETE JOIN between ProductTable and Victims.
Or also you can select Product only, and then do a DELETE for some other JOIN condition, for example having an invalid CustomerId, or EntryDate NULL, or anything else. This works if you know that there is one and only one valid copy of Product, and all the others are recognizable by the invalid data.
Suppose you instead have IDENTICAL records (or you have both identical and non-identical, or you may have several dupes for some product and you don't know which). You run exactly the same query. Then, you run a SELECT query on ProductsTable and SELECT DISTINCT all products matching the product codes to be deduped, grouping by Product, and choosing a suitable aggregate function for all fields (if identical, any aggregate should do. Otherwise I usually try for MAX or MIN). This will "save" exactly one row for each product.
At that point you run the DELETE JOIN and kill all the duplicated products. Then, simply reimport the saved and deduped subset into the main table.
Of course, between the DELETE JOIN and the INSERT SELECT, you will have the DB in a unstable state, with all products with at least one duplicate simply disappeared.
Another way which should work in MySQL:
-- Create an empty table
CREATE TABLE deduped AS SELECT * FROM ProductsTable WHERE false;
CREATE UNIQUE INDEX deduped_ndx ON deduped(Product);
-- DROP duplicate rows, Joe the Butcher's way
INSERT IGNORE INTO deduped SELECT * FROM ProductsTable;
ALTER TABLE ProductsTable RENAME TO ProductsBackup;
ALTER TABLE deduped RENAME TO ProductsTable;
-- TODO: Copy all indexes from ProductsTable on deduped.
NOTE: the way above DOES NOT WORK if you want to distinguish "good records" and "invalid duplicates". It only works if you have redundant DUPLICATE records, or if you do not care which row you keep and which you throw away!
EDIT:
You say that "duplicates" have invalid fields. In that case you can modify the above with a sorting trick:
SELECT * FROM ProductsTable ORDER BY Product, FieldWhichShouldNotBeNULL IS NULL;
Then if you have only one row for product, all well and good, it will get selected. If you have more, the one for which (FieldWhichShouldNeverBeNull IS NULL) is FALSE (i.e. the one where the FieldWhichShouldNeverBeNull is actually not null as it should) will be selected first, and inserted. All others will bounce, silently due to the IGNORE clause, against the uniqueness of Product. Not a really pretty way to do it (and check I didn't mix true with false in my clause!), but it ought to work.
EDIT
actually more of a new answer
This is a simple table to illustrate the problem
CREATE TABLE ProductTable ( Product varchar(10), Description varchar(10) );
INSERT INTO ProductTable VALUES ( 'CBPD10', 'C-Beam Prj' );
INSERT INTO ProductTable VALUES ( 'CBPD11', 'C Proj Mk2' );
INSERT INTO ProductTable VALUES ( 'CBPD12', 'C Proj Mk3' );
There is no index yet, and no primary key. We could still declare Product to be primary key.
But something bad happens. Two new records get in, and both have NULL description.
Yet, the second one is a valid product since we knew nothing of CBPD14 before now, and therefore we do NOT want to lose this record completely. We do want to get rid of the spurious CBPD10 though.
INSERT INTO ProductTable VALUES ( 'CBPD10', NULL );
INSERT INTO ProductTable VALUES ( 'CBPD14', NULL );
A rude DELETE FROM ProductTable WHERE Description IS NULL is out of the question, it would kill CBPD14 which isn't a duplicate.
So we do it like this. First get the list of duplicates:
SELECT Product, COUNT(*) AS Dups FROM ProductTable GROUP BY Product HAVING Dups > 1;
We assume that: "There is at least one good record for every set of bad records".
We check this assumption by positing the opposite and querying for it. If all is copacetic we expect this query to return nothing.
SELECT Dups.Product FROM ProductTable
RIGHT JOIN ( SELECT Product, COUNT(*) AS Dups FROM ProductTable GROUP BY Product HAVING Dups > 1 ) AS Dups
ON (ProductTable.Product = Dups.Product
AND ProductTable.Description IS NOT NULL)
WHERE ProductTable.Description IS NULL;
To further verify, I insert two records that represent this mode of failure; now I do expect the query above to return the new code.
INSERT INTO ProductTable VALUES ( "AC5", NULL ), ( "AC5", NULL );
Now the "check" query indeed returns,
AC5
So, the generation of Dups looks good.
I proceed now to delete all duplicate records that are not valid. If there are duplicate, valid records, they will stay duplicate unless some condition may be found, distinguishing among them one "good" record and declaring all others "invalid" (maybe repeating the procedure with a different field than Description).
But ay, there's a rub. Currently, you cannot delete from a table and select from the same table in a subquery ( http://dev.mysql.com/doc/refman/5.0/en/delete.html ). So a little workaround is needed:
CREATE TEMPORARY TABLE Dups AS
SELECT Product, COUNT(*) AS Duplicates
FROM ProductTable GROUP BY Product HAVING Duplicates > 1;
DELETE ProductTable FROM ProductTable JOIN Dups USING (Product)
WHERE Description IS NULL;
Now this will delete all invalid records, provided that they appear in the Dups table.
Therefore our CBPD14 record will be left untouched, because it does not appear there. The "good" record for CBPD10 will be left untouched because it's not true that its Description is NULL. All the others - poof.
Let me state again that if a record has no valid records and yet it is a duplicate, then all copies of that record will be killed - there will be no survivors.
To avoid this can may first SELECT (using the query above, the check "which should return nothing") the rows representing this mode of failure into another TEMPORARY TABLE, then INSERT them back into the main table after the deletion (using transactions might be in order).
Create a new table by scripting the old one out and renaming it. Also script all objects (indexes etc..) from the old table to the new. Insert the keepers into the new table. If you're database is in bulk-logged or simple recovery model, this operation will be minimally logged. Drop the old table and then rename the new one to the old name.
The advantage of this over a delete will be that the insert can be minimally logged. Deletes do double work because not only does the data get deleted, but the delete has to be written to the transaction log. For big tables, minimally logged inserts will be much faster than deletes.
If it's not that big and you have some downtime, and you have Sql Server Management studio, you can put an identity field on the table using the GUI. Now you have the situation like your CTE, except the rows themselves are truly distinct. So now you can do the following
SELECT MIN(table_a.MyTempIDField)
FROM
table_a lhs
join table_1 rhs
on lhs.field1 = rhs.field1
and lhs.field2 = rhs.field2 [etc]
WHERE
table_a.MyTempIDField <> table_b.MyTempIDField
GROUP BY
lhs.field1, rhs.field2 etc
This gives you all the 'good' duplicates. Now you can wrap this query with a DELETE FROM query.
DELETE FROM lhs
FROM table_a lhs
join table_b rhs
on lhs.field1 = rhs.field1
and lhs.field2 = rhs.field2 [etc]
WHERE
lhs.MyTempIDField <> rhs.MyTempIDField
and lhs.MyTempIDField not in (
SELECT MIN(lhs.MyTempIDField)
FROM
table_a lhs
join table_a rhs
on lhs.field1 = rhs.field1
and lhs.field2 = rhs.field2 [etc]
WHERE
lhs.MyTempIDField <> rhs.MyTempIDField
GROUP BY
lhs.field1, lhs.field2 etc
)
Try this:
DELETE FROM TblProducts
WHERE Product IN
(
SELECT Product
FROM TblProducts
GROUP BY Product
HAVING COUNT(*) > 1)
This suffers from the defect that it deletes ALL the records with a duplicated Product. What you probably want to do is delete all but one of each group of records with a given Product. It might be worthwhile to copy all the duplicates to a separate table first, and then somehow remove duplicates from that table, then apply the above, and then copy remaining products back to the original table.

Underlying rows in Group By

I have a table with a certain number of columns and a primary key column (suppose OriginalKey). I perform a GROUP BY on a certain sub-set of those columns and store them in a temporary table with primary key (suppose GroupKey). At a later stage, I may need to get more details about one or more of those groupings (which can be found in the temporary table) i.e. I need to know which were the rows from the original table that formed that group. Simply put, I need to know the mappings between GroupKey and OriginalKey. What's the best way to do this? Thanks in advance.
Example:
Table Student(
StudentID INT PRIMARY KEY,
Level INT, --Grade/Class/Level depending on which country you are from)
HomeTown TEXT,
Gender CHAR)
INSERT INTO TempTable SELECT HomeTown, Gender, COUNT(*) AS NumStudents FROM Student GROUP BY HomeTown, Gender
On a later date, I would like to find out details about all towns that have more than 50 male students and know details of every one of them.
How about joining the 2 tables using the GroupKey, which, you say, are the same?
Or how about doing:
select * from OriginalTable where
GroupKey in (select GroupKey from my_temp_table)
You'd need to store the fields you grouped on in your temporary table, so you can join back to the original table. e.g. if you grouped on fieldA, fieldB, and fieldC, you'd need something like:
select original.id
from original
inner join temptable on
temptable.fieldA = original.fieldA and
temptable.fieldB = original.fieldB and
temptable.fieldC = original.fieldC

Normalizing a table, from one to the other

I'm trying to normalize a mysql database....
I currently have a table that contains 11 columns for "categories". The first column is a user_id and the other 10 are category_id_1 - category_id_10. Some rows may only contain a category_id up to category_id_1 and the rest might be NULL.
I then have a table that has 2 columns, user_id and category_id...
What is the best way to transfer all of the data into separate rows in table 2 without adding a row for columns that are NULL in table 1?
thanks!
You can create a single query to do all the work, it just takes a bit of copy and pasting, and adjusting the column name:
INSERT INTO table2
SELECT * FROM (
SELECT user_id, category_id_1 AS category_id FROM table1
UNION ALL
SELECT user_id, category_id_2 FROM table1
UNION ALL
SELECT user_id, category_id_3 FROM table1
) AS T
WHERE category_id IS NOT NULL;
Since you only have to do this 10 times, and you can throw the code away when you are finished, I would think that this is the easiest way.
One table for users:
users(id, name, username, etc)
One for categories:
categories(id, category_name)
One to link the two, including any extra information you might want on that join.
categories_users(user_id, category_id)
-- or with extra information --
categories_users(user_id, category_id, date_created, notes)
To transfer the data across to the link table would be a case of writing a series of SQL INSERT statements. There's probably some awesome way to do it in one go, but since there's only 11 categories, just copy-and-paste IMO:
INSERT INTO categories_users
SELECT user_id, 1
FROM old_categories
WHERE category_1 IS NOT NULL