Insert values from another table and update original table with returning values - sql

I'm new to PostgreSQL (and even Stackoverflow).
Say, I have two tables Order and Delivery:
Order
id product address delivery_id
--------------------------------------------------
1 apple mac street (null)
3 coffee java island (null)
4 window micro street (null)
Delivery
id address
----------------
Delivery.id and Order.id are auto-incrementing serial columns.
The table Delivery is currently empty.
I would like to move Order.address to Delivery.address and its Delivery.id to Order.delivery_id to arrive at this state:
Order
id product address delivery_id
--------------------------------------------------
1 apple mac street 1
5 coffee java island 2
7 window micro street 3
Delivery
id address
---------------------
1 mac street
2 java island
3 micro street
I'll then remove Order.address.
I found a similar question for Oracle but failed to convert it to PostgreSQL:
How to insert values from one table into another and then update the original table?
I still think it should be possible to use a plain SQL statement with the RETURNING clause and a following INSERT in Postgres.
I tried this (as well as some variants):
WITH ids AS (
INSERT INTO Delivery (address)
SELECT address
FROM Order
RETURNING Delivery.id AS d_id, Order.id AS o_id
)
UPDATE Order
SET Delivery_id = d_id
FROM ids
WHERE Order.id = ids.o_id;
This latest attempt failed with:
ERROR: missing FROM-clause entry for table "Delivery" LINE 1: ...address Order RETURNING Delivery.id...
How to do this properly?

First of all, ORDER is a reserved word. Don't use it as identifier. Assuming orders as table nae instead.
WITH ids AS (
INSERT INTO delivery (address)
SELECT DISTINCT address
FROM orders
ORDER BY address -- optional
RETURNING *
)
UPDATE orders o
SET delivery_id = i.id
FROM ids i
WHERE o.address = i.address;
You have to account for possible duplicates in order.address. SELECT DISTINCT produces unique addresses.
In the outer UPDATE we can now join back on address because delivery.address is unique. You should probably keep it that way beyond this statement and add a UNIQUE constraint on the column.
Effectively results in a one-to-many relationship between delivery and orders. One row in delivery can have many corresponding rows in orders. Consider to enforce that by adding a FOREIGN KEY constraint accordingly.
This statement enjoys the benefit of starting out on an empty delivery table. If delivery wasn't empty, we'd have to work with an UPSERT instead of the INSERT. See:
How to use RETURNING with ON CONFLICT in PostgreSQL?
Related:
Insert data in 3 tables at a time using Postgres
About the cause for the error message you got:
RETURNING causes error: missing FROM-clause entry for table
Use legal, lower-case identifiers exclusively, if you can. See:
Are PostgreSQL column names case-sensitive?

You can't return columns from the FROM relation in the RETURNING clause of the CTE query. You'll have to either manage this in a cursor, or add an order_id column to the Delivery table, something like this:
ALTER TABLE Delivery ADD COLUMNN order_id INTEGER:
INSERT INTO Delivery (address, order_id)
SELECT address, id
FROM Order
;
WITH q_ids AS
(
SELECT id, order_id
FROM Delivery
)
UPDATE Order
SET delivery_id = q_ids.id
FROM q_ids
WHERE Order.id = q_ids.order_id;

Related

SQL data cleaning SELECT DISTINCT from duplicate ID and return list of records. Scenario: Return unique IDs for first and latest instance

Dataset: customer_data
Table: customer_table (30 records)
Fields: customer_id, name
Datatype: customer_id = INTEGER, name = STRING
The problem or request: the customer_table contains 30 rows of customer data. However, there are some duplicate rows and I need to clean the data. I am using Google BigQuery to perform my SQL querying and I want to query the customer_table from the customer_data dataset to return unique customer_id along with the corresponding name.
If duplicate customer_id exists, but the duplicate has a different name, return the first instance record and discard the duplicate and continue returning all unique customer_id and name.
Alternately, if duplicate customer_id exists, but has different name, return the latest instance record from the table and discard the duplicate and continue returning all unique customer_id and name.
My methods:
Identify the unique values using SELECT DISTINCT.
SELECT DISTINCT customer_id
FROM customer_data.customer_table
Result: 24 rows
SELECT DISTINCT name
FROM customer_data.customer_table
Result: 25 rows
After finding out the number of unique values from customer_id and name do not match, I suspect one of the customer_id shares two different name.
Visualize which duplicate customer_id has two names:
SELECT DISTINCT customer_id, name
FROM customer_data.customer_table
ORDER BY customer_id ASC
Result: 25 rows
It appears there is one duplicate customer_id and the same customer_id has two different name.
Example:
customer_id
name
1890
Henry Fiction
1890
Arthur Stories
Return DISTINCT customer_id and name. If there are duplicates return only the first, discard the duplicate, and continue returning unique customer_id and name.
SELECT DISTINCT customer_id, name
FROM
(SELECT
customer_id, name,
ROW_NUMBER() OVER (PARTITION BY customer_id
ORDER BY customer_id ASC) AS row_num
FROM
customer_data.customer_table) subquery
WHERE
subquery.rownum = 1
Result: 24 rows
I decided to try using ROW_NUMBER() in a subquery to ask the query to perform an inner task first by making an index for the number of times the query count for each customer_id. Then, have it perform the final task with a WHERE clause to return a list of DISTINCT customer_id and the matching name for the first instance the customer_id is recorded in the customer_table.
Excellent! I was able to make a query to return unique customer_id along with their name from the customer_table, and if there are duplicate customer_id but the duplicate id has different name, create a list of customer_id and name that selects the first instance customer_id is recorded in the customer_table.
Now, what if I wanted to ask the query to create a list of unique customer_id and name that, instead of selecting the first customer_id when it encounter duplicates, select the latest record entry in the table if it encounter duplicate customer_id. How should I approach to solving this problem? What query method would you suggest?
Expected result: 24 rows
What I've tried:
SELECT DISTINCT customer_id, name
FROM
(SELECT
customer_id, name,
ROW_NUMBER() OVER (PARTITION BY customer_id
ORDER BY customer_id ASC) AS row_num
FROM
customer_data.customer_table) subquery
WHERE
subquery.row_num > 1
Result : 4 rows
Desired result: 24 rows
I tried changing the WHERE clause for subquery.row_num > 1 just to see what would change and see the desired values I want in my created list of unique customer_id and name. Of the 4 rows produced from the query, only one row has the duplicate customer_id and different name that I want, which is the latest duplicate customer_id having a different name in the customer_table. Referring back to the example where
SELECT DISTINCT customer_id, name
FROM customer_data.customer_table
revealed:
customer_id
name
1890
Henry Fiction
1890
Arthur Stories
One of the duplicates customer_id, 1890, was recorded first in the table and the other recorded later. The alternate request is to return a list of unique customer_id and name that if the query encounters duplicate customer_id it will select the latest record in the customer_table.
In case you don't have a timestamp when a record was added, I am afraid you won't be able to identify the latest record. Based on this post, BQ does not add the timestamp automatically. Is your table partitioned? If yes, then you might be able to identify the latest record using partitions.

Postgres - How to find id's that are not used in different multiple tables (inactive id's) - badly written query

I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.
There are 3 referent table that are using my town_id as reference to towns.
And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).
as you can see towns is used in this 2 different tables:
employees
offices
and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.
so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)
I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:
select count(*)
from towns
where vendor_id is null
and id not in (select town_id from banks)
and id not in (select town_id from employees)
So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..
In face I'm not able to execute given query since server terminated abnormally..
Here is full error message:
ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.
Any kind of help would be awesome
Thanks!
You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :
WITH list AS
( SELECT t.town_id
FROM towns AS t
LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
WHERE t.vendor_id IS NULL
AND b.town_id IS NULL
AND e.town_id IS NULL
LIMIT 1000
)
DELETE FROM tbl.towns AS t
USING list AS l
WHERE t.town_id = l.town_id ;
Before launching the DELETE, you can check the indexes on your tables.
Adding an index as follow can be usefull :
CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.
You can try an JOIN on big tables it would be faster then two IN
you could also try UNION ALL and live with the duplicates, as it is faster as UNION
Finally you can use a combined Index on id and vendor_id, to speed up the query
CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*)
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1
fiddle
The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there.
By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.
-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep
AS
SELECT town_id
FROM tbl.towns
WHERE 1 = 2;
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id
FROM tbl.banks;
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id
FROM tbl.employees
EXCEPT -- auto-distincts
SELECT town_id
FROM TEMP_must_keep;
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;
-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE
FROM tbl.towns as del
WHERE vendor_id is null
AND NOT EXISTS ( SELECT *
FROM TEMP_must_keep mk
WHERE mk.town_id = del.town_id);
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;
The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.

How to get one row from duplicated rows?

I have 2 tables SVC_ServiceTicket and SVC_CustomersVehicle
The table ServiceTicket has a column customerID which is a foreign key to CustomersVehicle.So in ServiceTicket column customerID can have duplicate values.
When I do
select sst.ServiceTicketID,sst.CustomerID
from ServiceTicket sst,CustomersVehicle scv
where sst.CustomerID=scv.CV_ID
then it gives me duplicate customerID.So my requirement is if there are duplicate values of customerID then I want the latest customerID and as well serviceticket of that corresponding(latest customerID)
For example in the below screenshot there are customerID 13 is repeating so in this case I want latest customerID as well as serviceticket so the values I want is 8008 and 13
Please tell me how to do
Use aggregate function MAX. Also I would recommend to use a JOIN.
SELECT MAX(sst.ServiceTicketID) AS ServiceTicketID,sst.CustomerID
FROM ServiceTicket sst JOIN
CustomersVehicle scv ON sst.CustomerVehicleID=scv.CV_ID
GROUP BY sst.CustomerID

Insert data and set foreign keys with Postgres

I have to migrate a large amount of existing data in a Postgres DB after a schema change.
In the old schema a country attribute would be stored in the users table. Now the country attribute has been moved into a separate address table:
users:
country # OLD
address_id # NEW [1:1 relation]
addresses:
id
country
The schema is actually more complex and the address contains more than just the country. Thus, every user needs to have his own address (1:1 relation).
When migrating the data, I'm having problems setting the foreign keys in the users table after inserting the addresses:
INSERT INTO addresses (country)
SELECT country FROM users WHERE address_id IS NULL
RETURNING id;
How do I propagate the IDs of the inserted rows and set the foreign key references in the users table?
The only solution I could come up with so far is creating a temporary user_id column in the addresses table and then updating the the address_id:
UPDATE users SET address_id = a.id FROM addresses AS a
WHERE users.id = a.user_id;
However, this turned out to be extremely slow (despite using indices on both users.id and addresses.user_id).
The users table contains about 3 million rows with 300k missing an associated address.
Is there any other way to insert derived data into one table and setting the foreign key reference to the inserted data in the other (without changing the schema itself)?
I'm using Postgres 8.3.14.
Thanks
I have now solved the problem by migrating the data with a Python/sqlalchemy script. It turned out to be much easier (for me) than trying the same with SQL. Still, I'd be interested if anybody knows a way to process the RETURNING result of an INSERT statement in Postgres SQL.
The table users must have some primary key that you did not disclose. For the purpose of this answer I will name it users_id.
You can solve this rather elegantly with data-modifying CTEs introduced with PostgreSQL 9.1:
country is unique
The whole operation is rather trivial in this case:
WITH i AS (
INSERT INTO addresses (country)
SELECT country
FROM users
WHERE address_id IS NULL
RETURNING id, country
)
UPDATE users u
SET address_id = i.id
FROM i
WHERE i.country = u.country;
You mention version 8.3 in your question. Upgrade! Postgres 8.3 has reached end of life.
Be that as it may, this is simple enough with version 8.3. You just need two statements:
INSERT INTO addresses (country)
SELECT country
FROM users
WHERE address_id IS NULL;
UPDATE users u
SET address_id = a.id
FROM addresses a
WHERE address_id IS NULL
AND a.country = u.country;
country is not unique
That's more challenging. You could just create one address and link to it multiple times. But you did mention a 1:1 relationship that rules out such a convenient solution.
WITH s AS (
SELECT users_id, country
, row_number() OVER (PARTITION BY country) AS rn
FROM users
WHERE address_id IS NULL
)
, i AS (
INSERT INTO addresses (country)
SELECT country
FROM s
RETURNING id, country
)
, r AS (
SELECT *
, row_number() OVER (PARTITION BY country) AS rn
FROM i
)
UPDATE users u
SET address_id = r.id
FROM r
JOIN s USING (country, rn) -- select exactly one id for every user
WHERE u.users_id = s.users_id
AND u.address_id IS NULL;
As there is no way to unambiguously assign exactly one id returned from the INSERT to every user in a set with identical country, I use the window function row_number() to make them unique.
Not as straight forward with Postgres 8.3. One possible way:
INSERT INTO addresses (country)
SELECT DISTINCT country -- pick just one per set of dupes
FROM users
WHERE address_id IS NULL;
UPDATE users u
SET address_id = a.id
FROM addresses a
WHERE a.country = u.country
AND u.address_id IS NULL
AND NOT EXISTS (
SELECT * FROM addresses b
WHERE b.country = a.country
AND b.users_id < a.users_id
); -- effectively picking the smallest users_id per set of dupes
Repeat this until the last NULL value is gone from users.address_id.

Safely normalizing data via SQL query

Suppose I have a table of customers:
CREATE TABLE customers (
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
This table does not have a primary key. However, customer_name and customer_address should be unique for any given customer_number.
It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:
SELECT
DISTINCT customer_number, customer_name, customer_address
FROM customers
Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number in question.
For example, I tried this query with no success:
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number
Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?
EDIT: The rationale behind the bizarre query:
Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...
I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:
CREATE TABLE unprocessed_invoices (
invoice_number INTEGER,
invoice_date DATE,
...
// other invoice columns
...
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.
The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number
Hopefully this helps clarify the original intent of the question.
EDIT: Examples of good/bad data
To clarify: customer_name and customer_address only have to be unique for a particular customer_number.
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
3 | 'Fred' | '456 Avenue'
3 | 'Fred' | '789 Crescent'
The first two rows are fine because it is the same customer_name and customer_address for customer_number 1.
The middle two rows are fine because it is the same customer_name and customer_address for customer_number 2 (even though another customer_number has the same customer_name and customer_address).
The last two rows are not okay because there are two different customer_addresses for customer_number 3.
The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
I hope this clarifies what I meant by "conflicting customer_name and customer_address". They have to be unique per customer_number.
I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)
Your approach is flawed. You do not want data that was successfully able to be stored to then throw an error on a select - that is a land mine waiting to happen and means you never know when a select could fail.
What I recommend is that you add a unique key to the table, and slowly start modifying your application to use this key rather than relying on any combination of meaningful data.
You can then stop caring about duplicate data, which is not really duplicate in the first place. It is entirely possible for two people with the same name to share the same address.
You will also gain performance improvements from this approach.
As an aside, I highly recommend you normalize your data, that is break up the name into FirstName and LastName (optionally MiddleName too), and break up the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip, or whatever)
Update: If I understand your situation correctly (which I am not sure I do), you want to prevent duplicate combinations of name and address from ever occurring in the table (even though that is a possible occurrence in real life). This is best done by a unique constraint or index on these two fields to prevent the data from being inserted. That is, catch the error before you insert it. That will tell you the import file or your resulting app logic is bad and you can choose to take the appropriate measures then.
I still maintain that throwing the error when you query is too late in the game to do anything about it.
A scalar sub-query must only return one row (per result set row...) so you could do something like:
select distinct
customer_number,
(
select distinct
customer_address
from customers c2
where c2.customer_number = c.customer_number
) as customer_address
from customers c
Making the query fail may be tricky...
This will show you if there are any duplicate records in the table:
select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1
If you just add a unique index for all the three fields, noone can create a duplicate record in the table.
The defacto key is Name+Address, so that's what you need to group by.
SELECT
Customer_Name,
Customer_Address,
CASE WHEN Count(DISTINCT Customer_Number) > 1
THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address
If you want to do it from the point of view of a Customer_Number, then this is good too.
SELECT *,
CASE WHEN Exists((
SELECT top 1 1
FROM Customers c2
WHERE c1.Customer_Number != c2.Customer_Number
AND c1.Customer_Name = c2.Customer_Name
AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = #Number
If you have dirty data, I would clean it up first.
Use this to find the duplicate customer records...
Select * From customers
Where customer_number in
(Select Customer_number from customers
Group by customer_number Having count(*) > 1)
If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.
CREATE TABLE #temp_customers
(customer_number int,
customer_name varchar(50),
customer_address varchar(50),
PRIMARY KEY (customer_number),
UNIQUE(customr_name, customer_address))
)
INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers
SELECT customer_number, customer_name, customer_address
FROM #temp_customers
DROP TABLE #temp_customers
This will fail if there are issues but will keep your duplicate records from causing issues.
Let's put the data into a temp table or table variable with your distinct query
select distinct customer_number, customer_name, customer_address,
IDENTITY(int, 1,1) AS ID_Num
into #temp
from unprocessed_invoices
Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.
Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.
Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number
You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.
Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number