How to write this query to avoid cartesian product? - sql

I want to create a CSV export for orders showing the warehouse_id where each order_item had shipped from, if available.
For brevity, here is the pertinent schema:
create table o (id integer);
orders have many order_items:
create table oi (id integer, o_id integer, sku text, quantity integer);
For each order_item in the CSV we want to show a warehouse_id from where it shipped out of. But that is not stored in order_items. It is stored in the shipment.
An order can be split up into many shipments from potentially from different warehouses.
create table s (id integer, o_id integer, warehouse_id integer);
shipments have many shipment items too:
create table si (id integer, s_id integer, oi_id integer, quantity_shipped integer);
How do I extract the warehouse_id for each order_item, given that warehouse_id is on the shipment and not every order has shipped yet (may not have a shipment record or shipment_items).
We are doing something like this (simplified):
select oi.sku, s.warehouse_id from oi
left join s on s.o_id = oi.o_id;
However if an order has 2 order items, let's call them sku A and B. And that order was split into two shipments where A was shipped from warehouse '50' and then a second shipment shipped B from '200'.
What we want would be a CSV output like:
sku | warehouse_id
-----|--------------
A | 50
B | 200
But what we get is some kind of cartesian product:
=================================
Here is the sample data:
select * from o;
id
----
1
(1 row)
select * from oi;
id | o_id | sku | quantity
----+------+-----+----------
1 | 1 | A | 1
2 | 1 | B | 1
(2 rows)
select * from s;
id | o_id | warehouse_id
----+------+--------------
1 | 1 | 50
2 | 1 | 200
(2 rows)
select * from si;
id | s_id | oi_id
----+------+------
1 | 1 | 1
2 | 2 | 2
(2 rows)
select oi.sku, s.warehouse_id from oi left join s on s.o_id = oi.o_id;
sku | warehouse_id
-----+--------------
A | 50
A | 200
B | 50
B | 200
(4 rows)
UPDATE ========
Per spencer, I'm adding a different example with different pk ids for more clarity. The following is 2 example orders. Order 2 has items A,B,C. A,B are shipped from shipment 200, C is shipped from shipment 201. Order 3 has 2 items E and A. E is not yet shipped and A is shipped twice out of the same warehouse '700', (like it was on back order).
# select * from o;
id
----
2
3
(2 rows)
# select * from oi;
id | o_id | sku | quantity
-----+------+-----+----------
100 | 2 | A | 1
101 | 2 | B | 1
102 | 2 | C | 1
103 | 3 | E | 1
104 | 3 | A | 2
(5 rows)
# select * from s;
id | o_id | warehouse_id
-----+------+--------------
200 | 2 | 700
201 | 2 | 800
202 | 3 | 700
203 | 3 | 700
(4 rows)
# select * from si;
id | s_id | oi_id
-----+------+-------
300 | 200 | 100
301 | 200 | 101
302 | 201 | 102
303 | 202 | 104
304 | 203 | 104
(5 rows)
I think this works, I use left join to keep the order_items in the report no matter if the order has shipped or not, I use group by to squash multiple shipments from the same warehouse. I believe this is what I need.
# select oi.o_id, oi.id, oi.sku, s.warehouse_id from oi left join si on si.oi_id = oi.id left join s on s.id = si.s_id group by oi.o_id, oi.id, oi.sku, s.warehouse_id order by oi.o_id;
o_id | id | sku | warehouse_id
------+-----+-----+--------------
2 | 102 | C | 800
2 | 101 | B | 700
2 | 100 | A | 700
3 | 104 | A | 700
3 | 103 | E |
(5 rows)

Order items that have shipped ...
SELECT oi.id
, oi.sku
, s.warehouse_id
FROM oi
JOIN si ON si.oi_id = oi.id
JOIN s ON s.id = si.s_id
Order items that haven't yet shipped, using anti-join to exclude rows where there is a matching row in si
SELECT oi.id
, oi.sku
, s.warehouse_id
FROM oi
JOIN s ON s.o_id = oi.o_id -- fk to fk shortcut join
-- anti-join
LEFT
JOIN si ON si.oi_id = oi.id
WHERE si.oi_id IS NULL
But this will still produce a (partial) Cartesian product. We can add a GROUP BY clause to collapse the rows...
GROUP BY si.oi_id
This doesn't avoid producing an intermediate cartesian product; the addition of the GROUP BY clause collapses the set. But it's indeterminate which of matching rows from s column values will be returned from.
The two queries could be combined with a UNION ALL operation. If I did that, I'd likely add a discriminator column (an additional column in each query with different values, which would tell which query returned a row.)
This set might meet the specification outlined in the OP question. But I don't think this is really the set that needs to be returned. Figuring out which warehouse an item should ship from may involve several factors... total quantity ordered, quantity available in each warehouse, can order be fulfilled from one warehouse, which warehouse is closer to delivery destination, etc.
I don't want to leave anyone with the impression that this query is really a "fix" for the cartesian product problem... this query just hides a bigger problem.

I think you need the si table:
select oi.sku, s.warehouse_id
from si join
oi
on si.o_id = oi.o_id join
s
on s.s_id = si.s_id;
si seems to be the proper junction table between the tables. I'm not sure why there is another join key that doesn't use it.

Related

Join query for multiple tables with condition

I'm using MSsql and I'm having a difficult time trying to get the results from a SELECT query. I have 3 tables.
First table Product
second table Seller
third table Customer
(data about customers - buyers and sellers).
select * from Product;
id(PK) | name_product
----------------------
1 | apple
2 | orange
3 | juice
select * from Seller;
id_seller(PK) | id_product | product_placement_date
---------------------------------------------------
45 | 3 | 2020-01-09
46 | 3 | 2020-01-05
58 | 2 | 2020-02-08
49 | 2 | 2020-01-04
43 | 1 | 2020-01-06
select * from Customer;
id_customer(PK) | name_customer
---------------------------
43 | Alice
45 | Sam
46 | Katy
49 | Soul
58 | Fab
I'm looking to select the name of the product and the first seller that placed that product ( given the first placement date ).
I've tried with this :
SELECT C.name_product,
P.mindate,
P.name_customer
FROM Product AS C
CROSS APPLY(SELECT MIN(S.product_placement_date) as mindate,
T.name_customer
FROM Seller AS S
JOIN Customer AS T ON T.id_customer = S.id_seller
WHERE S.id_product = C.id) AS P
But I am not getting correct result. I want results as shown below:
name_product | product_placement_date | name_customer
-----------------------------------------------------
apple | 2020-01-06 | Alice
orange | 2020-01-04 | Soul
juice | 2020-01-05 | Katy
Please advise
Looks like you may have an issue with seller table. It APPEARS that the seller ID is the foreign key to the customer table. This would indicate that you would never allow the seller to sell any other item on any other date... unless the primary key for the table was the Seller ID, the item sold and the date thus pulling all 3 columns. I would expect the "Seller" table really be a "SellING" table and be more to a context of
SellingID (PK) | id_seller | id_product | product_placement_date
---------------------------------------------------
1 | 45 | 3 | 2020-01-09
2 | 46 | 3 | 2020-01-05
3 | 58 | 2 | 2020-02-08
4 | 49 | 2 | 2020-01-04
5 | 43 | 1 | 2020-01-06
Next consideration is what if two or more people were selling oranges and listed on the same day. On your original, you have no context of who listed theirs first... Or would you want ALL people who listed their product on the earliest date. Of which you could have both names shown. By having this "selling" table with a "sellingid" column as an auto-increment, you would then be able to KNOW who was first based on the earliest SELLINGID for a given product because somebody would have to commit their record first, even if on the same day. Then you might end up with something like
select
p.name_product,
S2.product_placement_date,
c.name_customer
from
( select id_product,
min( sellingid ) as FirstListedID
from
selling
group by
id_product ) First
join selling S2
on First.FirstListedID = s2.sellingID
join customer c
on S2.id_seller = c.id_customer
join product p
on S2.id_product = p.id
Here, the pre-query of selling activity to the alias "First" represents a single list of all products with the first selling ID instance sold regardless of the date per the explanation why and using the auto-increment in the case of multiple people offering on the same date.
Once that is done, re-join to the original selling table on that first "ID". Then you can join out to the product and customer for the final details.
SELECT P.name_product,
S.product_placement_date,
S.name_customer
FROM Product AS P
CROSS APPLY(SELECT TOP 1 S.product_placement_date,
C.name_customer
FROM Seller AS S
INNER JOIN Customer AS C ON C.id_customer = S.id_seller
WHERE S.id_product = P.id
ORDER BY S.product_placement_date) AS S

Join tables - show every row of the left table only once and add a row with data that is not connected to the table

I have a table that look like this:
products:
order_id prices
_______ _____
2 20
3 11
null 40
Orders:
id number
1 30
2 50
3 10
4 10
I want to get the following table:
id number price
-- ------ -----
1 30 null
2 50 20
3 10 11
4 10 null
null(0) null(0) 40
foreign key are obviously the order_id -> orders. can be null.
As you probably can see i want to include all the rows from table orders if there is a link to products combine them.
and if there is no link just show null and the '40' (sum of 'disconnected' products)
Can anyone help me please?
I think you want a full join:
select o.id o.number, p.price
from orders o full join
products p
on p.order_id = o.id;
You need a left join from orders to products and union with the products that have null as order_id:
select o.id, o.number, p.prices
from orders o left join products p
on p.order_id = o.id
union all
select null, null, p.prices
from products p
where p.order_id is null
Retrieve the orders
SELECT orders.id, orders.number
FROM orders;
id | number
----+--------
1 | 30
2 | 50
3 | 10
4 | 10
Retrieve the prices associated to orders
SELECT orders.id, orders.number, products.prices
FROM orders
LEFT JOIN products ON orders.id = products.order_id;
id | number | prices
----+--------+--------
1 | 30 |
2 | 50 | 20
3 | 10 | 11
4 | 10 |
Retrieve the prices associated to orders as well as the products without order associated
SELECT orders.id, orders.number, products.prices
FROM orders
FULL JOIN products ON orders.id = products.order_id;
id | number | prices
----+--------+--------
1 | 30 |
2 | 50 | 20
3 | 10 | 11
| | 40
4 | 10 |
Sum the prices with no order associated. We see no difference here since there is only one product with no order associated (the one where order_id is null), but you asked for the sum of these prices so here you go :-)
SELECT orders.id, orders.number, SUM(products.prices) AS prices
FROM orders
FULL JOIN products ON orders.id = products.order_id
GROUP BY orders.id, orders.number;
id | number | prices
----+--------+--------
1 | 30 |
2 | 50 | 20
3 | 10 | 11
| | 40
4 | 10 |
Use your null(0) label and order by id
SELECT
coalesce(orders.id::varchar(255), 'null(0)') AS id,
coalesce(orders.number::varchar(255), 'null(0)') AS number,
SUM(products.prices) AS prices
FROM orders
FULL JOIN products ON orders.id = products.order_id
GROUP BY orders.id, orders.number
ORDER BY id;
id | number | prices
---------+---------+--------
1 | 30 |
2 | 50 | 20
3 | 10 | 11
4 | 10 |
null(0) | null(0) | 40

Join max records in Postresql

I have two tables:
products
+----+--------+
| id | name |
+----+--------+
| 1 | Orange |
| 2 | Juice |
| 3 | Fance |
+----+--------+
reviews
+----+------------+-------+------------+
| id | created_at | price | product_id |
+----+------------+-------+------------+
| 1 | 12/12/20 | 2 | 1 |
| 2 | 12/14/20 | 4 | 1 |
| 3 | 12/15/20 | 5 | 2 |
+----+------------+-------+------------+
How can I get list of products ordered by price of most recent (max created_at) review?
+------------+--------+-----------+-------+
| product_id | name | review_id | price |
+------------+--------+-----------+-------+
| 2 | Juice | 3 | 5 |
| 1 | Orance | 2 | 4 |
| 3 | Fance | | |
+------------+--------+-----------+-------+
I use latest PostgreSQL.
demo:db<>fiddle
Using DISTINCT ON
SELECT
*
FROM (
SELECT DISTINCT ON (p.id)
p.id,
p.name,
r.id as review_id,
r.price
FROM
reviews r
RIGHT JOIN products p ON r.product_id = p.id
ORDER BY p.id, r.created_at DESC NULLS LAST
) s
ORDER BY price DESC NULLS LAST
Join both tables (products LEFT JOIN review or review RIGHT JOIN products).
Now you have to do your orders. First you want to group the products together. Then you want to get the most recent entry per product (date in descending order to get the most recent as first row).
DISTINCT ON filters always the first row of an ordered group. So you get the most recent entry per product.
To sort your product rows put 1-3 into a subquery and order by price afterwards.
DISTINCT ON and an outer join is a good approach, but I would handle this as:
SELECT . . . -- whatever columns you want
FROM products p LEFT JOIN
(SELECT DISTINCT ON (r.product_id) r.*
FROM reviews r
ORDER BY r.product_id, r.created_at DESC NULLS LAST
) r
ON r.product_id = p.id
ORDER BY p.price DESC NULLS LAST;
The difference in doing DISTINCT ON before the JOIN or after may look minor. But this version of the query can take advantage of an index on reviews(product_id, created_at desc). And that could be a big performance win on a lot of data.
Indexes cannot be used for an ORDER BY that mixes columns from different tables.

Using sql to sum with multiple table calls

I'll get down to the point. So basically I have 3 tables structured as follows:
orders:
i_id | o_id | quantity
-----+--------+----------
1 | 1 | 5
2 | 2 | 2
1 | 3 | 3
1 | 4 | 3
2 | 5 | 4
orderinfos:
o_id | c_id
------+------------
1 | 1
2 | 2
3 | 2
4 | 1
5 | 2
customers:
c_id | name_id
----------+----------
1 | 100001
2 | 100002
then the resulting chart would be:
name_id | i_id | quantity
-----------+----------+----------
100001 | 1 | 8
100002 | 2 | 6
100002 | 1 | 3
So basically, you have a summary of something (in this case, orders) with their quantity, and then where each order has the customer id and the item name associated. Then the resulting chart would be something that gives the quantity per customer, per item, in descending order by the customer. My first implementation was this:
select quantCust.custIdName, quantCust.itemId, quantCust.quant
from
(select O.i_id as itemId,
C.name_id as custIdName,
sum(O.quantity) as quant
from orders as O, orderinfos as I, customers as C
where O.o_id = I.o_id and I.c_id = C.c_id
group by O.i_id, I.c_id) as quantCust
order by quantCust.custId, quantCust.quant desc;
which does not print the correct values.
I think you're close with your approach, but I recommend using explicit JOIN syntax, and using the aggregate SUM (along with GROUP BY) to get your totals:
SELECT c.name_id, i_id, SUM(quant) AS quant
FROM customers c
INNER JOIN orderinfo oi
ON c.c_id = oi.c_id
INNER JOIN orders o
ON oi.o_id = o.o_id
GROUP BY c.name_id, i_id
ORDER BY c.name_id, quant DESC
This works for me with your sample data, giving the desired output that you indicate.

Select DISTINCT returning too many records

I have two tables: Products and Items. I want to select distinct items that belong to a product based on the condition column, sorted by price ASC.
+-------------------+
| id | name |
+-------------------+
| 1 | Mickey Mouse |
+-------------------+
+-------------------------------------+
| id | product_id | condition | price |
+-------------------------------------+
| 1 | 1 | New | 90 |
| 2 | 1 | New | 80 |
| 3 | 1 | Excellent | 60 |
| 4 | 1 | Excellent | 50 |
| 5 | 1 | Used | 30 |
| 6 | 1 | Used | 20 |
+-------------------------------------+
Desired output:
+----------------------------------------+
| id | name | condition | price |
+----------------------------------------+
| 2 | Mickey Mouse | New | 80 |
| 4 | Mickey Mouse | Excellent | 50 |
| 6 | Mickey Mouse | Used | 20 |
+----------------------------------------+
Here's the query. It returns six records instead of the desired three:
SELECT DISTINCT(items.condition), items.price, products.name
FROM products
INNER JOIN items ON products.id = items.product_id
WHERE products.id = 1
ORDER BY items."price" ASC, products.name;
Correct PostgreSQL query:
SELECT DISTINCT ON (items.condition) items.id, items.condition, items.price, products.name
FROM products
INNER JOIN items ON products.id = items.product_id
WHERE products.id = 1
ORDER BY items.condition, items.price, products.name;
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal.
Details here
There is no distinct() function in SQL. Your query is being parsed as
SELECT DISTINCT (items.condition), ...
which is equivalent to
SELECT DISTINCT items.condition, ...
DISTINCT applies to the whole row - if two or more rows all have the same field values, THEN the "duplicate" row is dropped from the result set.
You probably want something more like
SELECT items.condition, MIN(items.price), products.name
FROM ...
...
GROUP BY products.id
I want to select distinct items that belong to a product based on the
condition column, sorted by price ASC.
You most probably want DISTINCT ON:
SELECT *
FROM (
SELECT DISTINCT ON (i.condition)
i.id AS item_id, p.name, i.condition, i.price
FROM products p
JOIN items i ON i.products.id = p.id
WHERE p.id = 1
ORDER BY i.condition, i.price ASC
) sub
ORDER BY item_id;
Since the leading columns of ORDER BY have to match the columns used in DISTINCT ON , you need a subquery to get the sort order you display.
Better yet:
SELECT i.item_id, p.name, i.condition, i.price
FROM (
SELECT DISTINCT ON (condition)
id AS item_id, product_id, condition, price
FROM items
WHERE product_id = 1
ORDER BY condition, price
) i
JOIN products p ON p.id = i.product_id
ORDER BY item_id;
Should be a bit faster.
Aside: You shouldn't be using the non-descriptive name id as identifier. Use item_id and product_id instead.
More details, links and a benchmark test in this related answer:
Select first row in each GROUP BY group?
Use a SELECT GROUP BY, extracting only the MIN(price) for every PRODUCT/CONDITION.