Improve efficiency of PostgreSQL Query - One to Many, Count is 1 - sql

I would like to improve the efficiency of the following query, if possible:
SELECT * FROM orders o
INNER JOIN order_items oi
ON o.id = oi.order_id
WHERE o.fulfilled = false
AND o.id NOT IN (SELECT order_id
FROM order_items
WHERE sku = '011111'
GROUP BY order_id
HAVING COUNT(order_id) = 1)
There is a one to many relationship between the orders and order_items tables (o.id = oi.order_id).
The goal is to select all of the information from two tables, with the following conditions:
The order has not been fulfilled (orders.fulfilled = false).
Exclude all of the orders that have exactly one order item with an SKU of '011111' (oi.sku like '011111').
Any help is appreciated!

IN can be slower , modified the query to use inner join
select * from orders o
inner join order_items oi
on o.id = oi.order_id
and o.fulfilled = false
inner join( select order_id
from order_items
where sku != '011111'
group by order_id
having count(order_id) = 1) T
on T.order_id = oi.id

count(whatever) usually will force a full table scan (because it has no idea how many orders there are grouped by order_items and you can not create an index on an aggregate), unless there is another clause that can use an index. Most likely a sku not equaling something will not be selective enough (I'm guessing you have a lot skus.) You can look at the explain output and you probably see a full table scan in the IN part of you query.
If thats the case then you have the option of caching the count data and then indexing it through a trigger function that updates a current_count column every time an order is placed or fulfilled. Or, you could cache a query that kept tracked of the count (say if the information does not need to refreshed very much.)

Can we assume that an order can't have more than one item with the same sku on the same order?
Can we assume that you can't have an order with no items?
If so, writing the opposite might be faster. The query below finds all orders that have any sku other than '011111'. Also, correlated subqueries are usually faster than non-correlated subqueries (although optimizers are smart enough to rewrite this a lot of the time). Exists clauses are usually faster than an in clause since the engine can exit before looking through all of the subquery rows.
SELECT *
FROM orders o
INNER JOIN order_items oi
ON o.id = oi.order_id
WHERE o.fulfilled = false
AND EXISTS (SELECT 'x'
FROM order_items oi2
WHERE o.order_id = oi2.order_id
AND sku != '011111')

Related

SQL how to count the number of relations between two tables and include zeroes?

I have a table of orders, and a table of products contained in these orders. (The products-table has order_id, a foreign key referring to orders.id).
I would like to query the number of products contained in each order. However, I also want orders to be contained in the results if they do not contain any products at all.
This means that a simple
SELECT *, COUNT(*) n_products FROM `orders` INNER JOIN `products` on `products.order_id` = `orders.id` GROUP_BY `order_id`
does not work, since orders without any products disappear.
Using a LEFT OUTER JOIN instead would add rows without product-information, but the distinction between an order with 1 product and an order with 0 products is lost.
What am I missing here?
You need a left join here, and you should be counting some column from the products table:
SELECT
o.*,
COUNT(p.order_id) AS n_products
FROM orders o
LEFT JOIN products p
ON p.order_id = o.id
GROUP BY
o.id;
Note that I assume that Postgres would allow grouping by orders.id and then selecting all columns from that table. If not, then you would only be able to select o.id in addition to the count.

add column to select statement when having a certain condition

I have a SQL statement for data from order_details, a table which has many columns including product name, code, etc. How can I add a column to the select statement that whenever the order has a certain product (The product_code I need is called 'Pap') it writes a flag 'pap', so I can visually know which orders have this product?
I tried the code below:
select distinct order_id, customer_id,
(select distinct order_id from order_details
group by 1 having sum (case when product_code='pap'
then 1 else 0 end)=1
) as pap from orders
left join order_details
on order_details.order_id=orders.order_id
group by 1,2,3
The code I am trying is giving me an error "[Firebird]multiple rows in singleton select; HY000".
At a guess, you want to show 'pap' for orders that have one or more order_details with product_code 'pap', in that case you can use:
select order_id, customer_id,
(select max(order_details.product_code)
from order_details
where order_details.order_id = orders.order_id
and order_details.product_code = 'pap') as pap
from orders
Or a more generic solution (that doesn't rely on the product_code for the value to display):
select order_id, customer_id,
case
when exists(
select 1
from order_details
where order_details.order_id = orders.order_id
and order_details.product_code = 'pap')
then 'pap'
end as pap
from orders
Let's try to build your query step by step. From simple to more complex in the obsolete bottom-to-top fashion :-)
I suggest you to run every query to see the results and see how the data is getting refined step by step and to check early whether your assumption holds true.
1st unknown is order_details - can one order had several rows with the same product? Is it possible to have an order with 2+3 Paps or only one summary 5 Paps? Is (order_id,product_code) a unique constraint or primary key over that table, or not?
Select Count(1), order_id, product_code
From order_details
Group by 2,3
Order by 1 DESC
This can show if such a repetition exists, but even if not - you have to check the meta-data (scheme) to see if that is allowed by table constraints or indices.
The thing is, when you JOIN tables - their matching rows get multiplied (in set theory terms). So if you can have several rows about Paps in one order - then we have to make special dealing about it. Which would add extra load on the server, unless we find a way to make it for free.
We can easily check for one specific order to have that product.
select 'pap' from order_details where order_id = :parameter_id and product_code='pap'
We can then suppress repetitions - if they were not prohibited by constraints - in a standard way (but requiring extra sorting) or Firebird-specific (but free) way.
select DISTINCT 'pap' from order_details where order_id = :parameter_id and product_code='pap'
or
select FIRST(1) 'pap' from order_details where order_id = :parameter_id and product_code='pap'
However, these can suit Mark's answer with correlated sub-query:
select o.order_id, o.customer_id,
coalesce(
( select first(1) 'pap' /* the flag */ from order_details d
where o.order_id = d.order_id and d.product_code = 'pap' )
, '' /* just avoiding NULL */
) as pap
from orders o
Lifehack: notice how use of coalesce and first(1) here substitutes use of case and exists in original Mark's answer. This trick can be used in Firebird wherever you use singular (and potentially empty) 1-column query as an expression.
To avoid multiple sub-queries and switch to outer-join we need to make one query to have ALL the order IDs with Paps, but only once.
select distinct order_id from order_details where product_code='pap'
Should do the trick. But probably at the cost of extra sorting to suppress possible duplication (again, is it possible though?)
select order_id, count(order_id)
from order_details
where product_code='pap'
group by 1 order by 2 desc
Would show as the repetitions if they are already there. Just to explain what I mean. And to see if you can enforce SQL constraints upon the already existing data, if you did not have them and would choose to harden your database structure.
This way we just have to outer-join with it and use CASE (or some its shorthand form) do the typical trick of filtering outer-join's NULL rows.
select o.order_id, o.customer_id,
iif( d.order_id is null, '', 'pap') as pap
from orders o
left join (
select distinct order_id
from order_details
where product_code = 'pap'
and product_quantity > 0 ) d
on o.order_id=d.order_id
As someone said this looks ugly, there is one more 'modern' way to write exactly that query, maybe it would look better :-D
with d as (
select distinct order_id
from order_details
where product_code = 'pap'
and product_quantity > 0 )
select o.order_id, o.customer_id,
iif( d.order_id is null, '', 'pap') as pap
from orders o left join d on o.order_id=d.order_id
Where the 'pap' repetitions can not (notice, not DO not, but CAN not) occur within one single order_id then the query would get even simpler and faster:
select o.order_id, o.customer_id,
iif( d.order+id is null, '', 'pap') as pap
from orders o
left join order_details d
on o.order_id=d.order_id
and d.product_code='pap'
and d.product_quantity>0
Notice the crucial detail: d.product_code='pap' is set as an internal condition on (before) the join. Would you put it into outer WHERE clause after the join - it would not work!
Now, to compare those two approaches, JOIN vs correlated subqueries, you have to see query statistics, how many fetches and cached fetches both wout generate. Chances are - on medium-sized tables and with OS disk caches and Firebird caches warmed up you would not see the difference in time. But would you at least shutdown and restart Firebird service and better the whole computer - to clean the said caches - and then get those queries to the last rows (by issuing "fetch all" or "scroll to the last row" in your database IDE, or by wrapping my and Mark's queries into
select count(1) from ( /* measured query here */) you may start to see timing changing too.
SELECT
...
<foreign_table>.<your_desired_extra_column>
FROM
<current_table>
LEFT JOIN
<foreign_table> ON <foreign_table>.id = <current_table>.id
AND
<current_table>.<condition_field> = <condition_value>
Extra column will be NULL if the condition is not met.
select order_id, customer_id,
(select max(order_details.product_code)
from order_details
where order_details.order_id = orders.order_id
and order_details.product_code = 'pap') as pap
from orders

Making select query more efficient (subquery slows run speed)

The below query seems to take forever to run ever since I have added the subquery into it.
I originally tried to accomplish my goal by having two joins but the results were wrong.
Does anyone know the correct way to write this?
SELECT
c.cus_Name,
COUNT(o.orderHeader_id) AS Orders,
(select count(ol.orderLines_id) from orderlines ol where ol.orderLines_orderId = o.orderHeader_id) as linesOrderd,
MAX(o.orderHeader_dateCreated) AS lastOrdered,
SUM(o.orderHeader_totalSell) AS orderTotal,
SUM(o.orderHeader_currentSell) AS sellTotal
FROM
cus c
JOIN
orderheader o ON o.orderHeader_customer = c.cus_id
group by
c.cus_name
order by
orderTotal desc
Example data below
For the data you want, I think this is the way to go:
SELECT c.cus_Name,
COUNT(o.orderHeader_id) AS Orders,
SUM(ol.cnt) as linesOrderd,
MAX(o.orderHeader_dateCreated) AS lastOrdered,
SUM(o.orderHeader_totalSell) AS orderTotal,
SUM(o.orderHeader_currentSell) AS sellTotal
FROM cus c JOIN
orderheader o
ON o.orderHeader_customer = c.cus_id LEFT JOIN
(SELECT ol.orderLines_orderId, count(*) as cnt
FROM orderlines ol
GROUP BY ol.orderLines_orderId
) ol
ON ol.orderLines_orderId = o.orderHeader_id)
GROUP BY c.cus_name
ORDER BY orderTotal DESC;
I'm not sure if it will be much faster, but it will at least produce a sensible result -- the total number of order lines for a customer rather than the number of order lines on an arbitrary order.
Strange that subselect should not be possible since the count is only very indirectly related to the grouping. You want to count all orderlines of all orders which are related to one customer? Normally this should be done using the second join, but then the orderheader will be repeated as often as the order_lines exist. That would produce wrong results in the other aggregations.
normally this should help then, put the subselect into the joined table:
could you replace orderheader o by
(select o.*, (select count(ol.orderLines_id) from orderlines ol where ol.orderLines_orderId = o.orderHeader_id) as linesOrder from orderheader o) as o
and replace the subselect by
sum(o.linesOrder)

SQL Select - Calculated Column if Value Exists in another Table

Trying to work through a SQL query with some very limited knowledge and experience. Tried quite a few things I've found through searches, but haven't come up with my desired result.
I have four tables:
ORDERS
[ID][DATE]
ORDER_DETAILS
[ID][ITEM_NO][QTY]
ITEMS
[ITEM_NO][DESC]
KITS
[KIT_NO][ITEM_NO]
Re: KITS - [KIT_NO] and [ITEM_NO] are both FK to the ITEMS table. The concatenation of them is the PK.
I want to select ORDERS, ORDERS.DATE, ORDER_DETAILS.ITEM_NO, ITEMS.DESC
No problem. A few simple inner joins and I'm on my way.
The difficulty lies in adding a column to the select statement, IS_KIT, that is true if:
EXISTS(SELECT null FROM KITS WHERE KITS.ITEM_NO = ORDER_DETAILS.ITEM_NO).
(if the kits table contains the item, flag this row)
Is there any way to calculate that column?
There are different ways to do this.
The simplest is probably a LEFT JOIN with a CASE calculated column:
SELECT
o.date,
od.item_no,
i.desc,
CASE WHEN k.item_no IS NULL THEN 0 ELSE 1 END AS is_kit
FROM orders o
JOIN order_details od ON od.id=o.id
JOIN items i ON i.item_no = od.item_no
LEFT JOIN kits k ON k.item_no = od.item_no
But you could also use a SUBSELECT:
SELECT
o.date,
od.item_no,
i.desc,
(SELECT COUNT(*) FROM kits k WHERE k.item_no = od.item_no) AS is_kit
FROM orders o
JOIN order_details od ON od.id=o.id
JOIN items i ON i.item_no = od.item_no

What is the most efficient way to write a select statement with a "not in" subquery?

What is the most efficient way to write a select statement similar to the below.
SELECT *
FROM Orders
WHERE Orders.Order_ID not in (Select Order_ID FROM HeldOrders)
The gist is you want the records from one table when the item is not in another table.
For starters, a link to an old article in my blog on how NOT IN predicate works in SQL Server (and in other systems too):
Counting missing rows: SQL Server
You can rewrite it as follows:
SELECT *
FROM Orders o
WHERE NOT EXISTS
(
SELECT NULL
FROM HeldOrders ho
WHERE ho.OrderID = o.OrderID
)
, however, most databases will treat these queries the same.
Both these queries will use some kind of an ANTI JOIN.
This is useful for SQL Server if you want to check two or more columns, since SQL Server does not support this syntax:
SELECT *
FROM Orders o
WHERE (col1, col2) NOT IN
(
SELECT col1, col2
FROM HeldOrders ho
)
Note, however, that NOT IN may be tricky due to the way it treats NULL values.
If Held.Orders is nullable, no records are found and the subquery returns but a single NULL, the whole query will return nothing (both IN and NOT IN will evaluate to NULL in this case).
Consider these data:
Orders:
OrderID
---
1
HeldOrders:
OrderID
---
2
NULL
This query:
SELECT *
FROM Orders o
WHERE OrderID NOT IN
(
SELECT OrderID
FROM HeldOrders ho
)
will return nothing, which is probably not what you'd expect.
However, this one:
SELECT *
FROM Orders o
WHERE NOT EXISTS
(
SELECT NULL
FROM HeldOrders ho
WHERE ho.OrderID = o.OrderID
)
will return the row with OrderID = 1.
Note that LEFT JOIN solutions proposed by others is far from being a most efficient solution.
This query:
SELECT *
FROM Orders o
LEFT JOIN
HeldOrders ho
ON ho.OrderID = o.OrderID
WHERE ho.OrderID IS NULL
will use a filter condition that will need to evaluate and filter out all matching rows which can be numerius
An ANTI JOIN method used by both IN and EXISTS will just need to make sure that a record does not exists once per each row in Orders, so it will eliminate all possible duplicates first:
NESTED LOOPS ANTI JOIN and MERGE ANTI JOIN will just skip the duplicates when evaluating HeldOrders.
A HASH ANTI JOIN will eliminate duplicates when building the hash table.
"Most efficient" is going to be different depending on tables sizes, indexes, and so on. In other words it's going to differ depending on the specific case you're using.
There are three ways I commonly use to accomplish what you want, depending on the situation.
1. Your example works fine if Orders.order_id is indexed, and HeldOrders is fairly small.
2. Another method is the "correlated subquery" which is a slight variation of what you have...
SELECT *
FROM Orders o
WHERE Orders.Order_ID not in (Select Order_ID
FROM HeldOrders h
where h.order_id = o.order_id)
Note the addition of the where clause. This tends to work better when HeldOrders has a large number of rows. Order_ID needs to be indexed in both tables.
3. Another method I use sometimes is left outer join...
SELECT *
FROM Orders o
left outer join HeldOrders h on h.order_id = o.order_id
where h.order_id is null
When using the left outer join, h.order_id will have a value in it matching o.order_id when there is a matching row. If there isn't a matching row, h.order_id will be NULL. By checking for the NULL values in the where clause you can filter on everything that doesn't have a match.
Each of these variations can work more or less efficiently in various scenarios.
You can use a LEFT OUTER JOIN and check for NULL on the right table.
SELECT O1.*
FROM Orders O1
LEFT OUTER JOIN HeldOrders O2
ON O1.Order_ID = O2.Order_Id
WHERE O2.Order_Id IS NULL
I'm not sure what is the most efficient, but other options are:
1. Use EXISTS
SELECT *
FROM ORDERS O
WHERE NOT EXISTS (SELECT 1
FROM HeldOrders HO
WHERE O.Order_ID = HO.OrderID)
2. Use EXCEPT
SELECT O.Order_ID
FROM ORDERS O
EXCEPT
SELECT HO.Order_ID
FROM HeldOrders
Try
SELECT *
FROM Orders
LEFT JOIN HeldOrders
ON HeldOrders.Order_ID = Orders.Order_ID
WHERE HeldOrders.Order_ID IS NULL