Optimize SQL subquery containing multiple inner joins and aggregate functions - sql

I have a select statement which is infact a subquery within a larger select statement built up programmatically. The problem is if I elect to include this subquery it acts as a bottle neck and the whole query becomes painfully slow.
An example of the data is as follows:
Payment
.Receipt_no|.Person |.Payment_date|.Type|.Reversed|
2|John |01/02/2001 |PA | |
1|John |01/02/2001 |GX | |
3|David |15/04/2003 |PA | |
6|Mike |26/07/2002 |PA |R |
5|John |01/01/2001 |PA | |
4|Mike |13/05/2000 |GX | |
8|Mike |27/11/2004 |PA | |
7|David |05/12/2003 |PA |R |
9|David |15/04/2003 |PA | |
The subquery is as follows :
select Payment.Person,
Payment.amount
from Payment
inner join (Select min([min_Receipt].Person) 'Person',
min([min_Receipt].Receipt_no) 'Receipt_no'
from Payment [min_Receipt]
inner join (select min(Person) 'Person',
min(Payment_date) 'Payment_date'
from Payment
where Payment.reversed != 'R' and Payment.Type != 'GX'
group by Payment.Person) [min_date]
on [min_date].Person= [min_Receipt].Person and [min_date].Payment_date = [min_Receipt].Payment_date
where [min_Receipt].reversed != 'R' and [min_Receipt].Type != 'GX'
group by [min_Receipt].Person) [1stPayment]
on [1stPayment].Receipt_no = Payment.Receipt_no
This retrieves the first payment of each person by .Payment_date (ascending), .Receipt_no (ascending) where .type is not 'GX' and .Reversed is not 'R'. As Follows:
Payment
.Receipt_No|.Person|.Payment_date
5|John |01/01/2001
3|David |15/04/2003
8|Mike |27/11/2004
Following Ahmads post -
From the following results
(3|David |15/04/2003)
and (9|David |15/04/2003)
I would only want the record with the lowest receipt_no. So
(3|David |15/04/2003)
So I added the aggregate function 'min(Payment.receipt_no)' grouping by person.
Query 1.
select min(Payment.Person) 'Person',
min(Payment.receipt_no) 'receipt_no'
from
Payment a
where
a.type<>'GX' and (a.reversed not in ('R') or a.reversed is null)
and a.payment_date =
(select min(payment_date) from Payment i
where i.Person=a.Person and i.type <> 'GX'
and (i.reversed not in ('R') or i.reversed is null))
group by a.Person
I added this as a subquery within my much larger query, however it still ran very slowly. So I tried rewriting the query whilst trying to avoid the use of aggregate functions and came up with the following.
Query 2.
SELECT
receipt_no,
person,
payment_date,
amount
FROM
payment a
WHERE
receipt_no IN
(SELECT
top 1 i.receipt_no
FROM
payment i
WHERE
(i.reversed NOT IN ('R') OR i.reversed IS NULL)
AND i.type<>'GX'
AND i.person = a.person
ORDER BY i.payment_date DESC, i.receipt_no ASC)
Which I wouldn't necessarily think as more efficient. In fact if I run the two queries side by side on my larger data set Query 1. completes in a matter of milliseconds where as Query 2. takes several seconds.
However if I then add them as subqueries within a much larger query, the larger query completes in hours using Query 1. and completes in 40 seconds using Query 2.
I can only attribute this to the use of aggregate functions in one and not the other.

How do you distinguish the payments
(3|David |15/04/2003)
and (9|David |15/04/2003)
These are both done by the same. Unless the time is different, then this query should work fine:
select
receipt_no,
person,
payment_date
from
payment a
where
type<>'GX' and (reversed not in ('R') or reversed is null)
and payment_date =
(select min(payment_date) from payment i
where i.person=a.person and i.type <> 'GX'
and (i.reversed not in ('R') or i.reversed is null))
order by person,payment_date desc
I have set up and tested this query on SQLFiddle, but I am not sure about the performance, since I don't have the amount of data that you have. So check and let me know
===
SQL Fiddle Demo for the Question above

Following a comment from CodeReview -
I've also re-written the query using the Rank() command as suggested.
Query 3.
left join
(select
a.Person,
a.amount,
(rank () over (Partition by a.Person order by a.payment_date desc, a.receipt_no desc)) 'Ranked'
from
Payment a
Where
(a.reversed not in ('R') or a.reversed is null)
and a.type != 'GX'
) [lastPayment]
on
[lastPayment].Person = [Person].Person
and [lastPayment].ranked = 1
This method has also resulted in the speeding up the larger query, the larger query now taking some 28 seconds
However Rank() is only supprted from SQL 2005 upwards.

Related

PostgreSQL average for every field

I'm trying to calculate the average in this sample; This example is working (but only when I select a specific ID, rather than the avg for every ID limited to 20 entries) but I'm having a hard time remembering how to calculate this for every id within the database, rather than the developer specifying the ID explicitly (in this case as 2958). I.E. It would be optimal to have the following rows (assuming this is grouped by each primary key with a limit of 20 values per avg):
ID: 1 -> avg 5
ID: 2 -> avg 2
ID: 3 -> avg 7
etc....
select avg(acc.amt)
from (
select acc.amt amt
from main_acc main_acc
join transactions trans on main_acc.id = trans.main_acc_id
where main_acc.id = 2958
order by main_acc.track_id, transactions.transaction_time desc
limit 20
) acc;
Any help at all would be greatly appreciated. The only relevant columns are the ones shown above, I can add a schema definition if requested. Thank you!
select main_acc.id, avg(acc.amt) from (select acc.amt amt
from main_acc main_acc
join transactions trans on main_acc.id = trans.main_acc_id
order by main_acc.track_id, transactions.transaction_time desc) acc
group by main_acc.id;
In fact you do not need the subquery.
select acc.id, avg(acc.amt)
from main_acc acc
join transactions trans on acc.id = trans.main_acc_id
group by acc.id
How do you define most recent. Is there a timestamp which is not shown, do you use the greatest id, something else? Basically, you order by that criteria and then use limit. So expanding the answer from #Tarik and assuming highest ids as most recent would yield something like:
select acc.id, avg(acc.amt) avg_for_id
from main_acc acc
join transactions trans on acc.id = trans.main_acc_id
group by acc.id
order by acc.id desc
limit 20;

Group table by custom column

I have a table transaction_transaction with columns:
id, status, total_amount, date_made, transaction_type
The status can be: Active, Paid, Trashed, Renewed, Void
So what i want to do is filter by date and status, but since sometimes there are no records with Renewed or Trashed, i get inconsistent data it returns only Active and Paid when grouping by status ( notice Renewed and Trashed is missing ). I want it allways to return smth like:
-----------------------------------
Active | 121 | 2017-08-09
Paid | 122 | 2017-08-19
Trashed | 123 | 2017-08-20
Renewed | 123 | 2017-08-20
The sql query i use:
SELECT
ST.type,
COALESCE(SUM(TR.total_amount), 0) AS amount
FROM sms_admin_status ST
LEFT JOIN transaction_transaction TR ON TR.status = ST.type
WHERE TR.store_id = 21 AND TR.transaction_type = 'Layaway' AND TR.status != 'Void'
AND TR.date_made >= '2018-02-01' AND TR.date_made <= '2018-02-26'
GROUP BY ST.type
Edit: I created a table sms_admin_status since you said its bad not having a table and in the future i might have new statuses, and i also changed the query to fit my needs.
Use a VALUES list in a subquery to LEFT JOIN your transaction table. You may need to NULLIF your sums to have them return 0.
https://www.postgresql.org/docs/10/static/queries-values.html
One possible solution (not very nice one) is the following
select statuses.s, date_made, coalesce(SUM(amount), 0)
from (values('active'),('inactive'),('deleted')) statuses(s)
left join transactions t on statuses.s = t.status and
date_made >= '2017-08-08'
group by statuses.s, date_made
I assume that you forgot to add date_made to the group by. therefore, I added it there. As you can see the possible values are hardcoded in the SQL. Some other solution (much more cleaner) is to create a table with possible values of status and replace my statuses.
Use SELECT ... FROM (VALUES) with restriction from the transaction table:
select * from (values('active', 0),('inactive', 0),('deleted', 0)) as statuses
where column1 not in (select status from transactions)
union select status, sum(amount) from transactions group by status
Add the date column as need be, I assume it's a static value
The multiple where statements will limit the rows selected unless they are in a sub-query. May I suggest something like the following?
SELECT ST.type, ISNULL(SELECT SUM(TR.total_amount)
FROM transaction_transaction TR
WHERE TR.status = ST.type AND TR.store_id = 21 AND TR.transaction_type = 'Layaway' AND TR.status != 'Void'
AND TR.date_made >= '2018-02-01' AND TR.date_made <= '2018-02-26'),0) AS amount
FROM sms_admin_status ST
GROUP BY ST.type

SQL aggregate functions and sorting

I am still new to SQL and getting my head around the whole sub-query aggregation to display some results and was looking for some advice:
The tables might look something like:
Customer: (custID, name, address)
Account: (accountID, reward_balance)
Shop: (shopID, name, address)
Relational tables:
Holds (custID*, accountID*)
With (accountID*, shopID*)
How can I find the store that has the least reward_balance?
(The customer info is not required at this point)
I tried:
SELECT accountID AS ACCOUNT_ID, shopID AS SHOP_ID, MIN(reward_balance) AS LOWEST_BALANCE
FROM Account, Shop, With
WHERE With.accountID = Account.accountID
AND With.shopID=Shop.shopID
GROUP BY
Account.accountID,
Shop.shopID
ORDER BY MIN(reward_balance);
This works in a way that is not intended:
ACCOUNT_ID | SHOP_ID | LOWEST_BALANCE
1 | 1 | 10
2 | 2 | 40
3 | 3 | 100
4 | 4 | 1000
5 | 4 | 5000
As you can see Shop_ID 4 actually has a balance of 6000 (1000+5000) as there are two customers registered with it. I think I need to SUM the lowest balance of the shops based on their balance and display it from low-high.
I have been trying to aggregate the data prior to display but this is where I come unstuck:
SELECT shopID AS SHOP_ID, MIN(reward_balance) AS LOWEST_BALANCE
FROM (SELECT accountID, shopID, SUM(reward_balance)
FROM Account, Shop, With
WHERE
With.accountID = Account.accountID
AND With.shopID=Shop.shopID
GROUP BY
Account.accountID,
Shop.shopID;
When I run something like this statement I get an invalid identifier error.
Error at Command Line : 1 Column : 24
Error report -
SQL Error: ORA-00904: "REWARD_BALANCE": invalid identifier
00904. 00000 - "%s: invalid identifier"
So I figured I might have my joining condition incorrect and the aggregate sorting incorrect, and would really appreciate any general advice.
Thanks for the lengthy read!
Approach this problem one step at time.
We're going to assume (and we should probably check this) that by least reward_balance, that refers to the total of all reward_balance associated with a shop. And we're not just looking for the shop that has the lowest individual reward balance.
First, get all of the individual "reward_balance" for each shop. Looks like the query would need to involve three tables...
SELECT s.shop_id
, a.reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
That will get us the detail rows, every shop along with the individual reward_balance amounts associated with the shop, if there are any. (We're using outer joins for this query, because we don't see any guarantee that a shops is going to be related to at least one account. Even if it's true for this use case, that's not always true in the more general case.)
Once we have the individual amounts, the next step is to total them for each shop. We can do that using a GROUP BY clause and a SUM() aggregate.
SELECT s.shop_id
, SUM(a.reward_balance) AS tot_reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
GROUP BY s.shop_id
At this point, with MySQL we could add an ORDER BY clause to arrange the rows in ascending order of tot_reward_balance, and add a LIMIT 1 clause if we only want to return a single row. We can also handle the case when tot_reward_balance is NULL, assigning a zero in place of the NULL.
SELECT s.shop_id
, IFNULL(SUM(a.reward_balance),0) AS tot_reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
GROUP BY s.shop_id
ORDER BY tot_reward_amount ASC, s.shop_id ASC
LIMIT 1
If there are two (or more) shops with the same least value of tot_reward_amount, this query returns only one of those shops.
Oracle doesn't have the LIMIT clause like MySQL, but we can get equivalent result using analytic function (which is not available in MySQL). We also replace the MySQL IFNULL() function with the Oracle equivalent NVL() function...
SELECT v.shop_id
, v.tot_reward_balance
, ROW_NUMBER() OVER (ORDER BY v.tot_reward_balance ASC, v.shop_id ASC) AS rn
FROM (
SELECT s.shop_id
, NVL(SUM(a.reward_balance),0) AS tot_reward_balance
FROM shop s
LEFT
JOIN with w
ON w.shop_id = s.shop_id
LEFT
JOIN account a
ON a.account_id = w.account_id
GROUP BY s.shop_id
) v
HAVING rn = 1
Like the MySQL query, this returns at most one row, even when two or more shops have the same "least" total of reward_balance.
If we want to return all of the shops that have the lowest tot_reward_balance, we need to take a slightly different approach.
The best approach to building queries is step wise refinement; in this case, start by getting all of the individual reward_amount for each shop. Next step is to aggregate the individual reward_amount into a total. The next steps is to pickout the row(s) with the lowest total reward_amount.
In SQL Server, You can try using a CTE:
;with cte_minvalue as
(
select rank() over (order by Sum_Balance) as RowRank,
ShopId,
Sum_Balance
from (SELECT Shop.shopID, SUM(reward_balance) AS Sum_Balance
FROM
With
JOIN Shop ON With.ShopId = Shop.ShopId
JOIN Account ON With.AccountId = Account.AccountId
GROUP BY
Shop.shopID)ShopSum
)
select ShopId, Sum_Balance from cte_minvalue where RowRank = 1

Join on id or null and get first result

I have created the query below:
select * from store str
left join(
select * from schedule sdl
where day = 3
order by
case when sdl.store_id is null then (
case when sdl.strong is true then 0 else 2 end
) else 1 end, sdl.schedule_id desc
) ovr on (ovr.store_id = str.store_id OR ovr.store_id IS NULL)
Sample data:
STORE
[store_id] [title]
20010 Shoes-Shop
20330 Candy-Shop
[SCHEDULE]
[schedule_id] [store_id] [day] [strong] [some_other_data]
1 20330 3 f 10% Discount
2 NULL 3 t 0% Discount
What I want to get from the LEFT JOIN is either data for NULL store_id (global schedule entry - affects all store entries) OR the actual data for the given store_id.
Joining the query like this, returns results with the correct order, but for both NULL and store_id matches. It makes sense using the OR statement on join clause.
Expected results:
[store_id] [title] [some_other_data]
20010 Shoes-Shop 0% Discount
20330 Candy-Shop 0% Discount
Current Results:
[store_id] [title] [some_other_data]
20010 Shoes-Shop 0% Discount
20330 Candy-Shop 0% Discount
20330 Candy-Shop 10% Discount
If there is a more elegant approach on the subject I would be glad to follow it.
DISTINCT ON should work just fine, as soon as you get ORDER BY right. Basically, matches with strong = TRUE in schedule have priority, then matches with store_id IS NOT NULL:
SELECT DISTINCT ON (st.store_id)
st.store_id, st.title, sl.some_other_data
FROM store st
LEFT JOIN schedule sl ON sl.day = 3
AND (sl.store_id = st.store_id OR sl.store_id IS NULL)
ORDER BY NOT strong, store_id IS NULL;
This works because:
Sorting null values after all others, except special
Basics for DISTINCT ON:
Select first row in each GROUP BY group?
Alternative with a LATERAL join (Postgres 9.3+):
SELECT *
FROM store st
LEFT JOIN LATERAL (
SELECT some_other_data
FROM schedule
WHERE day = 3
AND (store_id = st.store_id OR store_id IS NULL)
ORDER BY NOT strong
, store_id IS NULL
LIMIT 1
) sl ON true;
About LATERAL joins:
What is the difference between LATERAL and a subquery in PostgreSQL?
I think the easiest way to do what you want is to use distinct on. The question is then how you order it:
select distinct on (str.store_id) *
from store str left join
schedule sdl
on (sdl.store_id = str.store_id or sdl.store_id is null) and dl.day = 3
order by str.store_id,
(case when sdl.store_id is null then 2 else 1 end)
This will return the store record if available, otherwise the schedule record that has a value of NULL. Note: your query has this notion of strength, but the question doesn't explain how to use it. This can be readily modified to include multiple levels of priorities.

Count the number of occurrences grouped by some rows

I have made a query to bring me the number of products that have not been in stock (I know that by looking at the orders which the manufacturer returned with some status code), by product, date and storage, that looks like this:
SELECT count(*) as out_of_stock,
prod.id as product_id,
ped.data_envio::date as date,
opl.id as storage_id
from sub_produtos_pedidos spp
left join cad_produtos prod ON spp.ean_produto = prod.cod_ean
left join sub_pedidos sp ON spp.id_pedido = sp.id
left join pedidos ped ON sp.id_pedido = ped.id
left join op_logisticos opl ON sp.id_op_logistico = opl.id
where spp.motivo = '201' -- this is the code that means 'not in inventory'
group by storage_id,product_id,date
That produces an answer like this:
out_of_stock | product_id | date | storage_id
--------------|------------|-------------|-------------
1 | 5 | 2012-10-16 | 1
5 | 4 | 2012-10-16 | 2
Now I need to get the number of occurrences, by product and storage, of products that have been out of stock for 2 or more days, 5 or more days and so on.
So I guess I need to do a new count on the first query, aggregating the resultant rows in some defined day intervals.
I tried looking at the datetime functions in Postgres (http://www.postgresql.org/docs/7.3/static/functions-datetime.html), but couldn't find what I need.
May be I didn't get correctly you question, but it looks you need leverage sub-query.
Now I need to get the number of occurrences, by product and storage, of products that have been out of stock for 2 or more days
So:
SELECT COUNT(*), date, product_id FROM ( YOUR BIG QUERY IS THERE ) a
WHERE a.date < (CURRENT_DATE - interval '2' day)
GROUP BY date, product_id
Since you seem to want every row in the result individually, you cannot aggregate. Use a window function instead to get the count per day. The well known aggregate function count() can also serve as window aggregate function:
SELECT current_date - ped.data_envio::date AS days_out_of_stock
,count(*) OVER (PARTITION BY ped.data_envio::date)
AS count_per_days_out_of_stock
,ped.data_envio::date AS date
,p.id AS product_id
,opl.id AS storage_id
FROM sub_produtos_pedidos spp
LEFT JOIN cad_produtos p ON p.cod_ean = spp.ean_produto
LEFT JOIN sub_pedidos sp ON sp.id = spp.id_pedido
LEFT JOIN op_logisticos opl ON opl.id = sp.id_op_logistico
LEFT JOIN pedidos ped ON ped.id = sp.id_pedido
WHERE spp.motivo = '201' -- code for 'not in inventory'
ORDER BY ped.data_envio::date, p.id, opl.id
Sort order: Products having been out of stock for the longest time first.
Note, you can just subtract dates to get an integer in Postgres.
If you want a running count in the sense of "n rows have been out of stock for this number of days or more", use:
count(*) OVER (ORDER BY ped.data_envio::date) -- ascending order!
AS running_count_per_days_out_of_stock
You get the same count for the same day, peers are lumped together.