Finding cumulative totals with condition and group by - sql

Here is my case:
I would like to calculate quantity and price for a given item on any given date.
Prices are calculated using total item quantity and unit price so price changes with respect to item's quantity.
warehouse_1 states that item was shipped from that warehouse, warehouse_2 states that item was sent to that warehouse.
Here is my logic:
Fetch deliveries for each item and sum their quantities. (1st CTE)
Find the sum of quantities in both warehouses separately. (2nd CTE)
Calculate final quantity and multiply it by unit price.
Show result which consists of item id, quantity and price.
I wrote a query which does the calculations correctly BUT it gets exponentially slower when data count gets bigger. (Takes 5 seconds on my DB with 6k rows, almost locks DB on my coworker's DB with 21k rows)
How can I optimize this query? I am doing cumulative calculations on 2nd CTE for each row coming from 1st CTE and that needs a rework I believe.
Can I use
LAG()
function for this use case? I tried that with something like
LAG(a.deliveryTotal) over(order by a.updated desc rows between unbounded preceding and current row)
instead of the CASE block in 2nd CTE but I can't seem to figure out how to use filter() or put a condition inside LAG() statement.
Here is my query:
`
with deliveriesCTE as (
select
row_number() over(partition by it.id
order by
dd.updated asc) as rn,
sum(dd.quantity) as deliveryTotal,
dd.updated as updated,
it.id as item_id,
d.warehouse_1 as outWH,
d.warehouse_2 as inWH,
d.company_code as company
from
deliveries d
join deliveries_detail dd on
dd.deliveries_id = d.id
join items it on
it.id = dd.item_id
where
...
group by
dd.updated,
it.id,
d.warehouse_1,
d.warehouse_2,
d.company_code
order by
dd.updated asc),
cumulativeTotalsByUnit as (
select
distinct on
(a.item_id) a.rn,
a.deliveryTotal,
a.updated,
a.item_id,
a.outWH,
a.inWH,
a.company,
case
when a.rn = 1
and a.outWH is not null then coalesce(a.deliveryTotal,
0)
else (
select
coalesce(sum(b.deliveryTotal) filter(
where b.outWH is not null),
0)
from
deliveriesCTE b
where
a.item_id = b.item_id
and b.rn <= a.rn)
end as outWHTotal,
case
when a.rn = 1
and a.inWH is not null then coalesce(a.deliveryTotal,
0)
else (
select
coalesce(sum(b.deliveryTotal) filter(
where b.inWH is not null),
0)
from
deliveriesCTE b
where
a.item_id = b.item_id
and b.rn <= a.rn)
end as inWHTotal
from
deliveriesCTE a
order by
a.item_id,
a.updated desc)
select
resultView.item_id,
resultView.quantity,
resultView.price
from
(
select
cumTotals.item_id,
cumTotals.inWHTotal - cumTotals.outWHTotal as quantity,
p.price * (cumTotals.inWHTotal - cumTotals.outWHTotal) as price
from
prices p
join cumulativeTotalsByUnit cumTotals on
cumTotals.item_id = p.item_id ) resultView
where
resultView.rn = 1;
`

It's hard to say for use without a MCV, but my guess on what you are trying to do is do a Windowed SUM() calculation as opposed to LAG(). There is documentation Here.
The query cumulativeTotalsByUnit shouldn't be necessary and is likely quadratic to do the complex self-referential join.
Your delivery CTE should look like:
select
sum(dd.quantity) over (partition by it.id ORDER BY dd.updated asc) as deliveryTotal,
dd.updated as updated,
it.id as item_id,
d.warehouse_1 as outWH,
d.warehouse_2 as inWH,
d.company_code as company
from
deliveries d
join deliveries_detail dd on
dd.deliveries_id = d.id
join items it on
it.id = dd.item_id
where
...
group by
dd.updated,
it.id,
d.warehouse_1,
d.warehouse_2,
d.company_code
order by
dd.updated asc

Related

Finding the most recent order duplicate from a customer order table - SLOW CROSS APPLY

I want to only show orders where the customer had an order for the same exact item, but only the most recent, completed order placed before.
I want to get the most immediate order placed for the same customer, for the same item. Like showing duplicates, but just the most recent.
The query works fine in accomplishing what I want it to do, but when I add the cross apply to my actual query, it slows it down by a LOT.
EDIT: I've also tried select top 1 rather than using the row number. The rownumber line makes it only 1 second faster.
declare #orders as table (
ord_id numeric(7,0),
customer_id numeric(4,0),
order_time datetime,
item_id numeric (4,0),
status int NOT NULL
)
insert into #orders values
(1516235,5116,'06/04/2021 11:06:00', 5616, 1),
(1516236,5116,'06/03/2021 13:51:00', 5616, 1),
(1514586,5554,'06/01/2021 08:16:00', 5616, 1),
(1516288,5554,'06/01/2021 15:35:00', 5616, 1),
(1516241,5554,'06/04/2021 11:11:00', 4862, 1),
(1516778,5554,'06/04/2021 11:05:00', 4862, 2)
select distinct *
from #orders o
cross apply (
select a.ord_id, row_number() over (partition by a.customer_id order by a.order_time) as rownum
from #orders a
where a.customer_id = o.customer_id and
a.status != 2 and
a.item_id = o.item_id and
a.order_time < o.order_time
)a
where a.rownum = 1
Is there some other way I can do this? How can I speed this up?
The previous order has to have
an order time before the other orders
the same customer record
the same item record
the most recent of all the other records before
a status of not cancelled (1 = Complete; 2 = Cancelled)
That's silly. Here's a simpler method using cross apply:
select o.*
from #orders o cross apply
(select top (1) a.ord_id
from #orders a
where a.customer_id = o.customer_id and
a.status <> 2 and
a.item_id = o.item_id and
a.order_time < o.order_time
order by a.order_time
) a;
This can use an index on (customer_id, item_id, status, order_time).
Note: If you want the most recent of the previous order, then the order by should use desc. However, that is not how the code is phrased in the question.
And, you should be able to use window functions. If ord_id increases with time:
min(case when status <> 2 then ord_id end) over (partition by customer_id, item_id)
Even if this is not true, there is a variation, but it is more complicated (i.e. requires a subquery) because of the filtering on status.

SQLite query with LIMIT per column

I am trying to compose a query with a where condition to get multiple unique sorted columns without having to do it in multiple queries. That is confusing so here is an example...
Price Table
id | item_id | date | price
I want to query to find the most recent price of multiple items given a date. I was previously iterating through items in my application code and getting the most recent price like this...
SELECT * FROM prices WHERE item_id = ? AND date(date) < date(?) ORDER BY date(date) DESC LIMIT 1
Iterating through each item and doing a query is too slow so I am wondering if there is a way I can accomplish this same query for multiple items in one go. I have tried UNION but I cannot get it to work with the ORDER BY and LIMIT commands like this thread says (https://stackoverflow.com/a/1415380/4400804) for MySQL
Any ideas on how I can accomplish this?
Try this (based on adapting the answer):
SELECT * FROM prices a WHERE a.RowId IN (
SELECT b.RowId
FROM prices b
WHERE a.item_id = b.item_id AND date < ?
ORDER BY b.item_id LIMIT 1
) ORDER BY date DESC;
Window functions (Available with sqlite 3.25 and newer) will likely help:
WITH ranked AS
(SELECT id, item_id, date, price
, row_number() OVER (PARTITION BY item_id ORDER BY date DESC) AS rn
FROM prices
WHERE date < ?)
SELECT id, item_id, date, price
FROM ranked
WHERE rn = 1
ORDER BY item_id;
will return the most recent of each item_id from all records older than a given date.
I would simply use a correlated subquery in the `where` clause:
SELECT p.*
FROM prices p
WHERE p.DATE = (SELECT MAX(p2.date)
FROM prices p2
WHERE p2.item_id = p.item_id
);
This is phrase so it works on all items. You can, of course, add filtering conditions (in the outer query) for a given set of items.
With NOT EXISTS:
SELECT p.* FROM prices p
WHERE NOT EXISTS (
SELECT 1 FROM prices
WHERE item_id = p.item_id AND date > p.date
)
or with a join of the table to a query that returns the last date for each item_id:
SELECT p.*
FROM prices p INNER JOIN (
SELECT item_id, MAX(date) date
FROM prices
GROUP BY item_id
) t ON t.item_id = p.item_id AND t.date = p.date

SQL Windowing Ranks Functions

SELECT
*
FROM (
SELECT
Product,
SalesAmount,
ROW_NUMBER() OVER (ORDER BY SalesAmount DESC) as RowNum,
RANK() OVER (ORDER BY SalesAmount DESC) as RankOf2007,
DENSE_RANK() OVER (ORDER BY SalesAmount DESC) as DRankOf2007
FROM (
SELECT
c.EnglishProductName as Product,
SUM(a.SalesAmount) as SalesAmount,
b.CalendarYear as CalenderYear
FROM FactInternetSales a
INNER JOIN DimDate b
ON a.OrderDateKey=b.DateKey
INNER JOIN DimProduct c
ON a.ProductKey=c.ProductKey
WHERE b.CalendarYear IN (2007)
GROUP BY c.EnglishProductName,b.CalendarYear
) Sales
) Rankings
WHERE [RankOf2007] <= 5
ORDER BY [SalesAmount] DESC
I am currently sorting products based on summation of Sales Amount in descending fashion and getting rank based on the summation of sales amount of every product in 2007 and ranking product 1 if it has the highest Sales Amount in that year and so forth.
Currently my database table looks like the one mentioned in the image (apart from RankOf2008 and DRankOf2008 columns), I would like to have rankings in year 2008 for same top 5 products of 2007 (Null value if any of those top 5 products of 2007 are unsold in 2008) in the same table with side by side columns as shown in the image above.
May be you require something like this.
First getting ranks for all products then partition by year, that is rank of products year wise and fetching required data with help of CTE.
WITH cte
AS (
SELECT *
FROM (
SELECT Product
,SalesAmount
,CalenderYear
,ROW_NUMBER() OVER (
PARTITION BY CalenderYear ORDER BY SalesAmount DESC
) AS RowNum
,RANK() OVER (
PARTITION BY CalenderYear ORDER BY SalesAmount DESC
) AS RankOf2007
,DENSE_RANK() OVER (
PARTITION BY CalenderYear ORDER BY SalesAmount DESC
) AS DRankOf2007
FROM (
SELECT c.EnglishProductName AS Product
,SUM(a.SalesAmount) AS SalesAmount
,b.CalendarYear AS CalenderYear
FROM FactInternetSales a
INNER JOIN DimDate b ON a.OrderDateKey = b.DateKey
INNER JOIN DimProduct c ON a.ProductKey = c.ProductKey
--WHERE b.CalendarYear IN (2007)
GROUP BY c.EnglishProductName
,b.CalendarYear
) Sales
) Rankings
--WHERE [RankOf2007] <= 5
--ORDER BY [SalesAmount] DESC
)
SELECT a.*
,b.DRankOf2007 AS [DRankOf2008]
,b.RankOf2007 AS [RankOf2008]
FROM cte a
LEFT JOIN cte b ON a.Product = b.Product
AND b.CalenderYear = 2008
WHERE a.CalenderYear = 2007
AND a.[RankOf2007] <= 5
Use conditional aggregation in your innermost query (i.e. select both years and sum conditionally for one of the years):
select
p.productkey,
p.englishproductname as product,
ranked.salesamount2007,
ranked.salesamount2008,
ranked.rankof2007,
ranked.rankof2008
from
(
select
productkey,
salesamount2007,
salesamount2008,
rank() over (order by salesamount2007 desc) as rankof2007,
rank() over (order by salesamount2008 desc) as rankof2008
from
(
select
s.productkey,
sum(case when d.calendaryear = 2007 then s.salesamount end) as salesamount2007,
sum(case when d.calendaryear = 2008 then s.salesamount end) as salesamount2008
from factinternetsales s
inner join dimdate d on d.datekey = s.orderdatekey
where d.calendaryear in (2007, 2008)
group by s.productkey
) aggregated
) ranked
join dimproduct p on p.productkey = ranked.productkey
where ranked.rankof2007 <= 5
order by ranked.rankof2007 desc;
For the case there are no rows for a product in 2008, salesamount2008 will be null. In standard SQL we would consider this in the ORDER BY clause:
rank() over (order by salesamount2008 desc nulls last) as rankof2008
But SQL Server doesn't comply with the SQL standard here and doesn't feature NULLS FIRST/LAST in the ORDER BY clause. Fortunately, it sorts nulls last when sorting in descending order, so it implicitly does just what we want here.
By the way: we could do the aggregation and ranking in a single step, but in that case we'd have to repeat the SUM expressions. It's a matter of personal preference, whether to do this in one step (shorter query) or two steps (no repetitive expressions).

How to get the records from inner query results with the MAX value

The results are below. I need to get the records (seller and purchaser) with the max count- grouped by purchaser (marked with yellow)
You can use window functions:
with q as (
<your query here>
)
select q.*
from (select q.*,
row_number() over (order by seller desc) as seqnum_s,
row_number() over (order by purchaser desc) as seqnum_p
from q
) q
where seqnum_s = 1 or seqnum_p = 1;
Try this:
SELECT COUNT,seller,purchaser FROM YourTable ORDER BY seller,purchaser DESC
SELECT T2.MaxCount,T2.purchaser,T1.Seller FROM <Yourtable> T1
Inner JOIN
(
Select Max(Count) as MaxCount, purchaser
FROM <Yourtable>
GROUP BY Purchaser
)T2
On T2.Purchaser=T1.Purchaser AND T2.MaxCount=T1.Count
First you select the Seller from which will give you a list of all 5 sellers. Then you write another query where you select only the Purchaser and the Max(count) grouped by Purchaser which will give you the two yellow-marked lines. Join the two queries on fields Purchaser and Max(Count) and add the columns from the joined table to your first query.
I can't think of a faster way but this works pretty fast even with rather large queries. You can further-by order the fields as needed.

ORACLE SQL Return only duplicated values (not the original)

I have a database with the following info
Customer_id, plan_id, plan_start_dte,
Since some customer switch plans, there are customers with several duplicated customer_ids, but with different plan_start_dte. I'm trying to count how many times a day members switch to the premium plan from any other plan ( plan_id = 'premium').
That is, I'm trying to do roughly this: return all rows with duplicate customer_id, except for the original plan (min(plan_start_dte)), where plan_id = 'premium', and group them by plan_start_dte.
I'm able to get all duplicate records with their count:
with plan_counts as (
select c.*, count(*) over (partition by CUSTOMER_ID) ct
from CUSTOMERS c
)
select *
from plan_counts
where ct > 1
The other steps have me stuck. First I tried to select everything except the original plan:
SELECT CUSTOMERS c
where START_DTE not in (
select min(PLAN_START_DTE)
from CUSTOMERS i
where c.CUSTOMER_ID = i.CUSTOMER_ID
)
But this failed. If I can solve this I believe all I have to add is an additional condition where c.PLAN_ID = 'premium' and then group by date and do a count. Anyone have any ideas?
I think you want lag():
select c.*
from (select c.*,
lag(plan_id) over (partition by customer_id order by plan_start_date) as prev_plan_id
from customers c
) c
where prev_plan_id <> 'premium' and plan_id = 'premium';
I'm not sure what output you want. For the number of times this occurs per day:
select plan_start_date, count(*)
from (select c.*, lag(plan_id) over (partition by customer_id order by plan_start_date) as prev_plan_id
from customers c
) c
where prev_plan_id <> 'premium' and plan_id = 'premium'
group by plan_start_date
order by plan_start_date;