Aggregate functions based on current Row value - sql

I am working with data similar to below,
week | product | sale
1 | ABC | 2
1 | ABC | 1
2 | ABC | 1
3 | ABC | 5
4 | ABC | 1
2 | DEF | 5
Let us say that is my Orders table named tblOrders. Now, in each row, I want to aggregate the total sales from last week for that product - for instance, if I am on week 2 of product "ABC", I need to show the aggregated sales amount of week 1 for product ABC. so, the output should look something like below,
week | product | sale | ProductPreviousWeekSales
1 | ABC | 2 | 0
1 | ABC | 1 | 0
2 | ABC | 1 | 3
3 | ABC | 5 | 1
4 | ABC | 1 | 5
2 | DEF | 5 | 0
I was originally thinking I could solve this using Aggregates and Window Function, but doesn't look to be so. Another thought I was having is to use Conditional Aggregate - something like sum(case when x=currentRow.x then sale else 0 end), but that wouldn't work too.
Here is the SQLFiddle for above sample - http://sqlfiddle.com/#!18/890b7/2
Note: I need to calculate similar value for Last 4 weeks, so trying to avoid doing this as a sub-query or multiple joins (if possible), as the data set I am working with is very large, and don't want to add to much performance overhead trying to incorporate this change.

Here is one approach which first aggregates your table in a separate CTE and uses LAG to find the previous week's amount, for each week and product:
WITH cte AS (
SELECT week, product,
LAG(SUM(sale)) OVER (PARTITION BY product ORDER BY week) AS lag_total_sales
FROM yourTable
GROUP BY week, product
)
SELECT t1.week, t1.product, t1.sale,
COALESCE(t2.lag_total_sales, 0) AS ProductPreviousWeekSales
FROM yourTable t1
INNER JOIN cte t2
ON t2.week = t1.week AND
t2.product = t1.product
ORDER BY
t1.product,
t1.week;
Demo

DISCLAIMER
The query I am showing below doesn't work in SQL Server, unfortunately. Up to SQL Server version 2019 the DBMS lacks full support of the RANGE clause that is essential for the query to work. Running the query in SQL Server results in
Msg 4194 Level 16 State 1 Line 1 RANGE is only supported with UNBOUNDED and CURRENT ROW window frame delimiters.
I am not deleting this answer, because this is standard SQL and the approach may help future readers. It runs fine in a lot of DBMS, and maybe a future version of SQL Server will be able to deal with this, too. I've added demos to show that it runs in PostgreSQL, MySQL and Oracle, but fails in SQL Server 2019.
ORIGINAL ANSWER
Your query shown in the fiddle (select a.*, sum(sale) over(partition by product) ProductPreviousWeekSales from tblOrder a) is merely lacking the appropriate windowing clause. As you are dealing with ties here (more than one row per product and week) this needs to be a RANGE clause:
select a.*,
sum(sale) over(partition by product
order by week range between 1 preceding and 1 preceding
) as ProductPreviousWeekSales
from tblOrder a
order by product, week;
(Use COALESCE if you want to see a zero instead of NULL.)
Demos:
https://dbfiddle.uk/?rdbms=postgres_13&fiddle=149eddbff82500d539b2c615f4167cff
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=a8453970efac08ad69275914910bb13e
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=64ed21150142caa0acb7f8c7ca7d9022
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=149eddbff82500d539b2c615f4167cff

You can do from following
; WITH cteorder AS
(
SELECT DISTINCT product, week FROM dbo.tblOrder
)
SELECT
cte.*,
SUM(ISNULL(b.sale,0)) ProductPreviousWeekSales
from tblOrder a
INNER JOIN cteorder cte ON cte.product = a.product AND cte.week = a.week
LEFT JOIN dbo.tblOrder b ON b.product = cte.product AND b.week = (a.week-1)
GROUP BY cte.product,
cte.week
You can run from : Fiddle

You need to select from TblOrders twice. Once, grouping by week and product and summing the sales, and the second time, a row-by-row scan against TblOrders, left-joining it with the grouping query on same product and week offset by 1:
If the join fails , the sales value of the joined grouping query returns NULL. You can put in 0 instead of NULL using COALESCE(), but ISNULL() has all chances of being faster, as it has a fixed number of parameters, while COALESCE() has a variable argument list, which comes at a certain cost.
WITH
tblorders(wk,product,sales) AS (
SELECT 1,'ABC',2
UNION ALL SELECT 1,'ABC',1
UNION ALL SELECT 2,'ABC',1
UNION ALL SELECT 3,'ABC',5
UNION ALL SELECT 4,'ABC',1
UNION ALL SELECT 2,'DEF',5
)
,
grp AS (
SELECT
wk
, product
, SUM(sales) AS sales
FROM tblorders
GROUP BY
wk
, product
)
SELECT
o.wk
, o.product
, o.sales
, ISNULL(g.sales,0) AS productpreviousweeksales
FROM tblorders o
LEFT
JOIN grp g
ON o.wk - 1 = g.wk
AND o.product= g.product
ORDER BY 2,1
;
wk | product | sales | productpreviousweeksales
----+---------+-------+--------------------------
1 | ABC | 2 | 0
1 | ABC | 1 | 0
2 | ABC | 1 | 3
3 | ABC | 5 | 1
4 | ABC | 1 | 5
2 | DEF | 5 | 0

Related

How to create BigQuery this query in retail dataset

I have a table with user retail transactions. It includes sales and cancels. If Qty is positive - it sells, if negative - cancels. I want to attach cancels to the most appropriate sell. So, I have tables likes that:
| CustomerId | StockId | Qty | Date |
|--------------+-----------+-------+------------|
| 1 | 100 | 50 | 2020-01-01 |
| 1 | 100 | -10 | 2020-01-10 |
| 1 | 100 | 60 | 2020-02-10 |
| 1 | 100 | -20 | 2020-02-10 |
| 1 | 100 | 200 | 2020-03-01 |
| 1 | 100 | 10 | 2020-03-05 |
| 1 | 100 | -90 | 2020-03-10 |
User with ID 1 has the following actions: buy 50 -> return 10 -> buy 60 -> return 20 -> buy 200 -> buy 10 - return 90. For each cancel row (with negative Qty) I find the previous row (by Date) with positive Qty and greater than cancel Qty.
So I need to create BigQuery queries to create table likes this:
| CustomerId | StockId | Qty | Date | CancelQty |
|--------------+-----------+-------+------------+-------------|
| 1 | 100 | 50 | 2020-01-01 | -10 |
| 1 | 100 | 60 | 2020-02-10 | -20 |
| 1 | 100 | 200 | 2020-03-01 | -90 |
| 1 | 100 | 10 | 2020-03-05 | 0 |
Does anybody help me with these queries? I have created one candidate query (split cancel and sales, join them, and do some staff for removing), but it works incorrectly in the above case.
I use BigQuery, so any BQ SQL features could be applied.
Any ideas will be helpful.
You can use the following query.
;WITH result AS (
select t1.*,t2.Qty as cQty,t2.Date as Date_t2 from
(select *,ROW_NUMBER() OVER (ORDER BY qty DESC) AS [ROW NUMBER] from Test) t1
join
(select *,ROW_NUMBER() OVER (ORDER BY qty) AS [ROW NUMBER] from Test) t2
on t1.[ROW NUMBER] = t2.[ROW NUMBER]
)
select CustomerId,StockId,Qty,Date,ISNULL(cQty, 0) As CancelQty,Date_t2
from (select CustomerId,StockId,Qty,Date,case
when cQty < 0 then cQty
else NULL
end AS cQty,
case
when cQty < 0 then Date_t2
else NULL
end AS Date_t2 from result) t
where qty > 0
order by cQty desc
result: https://dbfiddle.uk
You can do this as a gaps-and-islands problem. Basically, add a grouping column to the rows based on a cumulative reverse count of negative values. Then within each group, choose the first row where the sum is positive. So:
select t.* (except cancelqty, grp),
(case when min(case when cancelqty + qty >= 0 then date end) over (partition by customerid grp) = date
then cancelqty
else 0
end) as cancelqty
from (select t.*,
min(cancelqty) over (partition by customerid, grp) as cancelqty
from (select t.*,
countif(qty < 0) over (partition by customerid order by date desc) as grp
from transactions t
) t
from t
) t;
Note: This works for the data you have provided. However, there may be complicated scenarios where this does not work. In fact, I don't think there is a simple optimal solution assuming that the returns are not connected to the original sales. I would suggest that you fix the data model so you record where the returns come from.
The below query seems to satisfy the conditions and the output mentioned.The solution is based on mapping the base table (t) and having the corresponding canceled qty row alongside from same table(t1)
First, a self join based on the customer and StockId is done since they need to correspond to the same customer and product.
Additionally, we are bringing in the canceled transactions t1 that happened after the base row in table t t.Dt<=t1.Dt and to ensure this is a negative qty t1.Qty<0 clause is added
Further we cannot attribute the canceled qty if they are less than the Original Qty. Therefore I am checking if the positive is greater than the canceled qty. This is done by adding a '-' sign to the cancel qty so that they can be compared easily. -(t1.Qty)<=t.Qty
After the Join, we are interested only in the positive qty, so adding a where clause to filter the other rows from the base table t with canceled quantities t.Qty>0.
Now we have the table joined to every other canceled qty row which is less than the transaction date. For example, the Qty 50 can have all the canceled qty mapped to it but we are interested only in the immediate one came after. So we first group all the base quantity values and then choose the date of the canceled Qty that came in first in the Having clause condition HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
Finally we get the rows we need and we can exclude the last column if required using an outer select query
SELECT t.CustomerId,t.StockId,t.Qty,t.Dt,IFNULL(t1.Qty, 0) CancelQty
,t1.dt dt_t1
FROM tbl t
LEFT JOIN tbl t1 ON t.CustomerId=t1.CustomerId AND
t.StockId=t1.StockId
AND t.Dt<=t1.Dt AND t1.Qty<0 AND -(t1.Qty)<=t.Qty
WHERE t.Qty>0
GROUP BY 1,2,3,4
HAVING IFNULL(t1.dt, '0')=MIN(IFNULL(t1.dt, '0'))
ORDER BY 1,2,4,3
fiddle
Consider below approach
with sales as (
select * from `project.dataset.table` where Qty > 0
), cancels as (
select * from `project.dataset.table` where Qty < 0
)
select any_value(s).*,
ifnull(array_agg(c.Qty order by c.Date limit 1)[offset(0)], 0) as CancelQty
from sales s
left join cancels c
on s.CustomerId = c.CustomerId
and s.StockId = c.StockId
and s.Date <= c.Date
and s.Qty > abs(c.Qty)
group by format('%t', s)
if applied to sample data in your question - output is

SQL select row containing all of the values in interval

I know the question is poorly worded, I'm sorry, I can't really put this problem into words. Here is a representation:
I have two tables: product and availability. A product can have multiple dates when it's available. Example:
Table 1 (products):
id | name | ....
----------------------------------
1 | My product 1 | ....
2 | My product 2 | ....
Table 2 (availability):
id | productId | date
-----------------------------------------
1 | 1 | 2021-01-15
2 | 1 | 2021-01-16
3 | 1 | 2021-01-17
4 | 2 | 2021-01-15
5 | 2 | 2021-01-16
Is there an sql statement that, given an interval, allows us to fetch a list of products having a row in the availabilty table for each element of the interval?
For example, given the interval [2021-01-15 -> 2021-01-17], the request should return product 1 because it's available during the entire period (it has a row for each element: the 15th, 16th and 17th). Product2 isn't returned because it's not available on 2021-01-17.
Is there a way to do this in SQL or do I have to use PL/SQL?
Any help is appreciated,
Thanks
You can use analytical function as follows:
select p.* from
(select p.*, count(distinct a.date) over (partition by a.productid) as cnt
from products p
join availability a on a.productid = p.id
where a.date >= date '201-01-15'
and a.date < date '201-01-17' + 1 )
where cnt = date '201-01-17' - date '201-01-15' + 1
Finally, came up with this, thanks #Popeye for the inspiration.
select occurence.pid from
(
select a.product_id as pid, count(distinct a.date::date) as cnt
from availability a
where a.date >= '2021-01-15'
and a.date < '2021-01-17'::date + 1
group by a.product_id
) as occurence
where cnt = '2021-01-17'::date - '2021-01-15'::date + 1;

Counting current items by month

I'm trying to build a monthly tally of active equipment, grouped by service area from a database log table. I think I'm 90% of the way there; I have a list of months, along with the total number of items that existed, and grouped by region.
However, I also need to know the state of each item as they were on the first of each month, and this is the part I'm stuck on. For instance, Item 1 is in region A in January, but moves to Region B in February. Item 2 is marked as 'inactive' in February, so shouldn't be counted. My existing query will always count item 1 in region A, and item 2 as 'active'.
I can correctly show that Item 3 is deleted in March, and Item 4 doesn't show up until the April count. I realize that I'm getting the first values because my query is specifying the min date, I'm just not sure how I need to change it to get what I want.
I think I'm looking for a way to group by Max(OperationDate) for each Month.
The Table looks like this:
| EQUIPID | EQUIPNAME | EQUIPACTIVE | DISTRICT | REGION | OPERATIONDATE | OPERATION |
|---------|-----------|-------------|----------|--------|----------------------|-----------|
| 1 | Item 1 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 3 | Item 3 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 0 | 1 | A | 2015-02-10T00:00:00Z | UPD |
| 1 | Item 1 | 1 | 1 | B | 2015-02-15T00:00:00Z | UPD |
| 3 | (null) | (null) | (null) | (null) | 2015-02-21T00:00:00Z | DEL |
| 1 | Item 1 | 1 | 1 | A | 2015-03-01T00:00:00Z | UPD |
| 4 | Item 4 | 1 | 1 | B | 2015-03-10T00:00:00Z | INS |
There is also a subtable that holds attributes that I care about. It's structure is similar. Unfortunately, due to previous design decisions, there is no correlation to operations between the two tables. Any joins will need to be done using the EquipmentID, and have the overlapping states matched up for each date.
Current query:
--cte to build date list
WITH calendar (dt) AS
(SELECT &fromdate from dual
UNION ALL
SELECT Add_Months(dt,1)
FROM calendar
WHERE dt < &todate)
SELECT dt, a.district, a.region, count(*)
FROM
(SELECT EQUIPID, DISTRICT, REGION, OPERATION, MIN(OPERATIONDATE ) AS FirstOp, deleted.deldate
FROM Equipment_Log
LEFT JOIN
(SELECT EQUIPID,MAX(OPERATIONDATE) as DelDate
FROM Equipment_Log
WHERE OPERATION = 'DEL'
GROUP BY EQUIPID
) Deleted
ON Equipment_Log.EQUIPID = Deleted.EQUIPID
WHERE OPERATION <> 'DEL' --AND additional unimportant filters
GROUP BY EQUIPID,DISTRICT, REGION , OPERATION, deldate
) a
INNER JOIN calendar
ON (calendar.dt >= FirstOp AND calendar.dt < deldate)
OR (calendar.dt >= FirstOp AND deldate is null)
LEFT JOIN
( SELECT EQUIPID, MAX(OPERATIONDATE) as latestop
FROM SpecialEquip_Table_Log
--where SpecialEquip filters
group by EQUIPID
) SpecialEquip
ON a.EQUIPID = SpecialEquip.EQUIPID and calendar.dt >= SpecialEquip.latestop
GROUP BY dt, district, region
ORDER BY dt, district, region
Take only last operation for each id. This is what row_number() and where rn = 1 do.
We have calendar and data. Make partitioned join.
I assumed that you need to fill values for months where entries for id are missing. So nvl(lag() ignore nulls) are needed, because if something appeared in January it still exists in Feb, March and we need district, region values from last not empty row.
Now you have everything to make count. That part where you mentioned SpecialEquip_Table_Log is up to you, because you left-joined this table and not used it later, so what is it for? Join if you need it, you have id.
db<>fiddle
with
calendar(mth) as (
select date '2015-01-01' from dual union all
select add_months(mth, 1) from calendar where mth < date '2015-05-01'),
data as (
select id, dis, reg, dt, op, act
from (
select equipid id, district dis, region reg,
to_char(operationdate, 'yyyy-mm') dt,
row_number()
over (partition by equipid, trunc(operationdate, 'month')
order by operationdate desc) rn,
operation op, nvl(equipactive, 0) act
from t)
where rn = 1 )
select mth, dis, reg, sum(act) cnt
from (
select id, mth,
nvl(dis, lag(dis) ignore nulls over (partition by id order by mth)) dis,
nvl(reg, lag(reg) ignore nulls over (partition by id order by mth)) reg,
nvl(act, lag(act) ignore nulls over (partition by id order by mth)) act
from calendar
left join data partition by (id) on dt = to_char(mth, 'yyyy-mm') )
group by mth, dis, reg
having sum(act) > 0
order by mth, dis, reg
It may seem complicated, so please run subqueries separately at first to see what is going on. And test :) Hope this helps.

Most efficient way of dividing rows by name and then transposing to one column for each name

I'm using standard SQL in Google Bigquery. So I have some data about metrics in this format:
Date | metric_name | metric_level
01/02/2019 | metric_one | 1
02/03/2019 | metric_one | 2
14/02/2019 | metric_two | 6
17/02/2019 | metric_two | 4
01/03/2019 | metric_three | 2
10/03/2019 | metric_three | 7
I want to get it in this format, date history going back one year, and then each metric filled in for each date. If a metric has no data for a particular date then it uses the most recent data point:
Date | metric_one | metric_two | metric_three
..........
01/02/2019 | 1 | null | null
02/02/2019 | 1 | null | null
03/02/2019 | 1 | null | null
...........
...........
13/02/2019 | 1 | null | null
14/02/2019 | 1 | 6 | null
15/02/2019 | 1 | 6 | null
...........
...........
09/03/2019 | 2 | 4 | 2
10/03/2019 | 2 | 4 | 7
11/03/2019 | 2 | 4 | 7
...........
and so on.
I've managed to write some code that does this, but I want to know if there's a more efficient way of doing it. There are actually a lot more than 3 metrics, so if I can make it more efficient in any way then it will save a lot of resources in the long run.
This is my code
WITH date_arr AS(
SELECT
date
FROM UNNEST(
GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(),INTERVAL 365 DAY),
CURRENT_DATE(),
INTERVAL 1 day
)
) AS date
),
metric_one_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_one'
),
metric_one_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_one
FROM date_arr d
LEFT JOIN metric_one_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
),
metric_two_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_two'
),
metric_two_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_two
FROM date_arr d
LEFT JOIN metric_two_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
),
metric_three_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_three'
),
metric_three_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_three
FROM date_arr d
LEFT JOIN metric_three_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
)
SELECT
*
FROM metric_one_gapless
LEFT JOIN metric_two_gapless USING(date)
LEFT JOIN metric_three_gapless USING(date)
Hope that makes sense. Thanks in advance!
You can do the following:
Generate the dates
Use a cross join to get all the rows
Use a left join to bring in your data
Use last_value() to fill in NULL values.
In other database, I would prefer lag(ignore nulls), but BigQuery does not support that.
So:
select d, m.metric,
coalesce(mm.metric_level,
last_value(mm.metric_level ignore nulls) over (partition by m.metric order by d)
) as metric_level
from (select distinct metric from metrics) m cross join
unnest(gnerate_date_array(date_sub(current_date(), interval 1 year), interval 1 day) d left join
metrics mm
on mm.metric = m.metric and mm.date = d;
after doing some research, I came up with somethig, due you are using left join and there would be more than one, or even, variable number of left joins, and also you can not use declare in BigQuery Web UI, you probably need better to use the API Rest BigQuery feature, you can find here the dependencies, you can use C#, GO, JAVA, NODE.JS, PHP, PYTHON or RUBY coding, this would allow you to assign in to a variable the number of metrics, so I recommend first to do a select distinct to know how many metrics are there and after that you can save them into a variable and after that do a loop to execute the left joins you want.
I hope this information helps you, and I'm here if you need more information.

How to return all rows with MAX value meeting a condition of another field in SQL?

I have the following costs table:
+--------+------+-----------+
| Year | ID | Amount |
+--------+------+-----------+
| 1960 | 1 | 100 |
| 1960 | 2 | 200 |
| 1960 | 3 | 200 |
| 1960 | 4 | 150 |
| 1961 | 1 | 300 |
| 1961 | 2 | 200 |
| 1961 | 3 | 100 |
| 1961 | 4 | 300 |
+---------+------+----------+
I want all ID’s having the MAX Amount by Year. For example, for 1960, I want rows with ID's 2 and 3. For 1961, I want rows with ID's 1 and 4.
SELECT Year, ID, Amount FROM costs WHERE Amount = (SELECT MAX(Amount) FROM costs);
The above gets me all MAX values across all Years. But I want a condition that only gets me the max Amount values per year. How do I add an condition to only select records with Year = 1960?
Please try this with below query.This is tested. Its working fine.
By clicking on the below link you can see your expected result in live which you want.
SQL Fiddle Live Demo
SELECT
t1.*
FROM
costs t1
WHERE
t1.amount = (
SELECT
MAX(t2.amount)
FROM
costs t2
WHERE
t2. `year` = t1. `year`
);
Try this....It should work
SELECT
*
FROM
costs
WHERE
(YEAR, amount) IN (
SELECT
YEAR,
max(amount)
FROM
costs
GROUP BY
YEAR
);
One option which should run on all major databases is to use a subquery which finds the max amounts for each year to select the records you want:
SELECT c1.*
FROM costs c1
INNER JOIN
(
SELECT Year, MAX(Amount) AS MaxAmount
FROM costs
GROUP BY Year
) c2
ON c1.Year = c2.Year AND
c1.Amount = c2.MaxAmount
Another way to do this would be to use a correlated subquery:
SELECT c1.*
FROM costs c1
WHERE c1.Amount = (SELECT MAX(c2.Amount) FROM costs c2 WHERE c2.Year = c1.Year)
I expect that joining (the first option) would be the fastest method for larger tables, especially if you have proper indices would could be used.
SELECT Year , ID , Amount
FROM #Table T1
JOIN
(
SELECT MAX(Amount) Amount,Year
FROM #Table
GROUP BY Year
) A ON A.Year = T1.Year AND A.Amount = T1.Amount