Adding additional group by in running average

Adding additional group by in running average - sql

I have the current code that is working
select format_date('%Y%m', date) as yyyymm,
(sum(sum(val)) over (order by min(date)) /
sum(count(*)) over (order by min(date))
) as running_avg
from t
group by yyyymm
order by yyyymm;
Returns
yyyymm Score
201712 25.57931742
201801 24.69794466
201802 24.23110781
201803 23.85651947
201804 23.66164799
201805 23.43029053
201806 23.17074628
201807 23.09766588
201808 23.08902284
I am now trying to add an additional group by clause, for department. The query runs however the results are inaccurate, can anyone recognize what i am doing incorrectly?
select format_date('%Y%m', date) as yyyymm, department
(sum(sum(val)) over (order by min(date)) /
sum(count(*)) over (order by min(date))
) as running_avg
from t
group by yyyymm, department
order by yyyymm;
Returns
yyyymm department Score
201712 HR 6.704365079
201712 F&B 8.550338502
201712 Marketing 8.550338502
201712 I.T. 9.857502908
201712 Security 9.551491994
201712 Contractors 9.411654456
201712 Executive Office 9.637075283
201712 Property Services 9.45905826
201712 Corporate 9.57458477
201712 Legal 9.700320268

You need to add department to the partition by:
select department, format_date('%Y%m', date) as yyyymm,
(sum(sum(val)) over (partition by department order by min(date)) /
sum(count(*)) over (partition by department order by min(date))
) as running_avg
from t
group by yyyymm, department
order by department, yyyymm;

Related

Using a where statement with rank and a subquery in SQL

so I have a table that's sort of like this:
DELIVERY_AREA_ID
DELIVERY_RADIUS_METERS
EVENT_STARTED_TIMESTAMP
234sfd
4000
2020-01-01 12:19:29.719
234sfd
6500
2020-01-01 12:31:40.325
234sfd
3500
2020-01-01 12:53:10.538
234sfd
6500
2020-01-01 13:11:36.094
234sfd
3500
2020-01-01 13:32:26.754
234sfd
6500
2020-01-01 13:59:11.104
234sfd
6500
2020-01-02 07:44:16.792
234sfd
3500
2020-01-02 08:07:36.284
234sfd
6500
2020-01-02 08:54:08.014
234sfd
3500
2020-01-02 09:53:05.853
234sfd
6500
2020-01-02 10:04:39.443
234sfd
10000
2020-07-01 08:29:20.194
234sfd
3500
2020-07-03 07:50:41.782
234sfd
10000
2020-07-03 08:33:14.695
234sfd
3500
2020-07-05 07:47:05.539
234sfd
10000
2020-07-05 07:53:13.930
234sfd
3500
2020-07-05 09:18:57.688
234sfd
10000
2020-07-05 09:51:07.547
234sfd
3500
2020-07-19 18:02:14.099
the data is actually much more varied but yeah it follows that format.
I am trying to, in one query, in snowflake database, make a get the top ranked radius by duration. I currently have this:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
I want to get the first rank for each month_year but when I use
where RADIUS_RANK = 1
I get an error: Syntax error: unexpected 'where'. (line 21)
Im not sure how to resolve this
I have tried this link which appears to have the same question but the solution is already what I am trying.

It is not possible to solve this scenario without querying the output of your query, in other words, using the output of that query as an input for another top-level query.
You can not use a field produced at the projection level in the WHERE clause
You can not use analytic functions in the WHERE clause
You can not use analytic functions in a HAVING clause
So the only solution is to query the output of that query and retrieve only the MIN rank.

To filter windowed function at the same query level you need to use QUALIFY clause:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP,
RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) AS RADIUS_RANK
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RADIUS_RANK = 1;
If the rank column is not required then the entire expression could be moved:
SELECT DELIVERY_AREA_ID,
MAX(DELIVERY_RADIUS_METERS) AS default_delivery_radius,
MONTH_YEAR,
DELIVERY_RADIUS_METERS,
SUM(DURATION_SECONDS) AS total_duration,
MAX(EVENT_STARTED_TIMESTAMP) AS MAX_TIMESTAMP
FROM (
-- Add the MONTH_YEAR column to the delivery_radius_log table
SELECT DELIVERY_AREA_ID,
DELIVERY_RADIUS_METERS,
EVENT_STARTED_TIMESTAMP,
CONCAT(MONTH(EVENT_STARTED_TIMESTAMP), '/',
YEAR(EVENT_STARTED_TIMESTAMP)) AS MONTH_YEAR,
DATEADD(second, DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)), EVENT_STARTED_TIMESTAMP) AS end_timestamp,
DATEDIFF(second, EVENT_STARTED_TIMESTAMP, LEAD(EVENT_STARTED_TIMESTAMP) OVER (PARTITION BY DELIVERY_AREA_ID ORDER BY EVENT_STARTED_TIMESTAMP)) AS duration_seconds
FROM delivery_radius_log
) t -- added alias here
GROUP BY DELIVERY_AREA_ID, MONTH_YEAR, DELIVERY_RADIUS_METERS
QUALIFY RANK() OVER (PARTITION BY DELIVERY_AREA_ID, MONTH_YEAR
ORDER BY SUM(DURATION_SECONDS) DESC) = 1;

Sales amounts of the top n selling vendors by month in bigquery

i have a table in bigquery like this (260000 rows):
vendor date item_price
x 2021-07-08 23:41:10 451,5
y 2021-06-14 10:22:10 41,7
z 2020-01-03 13:41:12 74
s 2020-04-12 01:14:58 88
....
exactly what I want is to group this data by month and find the sum of the sales of only the top 20 vendors in that month. Expected output:
month sum_of_only_top20_vendor's_sales
2020-01 7857
2020-02 9685
2020-03 3574
2020-04 7421
.....

Consider below approach
select month, sum(sale) as sum_of_only_top20_vendor_sales
from (
select vendor,
format_datetime('%Y%m', date) month,
sum(item_price) as sale
from your_table
group by vendor, month
qualify row_number() over(partition by month order by sale desc) <= 20
)
group by month

Another solution that potentially can show much much better performance on really big data:
select month,
(select sum(sum) from t.top_20_vendors) as sum_of_only_top20_vendor_sales
from (
select
format_datetime('%Y%m', date) month,
approx_top_sum(vendor, item_price, 20) top_20_vendors
from your_table
group by month
) t
or with a little refactoring
select month, sum(sum) as sum_of_only_top20_vendor_sales
from (
select
format_datetime('%Y%m', date) month,
approx_top_sum(vendor, item_price, 20) top_20_vendors
from your_table
group by month
) t, t.top_20_vendors
group by month

Teradata - Cannot nest aggregate operations min(avg)

I want to get the month a store had its lowest average revenue. I either get a list of all the stores (the code below is giving me all 12 months for a store) or when i try min(avg_rev) in the inner select it says 'Teradata - Cannot nest aggregate operations'. Please help.
| store | yearmonth | min(avg_rev)|
| 102 | 2004 9 | $2000 |
| 103 | 2004 8 | $30000 |
etc
SELECT STORE, month_num||year_num AS yearmonth, min(avg_rev)
FROM (SELECT store, EXTRACT(year from saledate) AS year_num,
EXTRACT(month from saledate) AS month_num,
sum(amt)/ COUNT (distinct saledate) AS avg_rev
FROM trnsact
WHERE stype='p'
GROUP BY year_num, month_num,store
HAVING NOT(year_num=2005 AND month_num=8) AND COUNT (distinct saledate)>20) AS clean_data
GROUP BY store, yearmonth, avg_rev
ORDER BY store asc, min(avg_rev)

If I understand correctly, you can use qualify to choose the month:
SELECT store, EXTRACT(year from saledate) AS year_num,
EXTRACT(month from saledate) AS month_num,
sum(amt)/ COUNT(distinct saledate) AS avg_rev
FROM trnsact
WHERE stype='p'
GROUP BY year_num, month_num, store
QUALIFY ROW_NUMBER() OVER (PARTITION BY store ORDER BY avg_rev ASC) = 1

Find the start and end date of stock difference

Please Suggest good sql query to find the start and end date of stock difference
imagine i data in a table like below.
Sample_table
transaction_date stock
2018-12-01 10
2018-12-02 10
2018-12-03 20
2018-12-04 20
2018-12-05 20
2018-12-06 20
2018-12-07 20
2018-12-08 10
2018-12-09 10
2018-12-10 30
Expected result should be
Start_date end_date stock
2018-12-01 2018-12-02 10
2018-12-03 2018-12-07 20
2018-12-08 2018-12-09 10
2018-12-10 null 30

It is the gap and island problem. You may use row_numer and group by for this.
select t.stock, min(transaction_date), max(transaction_date)
from (
select row_number() over (order by transaction_date) -
row_number() over (partition by stock order by transaction_date) grp,
transaction_date,
stock
from data
) t
group by t.grp, t.stock
In the following DBFIDDLE DEMO I solve also the null value of the last group, but the main idea of finding consecutive rows is build on the above query.
You may check this for an explanation of this solution.

You can try below using row_number()
select stock,min(transaction_date) as start_date,
case when min(transaction_date)=max(transaction_date) then null else max(transaction_date) end as end_date
from
(
select *,row_number() over(order by transaction_date)-
row_number() over(partition by stock order by transaction_date) as rn
from t1
)A group by stock,rn

Try to use GROUP BY with MIN and MAX:
SELECT
stock,
MIN(transaction_date) Start_date,
CASE WHEN COUNT(*)>1 THEN MAX(transaction_date) END end_date
FROM Sample_table
GROUP BY stock
ORDER BY stock

You can try with LEAD, LAG functions as below:
select currentStockDate as startDate,
LEAD(currentStockDate,1) as EndDate,
currentStock
from
(select *
from
(select
LAG(transaction_date,1) over(order by transaction_date) as prevStockDate,
transaction_date as CurrentstockDate,
LAG(stock,1) over(order by transaction_date) as prevStock,
stock as currentStock
from sample_table) as t
where (prevStock <> currentStock) or (prevStock is null)
) as t2

Date bound SQL group by

I have a dataset that looks like this:
StartDate EndDate InstrumentID Dimension DimensionValue
2018-01-01 2018-01-01 123 Currency GBP
2018-01-02 2018-01-02 123 Currency GBP
2018-01-03 2018-01-03 123 Currency USD
2018-01-04 2018-01-04 123 Currency USD
2018-01-05 2018-01-05 123 Currency GBP
2018-01-06 2018-01-06 123 Currency GBP
What I would like is to transform this dataset into a date bound dataset like below:
StartDate EndDate InstrumentID Dimension DimensionValue
2018-01-01 2018-01-02 123 Currency GBP
2018-01-03 2018-01-04 123 Currency USD
2018-01-05 2018-01-06 123 Currency GBP
I thought about writing the SQL like this:
SELECT
MIN(StartDate) AS StartDate
, MAX(EndDate) AS EndDate
, [InstrumentID]
, Dimension
, DimensionValue
FROM #Worktable
GROUP BY InstrumentID, Dimension, DimensionValue
However this obviously won't work as it will ignore the change in date for GBP and just group one record together with start date of 2018-01-01 and end date of 2018-01-06.
Is there a way in which I can do this and achieve the dates I require?
Thanks

This is a common Gaps and Islands question. There are plenty of examples out there on how to do this; for example:
WITH VTE AS(
SELECT CONVERT(date,StartDate) AS StartDate,
CONVERT(Date,EndDate) AS EndDate,
InstrumentID,
Dimension,
DimensionValue
FROM (VALUES('20180101','20180101',123,'Currency','GBP'),
('20180102','20180102',123,'Currency','GBP'),
('20180103','20180103',123,'Currency','USD'),
('20180104','20180104',123,'Currency','USD'),
('20180105','20180105',123,'Currency','GBP'),
('20180106','20180106',123,'Currency','GBP')) V(StartDate,EndDate,InstrumentID,Dimension,DimensionValue)),
Grps AS (
SELECT StartDate,
EndDate,
InstrumentID,
Dimension,
DimensionValue,
ROW_NUMBER() OVER (PARTITION BY InstrumentID, Dimension ORDER BY StartDate) -
ROW_NUMBER() OVER (PARTITION BY InstrumentID, Dimension, DimensionValue ORDER BY StartDate) AS Grp
FROM VTE)
SELECT MIN(StartDate) AS StartDate,
MAX(EndDate) AS EndDate,
InstrumentID,
Dimension,
DimensionValue
FROM Grps
GROUP BY InstrumentID,
Dimension,
DimensionValue,
Grp
ORDER BY StartDate;

This is a form of gaps-and-islands. But because there are start date and end dates, you need to be careful. I recommend lag() and cumulative sum:
select InstrumentID, Dimension, DimensionValue,
min(startdate) as startdate, max(enddate) as enddate
from (select w.*,
sum(case when prev_enddate = startdate then 0 else 1 end)
over (partition by InstrumentID, Dimension,
DimensionValue order by startdate) as grp
from (select w.*,
lag(enddate) over (partition by InstrumentID, Dimension, DimensionValue
order by startdate) as prev_enddate
from #worktable w
) w
group by InstrumentID, Dimension, DimensionValue, grp
order by InstrumentID, Dimension, DimensionValue, min(startdate);

You need to use dense rank like:
with x as(
select DENSE_RANK() OVER
(PARTITION BY DimensionValue) AS Rank , *
from Worktable
) select StartDate AS StartDate
, EndDate AS EndDate
, [InstrumentID]
, Max(Dimension) AS Dimension
, DimensionValue, Rank
FROM x
GROUP BY InstrumentID, StartDate, EndDate, DimensionValue,Rank

Update, I just thought of this, I couldn't test it yet, I think it will work the way you want it to.
Select StartDate, EndDate, InstrumentID, Dimension, DimensionValue From (
SELECT
StartDate AS StartDate
, EndDate AS EndDate
, [InstrumentID]
, Dimension
, DimensionValue
, Count(*)
FROM #Worktable
GROUP BY InstrumentID, StartDate, EndDate, Dimension, DimensionValue) x
Hope this helps!

Try something like the following:
WITH CTE AS(
SELECT StartDate::DATE AS StartDate,
EndDate::DATE AS EndDate,
InstrumentID,
Dimension,
DimensionValue
FROM (VALUES('20180101','20180101',123,'Currency','GBP'),
('20180102','20180102',123,'Currency','GBP'),
('20180103','20180103',123,'Currency','USD'),
('20180104','20180104',123,'Currency','USD'),
('20180105','20180105',123,'Currency','GBP'),
('20180106','20180106',123,'Currency','GBP')) V(StartDate,EndDate,InstrumentID,Dimension,DimensionValue))
SELECT startdate
, enddate
, instrumentid
, dimension
, dimensionvalue
FROM (
SELECT *
, CASE WHEN (LAG(enddate, 1) OVER(PARTITION BY dimensionvalue ORDER BY startdate) IS NULL) OR (enddate - LAG(enddate, 1) OVER(PARTITION BY dimensionvalue ORDER BY startdate) <> 1) THEN 0
ELSE 1 END is_valid
FROM CTE
) a
WHERE is_valid = 1
ORDER BY startdate;
Credit to #Lamu for creating the temp table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Adding additional group by in running average - sql

Related

Using a where statement with rank and a subquery in SQL

Sales amounts of the top n selling vendors by month in bigquery

Teradata - Cannot nest aggregate operations min(avg)

Find the start and end date of stock difference

Date bound SQL group by

Categories

Resources