How to calculate average of values without including the last value (sql)? - sql

I have a table. I partition it by the id and want to calculate average of the values previous to the current, without including the current value. Here is a sample table:
+----+-------+------------+
| id | Value | Date |
+----+-------+------------+
| 1 | 51 | 2020-11-26 |
| 1 | 45 | 2020-11-25 |
| 1 | 47 | 2020-11-24 |
| 2 | 32 | 2020-11-26 |
| 2 | 51 | 2020-11-25 |
| 2 | 45 | 2020-11-24 |
| 3 | 47 | 2020-11-26 |
| 3 | 32 | 2020-11-25 |
| 3 | 35 | 2020-11-24 |
+----+-------+------------+
In this case, it means calculating the average of values for dates BEFORE 2020-11-26. This is the expected result
+----+-------+
| id | Value |
+----+-------+
| 1 | 46 |
| 2 | 48 |
| 3 | 33.5 |
+----+-------+
I have calculated it using ROWS N PRECEDING but it appears that this way I average N preceding + last row, and I want to exclude the last row (which is the most recent date in my case).
Here is my query:
SELECT ID,
(avg(Value) OVER(
PARTITION BY ID
ORDER BY Date
ROWS 9 PRECEDING )) as avg9
FROM t1

Then define your window in full using both the start and ends with BETWEEN:
SELECT ID,
(AVG(Value) OVER (PARTITION BY ID ORDER BY Date ROWS BETWEEN 9 PRECEDING AND 1 PRECEDING)) AS avg9
FROM t1;

Why not just filter:
select id, avg(value)
from t1
where date < '2020-11-26'
group by id;
If you want the date to be flexible -- say the most recent value for each date, then:
select id, avg(value)
from (select t1.*,
max(date) over (partition by id) as max_date
from t1
) t1
where date < max_date
group by id;

Do a row_number() over (Partition by id ORDER BY [Date] DESC). This will give a rank = 1 to the row with latest date. Wrap it within a CTE and then calculate avg for each partition where RANK > 1. Please check syntax.
;with a as
(
select id, value, Date, row_number() over (partition by id order by date
desc) as RN
)
select id, avg(Value) from a group by id where r.RN > 1

Related

How to create a cumulative count distinct with partition by in SQL?

I have a table with user data and want to create a cumulative count distinct but this type of window function does not exist. This is my table
date | user-id | purchase-id
2020-01-01 | 1 | 244
2020-01-03 | 1 | 244
2020-02-01 | 1 | 524
2020-03-01 | 2 | 443
Now, I want a cum count distinct for purchase id like this:
date | user-id | purchase-id | cum_purchase
2020-01-01 | 1 | 244 | 1
2020-01-03 | 1 | 244 | 1
2020-02-01 | 1 | 524 | 2
2020-03-01 | 2 | 443 | 1
I tried
Select
dt,
user_id,
count(distinct purchase_id) over (partition by user_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cum_ct
from table
I get an error that I cannot use count distinct with an order by statement. What to do?
Something like this
Select
dt as [date],
user_id,
purchase_id
SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) over (partition by user_id ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cum_ct
from (
SELECT
dt,
user_id,
purchase_id,
ROW_NUMBER() OVER (PARTITION BY user_id, purchase_id ORDER BY dt) as RN
FROM sometable
) sub

Grouped LIMIT in PostgreSQL: show the first N rows for each group, BUT only if the first of those row equals specific data

Consider the following table:
SELECT * FROM report_raw_data;
ts | d_stamp | id_mod | value
-----------+------------+--------+------
1605450647 | 2020-11-15 | 1 | 60
1605464634 | 2020-11-15 | 2 | 54
1605382126 | 2020-11-14 | 1 | 40
1605362085 | 2020-11-14 | 3 | 33
1605355089 | 2020-11-13 | 1 | 60
1605202153 | 2020-11-12 | 2 | 30
What I need is to get the first two rows ordered by ts of each id_mod but only if the d_stamp is the current date (in this case 2020-11-15).
So far I have managed to get the first two rows of each id_mod ordered by ts, but I struggle with the only current date 2020-11-15.
Here is my and wrong result try:
SELECT * FROM (SELECT ROW_NUMBER() OVER (PARTITION BY id_mod ORDER BY ts DESC) AS r,t.* FROM
report_raw_data t) x WHERE x.r <= 2;
ts | d_stamp | id_mod | value
-----------+------------+--------+------
1605450647 | 2020-11-15 | 1 | 60
1605382126 | 2020-11-14 | 1 | 40
1605464634 | 2020-11-15 | 2 | 54
1605202153 | 2020-11-12 | 2 | 30
1605362085 | 2020-11-14 | 3 | 33
If I use in the query WHERE = '2020-11-15' I will ultimately get only those records (so no second rows) which I need.
This is what I would like to get (ignoring the id_mod number 3) since it's the first row does not start on 2020-11-15:
ts | d_stamp | id_mod | value
-----------+------------+--------+------
1605450647 | 2020-11-15 | 1 | 60
1605382126 | 2020-11-14 | 1 | 40
1605464634 | 2020-11-15 | 2 | 54
1605202153 | 2020-11-12 | 2 | 30
One more note: I will need to be able to use LIMIT and OFFSET with the query to be able to paginate through the results on the frontend.
Starting from your current query, a simple approach is to use a window MAX() in the subquery to recover the latest ts per id_mod. You can then use that for additional filtering in the outer query.
SELECT *
FROM (
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id_mod ORDER BY ts DESC) AS rn,
MAX(ts) OVER(PARTITION BY id_mod) max_ts
FROM report_raw_data t
) x
WHERE rn <= 2 and max_ts = current_date;
Assuming you have no future data, I would suggest:
SELECT rdr.*
FROM (SELECT rdr.*,
ROW_NUMBER() OVER (PARTITION BY id_mod ORDER BY ts DESC) AS seqnum
FROM report_raw_data rdr
WHERE d_stamp = current_date
) rdr
WHERE seqnum <= 2;
Filtering based on the time in the subquery should significantly improve performance. And for optimal performance, you want an index on (d_stamp, id_mod, ts desc).

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

Select Rows who's Sum Value = 80% of the Total

Here is an example the business problem.
I have 10 sales that resulted in negative margin.
We want to review these records, we generally use the 20/80 rule in reviews.
That is 20 percent of the sales will likely represent 80 of the negative margin.
So with the below records....
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
| 5 | 5 |
| 6 | 5 |
| 7 | 2 |
| 8 | 2 |
| 9 | 1 |
| 10 | 1 |
+----+-------+
I would want to return...
+----+-------+
| ID | Value |
+----+-------+
| 1 | 30 |
| 2 | 30 |
| 3 | 20 |
| 4 | 10 |
+----+-------+
The Total of Value is 106, 80% is then 84.8.
I need all the records, sorted descending who sum value gets me to at least 84.8
We use Microsoft APS PDW SQL, but can process on SMP if needed.
Assuming window functions are supported, you can use
with cte as (select id,value
,sum(value) over(order by value desc,id) as running_sum
,sum(value) over() as total
from tbl
)
select id,value from cte where running_sum < total*0.8
union all
select top 1 id,value from cte where running_sum >= total*0.8 order by value desc
One way is to use running totals:
select
id,
value
from
(
select
id,
value,
sum(value) over () as total,
sum(value) over (order by value desc) as till_here,
sum(value) over (order by value desc rows between unbounded preceding and 1 preceding)
as till_prev
from mytable
) summed_up
where till_here * 1.0 / total <= 0.8
or (till_here * 1.0 / total >= 0.8 and coalesce(till_prev, 0) * 1.0 / total < 0.8)
order by value desc;
This link could be useful, it calculates running totals:
https://www.codeproject.com/Articles/300785/Calculating-simple-running-totals-in-SQL-Server

SQL formula for Row number

I'm trying to rank the rows in the following table that looks like this:
| ID | Key | Date | Row|
*****************************
| P175 | 5 | 2017-01| 2 |
| P175 | 5 | 2017-02| 2 |
| P175 | 5 | 2017-03| 2 |
| P175 | 12 | 2017-03| 1 |
| P175 | 12 | 2017-04| 1 |
| P175 | 12 | 2017-05| 1 |
This person has two Keys at once during 2017-03, but I want the formula to put '1' for the rows where Key=12 since it reflects the most recent records.
I want the same formula to also work for the people who don't have overlapping Keys, putting '1' for the most recent records:
| ID | Key | Date | Row|
*****************************
| P170 | 8 | 2017-01| 2 |
| P170 | 8 | 2017-02| 2 |
| P170 | 8 | 2017-03| 2 |
| P170 | 6 | 2017-04| 1 |
| P170 | 6 | 2017-05| 1 |
I've tried variations of ROW_NUMBER() OVER PARTITION BY and DENSE_RANK but cannot figure out the correct formula. Thanks for your help.
First calculate the max date for the key. Then use dense_rank():
select t.*,
dense_rank() over (partition by id order by max_date desc, key) as row
from (select t.*, max(date) over (partition by id, key) as max_date
from t
) t;
If the ranges for each key did not overlap, you could do this with a cumulative count distinct:
select t.*, count(distinct key) over (partition by id order by date desc) as rank
from t;
However, this would not work in the first case. I just find it interesting that this does almost the same thing as the first query.
I guess you are looking for something like this
select personid, mykey, month,
dense_rank() over (partition by personid order by mykey desc) rown
from personkeys
order by month
see the example
http://sqlfiddle.com/#!15/cf751/8