Window functions: PARTITION BY one column after ORDER BY another

Window functions: PARTITION BY one column after ORDER BY another - sql

Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar).
So I am trying to explain the problem more generally first:
I am using PostgreSQL but I am sure this problem exists in other window function supporting DBMS' (MS SQL Server, Oracle, ...) as well.
Window functions can be used to group certain values together by a common attribute or value. For example you can group rows by a date. Then you are able to calculate the max value within every single date or an average value or counting rows or whatever.
This can be achieved by defining a PARTITION. Grouping by dates would work with PARTITION BY date_column. Now you want to do an operation which needs a special order within your groups (calculating row numbers or sum up a column). This can be done with PARTITON BY date_column ORDER BY an_attribute_column.
Now think about a finer resolution of time series. What if you do not have dates but timestamps. Then you cannot group by the time column anymore. But nevertheless it might be important to analyse the data in the order they were added (maybe the timestamp is the creating time of your data set). Then you realize that some consecutive rows have the same value and you want to group your data by this common value. But the clue is that the rows have different timestamps.
The problem here is that you cannot do a PARTITION BY value_column. Because PARTITION BY forces an ordering first. So your table would be ordered by the value_column before the grouping and is not ordered by the timestamp anymore. This yields in results you are not expecting.
More general speaking: The problem is to ensure a special ordering even if the ordered column is not part of the created partition.
Example:
db<>fiddle
I have the following table:
ts val
100000 50
130100 30050
160100 60050
190200 100
220200 30100
250200 30100
300000 300
500000 100
550000 1000
600000 1000
650000 2000
700000 2000
720000 2000
750000 300
I had the problem that I had to group all tied values of the column val. But I wanted to hold the order by ts. To achieve this I wanted to add a column with a unique ID per val group
Expected result:
ts val group
100000 50 1
130100 30050 2
160100 60050 3
190200 100 4
220200 30100 5 \ same group
250200 30100 5 /
300000 300 6
500000 100 7
550000 1000 8 \ same group
600000 1000 8 /
650000 2000 9 \
700000 2000 9 | same group
720000 2000 9 /
750000 300 10
First try was the use of the rank window function which would do this job normally:
SELECT
*,
rank() OVER (PARTITION BY val ORDER BY ts)
FROM
test
But in this case this doesn't work because the PARTITION BY clause orders the table first by its partition columns (val in this case) and then by its ORDER BY columns. So the order is by val, ts instead of the expected order by ts. So the result was not the expected one of course.
ts val rank
100000 50 1
190200 100 1
500000 100 2
300000 300 1
750000 300 2
550000 1000 1
600000 1000 2
650000 2000 1
700000 2000 2
720000 2000 3
130100 30050 1
220200 30100 1
250200 30100 2
160100 60050 1
The question is: How to get the group ids with respect to the order by ts?
Edit: I added an own solution below but I feel very uncomfortable with it. It seems way too complicated. I was wondering if there's a better way to achieve this result.

I came up with this solution by myself (hoping someone else will get a better one):
demo:db<>fiddle
order by ts
give out the next val value with the lag window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
check if the next and the current values are the same. Then I can print out a 0 or a 1
sum up these values with an ordered SUM. This generates the groups I am looking for. They group the val column but ensure the ordering by the ts column.
The query:
SELECT
*,
SUM(is_diff) OVER (ORDER BY ts)
FROM (
SELECT
*,
CASE WHEN val = lag(val) over (order by ts) THEN 0 ELSE 1 END as is_diff
FROM test
)s
The result:
ts val is_diff sum
100000 50 1 1
130100 30050 1 2
160100 60050 1 3
190200 100 1 4
220200 30100 1 5 \ group
250200 30100 0 5 /
300000 300 1 6
500000 100 1 7
550000 1000 1 8 \ group
600000 1000 0 8 /
650000 2000 1 9 \
700000 2000 0 9 | group
720000 2000 0 9 /
750000 300 1 10

Related

Slab based calculation in SQL

I want to calculate the slab based logic.
This is my table where the min_bucket and max_bucket range is mention and also the rate of it.
slabs min_bucket max_bucket rate_per_month
----------------------------------------------
Slab 1 0 300000 20
Slab 2 300000 500000 17
Slab 3 500000 1000000 14
Slab 4 1000000 13
We need to calculate as
If there are 450k subs, the payout will be 300k20 + 150k17
If the Total Count is 1000001, then its output should be as
min_bucket max_bucket rate_per_month Count rate_per_month revenue
-----------------------------------------------------------------------
0 300000 20 300000 20 6000000
300000 500000 17 500000 17 8500000
500000 1000000 14 200001 14 2800014
Where count is calculated as 300000+500000+200001 = 1000001, and revenue is calculated as rate_per_month * Count as per the slab.
Can anyone help me write the SQL query for this, which will handle all the cases?

You can build running totals of the slabs table and work with them:
with given as (select 1000001 as value)
, slabs as
(
select
slab,
min_bucket,
max_bucket,
rate_per_month,
sum(min_bucket) over (order by min_bucket) as sum_min_bucket,
sum(coalesce(max_bucket, 2147483647)) over (order by min_bucket) as sum_max_bucket
from mytable
)
select
slabs.slab,
slabs.min_bucket,
slabs.max_bucket,
slabs.rate_per_month,
case when slabs.sum_max_bucket <= given.value
then slabs.max_bucket
else given.value - sum_min_bucket
end as used,
case when slabs.sum_max_bucket <= given.value
then slabs.max_bucket
else given.value - sum_min_bucket
end * slabs.rate_per_month as revenue
from given
join slabs on slabs.sum_min_bucket < given.value
order by slabs.min_bucket;
I don't know your DBMS, but this is standard SQL and likely to work for you (either right away or with a little tweak).
Demo: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=9c4f5f837b6167c7e4f2f7e571f4b26f

Can't get the cumulative sum(running total) within a group in SQL Server

I'm trying to get a running total within a group but my current code just gives me an aggregate sum.
For example, my data looks like this
ID ShiftNum Status Type Rate HourlyWage Hours Total_Amount
12542 1 Full A 1 12.5 40 500
12542 1 Full A 1 12.5 35 420
12542 2 Full A 1 10 40 400
12542 2 Full B 1.2 10 40 480
17842 1 Full A 1 11 27 297
17842 1 Full B 1.3 11 30 429
And what I want is a running total within the same ID, Shift Number, and Status. For example, I want something like this as my final result
ID ShiftNum Status Type Rate HourlyWage Hours Total_Amount Running_Tot
12542 1 Full A 1 12.5 40 500 500
12542 1 Full A 1 12.5 35 420 920
12542 2 Full A 1 10 40 400 400
12542 2 Full B 1.2 10 40 480 880
17842 1 Full A 1 11 27 297 297
17842 1 Full B 1.3 11 30 429 726
However, my current code just gives me the total sum within each group. For example, 920, 920 for row 1&2. Here's my code.
Select a.*,
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY ID, ShiftNum, Status) as Runnint_Tot
from table a
How do I fix my code to get the final result I want?

You need an ordering column that uniquely defines each row. There is not an obvious one in your row, but something like this:
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY hours) as Running_Tot
Or:
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status
ORDER BY (SELECT NULL)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as Running_Tot
The problem you are facing is because the ORDER BY keys have ties. The default window frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Note the RANGE. That means that all rows with ties are combined.
Also note that there is no utility to including the PARTITION BY keys in the ORDER BY (well . . . there is one exception in SQL Server if you don't care about the ordering, then including a key can be a handy short-cut). The ordering occurs within a partition.
If your rows can have exact duplicates, I would first suggest that you add a primary key. But, in the meantime, you could use:
with a as (
select a.*,
row_number() over (order by id, shiftnum, status) as seqnum
from tablea a
)
Select a.*,
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY seqnum) as Running_Tot
from a;
The ordering will be arbitrary, but it will at least accumulate.

SQL: How to group rows with the condition that sum of fields is limited to a certain value?

This is my table:
id user_id date balance
1 1 2016-05-10 10
2 1 2016-05-10 30
3 2 2017-04-24 5
4 2 2017-04-27 10
5 3 2017-11-10 40
I want to group the rows by user_id and sum the balance, but so that the sum is equal or less than 30. Moreover, I need to display the minimum date in the group. It should look like this:
id balance date_start
1-1 10 2016-05-10
1-2 30 2016-05-10
2-1 15 2017-04-24
Excuse for my language. Thanks.

You should be able to do so by using group by & having, here is an example of what may solve your case :
SELECT id, user_id, SUM(balance) as balance, data_start
FROM your_table
GROUP BY user_id
HAVING SUM(balance) >= 30
AND MIN(date_start)
This is a good way to do it with one query, but it is a complex query and you should be careful if using it on a very large tables.

Return rows where specific number is reached for the first time (postgres)

Have hit a roadblock.
Context: am using PostgreSQL 9.5.8
I have a table, as follows, with customers' points accrued. The table has multiple rows per customer as it records every change in points (like an event table). i.e. customer 1 may buy 1 item and accrue 10 points which is one row, then on another day spend some of these points and be left with 5 points which is another row, and then purchase another item and accrue a further 10 bringing them back up to 15 which displays as another row. Each of these rows with point amounts has a created_at column.
Example table:
Customer ID created_at no_points row
123 17/09/2017 5 1
123 09/10/2017 8 2
124 10/10/2017 12 3
123 10/10/2017 15 4
125 12/10/2017 12 5
126 17/09/2017 6 6
123 11/10/2017 11 7
123 12/10/2017 9 8
127 17/09/2017 5 9
124 11/10/2017 5 10
125 13/10/2017 5 11
123 13/10/2017 12 12
I want to track the first time a customer reaches a certain threshold i.e. >= 10 points. It doesn't matter how much they go over 10 points, the only criteria is that I select the first time the customer reaches this threshold. I would also like this query to fetch only rows where the customer has reached the threshold of 10 for the first time in the last week.
Following these rules, in the above example, I would like my query to select rows 3, 4 and 5.
I have tried the following query:
SELECT x.id,
min(x.created_at)
FROM (
SELECT
p.id as id,
p.created_at as created_at,
p.amount as amount
FROM "points" p
WHERE p.amount >= 10 ) x
WHERE x.created_at >= (now()::date - 7)
AND x.created_at < now()::date
GROUP BY x.id
I'm unsure that I'm retrieving the right thing however from the result set I am seeing & the results set is huge so it's not evident. Could someone sense check?
Thanks in advance.

Use cumulative functions:
select p.*
from (select p.*,
sum(num_points) over (partition by p.customer_id order by p.created_at) as cume_num_points
from points p
) p
where cume_num_points >= 10 and
(cume_num_points - num_points) < 10;
EDIT:
I may have misunderstood the question. If you just want the first break, one method uses window functions:
select p.*
from (select p.*,
lag(num_points) over (partition by p.customer_id order by p.created_at) as prev_num_points
from points p
) p
where num_points >= 10 and
prev_num_points < 10;
Or, without a subquery:
select distinct on (p.customer_id) p.*
from customers p
where num_points >= 10
order by p.customer_id, p.created_at;

Grouping of Similar data by amount in Oracle

I have a txn table with columns ac_id, txn_amt. It will store the data txn amounts along with account ids. Below is example of data
AC_ID TXN_AMT
10 1000
10 1000
10 1010
10 1030
10 5000
10 5010
10 10000
20 32000
20 32200
20 5000
I want to write a query in such a way that all the amounts which are within 10% range of the previous amounts should be grouped together. Output should be something like this:
AC_ID TOTAL_AMT TOTAL_CNT GROUP
10 4040 4 1
10 10010 2 2
20 64200 2 3
20 5000 1 4
I tried with LAG function but still clueless. This is the code snippet I tried:
select ac_id, txn_amt, round((((txn_amt - lag(txn_amt, 1) over (partition by ac_id order by ac_id, txn_amt))/txn_amt)*100,2) as amt_diff_pct from txn;
Any clue or help will be highly appreciated.

If by previous you mean "the largest amount less than", then you can do this. You can find where the gaps are (i.e. larger than a 10% difference). Then you can assign a group by counting the number of gaps:
select ac_id, sum(txn_amt) as total_amt, count(*) as total_cnt, grp
from (select t.*,
sum(case when prev_txn_amt * 1.1 > txn_amt then 0 else 1 end) over
(partition by ac_id order by txn_amt) as grp
from (select t.*,
lag(txn_amt) over (partition by ac_id order by txn_amt) as prev_txn_amt
from txn t
) t
) t
group by ac_id, grp;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas