Sum iteratively in sql based on what value next row has? - sql

I want to aggregate a transaction table in a way that it sums checking a variable in next row and sums if condition is met, otherwise breaks and start summing again creating a new row. This might be confusing, so adding example below -
What I have -
ID
date
type
amount
a
1/1/2023
incoming
10
a
2/1/2023
incoming
10
a
3/1/2023
incoming
10
a
4/1/2023
incoming
10
a
5/1/2023
outgoing
20
a
6/1/2023
outgoing
10
a
7/1/2023
incoming
10
a
8/1/2023
incoming
10
a
9/1/2023
outgoing
30
a
10/1/2023
incoming
10
Summary Output I want -
ID
type
min_date
max_date
amount
a
incoming
1/1/2023
4/1/2023
40
a
outgoing
5/1/2023
6/1/2023
30
a
incoming
7/1/2023
8/1/2023
20
a
outgoing
9/1/2023
9/1/2023
30
a
incoming
10/1/2023
10/1/2023
10
Basically keep summing until the next row has same transaction type (after sorting on date), if it changes create a new row and repeat same process.
Thanks!
I tried approaches like using window function (dense_rank) and sum() over (partition by) but not getting the output I am looking for.

Using window functions is the correct approach, you need to identify when the type changes (one way is to use Lag or Lead) and then assign grouping to each set, see if the following gives your expected results:
with d as (
select *,
case when lag(type) over(partition by id order by date) = type then 0 else 1 end diff
from t
), grp as (
select *, Sum(diff) over(partition by id order by date) grp
from d
)
select Id, Type,
Min(date) Min_Date,
Max(Date) Max_Date,
Sum(Amount) Amount
from grp
group by Id, Type, grp
order by Min_Date;
See this example Fiddle

Related

Query for days since last value change

I have a table with the following data that includes any change to coupon program (Rate & Status)
timestamp
account_ID
active
rate
1675894331538
1234
true
5
1675386736152
1234
false
0
1674778434298
1234
true
7
1673500367524
1234
true
5
1673309563251
1234
true
8
I am trying to determine how to best write a query to have the output look like this:
account_ID
days_since_status_change
days_since_rate_change
1234
2
4
I've been looking into using row_number and partitioning by account_id over timestamp DESC, but I can't wrap my head around how to narrow it down to two specific events and then counting the days since that event happened.
If you can make suggestions, this n00b would really appreciate the help!
I'm using BigQuery if that helps too.
You might consider below query.
SELECT account_ID,
COUNTIF(active_chage) OVER w1 AS days_since_status_change,
COUNTIF(rate_change) OVER w1 AS days_since_rate_change
FROM (
SELECT *,
active <> LAG(active) OVER w0 AS active_chage,
rate <> LAG(rate) OVER w0 AS rate_change
FROM sample_table
WINDOW w0 AS (PARTITION BY account_ID ORDER BY timestamp)
) QUALIFY timestamp = MAX(timestamp) OVER (PARTITION BY account_ID)
WINDOW w1 AS (PARTITION BY account_ID ORDER BY timestamp);
Query results

count most repeated value per group in hive?

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
tpep_pickup_datetime
pulocationid
2022-01-28 23:32:52.0
100
2022-02-28 23:02:40.0
202
2022-02-28 17:22:45.0
102
2022-02-28 23:19:37.0
102
2022-03-29 17:32:02.0
102
2022-01-28 23:32:40.0
101
2022-02-28 17:28:09.0
201
2022-03-28 23:59:54.0
100
2022-02-28 21:02:40.0
100
I want to find out what was the most common hour in each locationid, this being the result:
locationid
hour
100
23
101
17
102
17
201
17
202
23
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error:
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?
with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)
You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime)
Then add the row_number but you need to order it by the total count in a descending way:
select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)
Or in total :
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1

SQL Implementation with Sliding-Window functions or Recursive CTEs

I have a problem that it's very easy to be solved in C# code for example, but I have no idea how to write in a SQL query with Recursive CTE-s or Sliding-Windows functions.
Here is the situation: let's say I have a table with 3 columns (ID, Date, Amount), and here is some data:
ID Date Amount
-----------------------
1 01.01.2016 -500
2 01.02.2016 1000
3 01.03.2016 -200
4 01.04.2016 300
5 01.05.2016 500
6 01.06.2016 1000
7 01.07.2016 -100
8 01.08.2016 200
The result I want to get from the table is this (ID, Amount .... Order By Date):
ID Amount
-----------------------
2 300
4 300
5 500
6 900
8 200
The idea is to distribute the amounts into installments, for each client separately, but the thing is when negative amount comes into play you need to remove amount from the last installment. I don't know how clear I am, so here is an example:
Let's say I have 3 Invoices for one client with amounts 500, 200, -300.
If i start distribute these Invoices, first i distribute the amount 500, then 200. But when i come to the third one -300, then i need to remove from the last Invoice. In other words 200 - 300 = -100, so the amount from second Invoice will disappear, but there are still -100 that needs to be substracted from first Invoice. So 500 - 100 = 400. The result i need is result with one row (first invoice with amount 400)
Another example when the first invoice is with negative amount (-500, 300, 500).
In this case, the first (-500) invoice will make the second disappear and another 200 will be substracted from the third. So the result will be: Third Invoice with amount 300.
This is something like Stack implementation in programming language, but i need to make it with sliding-window functions in SQL Server.
I need an implementation with Sliding Function or Recursive CTEs.
Not with cycles ...
Thanks.
Ok, think this is what you want. there are two recursive queries. One for upward propagation and the second one for the downward propagation.
with your_data_rn as
(
select *, row_number() over (order by date) rn
from your_data
), rcte_up(id, date, ammount, rn, running) as
(
select *, ammount as running
from your_data_rn
union all
select d.*,
d.ammount + rcte_up.running
from your_data_rn d
join rcte_up on rcte_up.running < 0 and d.rn = rcte_up.rn - 1
), data2 as
(
select id, date, min(running) ammount,
row_number() over (order by date) rn
from rcte_up
group by id, date, rn
having min(running) > 0 or rn = 1
), rcte_down(id, date, ammount, rn, running) as
(
select *, ammount as running
from data2
union all
select d.*, d.ammount + rcte_down.running
from data2 d
join rcte_down on rcte_down.running < 0 and d.rn = rcte_down.rn + 1
)
select id, date, min(running) ammount
from rcte_down
group by id, date
having min(running) > 0
demo
I can imagine that you use just the upward propagation and the downward propagation of the first row is done in some procedural language. Downward propagation is one scan through few first rows, therefore, the recursive query may be a hammer on a mosquito.
I add client ID in table for more general solution. Then I implemented the stack stored as XML in query field. And emulated a program cycle with Recursive-CTE:
with Data as( -- Numbering rows for iteration on CTE
select Client, id, Amount,
cast(row_number() over(partition by Client order by Date) as int) n
from TabW
),
CTE(Client, n, stack) as( -- Recursive CTE
select Client, 1, cast(NULL as xml) from Data where n=1
UNION ALL
select D.Client, D.n+1, (
-- Stack operations to process current row (D)
select row_number() over(order by n) n,
-- Use calculated amount in first positive and oldest stack cell
-- Else preserve value stored in stack
case when n=1 or (n=0 and last=1) then new else Amount end Amount,
-- Set ID in stack cell for positive and new data
case when n=1 and D.Amount>0 then D.id else id end id
from (
select Y.Amount, Y.id, new,
-- Count positive stack entries
sum(case when new<=0 or (n=0 and Amount<0) then 0 else 1 end) over (order by n) n,
row_number() over(order by n desc) last -- mark oldest stack cell by 1
from (
select X.*,sum(Amount) over(order by n) new
from (
select case when C.stack.value('(/row/#Amount)[1]','int')<0 then -1 else 0 end n,
D.Amount, D.id -- Data from new record
union all -- And expand current stack in XML to table
select node.value('#n','int') n, node.value('#Amount','int'), node.value('#id','int')
from C.stack.nodes('//row') N(node)
) X
) Y where n>=0 -- Suppress new cell if the stack contained a negative amount
) Z
where n>0 or (n=0 and last=1)
for xml raw, type
)
from Data D, CTE C
where D.n=C.n and D.Client=C.Client
) -- Expand stack into result table
select CTE.Client, node.value('#id','int') id, node.value('#Amount','int')
from CTE join (select Client, max(n) max_n from Data group by Client) X on CTE.Client=X.Client and CTE.n=X.max_n+1
cross apply stack.nodes('//row') N(node)
order by CTE.Client, node.value('#n','int') desc
Test on sqlfiddle.com
I think this method is slower than #RadimBača. And it is shown to demonstrate the possibilities of implementing a sequential algorithm on SQL.

Selecting n-th to last values

I have a table like so:
id device group
-----------------
1 a 1000
2 a 1000
3 b 1001
4 b 1001
5 b 1001
6 b 1002
8 a 1003
9 a 1003
10 a 1003
11 a 1003
12 b 1004
13 b 1004
All id's and groups are sequential. What I would like is to select id and device based on groups and devices. Think of it as a pagination type selection. Getting the last group is a simple inner selection, but how do I select the second last group, or the third last group - etc.
I tried the row number function like this:
SELECT * FROM
( SELECT *, ROW_NUMBER() OVER (PARTITION BY device ORDER BY group DESC) rn FROM data) tmp
WHERE rn = 1;
.. but changing rn is giving me the previous id, not the previous group.
I would like to end up with a selection that could accomodate these results:
device = a, group = latest:
id device group
10 a 1003
11 a 1003
device = a, group = latest - 1:
id device group
1 a 1000
2 a 1000
Any one know how to accomplish this?
Edit:
Use case is a GPS enabled device in a car, sending data every 30 seconds. Imagine going on a drive today. First you go to the shops, then you go home. the first trip is you driving to the shop. The second trip is you driving back. I want to show those trips on a map, but it means I need to identify your last trip, and then the trip before it - ad infinitum, until you run out of trips.
You can try this approach:
`with x as (
select distinct page
from test_table),
y as (
select x.page
,row_number() over (order by page desc) as row_num
from x)
select test_table.* from test_table join y on y.page = test_table.page
where y.row_num =2`
I will try to explain what I have did here.
The first block(x) returns the distinct groups(pages in my case).
The second block(y) assigns row numbers to the groups in terms of their rank. In this case the ranking is in descending order of the pages.
Finally the third block, selects the desired value for the desired page. In case you want the pen-ultimate page , type rouw_num=2, if third from last use row_num =3 and likewise.
You can play around with the values [here]: http://sqlfiddle.com/#!15/190c06/26
Use dense_rank():
select d.*
from (select d.*, dense_rank() over (order by group_id desc) as seqnum
from data d
where device = 'a'
) d
where seqnum = 2;

GROUP values separated by specific records

I want to make a specific counter which will raise by one after a specific record is found in a row.
time event revenue counter
13.37 START 20 1
13.38 action A 10 1
13.40 action B 5 1
13.42 end 1
14.15 START 20 2
14.16 action B 5 2
14.18 end 2
15.10 START 20 3
15.12 end 3
I need to find out total revenue for every visit (actions between START and END). I was thinking the best way would be to set a counter like this:
so I could group events. But if you have a better solution, I would be grateful.
You can use a query similar to the following:
with StartTimes as
(
select time,
startRank = row_number() over (order by time)
from events
where event = 'START'
)
select e.*, counter = st.startRank
from events e
outer apply
(
select top 1 st.startRank
from StartTimes st
where e.time >= st.time
order by st.time desc
) st
SQL Fiddle with demo.
May need to be updated based on the particular characteristics of the actual data, things like duplicate times, missing events, etc. But it works for the sample data.
SQL Server 2012 supports an OVER clause for aggregates, so if you're up to date on version, this will give you the counter you want:
count(case when eventname='START' then 1 end) over (order by eventtime)
You could also use the latest START time instead of a counter to group by, like this:
with t as (
select
*,
max(case when eventname='START' then eventtime end)
over (order by eventtime) as timeStart
from YourTable
)
select
timeStart,
max(eventtime) as timeEnd,
sum(revenue) as totalRevenue
from t
group by timeStart;
Here's a SQL Fiddle demo using the schema Ian posted for his solution.