Rolling sum for last 3 hour records of just one column in SAS - sql

Everyone,
What I need is to calculate for every record (every row) for the last 3 hour sum of usage (Usage is one of the columns in dataset) grouped by User and ID_option.
Every line(row) represent one record (one hour have about million records). For example I made a table with just a few records (including desired column sum_usage_3 hour):
User ID_option time usage sum_usage_3hr
1 a1 12OCT2017:11:20:32 3 10
1 a1 12OCT2017:10:23:24 7 14
1 b1 12OCT2017:09:34:55 12 12
2 b1 12OCT2017:08:55:06 4 6
1 a1 12OCT2017:07:59:53 7 7
2 b1 12OCT2017:06:59:12 2 2
I have tried with something like this code below and it returns me a sum of all time, not just the last 3 hour. I'm not surprised, but I have not so much idea how I'm going to do this in SAS.
proc sql:
CREATE table my_table
SELECT *, SUM(usage) AS sum_usage_3hr
FROM prev_table WHERE time BETWEEN TIME and intnx('second', time, -3*3600)
GROUP BY User, ID_option;
RUN;
Any help is welcomed, thanks. It's not necessary to do this in proc sql, data step is also acceptable if it's possible. I just assume that I need some kind of partition by.
Thanks in advance.

Why not just use a correlated sub-query to get the sum?
data have ;
input user id_option $ datetime :datetime. usage expected ;
format datetime datetime20.;
cards;
1 a1 12OCT2017:11:20:32 3 10
1 a1 12OCT2017:10:23:24 7 14
1 b1 12OCT2017:09:34:55 12 12
2 b1 12OCT2017:08:55:06 4 6
1 a1 12OCT2017:07:59:53 7 7
2 b1 12OCT2017:06:59:12 2 2
;
proc print; run;
proc sql ;
create table want as
select a.*
, (select sum(b.usage)
from have b
where a.user=b.user and a.id_option=b.id_option
and b.datetime between intnx('hour',a.datetime,-3,'s') and a.datetime
) as usage_3hr
from have a
;
quit;
Results
usage_
Obs user id_option datetime usage expected 3hr
1 1 a1 12OCT2017:11:20:32 3 10 10
2 1 a1 12OCT2017:10:23:24 7 14 14
3 1 b1 12OCT2017:09:34:55 12 12 12
4 2 b1 12OCT2017:08:55:06 4 6 6
5 1 a1 12OCT2017:07:59:53 7 7 7
6 2 b1 12OCT2017:06:59:12 2 2 2

The result is not surprising, as the condition for the WHERE clause is always true (time is necessarily greater or equal (or lesser or equal) to time).
I believe the simplest way would be to join the table on itself, and select the relevant rows this way:
proc sql;
create table want as
select distinct a.*
,sum(b.USAGE) as sum_usage_3hr
from have as a
left join have as b
on a.USER = b.USER
and a.ID_OPTION = b.ID_OPTION
and b.TIME between intnx('hour', a.TIME, -3) and a.TIME
group by a.USER, a.ID_OPTION, a.TIME;
quit;

Related

Counting SUM(VALUE) from previous cell

I have the following table:
A
Sum(Tickets)
01-2022
5
02-2022
2
03-2022
8
04-2022
1
05-2022
3
06-2022
3
07-2022
4
08-2022
1
09-2022
5
10-2022
5
11-2022
3
I would like to create the following extra column 'TotalSum(Tickets)' but I am stuck....
Anyone who can help out?
A
Sum(Tickets)
TotalSum(Tickets)
01-2022
5
5
02-2022
2
7
03-2022
8
15
04-2022
1
16
05-2022
3
19
06-2022
3
22
07-2022
4
26
08-2022
1
27
09-2022
5
32
10-2022
5
37
11-2022
3
40
You may use SUM() as a window function here:
SELECT A, SumTickets, SUM(SumTickets) OVER (ORDER BY A) AS TotalSumTickets
FROM yourTable
ORDER BY A;
But this assumes that you actually have a bona-fide column SumTickets which contains the sums. Assuming you really showed us the intermediate result of some aggregation query, you should use:
SELECT A, SUM(Tickets) AS SumTickets,
SUM(SUM(Tickets)) OVER (ORDER BY A) AS TotalSumTickets
FROM yourTable
GROUP BY A
ORDER BY A;
left join the same table where date is not bigger, then sum that for every date:
select
table1.date,
sum(t.tickets)
from
table1
left join table1 t
on t.date<= table1.date
group by
table1.date;

SQL How to SUM rows in second column if first column contain

View of a table
ID
kWh
1
3
1
10
1
8
1
11
2
12
2
4
2
7
2
8
3
3
3
4
3
5
I want to recive
ID
kWh
1
32
2
31
3
12
The table itself is more complex and larger. But the point is this. How can this be done? And I can't know in advance the ID numbers of the first column.
SELECT T.ID,SUM(T.KWH)SUM_KWH
FROM YOUR_TABLE T
GROUP BY T.ID
Do you need this one?
Let's assume your database name is 'testdb' and table name is 'table1'.
SELECT * FROM testdb.table1;
SELECT id, SUM(kwh) AS "kwh2"
FROM stack.table1
WHERE id = 1
keep running the query will all (ids). you will get output.
By following this query you will get desired output.
Hope this helps.

Applying transformations or joining conditions to achieve the result in pyspark or hive

Given two tables or dataframes. One will be having datasets and corresponding tables. Other will be having source and target.
I need a solution for the below condition:
Once we find ft.dataset = st.source, we need to replace ft.table in st.source and neglect the remaining records.
For example: Here in first block of second table which is from seq_no 1 to 6, we have a match at Abc, so we replaced with db.table1 and neglect the remaining records in that block. Similarly we need to do the same for each and every block of second table.
Note that Target is same in all the rows of second table.
Please help me with a possible solution in pyspark or Hive.
First table(ft):
Dataset | Table
_________________
Abc db.table1
Xyz db.table2
Def db.table3
Second table(st):
Target| seq_no| source
______________________
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 Abc
A 6 Xyz
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 Def
A 6 Abc
A 7 Xyz
Expected output:
Target| seq_no | source
_______________________
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 db.table1
A 1 A
A 2 B1
A 3 C1
A 4 D1
A 5 db.table3
In Hive, you can use a left join to search for a match in the first table, and a window min() to identify the sequence of the first match
select target, seq_no, source
from (
select
st.target,
st.seq_no,
coalesce(st.source, ft.table) as source,
min(case when ft.dataset is not null then st.seq_no end) over(partition by st.target) first_matched_seq_no
from st
left join ft on ft.dataset = st.source
) t
where first_matched_seq_no is null or seq_no <= first_matched_seq_no
order by target, seq_no

How to get average runs for each over in SQL?

The first six balls mean first over, next six balls mean second over & so on than how to get average runs for each over.
input as
Ball no Runs
1 4
2 6
3 3
4 2
5 6
6 1
1 2
2 4
3 6
4 3
5 1
6 1
1 2
output should be:
Over no avg runs
1 3.66
2 2.83
As Gordon Linoff suggested, SQL table represents unordered sets, So you have to use an ordered column in your table. If you can use such a column you may use below query -
SELECT Over_no AVG(Runs) avg_runs
FROM (SELECT Ball_no, Runs, CEIL(ROW_NUMBER() OVER(ORDER BY ORDER_COLUMN, Ball_no) RN / 6) Over_no
FROM YOUR_TABLE)
GROUP BY Over_no;
I have managed to solve my problem with the following query:
SELECT ROWNUM OVER_NO, AVG_RUNS
FROM(
SELECT ROWNUM RN,
ROUND(AVG(RUNS)OVER(ORDER BY ROWNUM RANGE BETWEEN CURRENT ROW AND 5 FOLLOWING),2) AVG_RUNS
FROM TABLE_NAME
)
WHERE RN=1 OR RN=7;

Delete rows, which are duplicated and follow each other consequently

It's hard to formulate, so i'll just show an example and you are welcome to edit my question and title.
Suppose, i have a table
flag id value datetime
0 b 1 343 13
1 a 1 23 12
2 b 1 21 11
3 b 1 32 10
4 c 2 43 11
5 d 2 43 10
6 d 2 32 9
7 c 2 1 8
For each id i want to squeze the table by flag columns such that all duplicate flag values that follow each other collapse to one row with sum aggregation. Desired result:
flag id value
0 b 1 343
1 a 1 23
2 b 1 53
3 c 2 75
4 d 2 32
5 c 2 1
P.S: I found functions like CONDITIONAL_CHANGE_EVENT, which seem to be able to do that, but the examples of them in docs dont work for me
Use the differnece of row number approach to assign groups based on consecutive row flags being the same. Thereafter use a running sum.
select distinct id,flag,sum(value) over(partition by id,grp) as finalvalue
from (
select t.*,row_number() over(partition by id order by datetime)-row_number() over(partition by id,flag order by datetime) as grp
from tbl t
) t
Here's an approach which uses CONDITIONAL_CHANGE_EVENT:
select
flag,
id,
sum(value) value
from (
select
conditional_change_event(flag) over (order by datetime desc) part,
flag,
id,
value
from so
) t
group by part, flag, id
order by part;
The result is different from your desired result stated in the question because of order by datetime. Adding a separate column for the row number and sorting on that gives the correct result.