Distinct count on a rolling time window

Distinct count on a rolling time window - sql

I want to count the number of distinct catalog numbers that have appeared within the last X minutes. This is usually called a rolling time window.
For instance, if I have:
row startime orderNumber catalogNumb
1 2007-09-24-15.50 o1 21
2 2007-09-24-15.51 o2 21
3 2007-09-24-15.52 o2 21
4 2007-09-24-15.53 o3 21
5 2007-09-24-15.54 o4 22
6 2007-09-24-15.55 o4 23
7 2007-09-24-15.56 o4 21
8 2007-09-24-15.57 o4 21
For instance, if I want to get this for the last 5 minutes (5 is just one of the possible values), the output should be:
row startime orderNumber catalogNumb countCatalog
1 2007-09-24-15.50 o1 21 1
2 2007-09-24-15.51 o2 22 2
3 2007-09-24-15.52 o2 23 3
4 2007-09-24-15.53 o3 24 4
5 2007-09-24-15.54 o4 21 4
6 2007-09-24-15.55 o4 21 4
7 2007-09-24-15.56 o4 21 4
8 2007-09-24-15.57 o4 21 3
I am using Big SQL for infosphere BigInsights v3.0.
Resulting query can use any db2 Olap windows functions except for
count (distinct catalogNumb) OVER()... which is not supported by my db2 version.
In addition to count, I may also need to use other aggregate functions (avg, sum...) over the catalogNumb and other attributes.
Any feedback would be appreciated.

True Db2 does not support count distinct as OLAP function but there is an easy workaround:
You can use
dense_rank
instead - the highest number (max) from dense rank is your count distinct!

You can try something like this:
select ...
from mytable
where starttime between current_time - 5 minutes and current_time
That will get all the rows for the last 5 minutes. 5 can be a variable. then count() or sum() or average() the rows.

Related

Group rows using the cumulative sum of a third column

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10

Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

How to write sql query to generate a group no for each grouped record

following is scenario:
I have data in following format:
entryid , ac_no, db/cr, amt
-----------------------------------------------
1 10 D 5
1 11 C 5
2 01 D 8
2 11 C 8
3 12 D 10
3 13 C 10
4 14 D 5
4 16 C 5
5 14 D 2
5 17 C 2
6 14 D 3
6 18 C 3
I want data in following format:
So far i have acheived the first 3 columns by query
select wm_concat(entryid),ac_no,db_cr,Sum(amt) from t1 group by ac_no,db_cr
wm_Concat(entryid),ac_no, db/cr, Sum(amt), set_id
------------------------------------------------
1 10 D 5 S1
2 01 D 8 S1
1,2 11 C 13 S1
3 12 D 10 S2
3 13 C 10 S2
4,5,6 14 D 10 S3
4 16 C 5 S3
5 17 C 2 S3
6 18 C 3 S3
I want an additional column `set_id` that either shows this S1, S2.. or any number 1,2.. so that the debit & credit entries sets can be identified.
I am making sets of debit and credit entries based on their Ac_no values.
Any little help will be highly appreciated. Thanks

Create a new column say set and give a unique identifier to the particular set. So for example the first three records will have set id S1, next two will have S2 and so on.
To distinguish a transaction from a set you can use column db/cr along with newly added set column. You can identify that the 3rd row is a set since it's transaction type is 'C' whereas the transactions are of type 'D'.
Here I have assumed that your transactions are debit only, if not please provide more details in the question. Hope this helps.

Multiple cumulative sums [duplicate]

This question already has answers here:
How to get the cumulative sum by group in R?
(2 answers)
Closed 7 years ago.
Hopefully the title is explicit enough.
I have a table looking like that :
classes id value
a 1 10
a 2 15
a 3 12
b 1 5
b 2 9
b 3 7
c 1 6
c 2 14
c 3 6
and here is what I would like :
classes id value cumsum
a 1 10 10
a 2 15 25
a 3 12 37
b 1 5 5
b 2 9 14
b 3 7 21
c 1 6 6
c 2 14 20
c 3 6 26
I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :
id value cumsum
1 10 10
2 15 25
3 12 37
It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.
However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a,b,c, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).
Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.
Is there any way this could be done faster ? I'm using SQL in R through the SQLDF package. A solution in either R code (with or without an external common package) or SQL code will do.
Thanks

In SQL, you can do a cumulative sum using the ANSI standard sum() over () functionality:
select classes, id, value,
sum(value) over (partition by classes order by id) as cumesum
from t;

Or you can use by from the base package:
df$cumsum <- unlist(by(df$value, df$classes, cumsum))
# classes id value cumsum
#1 a 1 10 10
#2 a 2 15 25
#3 a 3 12 37
#4 b 1 5 5
#5 b 2 9 14
#6 b 3 7 21
#7 c 1 6 6
#8 c 2 14 20
#9 c 3 6 26

sql server 2008 - calculated and ordered list needs to return only 2 entries per supplier

I have a dataset like below, but longer. I want to ensure I am picking the 'fleet_id' in terms of their 'StarDriver' value overall, but I want to return only two results for each 'supplier_id' and return a max of 20 in total.
(I'm sorry I didnt work out how to copy the below in proper formatting, couldn't find from toolbar above and google results were about copying data; would also be grateful if someone would point out how)
fleet_id supplier_id Ratings Driver Punctuality Car StarDriver
19442 151 10 5 5 5 5
19634 151 11 5 5 5 5
19437 151 12 5 5 5 5
12832 10 14 5 4.92857142857143 5 4.97619047619048
12217 111 10 5 5 4.9 4.96666666666667
21135 158 19 5 4.89473684210526 5 4.96491228070175
19436 151 14 4.85714285714286 5 5 4.95238095238095
12239 111 12 4.91666666666667 5 4.91666666666667 4.94444444444445
10520 92 12 4.91666666666667 5 4.91666666666667 4.94444444444445
19997 151 12 5 5 4.83333333333333 4.94444444444444

To limit to the top 2 for each supplier, use row_number(). This will enumerate the rows and you can choose just two with where seqnum <= 2.
The rest of the query is just selecting 20 rows based on a field:
select top 20 t.*
from (select t.*,
row_number() over (partition by supplier order by StarDriver desc) as seqnum
from table t
) t
where seqnum <= 2
order by StarDriver;

SQL Query- Partition into groups & calculate max- min value

Need your help with a SQL query in Oracle db. I have data that I want to partition into groups when event = "Start". E.g. Row 1-6 is a group, row 7-9 is a group. I want to ignore rows with event = "Ignore". Finally I want to calculate max(Value)-min(Value) for these groups. I dont have any way to group the data.
Can this be achieved? Is it possible to use partition by Event = start. Same data is below:
Row Event Value Required Result is max-min of value
1 Start 10
2 A 11
3 B 12
4 C 13
5 D 14
6 E 15 5
--------------------------------------------
7 Start 16
8 A 18
9 B 20 4
--------------------------------------------
10 Start 27
11 A 30
12 B 33
13 C 34 7
--------------------------------------------
14 Ignore 35
--------------------------------------------
15 Ignore 36
--------------------------------------------
16 Start 33
17 A 34
18 B 35
19 C 36
20 D 37
21 E 38 5
--------------------------------------------

Yes, you can do this in SQL.
The following query first finds the group that a row is in, by finding the largest start before the row id. This version uses a correlated subquery for this calculation.
It then does the grouping on the id and does the calculation.
select groupid, max(value) - min(value)
from (select t.*,
(select max(row) from t t2 where t2.row < t.row and t2.event = start
) as groupid
from t
) t
where event <> 'IGNORE'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Distinct count on a rolling time window - sql

True Db2 does not support count distinct as OLAP function but there is an easy workaround: You can use dense_rank instead - the highest number (max) from dense rank is your count distinct!

You can try something like this: select ... from mytable where starttime between current_time - 5 minutes and current_time That will get all the rows for the last 5 minutes. 5 can be a variable. then count() or sum() or average() the rows.

Related

Group rows using the cumulative sum of a third column

How to write sql query to generate a group no for each grouped record

Multiple cumulative sums [duplicate]

sql server 2008 - calculated and ordered list needs to return only 2 entries per supplier

SQL Query- Partition into groups & calculate max- min value

Categories

Resources