Rolling count over window without NULL or repeated values - google-bigquery

Consider the follow window already grouped by id and sorted by timestamp in descending order.
id
timestamp
val
foo
10:50
NULL
foo
10:40
a
foo
10:30
a
foo
10:20
NULL
foo
10:10
NULL
foo
10:00
b
foo
9:50
c
foo
9:40
NULL
foo
9:30
d
foo
9:20
NULL
Given that a val will not appear again once a different value occurs. That is, I won't have a,b,a but a,null,a might appear. I want to generated a rolling count on the condition that val is not NULL or previously seen. That is, I'd like something like:
id
timestamp
val
count
foo
10:50
NULL
0
foo
10:40
a
1
foo
10:30
a
1
foo
10:20
NULL
1
foo
10:10
a
1
foo
10:00
b
2
foo
9:50
c
3
foo
9:40
NULL
3
foo
9:30
d
4
foo
9:20
NULL
4
So essentially a "collapsed" count. I've tried something like
SELECT *, COUNT(val) OVER (PARTITION BY id ORDER BY timestamp DESC) count
But this does not disregard val that's previous occured.
Any idea how to do this?

Consider below approach
select * except(first_seen),
countif(first_seen and not val is null) over(order by timestamp desc) distinct_count
from (
select *,
1 = row_number() over(partition by val order by timestamp desc) first_seen
from table
)
if applied to sample data in your question - output is

Try this one:
with mytable as
(
select 'foo' as id, time(10, 50, 0) as timestamp, NULL as val union all
select 'foo' as id, time(10, 40, 0) as timestamp, 'a' as val union all
select 'foo' as id, time(10, 30, 0) as timestamp, 'a' as val union all
select 'foo' as id, time(10, 20, 0) as timestamp, NULL as val union all
select 'foo' as id, time(10, 10, 0) as timestamp, 'a' as val union all
select 'foo' as id, time(10, 0, 0) as timestamp, 'b' as val union all
select 'foo' as id, time(9, 50, 0) as timestamp, 'c' as val union all
select 'foo' as id, time(9, 40, 0) as timestamp, NULL as val union all
select 'foo' as id, time(9, 30, 0) as timestamp, 'd' as val union all
select 'foo' as id, time(9, 20, 0) as timestamp, NULL as val
)
select id, timestamp, val, (select count(distinct c) from unnest(cnt) as c) - 1 as count
from (
select *, ARRAY_AGG(IFNULL(val,'dummy')) OVER (PARTITION BY id ORDER BY timestamp DESC) as cnt
from mytable
)

Related

How to group-by in Oracle

I have a table like [Original] in below.
I want to sum by group-by field like [result].
Does anyone have an idea to make this query?
Thank you in advance for your help.
WITH t1 as (
SELECT 1 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 2 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 3 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 4 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 5 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 6 AS ID, 'B' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 7 AS ID, 'A' AS FIELD, 3 AS VAL FROM dual
UNION SELECT 8 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 9 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
)
SELECT *
FROM t1
[Original Data]
ID FIELD VAL
1 A 1
2 A 2
3 A 1
4 B 2
5 B 2
6 B 1
7 A 3
8 A 2
9 A 1
[Result]
ID FIELD VAL
1 A 4
4 B 5
7 A 6
This is island and gap issue and you can use analytical function as follows:
SQL> WITH t1 as (
2 SELECT 1 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
3 UNION SELECT 2 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
4 UNION SELECT 3 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
5 UNION SELECT 4 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
6 UNION SELECT 5 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
7 UNION SELECT 6 AS ID, 'B' AS FIELD, 1 AS VAL FROM dual
8 UNION SELECT 7 AS ID, 'A' AS FIELD, 3 AS VAL FROM dual
9 UNION SELECT 8 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
10 UNION SELECT 9 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
11 )
12 SELECT MIN(ID) AS ID, FIELD, SUM(VAL)
13 FROM (SELECT T1.*,
14 SUM(CASE WHEN LAG_FIELD = FIELD THEN 0 ELSE 1 END)
15 OVER (ORDER BY ID) AS SM
16 FROM (SELECT T1.*,
17 LAG(FIELD) OVER (ORDER BY ID) AS LAG_FIELD
18 FROM t1
19 ) T1
20 )
21 GROUP BY FIELD, SM
22 ORDER BY 1;
ID F SUM(VAL)
---------- - ----------
1 A 4
4 B 5
7 A 6
SQL>
This is indeed a gaps-and-islands problem. I think the simplest approach here is to use the difference between row numbers to identify groups of adjacent rows:
select min(id) as id, field, sum(val) as val
from (
select t1.*,
row_number() over(order by id) rn1,
row_number() over(partition by field order by id) rn2
from t1
) t
group by field, rn1 - rn2
order by min(id)
If id is always incrementing without gaps, this is even simpler:
select min(id) as id, field, sum(val) as val
from (
select t1.*,
row_number() over(partition by field order by id) rn
from t1
) t
group by field, id - rn
order by min(id)
From Oracle 12, you can do it quite simply using MATCH_RECOGNIZE:
WITH t1 as (
SELECT 1 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 2 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 3 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 4 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 5 AS ID, 'B' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 6 AS ID, 'B' AS FIELD, 1 AS VAL FROM dual
UNION SELECT 7 AS ID, 'A' AS FIELD, 3 AS VAL FROM dual
UNION SELECT 8 AS ID, 'A' AS FIELD, 2 AS VAL FROM dual
UNION SELECT 9 AS ID, 'A' AS FIELD, 1 AS VAL FROM dual
)
SELECT *
FROM t1
MATCH_RECOGNIZE (
ORDER BY id
MEASURES
FIRST( id ) AS id,
FIRST( field ) AS field,
SUM( val ) AS total
ONE ROW PER MATCH
PATTERN( same_field+ )
DEFINE same_field AS FIRST(field) = field
)
Which outputs:
ID | FIELD | TOTAL
-: | :---- | ----:
1 | A | 4
4 | B | 5
7 | A | 6
db<>fiddle here

Identify only when value matches

I need to return only rows that have the match e.g Value = A, but I only need the rows that have A and with no other values.
T1:
ID Value
1 A
1 B
1 C
2 A
3 A
3 B
4 A
5 B
5 D
5 E
5 F
Desired Output:
2
4
how can I achieve this?
when I try the following, 1&3 are also returned:
select ID from T1 where Value ='A'
With NOT EXISTS:
select t.id
from tablename t
where t.value = 'A'
and not exists (
select 1 from tablename
where id = t.id and value <> 'A'
)
From the sample data you posted there is no need to use:
select distinct t.id
but if you get duplicates then use it.
Another way if there are no null values:
select id
from tablename
group by id
having sum(case when value <> 'A' then 1 else 0 end) = 0
Or if you want the rows where the id has only 1 value = 'A':
select id
from tablename
group by id
having count(*) = 1 and max(value) = 'A'
I think the simplest way is aggregation with having:
select id
from tablename
group by id
having min(value) = max(value) and
min(value) = 'A';
Note that this ignores NULL values so it could return ids with both NULL and A. If you want to avoid that:
select id
from tablename
group by id
having count(value) = count(*) and
min(value) = max(value) and
min(value) = 'A';
Oracle Setup:
CREATE TABLE test_data ( ID, Value ) AS
SELECT 1, 'A' FROM DUAL UNION ALL
SELECT 1, 'B' FROM DUAL UNION ALL
SELECT 1, 'C' FROM DUAL UNION ALL
SELECT 2, 'A' FROM DUAL UNION ALL
SELECT 3, 'A' FROM DUAL UNION ALL
SELECT 3, 'B' FROM DUAL UNION ALL
SELECT 4, 'A' FROM DUAL UNION ALL
SELECT 5, 'B' FROM DUAL UNION ALL
SELECT 5, 'D' FROM DUAL UNION ALL
SELECT 5, 'E' FROM DUAL UNION ALL
SELECT 5, 'F' FROM DUAL
Query:
SELECT ID
FROM test_data
GROUP BY ID
HAVING COUNT( CASE Value WHEN 'A' THEN 1 END ) = 1
AND COUNT( CASE Value WHEN 'A' THEN NULL ELSE 1 END ) = 0
Output:
| ID |
| -: |
| 2 |
| 4 |
db<>fiddle here

Count total and count having in the same query

is there a way to get a total count of rows per {id, date} and the count > 1 per {id, date, columnX} in the same query?
For example, having such a table:
id date columnX
1 2017-04-20 a
1 2017-04-20 a
1 2017-04-18 b
1 2017-04-17 c
2 2017-04-20 a
2 2017-04-20 a
2 2017-04-20 c
2 2017-04-19 b
2 2017-04-19 b
2 2017-04-19 b
2 2017-04-19 b
2 2017-04-19 c
As the result, I wanna get the following table:
id date columnX count>1 count_total
1 2017-04-20 a 2 2
2 2017-04-20 a 2 3
2 2017-04-19 b 4 5
I tried to do it with partition by but receive weird results. I've heard Rollup function might be used but it seems like it's applicable only in legacy SQL, which is not the option for me.
If I understand correctly, you can use window functions:
select id, date, columnx, cnt,
(case when cnt > 1 then cnt else 0 end) as cnt_gt_1,
total_cnt
from (select id, date, columnx, count(*) as cnt
sum(count(*)) over (partition by id, date) as total_cnt
from t
group by id, date, columnx
) x
where cnt > 1;
Another possibility:
SELECT
id,
date,
data.columnX columnX,
data.count_ count_bigger_1,
count_total
FROM(
SELECT
id,
date,
ARRAY_AGG(columnX) data,
COUNT(1) count_total
FROM
`your_table_name`
GROUP BY
id, date
),
UNNEST(ARRAY(SELECT AS STRUCT columnX, count(1) count_ FROM UNNEST(data) columnX GROUP BY columnX HAVING count(1) > 1)) data
You can test it with simulated data:
WITH data AS(
SELECT 1 AS id, '2017-04-20' AS date, 'a' AS columnX UNION ALL
SELECT 1 AS id, '2017-04-20' AS date, 'a' AS columnX UNION ALL
SELECT 1 AS id, '2017-04-18' AS date, 'b' AS columnX UNION ALL
SELECT 1 AS id, '2017-04-17' AS date, 'c' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-20' AS date, 'a' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-20' AS date, 'a' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-20' AS date, 'c' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-19' AS date, 'b' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-19' AS date, 'b' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-19' AS date, 'b' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-19' AS date, 'b' AS columnX UNION ALL
SELECT 2 AS id, '2017-04-19' AS date, 'c' AS columnX
)
SELECT
id,
date,
data.columnX columnX,
data.count_ count_bigger_1,
count_total
FROM(
SELECT
id,
date,
ARRAY_AGG(columnX) data,
COUNT(1) count_total
FROM
data
GROUP BY
id, date
),
UNNEST(ARRAY(SELECT AS STRUCT columnX, count(1) count_ FROM UNNEST(data) columnX GROUP BY columnX HAVING count(1) > 1)) data
This solution avoids the analytical function (which can be quite expensive depending on the input) and scales well to large volumes of data.
I recommend you to add into your example two more below rows
1 2017-04-20 x
1 2017-04-20 x
and check what solutions in two previous answers will give you:
It will be something like below:
id date columnX count>1 count_total
1 2017-04-20 a 2 4
1 2017-04-20 x 2 4
2 2017-04-20 a 2 3
2 2017-04-19 b 4 5
Notice two rows for id=1 and date=2017-04-20 and both having count_total=4
I am not sure if this is what you want - even though you might not even considered this scenario in your question
Anyway, I feel that to support more generic case like above your expectation of output should of be like below
Row id date x.columnX x.countX count_total
1 1 2017-04-20 x 2 4
a 2
2 2 2017-04-20 a 2 3
3 2 2017-04-19 b 4 5
where x is repeated field and each value represents respective columnX with its count
Below query does exactly this
#standardSQL
SELECT id, date,
ARRAY(SELECT x FROM UNNEST(x) AS x WHERE countX > 1) AS x,
count_total
FROM (
SELECT id, date, SUM(countX) AS count_total,
ARRAY_AGG(STRUCT<columnX STRING, countX INT64>(columnX, countX) ORDER BY countX DESC) AS X
FROM (
SELECT id, date,
columnX, COUNT(1) countX
FROM `yourTable`
GROUP BY id, date, columnX
)
GROUP BY id, date
HAVING count_total > 1
)
you can play/test it with dummy data from your question
#standardSQL
WITH `yourTable` AS(
SELECT 1 AS id, '2017-04-20' AS date, 'a' AS columnX UNION ALL
SELECT 1, '2017-04-20', 'a' UNION ALL
SELECT 1, '2017-04-20', 'x' UNION ALL
SELECT 1, '2017-04-20', 'x' UNION ALL
SELECT 1, '2017-04-18', 'b' UNION ALL
SELECT 1, '2017-04-17', 'c' UNION ALL
SELECT 2, '2017-04-20', 'a' UNION ALL
SELECT 2, '2017-04-20', 'a' UNION ALL
SELECT 2, '2017-04-20', 'c' UNION ALL
SELECT 2, '2017-04-19', 'b' UNION ALL
SELECT 2, '2017-04-19', 'b' UNION ALL
SELECT 2, '2017-04-19', 'b' UNION ALL
SELECT 2, '2017-04-19', 'b' UNION ALL
SELECT 2, '2017-04-19', 'c'
)
SELECT id, date,
ARRAY(SELECT x FROM UNNEST(x) AS x WHERE countX > 1) AS x,
count_total
FROM (
SELECT id, date, SUM(countX) AS count_total,
ARRAY_AGG(STRUCT<columnX STRING, countX INT64>(columnX, countX) ORDER BY countX DESC) AS X
FROM (
SELECT id, date,
columnX, COUNT(1) countX
FROM `yourTable`
GROUP BY id, date, columnX
)
GROUP BY id, date
HAVING count_total > 1
)

SQL query for segregating max and min values of a column in two different columns say Val1 and Val2

I have below table:
ID DateVal Val
1 1/1/2010 a
1 2/2/2010 b
1 3/3/2010 c
2 4/4/2010 d
2 5/5/2010 e
2 6/6/2010 f
3 7/7/2010 g
3 8/8/2010 h
3 9/9/2010 i
I need below:
ID Val1 Val2
1 a c
2 d f
3 g i
i.e. the Val at min date in column 'Val1' and Val and max date in column 'Val2'.
What all queries are there to achieve this output and which one is easiest?
DDL
with T as (
select 1 as id, to_date('01.01.2010','DD.MM.YYYY') dt, 'a' val
from dual union all
select 1 as id, to_date('02.02.2010','DD.MM.YYYY') dt, 'b' val
from dual union all
select 1 as id, to_date('03.03.2010','DD.MM.YYYY') dt, 'c' val
from dual union all
select 2 as id, to_date('04.04.2010','DD.MM.YYYY') dt, 'd' val
from dual union all
select 2 as id, to_date('05.05.2010','DD.MM.YYYY') dt, 'e' val
from dual union all
select 2 as id, to_date('06.06.2010','DD.MM.YYYY') dt, 'f' val
from dual union all
select 3 as id, to_date('07.07.2010','DD.MM.YYYY') dt, 'g' val
from dual union all
select 3 as id, to_date('08.08.2010','DD.MM.YYYY') dt, 'h' val
from dual union all
select 3 as id, to_date('09.09.2010','DD.MM.YYYY') dt, 'i' val
from dual)
Code
select
id,
max(val) keep (dense_rank first order by dt) as maxs,
max(val) keep (dense_rank first order by dt desc) as mins from t
group by id
with t1 as
(select ID, min(dtval) dtval, min(val) val from date_val group by ID),
t2 as
(select ID, max(dtval) dtval, max(val) val from date_val group by ID)
select t1.id, t1.val val1, t2.val val2 from t1 join t2 on t1.id = t2.id;

SQL Grouping by Ranges

I have a data set that has timestamped entries over various sets of groups.
Timestamp -- Group -- Value
---------------------------
1 -- A -- 10
2 -- A -- 20
3 -- B -- 15
4 -- B -- 25
5 -- C -- 5
6 -- A -- 5
7 -- A -- 10
I want to sum these values by the Group field, but parsed as it appears in the data. For example, the above data would result in the following output:
Group -- Sum
A -- 30
B -- 40
C -- 5
A -- 15
I do not want this, which is all I've been able to come up with on my own so far:
Group -- Sum
A -- 45
B -- 40
C -- 5
Using Oracle 11g, this is what I've hobbled togther so far. I know that this is wrong, by I'm hoping I'm at least on the right track with RANK(). In the real data, entries with the same group could be 2 timestamps apart, or 100; there could be one entry in a group, or 100 consecutive. It does not matter, I need them separated.
WITH SUB_Q AS
(SELECT K_ID
, GRP
, VAL
-- GET THE RANK FROM TIMESTAMP TO SEPARATE GROUPS WITH SAME NAME
, RANK() OVER(PARTITION BY K_ID ORDER BY TMSTAMP) AS RNK
FROM MY_TABLE
WHERE K_ID = 123)
SELECT T1.K_ID
, T1.GRP
, SUM(CASE
WHEN T1.GRP = T2.GRP THEN
T1.VAL
ELSE
0
END) AS TOTAL_VALUE
FROM SUB_Q T1 -- MAIN VALUE
INNER JOIN SUB_Q T2 -- TIMSTAMP AFTER
ON T1.K_ID = T2.K_ID
AND T1.RNK = T2.RNK - 1
GROUP BY T1.K_ID
, T1.GRP
Is it possible to group in this way? How would I go about doing this?
I approach this problem by defining a group which is the different of two row_number():
select group, sum(value)
from (select t.*,
(row_number() over (order by timestamp) -
row_number() over (partition by group order by timestamp)
) as grp
from my_table t
) t
group by group, grp
order by min(timestamp);
The difference of two row numbers is constant for adjacent values.
A solution using LAG and windowed analytic functions:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE TEST ( "Timestamp", "Group", Value ) AS
SELECT 1, 'A', 10 FROM DUAL
UNION ALL SELECT 2, 'A', 20 FROM DUAL
UNION ALL SELECT 3, 'B', 15 FROM DUAL
UNION ALL SELECT 4, 'B', 25 FROM DUAL
UNION ALL SELECT 5, 'C', 5 FROM DUAL
UNION ALL SELECT 6, 'A', 5 FROM DUAL
UNION ALL SELECT 7, 'A', 10 FROM DUAL;
Query 1:
WITH changes AS (
SELECT t.*,
CASE WHEN LAG( "Group" ) OVER ( ORDER BY "Timestamp" ) = "Group" THEN 0 ELSE 1 END AS hasChangedGroup
FROM TEST t
),
groups AS (
SELECT "Group",
VALUE,
SUM( hasChangedGroup ) OVER ( ORDER BY "Timestamp" ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS grp
FROM changes
)
SELECT "Group",
SUM( VALUE )
FROM Groups
GROUP BY "Group", grp
ORDER BY grp
Results:
| Group | SUM(VALUE) |
|-------|------------|
| A | 30 |
| B | 40 |
| C | 5 |
| A | 15 |
This is typical "star_of_group" problem (see here: https://timurakhmadeev.wordpress.com/2013/07/21/start_of_group/)
In your case, it would be as follows:
with t as (
select 1 timestamp, 'A' grp, 10 value from dual union all
select 2, 'A', 20 from dual union all
select 3, 'B', 15 from dual union all
select 4, 'B', 25 from dual union all
select 5, 'C', 5 from dual union all
select 6, 'A', 5 from dual union all
select 7, 'A', 10 from dual
)
select min(timestamp), grp, sum(value) sum_value
from (
select t.*
, sum(start_of_group) over (order by timestamp) grp_id
from (
select t.*
, case when grp = lag(grp) over (order by timestamp) then 0 else 1 end
start_of_group
from t
) t
)
group by grp_id, grp
order by min(timestamp)
;