BigQuery Standard SQL - Cumulative Count of (almost) Duplicated Rows - sql

With the following data:
id
field
eventTime
1
A
1
1
A
2
1
B
3
1
A
4
1
B
5
1
B
6
1
B
7
For visualisation purposes, I would like to turn it into the below. Consecutive occurrences of the same field value essentially get aggregated into one.
id
field
eventTime
1
Ax2
1
1
B
3
1
A
4
1
Bx3
5
I will then use STRING_AGG() to turn it into "Ax2 > B > A > Bx3".
I've tried using ROW_NUMBER() to count the repeated instances, with the plan being to utilise the highest row number to modify the string in field, but if I partition on eventTime, there are no consecutive "duplicates", and if I don't partition on it then all rows with the same field value are counted - not just consecutive ones.
I though about bringing in the previous field with LAG() for a comparison to reset the row count, but that only works for transitions from one field value to the other and is a problem if the same field is repeated consecutively.
I'm been struggling with this to the point where I'm considering writing a script that just CASE WHENs up to a reasonable number of consecutive hits, but I've seen it get as high as 17 on a given day and really don't want to be doing that!
My other alternative will just be to enforce a maximum number of field values to help control this, but now I've started this problem I'd quite like to solve it without that, if at all possible.
Thanks!

Consider below
select id,
any_value(field) || if(count(1) = 1, '', 'x' || count(1)) field,
min(eventTime) eventTime
from (
select id, field, eventTime,
countif(ifnull(flag, true)) over(partition by id order by eventTime) grp
from (
select id, field, eventTime,
field != lag(field) over(partition by id order by eventTime) flag
from `project.dataset.table`
)
)
group by id, grp
# order by eventTime
If applied to sample data in your question - output is

Just use lag() to detect when the value of field changes. You can now do that with qualify:
select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field;
For your final step, you can use a subquery:
select id, string_agg(field, '->' order by eventtime)
from (select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field
) t
group by id;

Related

Bigquery to find max losing streak for each perso

Assume I have a database like this. The original database has millions of data for ~1000 unique names.
What I am after is to find the maximum losing streak for each person when sorted by date.
What I am looking for is a query that have all unique names in 1 column and the maximum losing streak they had in the next one.
Another option to consider
select distinct name, count(*) streak from (
select name,
count(*) over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
if applied to sample data in your question - output is
(Updated - SKIP to be ignored)
You might consider below trick.
Firstly, generate sequences of loss(1) and others(0) over time per user.
ABC - 111011110
XYZ - 1111111011
If the sequence is split with delimiter 0, you will get multiple losing-streaks.
Find a sequence with max length from losing-streaks array.
SELECT Name,
(SELECT MAX(LENGTH(r)) FROM UNNEST(SPLIT(results, '0')) r) AS losing_streaks
FROM (
SELECT Name,
STRING_AGG(
CASE WinLoss
WHEN 'LOSS' THEN '1'
WHEN 'SKIP' THEN NULL
ELSE '0'
END, '' ORDER BY Date
) AS results
FROM sample_table
GROUP BY 1
);
+------+----------------+
| Name | losing_streaks |
+------+----------------+
| ABC | 4 |
| XYZ | 7 |
+------+----------------+
SKIP to be ignored
ABC has 3 losing streak before first win and then 4 losing streak before next (ignoring SKIP). So the answer has to be 4 not 3
to address above you can use below version
select distinct name, count(*) streak from (
select name, win_loss,
countif(win_loss != 'SKIP') over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
with output

Min and max value per group keeping order

I have a small problem in Redshift with with grouping; I have a table like following:
INPUT
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-11'
1 '2020-09-11' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
I want to obtain this output:
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
If I do a simple Min and Max date grouping by the value, it doesn't work.
This is an example of a gap-and-islands problem. If there are no time gaps in the data, then a difference of row numbers is a simple solution:
select value, min(created), max(updated)
from (select t.*,
row_number() over (order by created) as seqnum,
row_number() over (partition by value order by created) as seqnum_2
from t
) t
group by value, (seqnum - seqnum_2)
order by min(created);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will see how the difference between the row numbers identifies adjacent rows with the same value.

SQL Ranking N records by one criteria and N records by another and repeat

In my table I have 4 columns Id, Type InitialRanking & FinalRanking. Based on certain criteria I’ve managed to apply InitialRanking to the records (1-20). I now need to apply FinalRanking by identifying the top 7 of Type 1 followed by the
top 3 of Type 2. Then I need to repeat the above until all records have a FinalRanking. My goal would be to achieve the output in the final column of the attached image.
The 7 & 3 will vary over time but for the purposes of this example let’s say they are fixed.
you can try like this
SELECT * FROM(
( SELECT ID,DISTINCT TYPE,
CASE WHEN TYPE=1 THEN
( SELECT TOP 7 INITIALRANK, FINALRANK
from table where type=1)
ELSE
( SELECT TOP 3 INITIALRANK, FINALRANK
from table where type=2)
END CASE
FROM TABLE WHERE TYPE IN (1,2)
)
UNION
( SELECT ID,TYPE,
INITIALRANK, FINALRANK
from table where type not in (1,2))
)
)
A simple (or simplistic) approach to your Final Rank would be the following:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(10-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(10-3)+7
end FinalRank
This can be generalized for more than 2 groups for example with three groups of size 7, 3 and 2, the pattern size is 7+3+2=12 the general form is PartitionedRowNum+(Ceil(PartitionedRowNum/GroupSize)-1)*(PaternSize-GroupSize)+Offset where the offset is the sum of the preceding group sizes:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(12-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(12-3)+7
when 3 then (ceil((row_number() over (partition by type order by initrank))/2)-1)*(12-2)+7+3
end FinalRank

Select SQL logic

Folks at a loss here!!!
First, this is what I am trying to achieve:
Select all the records from table CUSTOMER_ORDER_DETAILS table shown below and if multiple entries for the same CUSTOMER_NO exist then:
- select the entry with PAID = 1
- if there are multiple PAID = 1 entries, then select the record with TYPE = Y
Expected Result:
877, CU115, lit, 0, 1, X
878, CU111, Toi, 1, 1, Y
879, CU117, Fla, 1, 1, X
My approach was to get the count(CUSTOMER_NO) > 1 using GROUP BY on CUSTOMER_NO, but as soon as I am adding the remaining columns of the table to the Select statement, the count column is showing a value of 1.
Any pointers to tackle this or implement if-else kind of logic?
This is a prioritization query. Here is one method to do what you want:
select t.*
from (select t.*,
row_number() over (partition by customer_no
order by paid desc, type desc
) as seqnum
from t
) t
where seqnum = 1;
This assumes that paid takes on the values 0 and 1, and that type has the values X and Y.
You can prioritize these conditions with an order by condition in row_number function.
select * from (
select t.*,
row_number() over(partition by customer_no
order by case when paid=1 and type='Y' then 1
when paid=1 then 2
else 3 end) as rnum
from customer_orders t
) t
where rnum=1
This assumes there can only be one row with type='Y' per customer_no if there exist multiple rows with paid=1 for that same customer_no.
If there exist multiple rows with paid =1 and all of them have a type <> 'Y' then a row is arbitrarily picked amongst them.

How to select last entry for one distinct pairing of two columns in Oracle?

I need to select the last row in mytable for a given pair of columns in Oracle v11.2:
id type timestamp raw_value normal_value
-- ---- --------- --------- ------------
1 3 3pm 3-Jun "Jon" "Jonathan"
1 3 5pm 3-Jun "Jonathan" "Jonathan"
1 3 2pm 4-Jun "John" "Jonathan"
1 3 8pm 6-Jun "Bob" "Robert"
1 5 6pm 3-Jun "NYC" "New York City"
1 5 7pm 5-Jun "N.Y.C." "New York City"
4 8 1pm 1-Jun "IBM" "International Business Machines"
4 8 5pm 8-Jun "I.B.M." "International Business Machines"
I'm thinking the query would be something like this:
SELECT raw_value, normal_value, MAX(timestamp)
FROM mytable
WHERE id = 1 and type = 3
GROUP BY id, type
For the above, this should give me:
"Bob", "Robert", 8pm 6-Jun
I do not actually need the timestamp in my answer, but only need it to select the matching row for the given id and type whose timestamp is greatest.
Will my approach work in Oracle v11.2, and if so, is there a way to omit timestamp from the selected columns since I don't actually need its value?
You can do this with the row_number() function:
select raw_value, normal_value, timestamp
from (select myt.*, ROW_NUMBER() over
(partition by id, type order by timestamp desc)
as seqnum
from mytable myt
) tmp
where seqnum = 1
and id = 1 and type = 3;
row_number() is an analytic function (aka window function) that assigns sequential numbers to rows. Every group defined by id, type gets its own numbers. The first row is the one with the most recent timestamp (order by timestamp desc). The outer select chooses this row in the where clause.
In the case of ties, this version returns only one row. To get all the rows, use rank() instead of row_number().
Try this:
SELECT m1.raw_value, m1.normal_value
FROM mytable m1
WHERE id = 1 and type = 3 and timestamp = (
SELECT MAX(timestamp)
FROM mytable m2
WHERE m1.id = m2.id and m1.type = m2.type
GROUP BY m2.id, m2.type
)
You can determine the most recent timestamp using the Oracle analytic RANK function like this:
SELECT
raw_value,
normal_value,
RANK() OVER (ORDER BY timestamp DESC) as TimestampRank
FROM myTable
This will set the TimestampRank column with value 1 for the row with the highest timestamp. If there's a tie for the highest timestamp, all rows with the highest timestamp with have TimestampRank set to 1.
To get just the "Bob", "Robert", surround the query above with an outer query that selects just those columns and filters for TimestampRank = 1:
SELECT raw_value, normal_value
FROM (
SELECT
raw_value,
normal_value,
RANK() OVER (ORDER BY timestamp DESC) as TimestampRank
FROM myTable
)
WHERE TimestampRank = 1
Note again that if there's a tie for the highest timestamp, all rows with that value will be returned. If you always want one row regardless of ties, use ROW_NUMBER() instead of RANK() in the query above.
Try
select max(raw_value ) keep (dense_rank last order by timestamp),
max(normal_value ) keep (dense_rank last order by timestamp)
from mytable
WHERE id = 1 and type = 3