Bigquery to find max losing streak for each perso - sql

Assume I have a database like this. The original database has millions of data for ~1000 unique names.
What I am after is to find the maximum losing streak for each person when sorted by date.
What I am looking for is a query that have all unique names in 1 column and the maximum losing streak they had in the next one.

Another option to consider
select distinct name, count(*) streak from (
select name,
count(*) over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
if applied to sample data in your question - output is

(Updated - SKIP to be ignored)
You might consider below trick.
Firstly, generate sequences of loss(1) and others(0) over time per user.
ABC - 111011110
XYZ - 1111111011
If the sequence is split with delimiter 0, you will get multiple losing-streaks.
Find a sequence with max length from losing-streaks array.
SELECT Name,
(SELECT MAX(LENGTH(r)) FROM UNNEST(SPLIT(results, '0')) r) AS losing_streaks
FROM (
SELECT Name,
STRING_AGG(
CASE WinLoss
WHEN 'LOSS' THEN '1'
WHEN 'SKIP' THEN NULL
ELSE '0'
END, '' ORDER BY Date
) AS results
FROM sample_table
GROUP BY 1
);
+------+----------------+
| Name | losing_streaks |
+------+----------------+
| ABC | 4 |
| XYZ | 7 |
+------+----------------+

SKIP to be ignored
ABC has 3 losing streak before first win and then 4 losing streak before next (ignoring SKIP). So the answer has to be 4 not 3
to address above you can use below version
select distinct name, count(*) streak from (
select name, win_loss,
countif(win_loss != 'SKIP') over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
with output

Related

Codility SqlEventsDelta (Compute the difference between the latest and the second latest value for each event type)

Recently, I'm practicing code exercises in Codility.
Here you can find the problem, it is in the Exercises 6 - SQL section.
Just start a test to see the problem description! SqlEventsDelta
Problem Define:
I wrote this solution to the SqlEventDelta Question in SQLite. It works fine in local tool But, It was not working in web tool.
Can anyone give any advice on how can I solve this problem?
※ I searched this problem in Stackoverflow and I know a better code then my own way.
But, If possible, I wanna use my own SQLite code logic and function.
WITH cte1 AS
(
SELECT *, CASE WHEN e2.event_type = e2.prev THEN 0
WHEN e2.event_type = e2.next THEN 0
ELSE 1 END AS grp
FROM (SELECT *, LAG(e1.event_type) OVER(ORDER BY (SELECT 1)) AS prev , LEAD(e1.event_type) OVER(ORDER BY (SELECT 1)) AS next FROM events e1) e2
)
,cte2 AS
(
SELECT cte1.event_type, cte1.time, cte1.grp, cte1.value - LAG(cte1.value) OVER(ORDER BY cte1.event_type, cte1.time) AS value
FROM cte1
WHERE cte1.grp = 0
ORDER BY cte1.event_type, cte1.time
)
SELECT c2.event_type, c2.value
FROM cte2 c2
WHERE (c2.event_type, c2.time) IN (
SELECT c2.event_type, MAX(c2.time) AS time
FROM cte2 c2
GROUP BY c2.event_type)
GROUP BY c2.event_type
ORDER BY c2.event_type, c2.time
It ran just fine on my local tool(DB Browser for SQLite Version 3.12.2) without error.
event_type | value
-----------+-----------
2 | -5
3 | 4
Execution finished without errors.
Result: 2 rows returned in 7ms
But, on the web tool(Codility test editor-SQLite Version 3.11.0) can't running and I am getting the following errors.
| Compilation successful.
| Example test: (example test)
| Output (stderr):
| error on query: ...
| ...
| ...,
| details: near "(": syntax error
| RUNTIME ERROR (tested program terminated with exit code 1)
Detected some errors.
SqlEventDelta Question :
Write an SQL query that, for each event_type that has been registered more than once, returns the difference between the latest (i.e. the most recent in terms of time) and the second latest value.
The table should be ordered by event_type (in ascending order).
The names of the columns in the rowset don't matter, but their order does.
Given a table events with the following structure :
create table events (
event_type integer not null,
value integer not null,
time timestamp not null,
unique(event_type, time)
);
For example, given the following data :
event_type | value | time
-----------+------------+--------------------
2 | 5 | 2015-05-09 12:42:00
4 | -42 | 2015-05-09 13:19:57
2 | 2 | 2015-05-09 14:48:30
2 | 7 | 2015-05-09 12:54:39
3 | 16 | 2015-05-09 13:19:57
3 | 20 | 2015-05-09 15:01:09
Given the above data, the output should return the following rowset :
event_type | value
-----------+-----------
2 | -5
3 | 4
Thank you.
I tried to use a somehow naive approach. I'm aware that it is very bad for performance due to many subqueries but the catch here is the "DISTINCT ON" of PostgreSQL, however I got 100% 😃
Hope you like it!
select distinct on (event_type) event_type, result * -1
from (select event_type, value, lead(value) over (order by event_type) - value result
from (select *
from events
where event_type in (select event_type
from events
group by event_type
having count(event_type) >= 2)
order by event_type, time desc) a) b
with data as (SELECT a.event_type, a.value, a.time,
--Produce a virtual table that stores the next and previous values for each event_type.
LEAD(a.value,1) over (PARTITION by a.event_type ORDER by 'event_type', 'time' DESC) as recent_val,
LAG(a.value,1) over (PARTITION by a.event_type ORDER by 'event_type', 'time' DESC) as penult_val
from events a
JOIN (SELECT event_type
from events --Filter the initial dataset for duplicates. Store in correct order
group by event_type HAVING COUNT(*) > 1
ORDER by event_type) b
on a.event_type = b.event_type) --Compare the virtual table to the filtered dataset
SELECT event_type, ("value"-"penult_val") as diff --Perform the desired arithematic
from data
where recent_val is NULL --Filter for the most recent value
Hi team! This one's my answer. It's largely a goopy conglomerate of the answers above, but it reads more simply and it's commented for context. Being a newbie, I hope it helps other newbies.
I do have the same problem when using the sqlite.
Try using below code with PostgreSQL
with data as (select
e.event_type,
e.value,
e.time,
lead(e.value,1) over (PARTITION by e.event_type order by e.event_type,e.time asc) as next_val,
lag (e.value,1) over (PARTITION by e.event_type order by e.event_type,e.time asc) as prev_val
from events e)
select distinct d.event_type, (d.value-d.prev_val) as diff
from
events e,data d
where e.event_type = d.event_type
and d.next_val is null
and e.event_type in ( SELECT event_type
from data
group by
event_type
having count(1) > 1)
order by 1;
Adding another answer involving self joins -
PostgreSQL
-- write your code in PostgreSQL 9.4
WITH TotalRowCount AS (
SELECT
event_type,
COUNT(*) as row_count
FROM events
GROUP BY 1
),
RankedEventType AS (
SELECT
event_type,
value,
ROW_NUMBER() OVER(PARTITION BY event_type ORDER BY time) as row_num
FROM events
)
SELECT
a.event_type,
a.value - b.value as value
FROM RankedEventType a
INNER JOIN TotalRowCount c
ON a.event_type = c.event_type
INNER JOIN RankedEventType b
ON a.event_type = b.event_type
WHERE 1 = 1
AND a.row_num = c.row_count
AND b.row_num = c.row_count - 1
ORDER BY 1
without nested queries, got 100%
with data as (
with count as (select event_type
from events
group by event_type
having count(event_type) >= 2)
select e.event_type , e.value, e.time from events as e inner join count as r on e.event_type=r.event_type order by e.event_type, e.time desc
)
select distinct on (event_type) event_type,
value - (LEAD(value) over (order by event_type)) result from data
Solution with one subquery
WITH diff AS
(SELECT event_type,
value,
LEAD(value) OVER (PARTITION BY event_type
ORDER BY TIME DESC) AS prev
FROM EVENTS
GROUP BY event_type,
value,
time
)
SELECT DISTINCT ON (event_type) event_type,
value - prev
FROM diff
WHERE prev IS NOT NULL;
with deltas as (
select distinct event_type,
first_value(value) over (PARTITION by event_type ORDER by time DESC) -
nth_value(value, 2) over (PARTITION by event_type ORDER by time DESC) as delta
from events
)
select * from deltas where delta is not null order by 1;
--in PostgreSQL 9.4
with ct1 as (SELECT
event_type,
value,
time,
rank() over (partition by event_type order by time desc) as rank
from events),
ct2 as (
select event_type, value, rank, lag (value,1) over (order by event_type) as previous_value
from ct1
order by event_type)
select event_type, previous_value - value from ct2
where rank = 2
order by event_type
My solution:
--Get table with rank 1, 2 group by event_type
with t2 as(
select event_type, value, rank from (
select event_type, value,
rank() over(
partition by event_type
order by time desc) as rank,
count(*) over (partition by event_type) as count
from events) as t
where t.rank <= 2 and t.count > 1
)
--Calculate diff using Lead() and filter out null diff with max
select t3.event_type, max(t3.diff) from (
select event_type,
value - lead(value, 1) over (
partition by event_type
order by rank) as diff
from t2) as t3
group by t3.event_type

BigQuery Standard SQL - Cumulative Count of (almost) Duplicated Rows

With the following data:
id
field
eventTime
1
A
1
1
A
2
1
B
3
1
A
4
1
B
5
1
B
6
1
B
7
For visualisation purposes, I would like to turn it into the below. Consecutive occurrences of the same field value essentially get aggregated into one.
id
field
eventTime
1
Ax2
1
1
B
3
1
A
4
1
Bx3
5
I will then use STRING_AGG() to turn it into "Ax2 > B > A > Bx3".
I've tried using ROW_NUMBER() to count the repeated instances, with the plan being to utilise the highest row number to modify the string in field, but if I partition on eventTime, there are no consecutive "duplicates", and if I don't partition on it then all rows with the same field value are counted - not just consecutive ones.
I though about bringing in the previous field with LAG() for a comparison to reset the row count, but that only works for transitions from one field value to the other and is a problem if the same field is repeated consecutively.
I'm been struggling with this to the point where I'm considering writing a script that just CASE WHENs up to a reasonable number of consecutive hits, but I've seen it get as high as 17 on a given day and really don't want to be doing that!
My other alternative will just be to enforce a maximum number of field values to help control this, but now I've started this problem I'd quite like to solve it without that, if at all possible.
Thanks!
Consider below
select id,
any_value(field) || if(count(1) = 1, '', 'x' || count(1)) field,
min(eventTime) eventTime
from (
select id, field, eventTime,
countif(ifnull(flag, true)) over(partition by id order by eventTime) grp
from (
select id, field, eventTime,
field != lag(field) over(partition by id order by eventTime) flag
from `project.dataset.table`
)
)
group by id, grp
# order by eventTime
If applied to sample data in your question - output is
Just use lag() to detect when the value of field changes. You can now do that with qualify:
select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field;
For your final step, you can use a subquery:
select id, string_agg(field, '->' order by eventtime)
from (select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field
) t
group by id;

SQL - Find Differences Between Columns

Let's say I have the following table
Sku | Number | Name
11 1 hat
12 1 hat
13 1 hats
22 2 car
33 3 truck
44 4 boat
45 4 boat
Is there an easy way to figure out how to find the differences within each Number. For example, with the table above, I would want the query to output:
13 | 1 | hats
The reason for this is because our program processes the rows as long as the number matches the name. If there is an instance where the name doesn't match but the rest of the names do, it will fail.
You can find the most common value (the "mode") using window functions and aggregation:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1;
You could then find everything that is not the mode using a join. The easier way is just to change the where condition:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum > 1;
Note: If there are ties in frequency for the most common value, then an arbitrary most common value is chosen.
EDIT:
Actually, if you want the original skus, you might as well do the join:
with modes as (
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1
)
select t.*
from t join
modes
on t.number = modes.number and t.name <> modes.name;
This will ignore NULL values (but the logic can easily be fixed to accommodate them).

Oracle SQL query : finding the last time a data was changed

I want to retrieve elapsed days since the last time the data of the specific column was changed, for example :
TABLE_X contains
ID PDATE DATA1 DATA2
A 10-Jan-2013 5 10
A 9-Jan-2013 5 10
A 8-Jan-2013 5 11
A 7-Jan-2013 5 11
A 6-Jan-2013 14 12
A 5-Jan-2013 14 12
B 10-Jan-2013 3 15
B 9-Jan-2013 3 15
B 8-Jan-2013 9 15
B 7-Jan-2013 9 15
B 6-Jan-2013 14 15
B 5-Jan-2013 14 8
I simplify the table for example purpose.
The result should be :
ID DATA1_LASTUPDATE DATA2_LASTUPDATE
A 4 2
B 2 5
which says,
- data1 of A last update is 4 days ago,
- data2 of A last update is 2 days ago,
- data1 of B last update is 2 days ago,
- data2 of B last update is 5 days ago.
Using query below is OK but it takes too long to complete if I apply it to the real table which have lots of records and add 2 more data columns to find their latest update days.
I use LEAD function for this purposes.
Any other alternatives to speed up the query?
with qdata1 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data1,1,0) over (partition by ID order by pdate desc) - data1 as data1_diff
from table_x a
) a
where data1_diff <> 0
)
where rnum=1
),
qdata2 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data2,1,0) over (partition by ID order by pdate desc) - data2 as data2_diff
from table_x a
) a
where data2_diff <> 0
)
where rnum=1
)
select a.ID,
trunc(sysdate) - b.pdate data1_lastupdate,
trunc(sysdate) - c.pdate data2_lastupdate,
from table_master a, qdata1 b, qdata2 c
where a.ID=b.ID(+) and a.ID=b.ID(+)
and a.ID=c.ID(+) and a.ID=c.ID(+)
Thanks a lot.
You can avoid the multiple hits on the table and the joins by doing both lag (or lead) calculations together:
with t as (
select id, pdate, data1, data2,
lag(data1) over (partition by id order by pdate) as lag_data1,
lag(data2) over (partition by id order by pdate) as lag_data2
from table_x
),
u as (
select t.*,
case when lag_data1 is null or lag_data1 != data1 then pdate end as pdate1,
case when lag_data2 is null or lag_data2 != data2 then pdate end as pdate2
from t
),
v as (
select u.*,
rank() over (partition by id order by pdate1 desc nulls last) as rn1,
rank() over (partition by id order by pdate2 desc nulls last) as rn2
from u
)
select v.id,
max(trunc(sysdate) - (case when rn1 = 1 then pdate1 end))
as data1_last_update,
max(trunc(sysdate) - (case when rn2 = 1 then pdate2 end))
as data2_last_update
from v
group by v.id
order by v.id;
I'm assuming that you meant your data to be for Jun-2014, not Jan-2013; and that you're comparing the most recent change dates with the current date. With the data adjusted to use 10-Jun-2014 etc., this gives:
ID DATA1_LAST_UPDATE DATA2_LAST_UPDATE
-- ----------------- -----------------
A 4 2
B 2 5
The first CTE (t) gets the actual table data and adds two extra columns, one for each of the data columns, using lag (whic his the the same as lead ordered by descending dates).
The second CTE (u) adds two date columns that are only set when the data columns are changed (or when they are first set, just in case they have never changed). So if a row has data1 the same as the previous row, its pdate1 will be blank. You could combine the first two by repeating the lag calculation but I've left it split out to make it a bit clearer.
The third CTE (v) assigns a ranking to those pdate columns such that the most recent is ranked first.
And the final query works out the difference from the current date to the highest-ranked (i.e. most recent) change for each of the data columns.
SQL Fiddle, including all the CTEs run individually so you can see what they are doing.
Your query wasn't returning the right results for me, maybe I missed something, but I got the correct results also with the below query (you can check this SQLFiddle demo):
with ranked as (
select ID,
data1,
data2,
rank() over(partition by id order by pdate desc) r
from table_x
)
select id,
sum(DATA1_LASTUPDATE) DATA1_LASTUPDATE,
sum(DATA2_LASTUPDATE) DATA2_LASTUPDATE
from (
-- here I get when data1 was updated
select id,
count(1) DATA1_LASTUPDATE,
0 DATA2_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data1 = data1)
and PRIOR r = r - 1
group by id
union
-- here I get when data2 was updated
select id,
0 DATA1_LASTUPDATE,
count(1) DATA0_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data2 = data2)
and PRIOR r = r - 1
group by id
)
group by id

Find minimum value in groups of rows

In the SQL space (specifically T-SQL, SQL Server 2008), given this list of values:
Status Date
------ -----------------------
ACT 2012-01-07 11:51:06.060
ACT 2012-01-07 11:51:07.920
ACT 2012-01-08 04:13:29.140
NOS 2012-01-09 04:29:16.873
ACT 2012-01-21 12:39:37.607 <-- THIS
ACT 2012-01-21 12:40:03.840
ACT 2012-05-02 16:27:17.370
GRAD 2012-05-19 13:30:02.503
GRAD 2013-09-03 22:58:48.750
Generated from this query:
SELECT Status, Date
FROM Account_History
WHERE AccountNumber = '1234'
ORDER BY Date
The status for this particular object started at ACT, then changed to NOS, then back to ACT, then to GRAD.
What is the best way to get the minimum date from the latest "group" of records where Status = 'ACT'?
Here is a query that does this, by identifying the groups where the student statuses are the same and then using simple aggregation:
select top 1 StudentStatus, min(WhenLastChanged) as WhenLastChanged
from (SELECT StudentStatus, WhenLastChanged,
(row_number() over (order by "date") -
row_number() over (partition by studentstatus order by "date)
) as grp
FROM Account_History
WHERE AccountNumber = '1234'
) t
where StudentStatus = 'ACT'
group by StudentStatus, grp
order by WhenLastChanged desc;
The row_number() function assigns sequential numbers within groups of rows based on the date. For your data, the two row_numbers() and their difference is:
Status Date
------ -----------------------
ACT 2012-01-07 11:51:06.060 1 1 0
ACT 2012-01-07 11:51:07.920 2 2 0
ACT 2012-01-08 04:13:29.140 3 3 0
NOS 2012-01-09 04:29:16.873 4 1 3
ACT 2012-01-21 12:39:37.607 5 4 1
ACT 2012-01-21 12:40:03.840 6 5 1
ACT 2012-05-02 16:27:17.370 7 6 1
GRAD 2012-05-19 13:30:02.503 8 1 7
GRAD 2013-09-03 22:58:48.750 9 2 7
Notice the last row is constant for rows that have the same status.
The aggregation brings these together and chooses the latest (top 1 . . . order by date desc) of the first dates (min(date)).
EDIT:
The query is easy to tweak for multiple account numbers. I probably should have written that way to begin with, except the final selection is trickier. The results from this has the date for each status and account:
select StudentStatus, min(WhenLastChanged) as WhenLastChanged
from (SELECT StudentStatus, WhenLastChanged, AccountNumber
(row_number() over (partition by AccountNumber order by WhenLastChanged) -
row_number() over (partition by AccountNumber, studentstatus order by WhenLastChanged)
) as grp
FROM Account_History
) t
where StudentStatus = 'ACT'
group by AccountNumber, StudentStatus, grp
order by WhenLastChanged desc;
But you can't get the last one per account quite so easily. Another level of subqueries:
select AccountNumber, StudentStatus, WhenLastChanged
from (select AccountNumber, StudentStatus, min(WhenLastChanged) as WhenLastChanged,
row_number() over (partition by AccountNumber, StudentStatus order by min(WhenLastChanged) desc
) as seqnum
from (SELECT AccountNumber, StudentStatus, WhenLastChanged,
(row_number() over (partition by AccountNumber order by WhenLastChanged) -
row_number() over (partition by AccountNumber, studentstatus order by WhenLastChanged)
) as grp
FROM Account_History
) t
where StudentStatus = 'ACT'
group by AccountNumber, StudentStatus, grp
) t
where seqnum = 1;
This uses aggregation along with the window function row_number(). This is assigning sequential numbers to the groups (after aggregation), with the last date for each account getting a value of 1 (order by min(WhenLastChanged) desc). The outermost select then just chooses that row for each account.
SELECT [Status], MIN([Date])
FROM Table_Name
WHERE [Status] = (SELECT [Status]
FROM Table_Name
WHERE [Date] = (SELECT MAX([Date])
FROM Table_Name)
)
GROUP BY [Status]
Try here Sql Fiddle
Hogan: basically, yes. I just want to know the date/time when the
account was last changed to ACT. The records after the point above
marked THIS are just extra.
Instead of just looking for act we can look for first time status changes and select act (and max) from that.
so... every time a status changes:
with rownumb as
(
select *, row_number() OVER (order by date asc) as rn
)
select status, date
from rownumb A
join rownumb B on A.rn = B.rn-1
where a.status != b.status
now finding the max of the act items.
with rownumb as
(
select *, row_number() OVER (order by date asc) as rn
), statuschange as
(
select status, date
from rownumb A
join rownumb B on A.rn = B.rn-1
where a.status != b.status
)
select max(date)
from satuschange
where status='Act'