I have this table:
ID BS time
1 1 14:10:00
1 1 14:10:05
1 1 15:04:03
1 2 16:18:05
1 2 17:00:09
1 3 18:33:50
1 1 19:03:14
1 1 19:10:23
and except:
ID BS start_time end_time
1 1 14:10:00 16:18:05
1 2 16:18:05 18:33:50
1 3 18:33:50 19:03:14
1 1 19:03:14 19:10:23
I try use lead, but i don't know, how to resolve problem, when BS is repeat after is end
SELECT id,bs,time,--min(time) time_start,
lead(time,1) over (partition by id order by time) next_time,
FROM `sage-facet-114619.Temp_data.temp_table`
order by id,time
After that I think about group by after this, but I have problem with same BS's
Below is for BigQuery Standard SQL (and actually returns expected result - which is not a case with other two answers)
#standardSQL
SELECT id, bs,
MIN(time) AS start_time,
MAX(IFNULL(end_time, time)) AS end_time
FROM (
SELECT id, bs, time, end_time,
COUNTIF(flag) OVER(PARTITION BY id ORDER BY time) AS grp
FROM (
SELECT *,
LEAD(time) OVER win AS end_time,
bs != LAG(bs) OVER win AS flag
FROM `sage-facet-114619.Temp_data.temp_table`
WINDOW win AS (PARTITION BY id ORDER BY time)
)
)
GROUP BY id, bs, grp
If applied to sample data from your question - output is
Row id bs start_time end_time
1 1 1 14:10:00 16:18:05
2 1 2 16:18:05 18:33:50
3 1 3 18:33:50 19:03:14
4 1 1 19:03:14 19:10:23
This is a gaps-and-islands problem. The idfference of row numbers is one solution:
select id, min(time) as start_time,
lead(min(time)) over (partition by id order by min(time)) as end_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partiton by id order by time) as seqnum_2
from `sage-facet-114619.Temp_data.temp_table` t
t
group by id, (seqnum - seqnum_2);
Another solution in this case is lag():
select id, time as start_time,
lead(time) over (partition by id order by time) as end_time
from (select t.*,
lag(id) over (order by time) as prev_id
from `sage-facet-114619.Temp_data.temp_table` t
t
where prev_id is null or prev_id <> id
I have fixed a few oversights in Gordon's query. As I have not used BigQuery myself I can't speak to it's "standard" syntax or features but I trust that it is reliable outside of the relatively minor changes I made.
select
id, bs, min(time) as start_time,
coalesce(
lead(min(time)) over (partition by id order by min(time)),
max(time) -- corrected: max() rather than min()
) as end_time
from (
select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by id, bs order by time) as seqnum_2
from t
) t
group by id, bs, seqnum - seqnum_2;
Please compare results (running against SQL Server): https://rextester.com/WCSL25882
Related
I'm trying to query a dataset about user status changes. and I want to find out the time it takes for the status to change, and the steps in between(number of rows).
Example data:
user_id
Status
date
1
a
2001-01-01
1
a
2001-01-08
1
b
2001-01-15
1
b
2001-01-28
1
a
2001-01-31
1
b
2001-02-01
2
a
2001-01-08
2
a
2001-01-18
2
a
2001-01-28
3
b
2001-03-08
3
b
2001-03-18
3
b
2001-03-19
3
a
2001-03-20
Desired output:
user_id
From
to
days in between
Steps in between
1
a
b
14
2
1
b
a
16
2
1
a
b
1
1
3
b
a
12
3
You might consider below another approach.
WITH partitions AS (
SELECT *, COUNTIF(flag) OVER w AS part FROM (
SELECT *, ROW_NUMBER() OVER w AS rn, status <> LAG(status) OVER w AS flag,
FROM sample_data
WINDOW w AS (PARTITION BY user_id ORDER BY date)
) WINDOW w AS (PARTITION BY user_id ORDER BY date)
)
SELECT user_id,
LAG(ANY_VALUE(status)) OVER w AS `from`,
ANY_VALUE(status) AS `to`,
EXTRACT(DAY FROM MIN(date) - LAG(MIN(date)) OVER w) AS days_in_between,
MIN(rn) - LAG(MIN(rn)) OVER w AS steps_in_between
FROM partitions
GROUP BY user_id, part
QUALIFY `from` IS NOT NULL
WINDOW w AS (PARTITION BY user_id ORDER BY MIN(date));
Query results
with main as (
select
*,
dense_rank() over(partition by user_id order by date) as rank_,
row_number() over(partition by user_id, status order by date) as rank_2,
row_number() over(partition by user_id, status order by date) - dense_rank() over(partition by id order by date) as diff,
row_number() over(partition by user_id order by date) as row_num,
lag(status) over(partition by user_id order by date) as prev_status,
concat(lag(status) over(partition by user_id order by date) , ' to ' , status) as status_change
from table
),
new_rank as (
select
*,
rown_num - diff as row_num_diff,
min(date) over(partition by user_id, status, rown_num - diff) as min_date
from main
),
prev_date as (
select
*,
lag(min_date) over(partition by user_id order by date) as prev_min_date
from new_rank
)
select
status as from,
prev_status as to,
date_diff(prev_min_date, min_date, DAY) as days_in_between
from prev_date
where status !=prev_status and prev_status is not null
Does this seem to work? I tried to solve this but it's very hard to solve it without a fiddle plus:
you may remove the extra steps/ranks that I have added, I left them there so you can visually see what they are doing
I don't get your steps logic so it is missing from the code
I have a table with 3 columns id, start_date, end_date
Some of the values are as follows:
1 2018-01-01 2030-01-01
1 2017-10-01 2018-10-01
1 2019-01-01 2020-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2010-10-01 2010-12-01
2 2008-01-01 2009-01-01
I have the above kind of data set where I have to filter out overlap date range by keeping maximum datarange and keep the other date range which is not overlapping for a particular id.
Hence desired output should be:
1 2018-01-01 2030-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2008-01-01 2009-01-01
I am unable to find the right way to code in impala. Can someone please help me.
I have tried like,
with cte as(
select a*, row_number() over(partition by id order by datediff(end_date , start_date) desc) as flag from mytable a) select * from cte where flag=1
but this will remove other date range which is not overlapping. Please help.
use row number with countItem for each id
with cte as(
select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable
)
select id,start_date,end_date
from cte
where seq = 1 or seq = countItem
or without cte
select id,start_date,end_date
from
(select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable) t
where seq = 1 or seq = countItem
demo in db<>fiddle
You can use a cumulative max to see if there is any overlap with preceding rows. If there is not, then you have the first row of a new group (row in the result set).
A cumulative sum of the starts assigns each row in the source to a group. Then aggregate:
select id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date >= start_date then 0 else 1 end) over
(partition by id
order by start_date
rows between unbounded preceding and current row
) as grp
from (select t.*,
max(end_date) over (partition by id
order by start_date
rows between unbounded preceding and 1 preceding
) as prev_end_date
from t
) t
) t
group by id, grp;
I have a the following result from a SELECT query with ORDER BY player_id ASC, time ASC in PostgreSQL database:
player_id points time
395 0 2018-06-01 17:55:23.982413-04
395 100 2018-06-30 11:05:21.8679-04
395 0 2018-07-15 21:56:25.420837-04
395 100 2018-07-28 19:47:13.84652-04
395 0 2018-11-27 17:09:59.384-05
395 100 2018-12-02 08:56:06.83033-05
399 0 2018-05-15 15:28:22.782945-04
399 100 2018-06-10 12:11:18.041521-04
454 0 2018-07-10 18:53:24.236363-04
675 0 2018-08-07 20:59:15.510936-04
696 0 2018-08-07 19:09:07.126876-04
756 100 2018-08-15 08:21:11.300871-04
756 100 2018-08-15 16:43:08.698862-04
756 0 2018-08-15 17:22:49.755721-04
756 100 2018-10-07 15:30:49.27374-04
756 0 2018-10-07 15:35:00.975252-04
756 0 2018-11-27 19:04:06.456982-05
756 100 2018-12-02 19:24:20.880022-05
756 100 2018-12-04 19:57:48.961111-05
I'm trying to find each player's longest streak where points = 100, with the tiebreaker being whichever streak began most recently. I also need to determine the time at which that player's longest streak began. The expected result would be:
player_id longest_streak time_began
395 1 2018-12-02 08:56:06.83033-05
399 1 2018-06-10 12:11:18.041521-04
756 2 2018-12-02 19:24:20.880022-05
A gaps-and-islands problem indeed.
Assuming:
"Streaks" are not interrupted by rows from other players.
All columns are defined NOT NULL. (Else you have to do more.)
This should be simplest and fastest as it only needs two fast row_number() window functions:
SELECT DISTINCT ON (player_id)
player_id, count(*) AS seq_len, min(ts) AS time_began
FROM (
SELECT player_id, points, ts
, row_number() OVER (PARTITION BY player_id ORDER BY ts)
- row_number() OVER (PARTITION BY player_id, points ORDER BY ts) AS grp
FROM tbl
) sub
WHERE points = 100
GROUP BY player_id, grp -- omit "points" after WHERE points = 100
ORDER BY player_id, seq_len DESC, time_began DESC;
db<>fiddle here
Using the column name ts instead of time, which is a reserved word in standard SQL. It's allowed in Postgres, but with limitations and it's still a bad idea to use it as identifier.
The "trick" is to subtract row numbers so that consecutive rows fall in the same group (grp) per (player_id, points). Then filter the ones with 100 points, aggregate per group and return only the longest, most recent result per player.
Basic explanation for the technique:
Select longest continuous sequence
We can use GROUP BY and DISTINCT ON in the same SELECT, GROUP BY is applied before DISTINCT ON. Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
About DISTINCT ON:
Select first row in each GROUP BY group?
This is a gap and island problem, you can try to use SUM condition aggravated function with window function, getting gap number.
then use MAX and COUNT window function again.
Query 1:
WITH CTE AS (
SELECT *,
SUM(CASE WHEN points = 100 THEN 1 END) OVER(PARTITION BY player_id ORDER BY time) -
SUM(1) OVER(ORDER BY time) RN
FROM T
)
SELECT player_id,
MAX(longest_streak) longest_streak,
MAX(cnt) longest_streak
FROM (
SELECT player_id,
MAX(time) OVER(PARTITION BY rn,player_id) longest_streak,
COUNT(*) OVER(PARTITION BY rn,player_id) cnt
FROM CTE
WHERE points > 0
) t1
GROUP BY player_id
Results:
| player_id | longest_streak | longest_streak |
|-----------|-----------------------------|----------------|
| 756 | 2018-12-04T19:57:48.961111Z | 2 |
| 399 | 2018-06-10T12:11:18.041521Z | 1 |
| 395 | 2018-12-02T08:56:06.83033Z | 1 |
One way to do this is to look at how many rows between the previous and next non-100 results. To get the lengths of the streaks:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
)
select s.*,
coalesce(next_seqnum, cnt + 1) - coalesce(prev_seqnum, 0) - 1 as length
from (select s.*,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as prev_seqnum,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as next_seqnum
from s
) s
where score = 100;
You can then incorporate the other conditions:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
),
streaks as (
select s.*,
coalesce(next_seqnum - prev_seqnum) over (partition by player_id) as length,
max(next_seqnum - prev_seqnum) over (partition by player_id) as max_length,
max(next_seqnum) over (partition by player_id) as max_next_seqnum
from (select s.*,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), 0) as prev_seqnum,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), cnt + 1) as next_seqnum
from s
) s
where score = 100
)
select s.*
from streaks s
where length = max_length and
next_seqnum = max_next_seqnum;
Here is my answer
select
user_id,
non_streak,
streak,
ifnull(non_streak,streak) strk,
max(time) time
from (
Select
user_id,time,
points,
lag(points) over (partition by user_id order by time) prev_point,
case when points + lag(points) over (partition by user_id order by time) = 100 then 1 end as non_streak,
case when points + lag(points) over (partition by user_id order by time) > 100 then 1 end as streak
From players
) where ifnull(non_streak,streak) is not null
group by 1,2,3
order by 1,2
) group by user_id`
suppose I have the following data frame in Reradata SQL.
How can I get the variation between the highest and lowest date, at user level? Regards
Initial table
user date price
1 1-1 10
1 2-1 20
1 3-1 30
2 1-1 12
2 2-1 22
2 3-1 32
3 1-1 13
3 2-1 23
3 3-1 33
Final table
user var_price
1 30/10-1
2 32/12-1
3 33/13-1
Try this-
SELECT B.[user],
CAST(SUM(B.max_price) AS VARCHAR)+'/'+CAST(SUM(B.min_price) AS VARCHAR)+ '-1' var_price,
SUM(B.max_price)/SUM(B.min_price) -1 calculated_var_price
FROM
(
SELECT * FROM
(
SELECT [user],0 max_price,price min_price,ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE) RN
FROM your_table
)A WHERE RN = 1
UNION ALL
SELECT * FROM
(
SELECT [user],price max_price,0 min_price, ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE DESC) RN
FROM your_table
)A WHERE RN = 1
)B
GROUP BY B.[user]
Output is-
user var_price calculated_var_price
1 30/10-1 2
2 32/12-1 1
3 33/13-1 1
Is this what you want?
select user, max(price) / min(price) - 1
from t
group by user;
Your values are monotonically increasing, so max() and min() seems like the simplest solution.
EDIT:
You can use window functions:
select user, max(last_price) / max(first_price) - 1
from (select t.*,
first_value(price) over (partition by user order by date rows between unbounded preceding and current_row) as first_price,
first_value(price) over (partition by user order by date desc rows between unbounded preceding and current_row) as last_price
from t
) t
group by user;
select user
,price as first_price
,last_value(price)
over (paritition by user
order by date
rows between unbounded preceding and unbounded following) as last_price
from mytab
qualify
row_number() -- lowest date only
over (paritition by user
order by date) = 1
This returns the row with the lowest date and adds the price of the latest date
Assume this is my table:
ID NUMBER DATE
------------------------
1 45 2018-01-01
2 45 2018-01-02
2 45 2018-01-27
I need to separate using partition by and row_number where the difference between one date and another is greater than 5 days. Something like this would be the result of the above example:
ROWNUMBER ID NUMBER DATE
-----------------------------
1 1 45 2018-01-01
2 2 45 2018-01-02
1 3 45 2018-01-27
My actual query is something like this:
SELECT ROW_NUMBER() OVER(PARTITION BY NUMBER ODER BY ID DESC) AS ROWNUMBER, ...
But as you can notice, it doesn't work for the dates. How can I achieve that?
You can use lag function :
select *, row_number() over (partition by number, grp order by id) as [ROWNUMBER]
from (select *, (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1
then 1 else 2
end) as grp
from table
) t;
by using lag and datediff funtion
select * from
(
select t.*,
datediff(day,
lag(DATE) over (partition by NUMBER order by id),
DATE
) as diff
from t
) as TT where diff>5
http://sqlfiddle.com/#!18/130ae/11
I think you want to identify the groups, using lag() and datediff() and a cumulative sum. Then use row_number():
select t.*,
row_number() over (partition by number, grp order by date) as rownumber
from (select t.*,
sum(grp_start) over (partition by number order by date) as grp
from (select t.*,
(case when lag(date) over (partition by number order by date) < dateadd(day, 5, date)
then 1 else 0
end) as grp_start
from t
) t
) t;