Group by on column that repeats

Group by on column that repeats - sql

I'm having trouble putting this issue into words which is probably why I can't find an example so here is what I'd like to do.
I have a table like such
| counter| timestamp |
| 1 | 2018-01-01T11:11:01 |
| 1 | 2018-01-01T11:11:02 |
| 1 | 2018-01-01T11:11:03 |
| 2 | 2018-01-01T11:11:04 |
| 2 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 |
| 3 | 2018-01-01T11:11:07 |
| 1 | 2018-01-01T11:11:08 |
| 1 | 2018-01-01T11:11:09 |
| 1 | 2018-01-01T11:11:10 |
what I'd like to do is group by each group of counters so if I do a query like
SELECT counter, MAX(timestamp) as st, MIN(timestamp) as et
FROM table
GROUP BY counter;
the result would be
| counter | st | et |
| 1 | 2018-01-01T11:11:01 | 2018-01-01T11:11:03 |
| 2 | 2018-01-01T11:11:04 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 | 2018-01-01T11:11:07 |
| 1 | 2018-01-01T11:11:08 | 2018-01-01T11:11:10 |
instead of what actually happens which is
| counter | st | et |
| 1 | 2018-01-01T11:11:01 | 2018-01-01T11:11:10 |
| 2 | 2018-01-01T11:11:04 | 2018-01-01T11:11:05 |
| 3 | 2018-01-01T11:11:06 | 2018-01-01T11:11:07 |
So I'd like some what to combine group by and partition ideally without having nested queries

You have to designate groups with the same repeating values of counter. This can be done using two window functions lag() and cumulative sum():
select counter, min(timestamp) as st, max(timestamp) as et
from (
select counter, timestamp, sum(grp) over w as grp
from (
select *, (lag(counter, 1, 0) over w <> counter)::int as grp
from my_table
window w as (order by timestamp)
) s
window w as (order by timestamp)
) s
group by counter, grp
order by st
DbFiddle.

You should calculate a new groups:
create table tbl(counter int, ts timestamp);
insert into tbl values
(1, '2018-01-01T11:11:01'),
(1, '2018-01-01T11:11:02'),
(1, '2018-01-01T11:11:03'),
(2, '2018-01-01T11:11:04'),
(2, '2018-01-01T11:11:05'),
(3, '2018-01-01T11:11:06'),
(3, '2018-01-01T11:11:07'),
(1, '2018-01-01T11:11:08'),
(1, '2018-01-01T11:11:09'),
(1, '2018-01-01T11:11:10');
✓
10 rows affected
select min(counter) as counter, min(ts) as st, max(ts) as et
from
(
select counter, ts, sum(rst) over (order by ts) as grp
from
(
select counter, ts,
case when coalesce(lag(counter) over (order by ts), -1) <> counter then 1 end rst
from tbl
) t1
) t2
group by grp
counter | st | et
------: | :------------------ | :------------------
3 | 2018-01-01 11:11:06 | 2018-01-01 11:11:07
1 | 2018-01-01 11:11:08 | 2018-01-01 11:11:10
2 | 2018-01-01 11:11:04 | 2018-01-01 11:11:05
1 | 2018-01-01 11:11:01 | 2018-01-01 11:11:03
db<>fiddle here

You can use ranking function
select counter, min(timestamp) st, max(timestamp) et
from (select *,
row_number() over (order by timestamp) Seq1,
row_number() over (partition by counter order by timestamp) Seq2
from table
) t
group by counter, (Seq1-Seq2);
This would use the differences of two ranking functions (Seq1-Seq2) and use them in GROUP BY clause.

Related

Counting number of rows leading to some number

The following is a snippet of my table...
My table has a lot of more users and higher order_rank
I'm trying to get the number of visits leading up to that order_rank in postgres.
So the result I'm trying to generate looks like...

I would address this as a gaps-and-island problem, where each island ends with a visit. You want the end of each island, along with the count of preceding records in the same island.
You can define the group with a window count of non-null values that starts from the end of the table. Then, just use that information to count how many records belong to each group:
select *
from (
select t.*,
count(*) over(partition by customer_id, grp) - 1 as number_of_visits
from (
select t.*,
count(order_rank) over(partition by customer_id order by visit_time desc) grp
from mytable t
) t
) t
where order_rank is not null
Demo on DB Fiddle:
customer_id | visit_time | txn_flag | order_rank | grp | number_of_visits
----------: | :--------- | -------: | ---------: | --: | ---------------:
123 | 2020-01-04 | 1 | 1 | 3 | 3
123 | 2020-01-06 | 1 | 2 | 2 | 1
123 | 2020-01-11 | 1 | 3 | 1 | 4

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success

Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is

One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date

I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

Get users who took ride for 3 or more consecutive dates

I have below table, it shows user_id and ride_date.
+---------+------------+
| user_id | ride_date |
+---------+------------+
| 1 | 2019-11-01 |
| 1 | 2019-11-03 |
| 1 | 2019-11-05 |
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-05 |
| 4 | 2019-11-07 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 5 | 2019-11-11 |
| 5 | 2019-11-13 |
+---------+------------+
I want user_id who took rides for 3 or more consecutive days along with days on which they took consecutive rides
The desired result is as below
+---------+-----------------------+
| user_id | consecutive_ride_date |
+---------+-----------------------+
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 4 | 2019-11-10 |
+---------+-----------------------+
SQL Fiddle

With LAG() and LEAD() window functions:
with cte as (
select *,
datediff(
day,
lag([ride_date]) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev1,
datediff(
day,
lag([ride_date], 2) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev2,
datediff(
day,
[ride_date],
lead([ride_date]) over (partition by [user_id] order by [ride_date])
) next1,
datediff(
day,
[ride_date],
lead([ride_date], 2) over (partition by [user_id] order by [ride_date])
) next2
from Table1
)
select [user_id], [ride_date]
from cte
where
(prev1 = 1 and prev2 = 2) or
(prev1 = 1 and next1 = 1) or
(next1 = 1 and next2 = 2)
See the demo.
Results:
> user_id | ride_date
> ------: | :---------
> 2 | 03/11/2019
> 2 | 04/11/2019
> 2 | 05/11/2019
> 2 | 06/11/2019
> 3 | 03/11/2019
> 3 | 04/11/2019
> 3 | 05/11/2019
> 3 | 06/11/2019
> 4 | 07/11/2019
> 4 | 08/11/2019
> 4 | 09/11/2019

Here is one way to adress this gaps-and-island problem:
first, assign a rank to each user ride with row_number(), and recover the previous ride_date (aliased lag_ride_date)
then, compare the date of the previous ride to the current one in a conditional sum, that increases when the dates are successive ; by comparing this with the rank of the user ride, you get groups (aliased grp) that represent consecutive rides with a 1 day spacing
do a window count how many records belong to each group (aliased cnt)
filter on records whose window count is greater than 3
Query:
select user_id, ride_date
from (
select
t.*,
count(*) over(partition by user_id, grp) cnt
from (
select
t.*,
rn1
- sum(case when ride_date = dateadd(day, 1, lag_ride_date) then 1 else 0 end)
over(partition by user_id order by ride_date) grp
from (
select
t.*,
row_number() over(partition by user_id order by ride_date) rn1,
lag(ride_date) over(partition by user_id order by ride_date) lag_ride_date
from Table1 t
) t
) t
) t
where cnt >= 3
Demo on DB Fiddle

This is a typical gaps and island problems.
We can solve it as follows
with data
as (
select user_id
,ride_date
,dateadd(day
,-row_number() over(partition by user_id order by ride_date asc)
,ride_date) as grp_field
from Table1
)
,consecutive_days
as(
select user_id
,ride_date
,count(*) over(partition by user_id,grp_field) as cnt
from data
)
select *
from consecutive_days
where cnt>=3
order by user_id,ride_date
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7bb851d9a12966b54afb4d8b144f3d46

There is no need to apply gaps-and-islands methodologies to this problem. The problem is much simpler to solve.
You can return the users and first date just by using LEAD():
SELECT t1.*
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1
WHERE ride_date_2 = DATEADD(day, 2, ride_date);
If you want the actual dates, you can unpivot the results:
SELECT DISTINCT t1.user_id, v.ride_date
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1 CROSS APPLY
(VALUES (t1.ride_date),
(DATEADD(day, 1, t1.ride_date)),
(DATEADD(day, 2, t1.ride_date))
) v(ride_date)
WHERE t1.ride_date_2 = DATEADD(day, 2, t1.ride_date)
ORDER BY t1.user_id, v.ride_date;

SQL Count In Range

How could I count data in range which could be configured
Something like this,
CAR_AVBL
+--------+-----------+
| CAR_ID | DATE_AVBL |
+--------------------|
| JJ01 | 1 |
| JJ02 | 1 |
| JJ03 | 3 |
| JJ04 | 10 |
| JJ05 | 13 |
| JJ06 | 4 |
| JJ07 | 10 |
| JJ08 | 1 |
| JJ09 | 23 |
| JJ10 | 11 |
| JJ11 | 20 |
| JJ12 | 3 |
| JJ13 | 19 |
| JJ14 | 22 |
| JJ15 | 7 |
+--------------------+
ZONE_CFG
+--------+------------+
| DATE | ZONE_DESCR |
+--------+------------+
| 15 | GREEN_ZONE |
| 25 | YELLOW_ZONE|
| 30 | RED_ZONE |
+--------+------------+
Table ZONE_CFG is configurable, so I could not use static value for this
The DATE column mean maximum date for each ZONE
And the result what I expected :
+------------+----------+
| ZONE_DESCR | AVBL_CAR |
+------------+----------+
| GREEN_ZONE | 11 |
| YELLOW_ZONE| 4 |
| RED_ZONE | 0 |
+------------+----------+
Please could someone help me with this

You can use LAG and group by as following:
SELECT
ZC.ZONE_DESCR,
COUNT(1) AS AVBL_CAR
FROM
CAR_AVBL CA
JOIN ( SELECT
ZONE_DECR,
COALESCE(LAG(DATE) OVER(ORDER BY DATE) + 1, 0) AS START_DATE,
DATE AS END_DATE
FROM ZONE_CFG ) ZC
ON ( CA.DATE_AVBL BETWEEN ZC.START_DATE AND ZC.END_DATE )
GROUP BY
ZC.ZONE_DESCR;
Note: Don't use oracle preserved keywords (DATE, in your case) as the name of the columns. Try to change it to something like DATE_ or DATE_START or etc..
Cheers!!

If you want the zero 0, I might suggest a correlated subquery instead:
select z.*,
(select count(*)
from car_avbl c
where c.date_avbl >= start_date and
c.date_avbl <= date
) as avbl_car
from (select z.*,
lag(date, 1, 0) as start_date
from zone_cfg z
) z;
In Oracle 12C, can phrase this using a lateral join:
select z.*,
(c.cnt - lag(c.cnt, 1, 0) over (order by z.date)) as cnt
from zone_cfg z left join lateral
(select count(*) as cnt
from avbl_car c
where c.date_avbl <= z.date
) c
on 1=1

SQL-Server query to select last and previous information for multiple columns

After looking in Stackoverflow I cant find a solution to this problem.
I'm using this query:
SELECT *
FROM(
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL(
SELECT DISTINCT * FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL (...
) a ORDER BY a.ID, a.StartDate
I got something like this, for each ID i would like to keep the last and previous date and other columns, to record a history
+------+------------+-----------+-------+-------+
| ID | StartDate | EndDate | Value | rate |
+------+------------+-----------+-------+-------+
| 1 | 2018-06-29 |2018-10-22 | 15 | 77.2 |
| 1 | 2018-04-28 |2018-06-21 | 23 | 55.3 |
| 1 | 2018-02-24 |2018-04-15 | 41 | 44.3 |
| 1 | 2017-06-29 |2017-11-29 | 55 | 44.1 |
| 2 | 2018-07-29 |2018-11-22 | 15 | 106.1 |
| 2 | 2018-03-28 |2018-07-21 | 23 | 10.8 |
| 2 | 2017-12-28 |2018-03-28 | 22 | 11.0 |
| 3 | 2017-09-28 |2018-01-28 | 11 | 87.09 |
| 3 | 2017-06-27 |2018-09-28 | 58 | 100 |
| ... | ... | ... | ... | ... |
+------+------------+-----------+-------+--------+
And I would like to have the next table, to keep the previous information
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+
| ID | StartDate | EndDate | StartDateP | EndDateP | Value | rate | ValueP| rateP |
+------+------------+------------+-----------+-----------+-------+--------+-------+--------+
| 1 | 2018-06-29 |2018-10-22 | 2018-04-28 |2018-06-21 | 15 | 77.2 | 23 | 55.3 |
| 2 | 2018-07-29 |2018-11-22 | 2018-03-28 |2018-07-21 | 15 | 106.1 | 23 | 10.8 |
| 3 | 2017-09-28 |2018-01-28 | 2017-06-27 |2018-09-28 | 11 | 87.09 | 58 | 100 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+

If I understand you correctly you want the row with the latest start date combined with the row with the startdate just before that? This might do the trick
WITH results AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY StartDate DESC) r
FROM (
-- start of your original query
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL
(
SELECT DISTINCT *
FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL
(...) a
ORDER BY a.ID, a.StartDate
-- end of your original query
)
)
SELECT
r1.id, r1.startDate, r2.enddate,
r2.startDate startDateP, r2.enddate enddateP,
r1.value, r1.rate,
r2.value valueP, r2.rate rateP
FROM results r1
LEFT JOIN results r2 ON r2.id = r1.id AND r2.r = 2
WHERE r1.r = 1

Another option is using Row_Number() in concert with a conditional aggregation
Example
Select ID
,StartDate = max(case when RN=1 then StartDate end)
,EndDate = max(case when RN=1 then EndDate end)
,StartDateP = max(case when RN=2 then StartDate end)
,EndDateP = max(case when RN=2 then EndDate end)
,Value = max(case when RN=1 then Value end)
,Rate = max(case when RN=1 then Rate end)
,ValueP = max(case when RN=2 then Value end)
,RateP = max(case when RN=2 then Rate end)
From (
Select *
,RN = Row_Number() over (Partition By ID Order by EndDate Desc)
From YourTable
) A
Group By ID
Returns

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by on column that repeats - sql

Related

Counting number of rows leading to some number

How to add records for each user based on another existing row in BigQuery?

Get users who took ride for 3 or more consecutive dates

SQL Count In Range

SQL-Server query to select last and previous information for multiple columns

Categories

Resources