If I have a dataset as follows:
1/01/2014 a
2/01/2014 a
3/01/2014 a
4/01/2014 b
5/01/2014 b
6/01/2014 b
7/01/2014 b
8/01/2014 a
9/01/2014 a
10/01/2014 a
11/01/2014 a
12/01/2014 a
13/01/2014 a
How would I get output that looks like this:
letter min max
a 1/01/2014 3/01/2014
b 4/01/2014 7/01/2014
a 8/01/2014 13/01/2014
Teradata supports the window functions. You need to calculate a group identifier. One method is a difference of row numbers:
select letter, min(date), max(date)
from (select t.*,
(row_number() over (order by date) -
row_number() over (partition by letter order by date)
) as grp
from t
) t
group by letter, grp;
Related
I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3
I have a table of names, dates and numeric values. I want to know the total first date entry and the total sum of numeric values for the first 90 days after the first date.
Eg
name
date
value
Joe
2020-10-30
3
Bob
2020-12-23
5
Joe
2021-01-03
7
Joe
2021-05-30
2
I want a query that returns
name
min_date
sum_first_90_days
Joe
2020-10-30
10
Bob
2020-12-23
5
So far I have
SELECT name, min(date) min_date,
sum(value) over (partition by name
order by date
rows between min(date) and dateadd(day,90,min(date))
) as first_90_days_sum
FROM table
but it's not executing. What's a good approach here? How can I set up a window function to use a dynamic date range for each partition?
You can use window functions and aggregation:
select name, sum(value)
from (select t.*,
min(date) over (partition by name) as min_date
from t
) t
where date <= min_date + interval '90 day'
group by name;
I have a hive table of two columns. The first columns is time, and second is scattered objects. I wish to get all groups of same objects which are continuous in time, and pick up the first and last record. How to achieve this in hive?
For example, I have a table like this:
id time object
1 10:01:00 a
2 10:02:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
7 10:07:00 a
8 10:08:00 a
9 10:09:00 a
10 10:10:00 a
11 10:11:00 c
I wish to get this (as object 'a' is continuous from 10:01:00 to 10:03:00 and from 10:06:00 to 10:10:00, so both line1&line3 and line6&line10 are picked up):
id time object
1 10:01:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
10 10:10:00 a
11 10:11:00 c
What should I do to achieve this?
This is island and gap problem and you can use the row_number analytical function as follows:
select * from
(select t.*,
row_number() over (partition by rn-rn_o order by time) as rn,
row_number() over (partition by rn-rn_o order by time desc) as rn_d
from
(select t.*,
row_number() over (order by time) as rn,
row_number() over (partition by object order by time) as rn_o
from your_table t) t)
where 1 in (rn, rn_d);
You can select the rows which are not equal to the previous row, or not equal to the next row, using lag and lead respectively. Also a check for first/last row is needed to include them.
select id, time, object from
(select *,
(lead(object) over (order by time) != object or row_number() over (order by time desc) = 1) cond1,
(lag(object) over (order by time) != object or row_number() over (order by time) = 1) cond2
from table)
where cond1 or cond2;
I don't see this as a gaps-and-islands problem. You seem to want only the rows where there is a change. This suggests a simple application of lag() and lead():
select t.*
from (select t.*,
lag(object) over (order by time) as prev_object,
lead(object) over (order by time) as next_object
from t
) t
where (prev_object is null or prev_object <> object) or
(next_object is null or next_object <> object);
Hive supports the NULL-safe comparison operator, so you can phrase the where as:
where not (prev_object <=> object) or
not (next_object <=> object)
i have a list with peoples id and date, the list say when a person Entered to website (his id and date).
how can i show for all the dates how many people enter the site two days in a row?
the data ( 30,000 like this in diffrent dates)
01/03/2019 4616
01/03/2019 17584
01/03/2019 7812
01/03/2019 34
01/03/2019 12177
01/03/2019 7129
01/03/2019 11660
01/03/2019 2428
01/03/2019 17514
01/03/2019 10781
01/03/2019 7629
01/03/2019 11119
I succeeded to show the amount of pepole enter the site on the same day but i didnt succeeded to add a column that show the pepole that enter 2 days in row.
date number_of_entrance
2019-03-01 7099
2019-03-02 7021
2019-03-03 7195
2019-03-04 7151
2019-03-05 7260
2019-03-06 7169
2019-03-07 7076
2019-03-08 7081
2019-03-09 6987
2019-03-10 7172
select date,count(*) as number_of_entrance
fROM [finalaa].[dbo].[Daily_Activity]
group by Date
order by date;
how can i show for all the dates how many people enter the site two days in a row?
I would just use lag():
select count(distinct person)
from (select t.*,
lag(date) over (partition by person order by date) as prev_date
from t
) t
where prev_date = dateadd(day, -1, date);
Your code suggests SQL Server, so I used the date functions in that database.
If you want this per date:
select date, count(distinct person)
from (select t.*,
lag(date) over (partition by person order by date) as prev_date
from t
) t
where prev_date = dateadd(day, -1, date)
group by date;
You can use a subquery which returns the number of common entrances in 2 days:
select
t.date,
count(*) as number_of_entrance,
(
SELECT COUNT(g.id) FROM (
SELECT id
FROM [Daily_Activity]
WHERE date IN (t.date, t.date - 1)
GROUP BY id
HAVING COUNT(DISTINCT date) = 2
) g
) number_of_entrance_2_days_in_a_row
FROM [Daily_Activity] t
group by t.date
order by t.date;
Replace id with the 2nd column's name in the table.
Please Suggest good sql query to find the start and end date of stock difference
imagine i data in a table like below.
Sample_table
transaction_date stock
2018-12-01 10
2018-12-02 10
2018-12-03 20
2018-12-04 20
2018-12-05 20
2018-12-06 20
2018-12-07 20
2018-12-08 10
2018-12-09 10
2018-12-10 30
Expected result should be
Start_date end_date stock
2018-12-01 2018-12-02 10
2018-12-03 2018-12-07 20
2018-12-08 2018-12-09 10
2018-12-10 null 30
It is the gap and island problem. You may use row_numer and group by for this.
select t.stock, min(transaction_date), max(transaction_date)
from (
select row_number() over (order by transaction_date) -
row_number() over (partition by stock order by transaction_date) grp,
transaction_date,
stock
from data
) t
group by t.grp, t.stock
In the following DBFIDDLE DEMO I solve also the null value of the last group, but the main idea of finding consecutive rows is build on the above query.
You may check this for an explanation of this solution.
You can try below using row_number()
select stock,min(transaction_date) as start_date,
case when min(transaction_date)=max(transaction_date) then null else max(transaction_date) end as end_date
from
(
select *,row_number() over(order by transaction_date)-
row_number() over(partition by stock order by transaction_date) as rn
from t1
)A group by stock,rn
Try to use GROUP BY with MIN and MAX:
SELECT
stock,
MIN(transaction_date) Start_date,
CASE WHEN COUNT(*)>1 THEN MAX(transaction_date) END end_date
FROM Sample_table
GROUP BY stock
ORDER BY stock
You can try with LEAD, LAG functions as below:
select currentStockDate as startDate,
LEAD(currentStockDate,1) as EndDate,
currentStock
from
(select *
from
(select
LAG(transaction_date,1) over(order by transaction_date) as prevStockDate,
transaction_date as CurrentstockDate,
LAG(stock,1) over(order by transaction_date) as prevStock,
stock as currentStock
from sample_table) as t
where (prevStock <> currentStock) or (prevStock is null)
) as t2