I have the following problem. I need to fetch the row for each distinct user_id with the min rank, max end date, max begin date, and max sequence number in that order. Essentially, I want a user's highest ranked, most recent MAJOR_ID. I am trying to avoid making separate temp tables and joining on those aggregate functions like the following:
select USER_ID
, SEQ_NBR
, BEGIN_DATE
, END_DATE
, MAJOR_RANK
, MAJOR_ID
, DEGREE_CODE
into #major0
from majors
select USER_ID
, MIN(MAJOR_RANK) as MAJOR_RANK
into #major1
from #major0
group by USER_ID
select #major0.USER_ID
, #major0.MAJOR_RANK
, MAX(#major0.END_DATE) as END_DATE
into #major2
from #major0
inner join #major1 on #major0.USER_ID = #major1.USER_ID and #major0.MAJOR_RANK = #major1.MAJOR_RANK
group by #major0.USER_ID
, #major0.MAJOR_RANK
etc...
until I get to that row that satisfies all the criteria, and I join back on all the fields from the original query. Does that make sense? It's a lot of work to write this out, and I can't create a view of it unless I made a absurdly long set of subqueries, I don't think I can utilize MIN(MAJOR_RANK) OVER (PARTITION BY USER_ID) in a subquery for all these fields because I will lose records that don't satisfy all of them.
Any suggestions would help! Thank you!
I do not see what the sequence_number is, but you can most likely solve this using a common table expression with row_number()
;with cte as (
select
user_id
, begin_date
, end_date
, major_rank
, major_id
, degree_code
, rn = row_number() over (
partition by user_id
order by major_rank asc, end_date desc, begin_date desc /*, sequence_number? desc*/
)
from majors
)
select
user_id
, begin_date
, end_date
, major_rank
, major_id
, degree_code
from cte
where rn = 1
without the cte
select
user_id
, begin_date
, end_date
, major_rank
, major_id
, degree_code
from (
select
user_id
, begin_date
, end_date
, major_rank
, major_id
, degree_code
, rn = row_number() over (
partition by user_id
order by major_rank asc, end_date desc, begin_date desc /*, sequence_number? desc*/
)
from majors
) sub
where rn = 1
Related
I have a table like
date
ticker
Action
'2022-03-01'
AAPL
BUY
'2022-03-02'
AAPL
SELL.
'2022-03-03'
AAPL
BUY.
'2022-03-01'
CMG
SELL.
'2022-03-02'
CMG
HOLD.
'2022-03-03'
CMG
HOLD.
'2022-03-01'
GPS
SELL.
'2022-03-02'
GPS
SELL.
'2022-03-03'
GPS
SELL.
I want to do a group by ticker then count all the times that Actions have sequentially been the value that they are as of the last date, here it's 2022-03-03. ie for this example table it'd be like;
ticker
NumSequentialDaysAction
AAPL
0
CMG
1
GPS
2
Fine to pass in 2022-03-03 as a value, don't need to figure that out on the fly.
Tried something like this
---Table Creation---
CREATE TABLE UserTable
([Date] DATETIME2, [Ticker] varchar(5), [Action] varchar(5))
;
INSERT INTO UserTable
([Date], [Ticker], [Action])
VALUES
('2022-03-01' , 'AAPL' , 'BUY'),
('2022-03-02' , 'AAPL' , 'SELL'),
('2022-03-03' , 'AAPL' , 'BUY'),
('2022-03-01' , 'CMG' , 'SELL'),
('2022-03-02' , 'CMG' , 'HOLD'),
('2022-03-03' , 'CMG' , 'HOLD'),
('2022-03-01' , 'GPS' , 'SELL'),
('2022-03-02' , 'GPS' , 'SELL'),
('2022-03-03' , 'GPS' , 'SELL')
;
---Attempted Solution---
I'm thinking that I need to do a sub query to get the last value and join on itself to get the matching values. Then apply a window function, ordered by date to see that the proceeding value is sequential.
WITH CTE AS (SELECT Date, Ticker, Action,
ROW_NUMBER() OVER (PARTITION BY Ticker, Action ORDER BY Date) as row_num
FROM UserTable)
SELECT Ticker, COUNT(DISTINCT Date) as count_of_days
FROM CTE
WHERE row_num = 1
GROUP BY Ticker;
WITH CTE AS (SELECT Date, Ticker, Action,
DENSE_RANK() OVER (PARTITION BY Ticker ORDER BY Action,Date) as rank
FROM table)
SELECT Ticker, COUNT(DISTINCT Date) as count_of_days
FROM CTE
WHERE rank = 1
GROUP BY Ticker;
You can do this with the help of the LEAD function like so. You didn't specify which RDBMS you're using. This solution works in PostgreSQL:
WITH "withSequential" AS (
SELECT
ticker,
(LEAD("Action") OVER (PARTITION BY ticker ORDER BY date ASC) = "Action") AS "nextDayIsSameAction"
FROM UserTable
)
SELECT
ticker,
SUM(
CASE
WHEN "nextDayIsSameAction" IS TRUE THEN 1
ELSE 0
END
) AS "NumSequentialDaysAction"
FROM "withSequential"
GROUP BY ticker
Here is a way to do this using gaps and islands solution.
Thanks for sharing the create and insert scripts, which helps to build the solution quickly.
dbfiddle link.
https://dbfiddle.uk/rZLDTrNR
with data
as (
select date
,ticker
,action
,case when lag(action) over(partition by ticker order by date) <> action then
1
else 0
end as marker
from usertable
)
,interim_data
as (
select *
,sum(marker) over(partition by ticker order by date) as grp_val
from data
)
,interim_data2
as (
select *
,count(*) over(partition by ticker,grp_val) as NumSequentialDaysAction
from interim_data
)
select ticker,NumSequentialDaysAction
from interim_data2
where date='2022-03-03'
Another option, you could use the difference between two row_numbers approach as the following:
select [Ticker], count(*)-1 NumSequentialDaysAction -- you could use (distinct) to remove duplicate rows
from
(
select *,
row_number() over (partition by [Ticker] order by [Date]) -
row_number() over (partition by [Ticker], [Action] order by [Date]) grp
from UserTable
where [date] <= '2022-03-03'
) RN_Groups
/* get only rows where [Action] = last date [Action] */
where [Action] = (select top 1 [Action] from UserTable T
where T.[Ticker] = RN_Groups.[Ticker] and [date] <= '2022-03-03'
order by [Date] desc)
group by [Ticker], [Action], grp
See demo
PERIOD_SERV
PERSON_NUMBER DATE_sTART PERIOD_ID
10 06-JAN-2020 192726
10 04-APR-2019 12827
11 01-FEB-2021 282726
11 09-APR-2018 827266
For each person_number I want to add a column with previous date start. When i am using the below query, it is giving me repeated rows.
I want to get only row, with an additional column of the most recent "last date_start". For example -
PERSON_NUMBER DATE_sTART PERIOD_ID PREVIOUS_DATE
10 06-JAN-2020 192726 04-APR-2019
11 01-FEB-2021 282726 09-APR-2018
I am using the below query but getting two rows,
SELECT person_number,
period_id AS pv_period_id,
LAG(date_start) OVER ( PARTITION BY person_number ORDER BY date_start) AS previous_date
FROM period_serv
You can restrict the set of rows in the outer query
select person_number, pv_period_id, PREVIOUS_DATE
from (
select person_number,
PERIOD_ID pv_period_id,
lag(date_start) OVER ( partition BY person_number order by DATE_sTART ) PREVIOUS_DATE ,
row_number() OVER ( partition BY person_number order by DATE_sTART desc) rn
from period_serv
) t
where rn = 1
One option is to use MAX(..) KEEP (DENSE_RANK ..) OVER (PARTITION BY ..) analytic function such as
WITH p AS
(
SELECT MAX(date_start) KEEP (DENSE_RANK FIRST ORDER BY date_start)
OVER (PARTITION BY person_number) AS previous_date,
p.*
FROM period_serv p
)
SELECT p.person_number, p.date_start, p.period_id, p.previous_date
FROM p
JOIN period_serv ps
ON ps.person_number = p.person_number
AND ps.period_id = p.period_id
WHERE ps.date_start != previous_date
Demo
I have the following request on POSTGRESQL. I don't know where it comes from, but each time I run a join query I get an error: "There was a problem fetching the schema of the table. Try again."
Context:
I'm trying to get a table of sessions using subqueries from a DB named sheets_app.tracks; I then reuse this subquery in a new table to map users to sessions.
This join is very simple but still doesn't work:
WITH sessions as (
SELECT
user_id || '-' || row_number() over(partition by user_id order by event_timestamp) as session_id
, user_id
, event_timestamp as session_start_at
, lead(event_timestamp) over(partition by user_id order by event_timestamp) as next_session_start_at
FROM (
SELECT
user_id
, event_timestamp
, last_event_timestamp
, (event_timestamp-last_event_timestamp) as inactivity_time
FROM (
SELECT
user_id
, event_timestamp
, LAG(event_timestamp) over (PARTITION BY user_id order by timestamp) as last_event_timestamp
FROM (
SELECT
user_id
, timestamp
, EXTRACT("EPOCH" FROM timestamp) as event_timestamp
FROM
sheets_app.tracks as e
) one
) two
) event
WHERE (event.inactivity_time > 30 OR event.inactivity_time is null))
SELECT
*
FROM sessions
LEFT JOIN sheets_app.tracks on sessions.user_id = sheets_app.tracks.user_id
This one is the same with more conditions on the LEFT JOIN
WITH sessions as (
SELECT
user_id || '-' || row_number() over(partition by user_id order by event_timestamp) as session_id
, user_id
, event_timestamp as session_start_at
, lead(event_timestamp) over(partition by user_id order by event_timestamp) as next_session_start_at
FROM (
SELECT
user_id
, event_timestamp
, last_event_timestamp
, (event_timestamp-last_event_timestamp) as inactivity_time
FROM (
SELECT
user_id
, event_timestamp
, LAG(event_timestamp) over (PARTITION BY user_id order by timestamp) as last_event_timestamp
FROM (
SELECT
user_id
, timestamp
, EXTRACT("EPOCH" FROM timestamp) as event_timestamp
FROM
sheets_app.tracks as e
) one
) two
) event
WHERE (event.inactivity_time > 30 OR event.inactivity_time is null))
SELECT
COUNT(*) AS sessions_count,
AVG(duration) AS average_session_duration
FROM (
SELECT
session_id
, (EXTRACT("EPOCH" FROM MIN(events.timestamp))-EXTRACT("EPOCH" FROM MAX(events.timestamp))) AS duration
FROM sessions
LEFT JOIN sheets_app.tracks as event on sessions.user_id = event.user_id
AND EXTRACT("EPOCH" FROM events.timestamp) >= sessions.session_start_at
AND EXTRACT("EPOCH" FROM events.timestamp) < sessions.next_session_start_at OR sessions.next_session_start_at is null)
GROUP BY 1
) as events
Do you see where it could come from?
Thank you!
I have a checking account table that contains columns Cust_id (customer id), Open_Date (start date), and Closed_Date (end date). There is one row for each account. A customer can open multiple accounts at any given point. I would like to know how long the person has been a customer.
eg 1:
CREATE TABLE [Cust]
(
[Cust_id] [varchar](10) NULL,
[Open_Date] [date] NULL,
[Closed_Date] [date] NULL
)
insert into [Cust] values ('a123', '10/01/2019', '10/15/2019')
insert into [Cust] values ('a123', '10/12/2019', '11/01/2019')
Ideally I would like to insert this into a table with just one row, that says this person has been a customer from 10/01/2019 to 11/01/2019. (as he opened his second account before he closed his previous one.
Similarly eg 2:
insert into [Cust] values ('b245', '07/01/2019', '09/15/2019')
insert into [Cust] values ('b245', '10/12/2019', '12/01/2019')
I would like to see 2 rows in this case- one that shows he was a customer from 07/01 to 09/15 and then again from 10/12 to 12/01.
Can you point me to the best way to get this?
I would approach this as a gaps and islands problem. You want to group together groups of adjacents rows whose periods overlap.
Here is one way to solve it using lag() and a cumulative sum(). Everytime the open date is greater than the closed date of the previous record, a new group starts.
select
cust_id,
min(open_date) open_date,
max(closed_date) closed_date
from (
select
t.*,
sum(case when not open_date <= lag_closed_date then 1 else 0 end)
over(partition by cust_id order by open_date) grp
from (
select
t.*,
lag(closed_date) over (partition by cust_id order by open_date) lag_closed_date
from cust t
) t
) t
group by cust_id, grp
In this db fiddle with your sample data, the query produces:
cust_id | open_date | closed_date
:------ | :--------- | :----------
a123 | 2019-10-01 | 2019-11-01
b245 | 2019-07-01 | 2019-09-15
b245 | 2019-10-12 | 2019-12-01
I would solve this with recursion. While this is certainly very heavy, it should accommodate even the most complex account timings (assuming your data has such). However, if the sample data provided is as complex as you need to solve for, I highly recommend sticking with the solution provided above. It is much more concise and clear.
WITH x (cust_id, open_date, closed_date, lvl, grp) AS (
SELECT cust_id, open_date, closed_date, 1, 1
FROM (
SELECT cust_id
, open_date
, closed_date
, row_number()
OVER (PARTITION BY cust_id ORDER BY closed_date DESC, open_date) AS rn
FROM cust
) AS t
WHERE rn = 1
UNION ALL
SELECT cust_id, open_date, closed_date, lvl, grp
FROM (
SELECT c.cust_id
, c.open_date
, c.closed_date
, x.lvl + 1 AS lvl
, x.grp + CASE WHEN c.closed_date < x.open_date THEN 1 ELSE 0 END AS grp
, row_number() OVER (PARTITION BY c.cust_id ORDER BY c.closed_date DESC) AS rn
FROM cust c
JOIN x
ON x.cust_id = c.cust_id
AND c.open_date < x.open_date
) AS t
WHERE t.rn = 1
)
SELECT cust_id, min(open_date) AS first_open_date, max(closed_date) AS last_closed_date
FROM x
GROUP BY cust_id, grp
ORDER BY cust_id, grp
I would also add the caveat that I don't run on SQL Server, so there could be syntax differences that I didn't account for. Hopefully they are minor, if present.
you can try something like that:
select distinct
cust_id,
(select min(Open_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
),
(select max(Closed_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
)
from Cust as a
so, for every row - you're selecting minimal and maximal dates from all overlapping ranges, later distinct filters out duplicates
I am trying to make a simple hive transformation.
Can some one provide me a way to do this? I have tried collect_set and currently looking at klout's open source UDF.
I think this gives you what you want. I wasn't able to run it and debug it though. Good luck!
select start_point.unit
, start_time as start
, start_time + min(stop_time - start_time) as stop
from
(select * from
(select date_time as start_time
, unit
, last_value(unit) over (order by date_time row desc between current row and 1 following) as previous_unit
from table
) previous
where unit <> previous_unit
) start_points
left outer join
(select * from
(select date_time as stop_time
, unit
, last_value(unit) over (order by date_time row between current row and 1 following) as next_unit
from table
) next
where unit <> next_unit
) stop_points
on start_points.unit = stop_points.unit
where stop_time > start_time
group by start_point.unit, start_time
;
What about using the min and max functions? I think the following will get you what you need:
SELECT
Unit,
MIN(datetime) as start,
MAX(datetime) as stop
from table_name
group by Unit
;
I found it. Thanks for the pointer to use window functions
select *
from
(select *,
case when lag(unit,1) over (partition by id order by effective_time_ut desc) is NULL THEN 1
when unit<>lag(unit,1) over (partition by id order by effective_time_ut desc) then 1
when lead(unit,1) over (partition by id order by effective_time_ut desc) is NULL then 1
else 0 end as different_loc
from units_we_care) a
where different_loc=1
create table temptable as select unit, start_date, end_time, row_number () over() as row_num from (select unit, min(date_time) start_date, max(date_time) as end_time from table group by unit) a;
select a.unit, a.start_date as start_date, nvl(b.start_date, a.end_time) end_time from temptable a left outer join temptable b on (a.row_num+1) = b.row_num;