I wanted to profile my data set to find the data discrepancies.
My sample date set:
id status stdate enddate
1 new 01-JUL-17 31-JUL-17
1 process 01-OCT-17 31-DEC-18
1 new 01-JAN-19 31-JAN-19--- issue
2 new 01-SEP-14 31-JAN-15
2 process 01-JUN-16 30-NOV-17
2 complete 01-DEC-17 31-DEC-18
....
....
I would like to find out how many of those IDs have a result status that is older than current. The order of the status sequence should be NEW-PROCESS-COMPLETE. So I want report all IDs where the most recent status has reversed to an earlier status.
You can use the LAG() function to find the offending rows, as in:
with x (id, status, stdate, enddate,
prev_id, prev_status, prev_stdate, prev_enddate) as (
select
id,
status,
stdate,
enddate,
lag(id) over(partition by id order by stdate),
lag(status) over(partition by id order by stdate),
lag(stdate) over(partition by id order by stdate),
lag(enddate) over(partition by id order by stdate)
from my_table
)
select * from x
where status = 'new' and prev_status in ('process', 'complete')
or status = 'process' and prev_status = 'complete'
Note: I assume you need to compare only between rows of the same ID.
Related
I have a requirement to write a query to retrieve the records which have POS_ORDER_ID in the table with same POS_ORDER_ID which comes within 30days as new record with status 'Canceled', 'Discontinued' and need to mark previous POS_ORDER_ID record as it as not eligible
Table columns:
POS_ORDER_ID,
Status,
Order_date,
Error_description
A query containing MAX() and ROW_NUMBER() analytic functions might help you such as :
with t as
(
select t.*,
row_number() over (partition by pos_order_id order by Order_date desc ) as rn,
max(Order_date) over (partition by pos_order_id) as mx
from tab t -- your original table
)
select pos_order_id, Status, Order_date, Error_description,
case when rn >1
and t.status in ('Canceled','Discontinued')
and mx - t.Order_date <= 30
then
'Not eligible'
end as "Extra Status"
from t
Demo
Please use below query,
Select and validate
select POS_ORDER_ID, Status, Order_date, Error_description, row_number()
over(partition by POS_ORDER_ID order by Order_date desc)
from table_name;
Update query
merge into table_name t1
using
(select row_id, POS_ORDER_ID, Status, Order_date, Error_description,
row_number() over(partition by POS_ORDER_ID order by Order_date desc) as rnk
from table_name) t2
on (t1.POS_ORDER_ID = t2.POS_ORDER_ID and t1.row_id = t2.row_id)
when matched then
update
set
case when t2.rnk = 1 then 'Canceled' else 'Not Eligible';
I have a checking account table that contains columns Cust_id (customer id), Open_Date (start date), and Closed_Date (end date). There is one row for each account. A customer can open multiple accounts at any given point. I would like to know how long the person has been a customer.
eg 1:
CREATE TABLE [Cust]
(
[Cust_id] [varchar](10) NULL,
[Open_Date] [date] NULL,
[Closed_Date] [date] NULL
)
insert into [Cust] values ('a123', '10/01/2019', '10/15/2019')
insert into [Cust] values ('a123', '10/12/2019', '11/01/2019')
Ideally I would like to insert this into a table with just one row, that says this person has been a customer from 10/01/2019 to 11/01/2019. (as he opened his second account before he closed his previous one.
Similarly eg 2:
insert into [Cust] values ('b245', '07/01/2019', '09/15/2019')
insert into [Cust] values ('b245', '10/12/2019', '12/01/2019')
I would like to see 2 rows in this case- one that shows he was a customer from 07/01 to 09/15 and then again from 10/12 to 12/01.
Can you point me to the best way to get this?
I would approach this as a gaps and islands problem. You want to group together groups of adjacents rows whose periods overlap.
Here is one way to solve it using lag() and a cumulative sum(). Everytime the open date is greater than the closed date of the previous record, a new group starts.
select
cust_id,
min(open_date) open_date,
max(closed_date) closed_date
from (
select
t.*,
sum(case when not open_date <= lag_closed_date then 1 else 0 end)
over(partition by cust_id order by open_date) grp
from (
select
t.*,
lag(closed_date) over (partition by cust_id order by open_date) lag_closed_date
from cust t
) t
) t
group by cust_id, grp
In this db fiddle with your sample data, the query produces:
cust_id | open_date | closed_date
:------ | :--------- | :----------
a123 | 2019-10-01 | 2019-11-01
b245 | 2019-07-01 | 2019-09-15
b245 | 2019-10-12 | 2019-12-01
I would solve this with recursion. While this is certainly very heavy, it should accommodate even the most complex account timings (assuming your data has such). However, if the sample data provided is as complex as you need to solve for, I highly recommend sticking with the solution provided above. It is much more concise and clear.
WITH x (cust_id, open_date, closed_date, lvl, grp) AS (
SELECT cust_id, open_date, closed_date, 1, 1
FROM (
SELECT cust_id
, open_date
, closed_date
, row_number()
OVER (PARTITION BY cust_id ORDER BY closed_date DESC, open_date) AS rn
FROM cust
) AS t
WHERE rn = 1
UNION ALL
SELECT cust_id, open_date, closed_date, lvl, grp
FROM (
SELECT c.cust_id
, c.open_date
, c.closed_date
, x.lvl + 1 AS lvl
, x.grp + CASE WHEN c.closed_date < x.open_date THEN 1 ELSE 0 END AS grp
, row_number() OVER (PARTITION BY c.cust_id ORDER BY c.closed_date DESC) AS rn
FROM cust c
JOIN x
ON x.cust_id = c.cust_id
AND c.open_date < x.open_date
) AS t
WHERE t.rn = 1
)
SELECT cust_id, min(open_date) AS first_open_date, max(closed_date) AS last_closed_date
FROM x
GROUP BY cust_id, grp
ORDER BY cust_id, grp
I would also add the caveat that I don't run on SQL Server, so there could be syntax differences that I didn't account for. Hopefully they are minor, if present.
you can try something like that:
select distinct
cust_id,
(select min(Open_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
),
(select max(Closed_Date)
from Cust as b
where b.cust_id = a.cust_id and
a.Open_Date <= b.Closed_Date and
a.Closed_Date >= b.Open_Date
)
from Cust as a
so, for every row - you're selecting minimal and maximal dates from all overlapping ranges, later distinct filters out duplicates
My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null
If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.
I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp
Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);
You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows
Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.
I am fairly new to SQL (SQL Management Studio 2016) and I only joined the site this morning...so my first post! I have been looking for a solution on the site regarding my issue. I have found a few links but none that (I think) will work having tried a few. I have a table that holds boiler service data. One address can have multiple dates/sequence numbers. I am looking to create a script that proves the latest sequential numbers start date is less than or equal to the latest sequential end date. So, in my example, I'd want to select the MAX seq_no for the start_date field and the 2nd MAX seq_no for the end_date field to make sure they haven't breached timescale.
My sample data has been added as an image (hopefully!)...just two addresses but there are 1000's in reality):
I have tried SLQ to get max seq_no's for just the end date initially but it just keeps bringing back all the entries:
select max (seq_no) as SEQNO, end_date, cmpnt_ref, prty_id
FROM hgmpcych
where prty_id in ('ABBEY10_TD12','ABBEY12_TD12') and cmpnt_ref='Boiler' and cycle_no='5'
group by end_date,prty_id,cmpnt_ref,seq_no
order by prty_id
This will probably be quite basic, but I am still pretty new to SQL. Any hints, advice or tips would be very much appreciated.
You could use ROW_NUMBER() to mark the rows in each group and only select the rows marked with 1 or 2 (The two "latest" rows)...
WITH
enumerated_hgmpcych AS
(
SELECT
seq_no, start_date, end_date, cmpnt_ref, prty_id,
ROW_NUMBER() OVER (PARTITION BY prty_id, cmpnt_ref
ORDER BY seq_no DESC
)
desc_seq_enumerator
FROM
hgmpcych
WHERE
prty_id in ('ABBEY10_TD12','ABBEY12_TD12')
AND cmpnt_ref='Boiler'
AND cycle_no='5'
)
SELECT
*
FROM
enumerated_hgmpcych
WHERE
desc_seq_enumerator <= 2
ORDER BY
prty_id,
cmpnt_ref,
seq_no
If you wanted to, you could collapse that to one row per group...
WITH
enumerated_hgmpcych AS
(
SELECT
seq_no, start_date, end_date, cmpnt_ref, prty_id,
ROW_NUMBER() OVER (PARTITION BY prty_id, cmpnt_ref
ORDER BY seq_no DESC
)
desc_seq_enumerator
FROM
hgmpcych
WHERE
prty_id in ('ABBEY10_TD12','ABBEY12_TD12')
AND cmpnt_ref='Boiler'
AND cycle_no='5'
)
SELECT
prty_id,
cmpnt_ref,
MAX(CASE WHEN desc_seq_enumerator = 1 THEN seq_no END) AS final_seq_no,
MAX(CASE WHEN desc_seq_enumerator = 1 THEN start_date END) AS final_start_date,
MAX(CASE WHEN desc_seq_enumerator = 1 THEN end_date END) AS final_end_date,
MAX(CASE WHEN desc_seq_enumerator = 2 THEN seq_no END) AS prev_seq_no,
MAX(CASE WHEN desc_seq_enumerator = 2 THEN start_date END) AS prev_start_date,
MAX(CASE WHEN desc_seq_enumerator = 2 THEN end_date END) AS prev_end_date
FROM
enumerated_hgmpcych
WHERE
desc_seq_enumerator <= 2
GROUP BY
prty_id,
cmpnt_ref
ORDER BY
prty_id,
cmpnt_ref
If you have max(seq_no), then you don't want it in the group by:
select max (seq_no) as SEQNO, end_date, cmpnt_ref, prty_id
from hgmpcych
where prty_id in ('ABBEY10_TD12', 'ABBEY12_TD12') and
cmpnt_ref = 'Boiler' and cycle_no = '5'
group by end_date, prty_id, cmpnt_ref
order by prty_id;