Oracle SQL function or buckets for data filtering - sql

SELECT
transaction
,date
,mail
,status
,ROW_NUMBER() OVER (PARTITION BY mail ORDER BY date) AS rownum
FROM table1
Having the above table and script I want to be able to filter the transactions on the basis of having first 3 rowids with status 'failed' to show rowid 4 if 'failed', having transactions with rowid 4,5,6 failed - show 7 if also failed etc. I was thinking about adding it to a pandas dataframe where to run a simple lambda function , but would really like to find a solution in SQL only.

You could use lead() and lag() to explicitly check:
select t.*
from (select t1.*,
lag(status, 3) over (partition by mail order by date) as status_3,
lag(status, 3) over (partition by mail order by date) as status_2,
lag(status, 3) over (partition by mail order by date) as status_1,
lead(status, 1) over (partition by mail order by date) as status_3n,
lead(status, 2) over (partition by mail order by date) as status_2n,
lead(status, 3) over (partition by mail order by date) as status_3n
from t
) t
where status = 'FAILED' and
( (status_3 = 'FAILED' and status_2 = 'FAILED' and status_1 = 'FAILED') or
(status_2 = 'FAILED' and status_1 = 'FAILED' and status_1n = 'FAILED') or
(status_1 = 'FAILED' and status_1n = 'FAILED' and status_2n = 'FAILED') or
(status_1n = 'FAILED' and status_2n = 'FAILED and status_3n = 'FAILED')
)
This is a bit brute force, but I think the logic is quite clear.
You could simplify the logic to:
where regexp_like(status_3 || status_2 || status_1 || status || status_1n || status_2n || status3n,
'FAILED{4}'
)

Try this:
select * from (
SELECT
transaction
,date
,mail
,status
,ROW_NUMBER() OVER (PARTITION BY mail ORDER BY date) AS rownum
FROM table1
WHERE status = 'FAILED' )
where mod(rownum, 3) = 1;
Richard

One option is to use window functions. Use lag to get the previous status value (based on specified ordering) and compare it with the current row's value and assign groups with a running sum. Then count the values in each group and finally filter for that condition.
SELECT t.*
FROM
( SELECT t.*,
count(*) over(PARTITION BY mail, grp) AS grp_count
FROM
( SELECT t.*,
sum(CASE
WHEN (prev_status IS NULL AND status='FAILED') OR
(prev_status='FAILED' AND status='FAILED') THEN 0
ELSE 1
END) over(PARTITION BY mail ORDER BY "date","transaction") AS grp
FROM
( SELECT t.*,
lag(status) over(PARTITION BY mail ORDER BY "date","transaction") AS prev_status
FROM tbl t
) t
) t
) t
WHERE grp_count>=4
If you are using versions starting with Oracle 12c, there is an option to use MATCH_RECOGNIZE which would simplify this.
select *
from tbl
MATCH_RECOGNIZE (
PARTITION BY mail
ORDER BY "date" ,"transaction"
ALL ROWS PER MATCH
AFTER MATCH SKIP TO LAST FAIL
PATTERN(fail{4,})
DEFINE
fail AS (status='FAILED')
) MR
ORDER BY "date","transaction"

Related

Convert CTE Query into normal Query

I want to convert my #PostgreSQL, CTE Query, into Normal Query because the cte function is mainly used in data warehouse SQL and not efficient for Postgres production DBS.
So, need help in converting this CTE query into a normal Query
WITH
cohort AS (
SELECT
*
FROM (
select
activity_id,
ts,
customer,
activity,
case
when activity = 'completed_order' and lag(activity) over (partition by customer order by ts) != 'email'
then null
when activity = 'email' and lag(activity) over (partition by customer order by ts) !='email'
then 1
else 0
end as cndn
from activity_stream where customer in (select customer from activity_stream where activity='email')
order by ts
) AS s
)
(
select
*
from cohort as s
where cndn = 1 OR cndn is null order by ts)
You may just inline the CTE into your outer query:
select *
from
(
select activity_id, ts, customer, activity,
case when activity = 'completed_order' and lag(activity) over (partition by customer order by ts) != 'email'
then null
when activity = 'email' and lag(activity) over (partition by customer order by ts) !='email'
then 1
else 0
end as cndn
from activity_stream
where customer in (select customer from activity_stream where activity = 'email')
) as s
where cndn = 1 OR cndn is null
order by ts;
Note that you have an unnecessary subquery in the CTE, which does an ORDER BY which won't "stick" anyway. But other than this, you might want to keep your current code as is.

Update value based on value from another record of same table

Here I have a sample table of a website visitors. As we can see, sometimes visitor don't provide their email. Also they may switch to different email addresses over period.
**
Original table:
**
I want to update this table with following requirements:
First time when a visitor provides an email, all his past visits will be tagged to that email
Also all his future visits will be tag to that email until he switches to another email.
**
Expected table after update:
**
I was wondering if there is a way of doing it in Redshift or T-Sql?
Thanks everyone!
In SQL Server or Redshift, you can use a subquery to calculate the email:
select t.*,
coalesce(email,
max(email) over (partition by visitor_id, grp),
max(case when activity_date = first_email_date then email end) over (partition by visitor_id)
)
from (select t.*,
min(case when email is not null then activity_date end) over
(partition by visitor_id order by activity_date rows between unbounded preceding and current row) as first_email_date,
count(email) over (partition by visitor_id order by activity_date between unbounded preceding and current row) as grp
from t
) t;
You can then use this in an update:
update t
set emai = tt.imputed_email
from (select t.,
coalesce(email,
max(email) over (partition by visitor_id, grp),
max(case when activity_date = first_email_date then email end) over (partition by visitor_id)
) as imputed_email
from (select t.,
min(case when email is not null then activity_date end) over
(partition by visitor_id order by activity_date) as first_email_date,
count(email) over (partition by visitor_id order by activity_date) as grp
from t
) t
) tt
where tt.visitor_id = t.visitor_id and tt.activity_date = t.activity_date and
t.email is null;
If we suppose that the name of the table is Visits and the primary key of that table is made of the columns Visitor_id and Activity_Date then you can do in T-SQL following:
using correlated subquery:
update a
set a.Email = coalesce(
-- select the email used previously
(
select top 1 Email from Visits
where Email is not null and Activity_Date < a.Activity_Date and Visitor_id = a.Visitor_id
order by Activity_Date desc
),
-- if there was no email used previously then select the email used next
(
select top 1 Email from Visits
where Email is not null and Activity_Date > a.Activity_Date and Visitor_id = a.Visitor_id
order by Activity_Date
)
)
from Visits a
where a.Email is null;
using window function to provide the ordering:
update v
set Email = vv.Email
from Visits v
join (
select
v.Visitor_id,
coalesce(a.Email, b.Email) as Email,
v.Activity_Date,
row_number() over (partition by v.Visitor_id, v.Activity_Date
order by a.Activity_Date desc, b.Activity_Date) as Row_num
from Visits v
-- previous visits with email
left join Visits a
on a.Visitor_id = v.Visitor_id
and a.Email is not null
and a.Activity_Date < v.Activity_Date
-- next visits with email if there are no previous visits
left join Visits b
on b.Visitor_id = v.Visitor_id
and b.Email is not null
and b.Activity_Date > v.Activity_Date
and a.Visitor_id is null
where v.Email is null
) vv
on vv.Visitor_id = v.Visitor_id
and vv.Activity_Date = v.Activity_Date
where
vv.Row_num = 1;
For each visitor_id you can update the null email value with the previus non-null value. In case there is none, you will use the next non-null value.You can get those values as follows:
select
v.*, v_prev.email prev_email, v_next.email next_email
from
visits v
left join visits v_prev on v.visitor_id = v_prev.visitor_id
and v_prev.activity_date = (select max(v2.activity_date) from visits v2 where v2.visitor_id = v.visitor_id and v2.activity_date < v.activity_date and v2.email is not null)
left join visits v_next on v.visitor_id = v_next.visitor_id
and v_next.activity_date = (select min(v2.activity_date) from visits v2 where v2.visitor_id = v.visitor_id and v2.activity_date > v.activity_date and v2.email is not null)
where
v.email is null

I need to write a query to mark previous record as “Not eligible ” if a new record comes in within 30 days with same POS Order ID

I have a requirement to write a query to retrieve the records which have POS_ORDER_ID in the table with same POS_ORDER_ID which comes within 30days as new record with status 'Canceled', 'Discontinued' and need to mark previous POS_ORDER_ID record as it as not eligible
Table columns:
POS_ORDER_ID,
Status,
Order_date,
Error_description
A query containing MAX() and ROW_NUMBER() analytic functions might help you such as :
with t as
(
select t.*,
row_number() over (partition by pos_order_id order by Order_date desc ) as rn,
max(Order_date) over (partition by pos_order_id) as mx
from tab t -- your original table
)
select pos_order_id, Status, Order_date, Error_description,
case when rn >1
and t.status in ('Canceled','Discontinued')
and mx - t.Order_date <= 30
then
'Not eligible'
end as "Extra Status"
from t
Demo
Please use below query,
Select and validate
select POS_ORDER_ID, Status, Order_date, Error_description, row_number()
over(partition by POS_ORDER_ID order by Order_date desc)
from table_name;
Update query
merge into table_name t1
using
(select row_id, POS_ORDER_ID, Status, Order_date, Error_description,
row_number() over(partition by POS_ORDER_ID order by Order_date desc) as rnk
from table_name) t2
on (t1.POS_ORDER_ID = t2.POS_ORDER_ID and t1.row_id = t2.row_id)
when matched then
update
set
case when t2.rnk = 1 then 'Canceled' else 'Not Eligible';

Select a line equal to 'X' without TOP 'N' plus the previous line 'Y' in SQL Server?

I need to return in a query only the last lines with 'ProductStatus' equal 'Stop' and the previous line.
I have the table:
And need to get this result:
How do I do this in SQL Server?
One method uses window functions to calculate the last stop and then get the row before that:
select t.*
from (select t.*,
lead(seqnum_ps) over (partition by producttype order by datevalue) as next_seqnum_ps,
lead(status) over (partition by producttype order by datevalue) as next_status
from (select t.*,
row_number() over (partition by producttype, product_status order by datevalue desc) as seqnum_ps
from t
) t
) t
where (seqnum_ps = 1 and product_status = 'Stop') or
(next_seqnum_ps = 1 and next_product_status = 'Stop');
An alternative method gets the maximum stop time and uses that:
select t.*
from (select t.*,
max(case when product_status = 'Stop' then datevalue end) over (partition by producttype) as max_stop_dv,
lead(datevalue) over (partition by producttype order by datevalue) as next_dv
from t
) t
where datevalue = max_stop_dv or
next_dv = max_stop_dv;

How to return all the rows in the yellow census blocks?

Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);
You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows
Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.