How to convert select row_number query to delete query on PostgreSQL? - sql

I have a query like this which is working well on postgresql server;
SELECT *
FROM
(SELECT s_o_i,
row_number() OVER (PARTITION BY g_i,
c
ORDER BY o_s DESC) AS rank
FROM b_o) b_o
WHERE rank > 1
AND m_t < CURRENT_DATE - interval '1 months'
When I convert this to this query;
DELETE
FROM
(SELECT s_o_i,
row_number() OVER (PARTITION BY g_i,
c
ORDER BY o_s DESC) AS rank
FROM b_o) b_o
WHERE rank > 1
AND m_t < CURRENT_DATE - interval '1 months'
It throws an error like this;
ERROR: syntax error at or near "("
LINE 3: (SELECT s_o_i,
why does it throw an error on delete query?
DELETE
FROM
(SELECT s_o_i,
row_number() OVER (PARTITION BY g_i,
c
ORDER BY o_s DESC) AS rank
FROM b_o) b_o
WHERE rank > 1
AND m_t < CURRENT_DATE - interval '1 months'
I expect it to run and delete the records on the clauses.

Assuming that your aim is to remove all records from each group of (g_i, c) from more than a month ago, except for the latest/highest o_s: online demo
DELETE FROM b_o WHERE ctid in
( SELECT ctid
FROM
( SELECT ctid,
row_number() OVER w AS rank
FROM b_o
WHERE m_t < CURRENT_DATE - interval '1 months'
WINDOW w AS (PARTITION BY g_i, c ORDER BY o_s DESC)
) as alias
WHERE rank > 1
);
You were trying to delete records from a subselect instead of your table b_o, which isn't syntactically correct and doesn't make much sense because the subquery table lives only for the duration of the outer query.
You referred to m_t column outside the subquery without selecting it there first, so it wouldn't be recognised.
If s_o_i column uniquely identifies records in the table, you can use it instead of ctid.
In this case common table expressions aren't necessary, but you can use one to flatten things a bit:
WITH cte AS
( SELECT ctid,
row_number() OVER w AS rank
FROM b_o
WHERE m_t < CURRENT_DATE - interval '1 months'
WINDOW w AS (PARTITION BY g_i, c ORDER BY o_s DESC)
)
DELETE FROM b_o WHERE ctid in
( SELECT ctid
FROM cte
WHERE rank > 1
);

Related

SQL Get last 7 days from event date

The best way to explain what I need is showing, so, here it is:
Currently I have this query
select
date_
,count(*) as count_
from table
group by date_
which returns me the following database
Now I need to get a new column, that shows me the count off all the previous 7 days, considering the row date_.
So, if the row is from day 29/06, I have to count all ocurrencies of that day ( my query is already doing it) and get all ocurrencies from day 22/06 to 29/06
The result should be something like this:
If you have values for all dates, without gaps, then you can use window functions with a rows frame:
select
date,
count(*) cnt
sum(count(*)) over(order by date rows between 7 preceding and current row) cnt_d7
from mytable
group by date
order by date
you can try something like this:
select
date_,
count(*) as count_,
(select count(*)
from table as b
where b.date_ <= a.date_ and b.date_ > a.date - interval '7 days'
) as count7days_
from table as a
group by date_
If you have gaps, you can do a more complicated solution where you add and subtract the values:
with t as (
select date_, count(*) as count_
from table
group by date_
union all
select date_ + interval '8 day', -count(*) as count_
from table
group by date_
)
select date_,
sum(sum(count_)) over (order by date_ rows between unbounded preceding and current row) - sum(count_)
from t;
The - sum(count_) is because you do not seem to want the current day in the cumulated amount.
You can also use the nasty self-join approach . . . which should be okay for 7 days:
with t as (
select date_, count(*) as count_
from table
group by date_
)
select t.date_, t.count_, sum(tprev.count_)
from t left join
t tprev
on tprev.date_ >= t.date_ - interval '7 day' and
tprev.date_ < t.date_
group by t.date_, t.count_;
The performance will get worse and worse as "7" gets bigger.
Try with subquery for the new column:
select
table.date_ as groupdate,
count(table.date_) as date_count,
(select count(table.date_)
from table
where table.date_ <= groupdate and table.date_ >= groupdate - interval '7 day'
) as total7
from table
group by groupdate
order by groupdate

CASE AND WHEN SQL

I have transactional data of customers' purchase. I tried to select customer_id from the last 1 month and calculate recency as the average day customers come to purchase (AVG(gap))
SELECT
customer_id,
(
CASE WHEN day::DATE<= '2015-05-01'::DATE AND day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(
SELECT
AVG(gap)
FROM
(
SELECT
customer_id,
( day- LAG(day) OVER ( PARTITION BY customer_id ORDER BY day ) ) AS gap
FROM
baskets
JOIN
basket_lines
USING
( basket_id )
GROUP BY 1
) a
) b
ELSE 0
) AS A
FROM
baskets
JOIN
basket_lines
USING
(basket_id)
GROUP BY
1;
However, I have an error like `
ERROR: syntax error at or near "b"
LINE 45: GROUP BY 1)a)b ELSE 0) AS A
^
Does it mean I can not use subquery after THEN statement?
A subquery in the THEN clause does not take an alias. Also, you must end your CASE expression with END:
SELECT
customer_id,
(CASE WHEN day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(SELECT AVG(gap) FROM (
SELECT customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1) a) ELSE 0 END) AS A
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1;
But you have a correlated subquery in your select statement. This is probably not optimal, and we can likely rewrite your query using a join.
I propose the following refactor:
WITH cte AS (
SELECT
customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
INNER JOIN basket_lines
USING (basket_id)
WHERE day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
)
SELECT
customer_id,
AVG(gap) AS cust_avg
FROM cte
GROUP BY
customer_id;

SQL SELECT rows where the difference between consecutive columns is less than X

Basically Mysql: Find rows, where timestamp difference is less than x, but I want to stop at the first value whose timestamp difference is larger than X.
I got so far:
SELECT *
FROM (
SELECT *,
(LEAD(datetime) OVER (ORDER BY datetime)) - datetime AS difference
FROM history
) AS sq
WHERE difference < '00:01:00'
Which seems to correctly return all rows where the difference between the row and the one "behind" it is less than a minute, but that means I still get large jumps in the datetimes, which I don't want - I want to select the most recent "run" of rows, where a "run" is defined as "the timestamps in datetime differ by less than a minute".
e.g., I have rows whose hypothetical timestamps are as follows:
24, 22, 21, 19, 18, 12, 11, 9, 7...
And my limit of differences is 3, i.e. I want the run of the rows whose difference between "timestamps" is less than 3; therefore just:
24, 22, 21, 19, 18
Is this possible in SQL?
You can use lag to get the previous row's timestamp and check if the current row is within 3 minutes of it. Reset the group if the condition fails. After this grouping is done, you have the find the latest such group, use max to get it. Then get all those rows from the latest group.
Include a partition by clause in the window functions lag, sum andmax if this has to be done for each id in the table.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
--checking if the current row's timestamp is within 3 minutes of the next row
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + interval '3 minute' THEN 0 ELSE 1 END col
from t) x
)
select dt
from (select g.*,max(grp) over() maxgrp --getting the latest group
from grps g
) g
where grp = maxgrp
The above would get you the members in the latest group even though it has one row. To avoid such results get the latest group which has more than 1 row.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + 3 THEN 0 ELSE 1 END col
from t) x
)
,grpcnts as (select g.*,count(*) over(partition by grp) grpcnt from grps g)
select dt from (select g.*,max(grp) over() maxgrp
from grpcnts g
where grpcnt > 1
) g
where grp = maxgrp
You can do this by using a flag based on the lead() or lag() values. I believe this does what you want:
SELECT h.*
FROM (SELECT h.*,
SUM( (next_datetime < datetime + interval '1 minute')::int) OVER (ORDER BY datetime DESC) as grp
FROM (SELECT h.*,
LEAD(h.datetime) OVER (ORDER BY h.datetime)) as next_datetime
FROM history h
) h
WHERE next_datetime < datetime + interval '1 hour'
) h
WHERE grp IS NULL OR grp = 0;
This can be easily solved with recursive CTEs (this will select your rows one-by-one and stops when there is no row in range interval '1 min'):
with recursive h as (
select * from (
select *
from history
order by history.datetime desc
limit 1
) s
union all
select * from (
select history.*
from h
join history on history.datetime >= h.datetime - interval '1 min'
and history.datetime < h.datetime
order by history.datetime desc
limit 1
) s
)
select * from h
This should be efficient if you have an index on history.datetime. Though, if you care about performance, you should test it against the window-function based ones. (I personally get a headache when see as much subqueries and window functions as needed for this problem. The irony in my answer is that postgresql does not support the ORDER BY clause directly inside recrursive CTEs, so I had to use 2 meaningless subqueries to "hide" them).
rextester

Reactivation SQL

I have the following:
with t as (
SELECT advertisable, EXTRACT(YEAR from day) as yy, EXTRACT(MONTH from day) as mon,
ROUND(SUM(cost)/1e6) as val
FROM adcube dac
WHERE advertisable IN (SELECT advertisable
FROM adcube dac
GROUP BY advertisable
HAVING SUM(cost)/1e6 > 100
)
GROUP BY advertisable, EXTRACT(YEAR from day), EXTRACT(MONTH from day)
)
select advertisable, min(yy * 10000 + mon) as yyyymm
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
)as foo
group by advertisable, grp, val
having count(*) >= 6 and val = 0
;
This tracks the activation date of an account that stops spend for 4 months. However I would like to track the reactivation date instead. So if an account starts spend again after 4 months I can see the new start date for that account?
You want to find accounts where val > 0 and there are 4 (or 6) preceding records with 0s.
Here is an idea:
Calculate the groups of similar values as in your query.
Assign a sequential number to each group (val_seqnum).
Then pull the previous value and sequence number for each record.
Now, you want the records where the following is true:
val > 0
prev_val = 0
The previous val_seqnum >= 4 (or whatever your threshold).
The following query should do this (assuming the same definition of t):
select t.*
from (select t.* ,
lag(val) over (partition by advertisable order by yy, mon) prev_val,
lag(val_seqnum) over (partition by advertisable order by yy, mon) as prev_val_seqnum
from (select t.*,
row_number() over (partition by advertisable, val, grp order by yy, mon) as val_seqnum
) as grp
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
) t
) t
) t
where val > 0 and prev_val = 0 and prev_val_seqnum >= 4;
I think this can be radically simpler (and faster):
SELECT advertisable, ym AS reactivation_ym
FROM (
SELECT advertisable
, date_trunc('month', day) AS ym
, SUM(cost) < 500000 AS asleep
, count(SUM(cost) < 500000 OR NULL)
OVER (PARTITION BY advertisable
ORDER BY date_trunc('month', day)
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS ct
FROM adcube dac
JOIN (
SELECT advertisable
FROM adcube
GROUP BY 1
HAVING SUM(cost) > 1e8 -- really 10000000 ?
) x USING (advertisable)
GROUP BY 1, 2
) sub
WHERE NOT asleep
AND ct = 4;
Building on a couple of assumptions to fill in for missing information.
I largely untangled your calculations and simplified the code making it shorter and faster than your original.
Count for each advertisable how many of the the last 4 months had a total cost below 500000. Only with all 4 (existing) months below the threshold, the row qualifies. (If you don't have rows for all months, you need to decide how to handle missing rows. Information is not available in your question.)
Using count() as window aggregate function with a custom frame. Here is a recent related answer with detailed explanation:
Querying count on daily basis with date constraints over multiple weeks
How can you "nest" count() and sum()?
They are not really nested. It's a window function over an aggregate function. Details:
Get the distinct sum of a joined table column

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html