How to count no of days per user on a rolling basis in Oracle SQL? - sql

Following on from a previous question, in which i have a table called orders with information regarding the time an order was placed and who made that order.
order_timestamp user_id
-------------------- ---------
1-JUN-20 02.56.12 123
3-JUN-20 12.01.01 533
23-JUN-20 08.42.18 123
12-JUN-20 02.53.59 238
19-JUN-20 02.33.72 34
I would like to calculate a daily rolling count of the number of days a user made an order in a past 10 days.
For example, in the last 10 days from the 20th June, user 34 made an order on 5 of those days. Then in the last 10 days from the 21st June, user 34 made an order on 6 of those days
In the end the table should be like this:
date user_id no_of_days
----------- --------- ------------
20-JUN-20 34 5
20-JUN-20 123 10
20-JUN-20 533 2
20-JUN-20 238 3
21-JUN-20 34 6
21-JUN-20 123 10
How would the query be written for this kind of analysis?
Please let me know if my question is unclear/more infor is required.
Thanks to you in advancement.

You can use window functions for this. Start by getting one row per user per day. And then use a rolling sum:
select day, user_id,
count(*) over (partition by user_id range between interval '10' day preceding and current row)
from (select distinct trunc(order_timestamp) as day, user_id
from t
) t

Assuming that a user places one order a day maximum, you can use window functions as follows:
select
t.*,
count(*) over(partition by user_id order by trunc(order_timestamp) range 10 preceding) no_of_days
from mytable t
Otherwise, you can get the distinct orders per day first:
select
order_day,
user_id,
count(*) over(partition by user_id order by order_day range 10 preceding) no_of_days
from (select distinct trunc(order_timestamp) order_day, user_id from mytable) t

Related

Apply SUM( where date between date1 and date2)

My table is currently looking like this:
+---------+---------------+------------+------------------+
| Segment | Product | Pre_Date | ON_Prepaid |
+---------+---------------+------------+------------------+
| RB | 01. Auto Loan | 2020-01-01 | 10645976180.0000 |
| RB | 01. Auto Loan | 2020-01-02 | 4489547174.0000 |
| RB | 01. Auto Loan | 2020-01-03 | 1853117000.0000 |
| RB | 01. Auto Loan | 2020-01-04 | 9350258448.0000 |
+---------+---------------+------------+------------------+
I'm trying to sum values of 'ON_Prepaid' over the course of 7 days, let's say from '2020-01-01' to '2020-01-07'.
Here is what I've tried
drop table if exists ##Prepay_summary_cash
select *,
[1W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 1 following and 7 following),
[2W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 8 following and 14 following),
[3W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 15 following and 21 following),
[1M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 22 following and 30 following),
[1.5M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 31 following and 45 following),
[2M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 46 following and 60 following),
[3M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 61 following and 90 following),
[6M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 91 following and 181 following)
into ##Prepay_summary_cash
from ##Prepay1
Things should be fine if the dates are continuous; however, there are some missing days in 'Pre_Date' (you know banks don't work on Sundays, etc.).
So I'm trying to work on something like
[1W] = SUM(ON_Prepaid) over (where Pre_date between dateadd(d,1,Pre_date) and dateadd(d,7,Pre_date))
something like that. So if per se there's no record on 2020-01-05, the result should only sum the dates on the 1,2,3,4,6,7 of Jan 2020, instead of 1,2,3,4,6,7,8 (8 because of "rows 7 following"). Or for example I have missing records over the span of 30 days or something, then all those 30 should be summed as 0s. So 45 days should return only the value of 15 days.
I've tried looking up all over the forum and the answers did not suffice. Can you guys please help me out? Or link me to a thread which the problem had already been solved.
Thank you so much.
Things should be fine if the dates are continuous
Then make them continuous. Left join your real data (grouped up so it is one row per day) onto your calendar table (make one, or use a recursive cte to generate you a list of 360 dates from X hence) and your query will work out
WITH d as
(
SELECT *
FROM
(
SELECT *
FROM cal
CROSS JOIN
(SELECT DISTINCT segment s, product p FROM ##Prepay1) x
) c
LEFT JOIN ##Prepay1 p
ON
c.d = p.pre_date AND
c.segment = p.segment AND
c.product = p.product
WHERE
c.d BETWEEN '2020-01-01' AND '2021-01-01' -- date range on c.d not c.pre_date
)
--use d.d/s/p not d.pre_date/segment/product in your query (sometimes the latter are null)
select *,
[1W_Prepaid] = sum(ON_Prepaid) over (partition by s, s order by d.d rows between 1 following and 7 following),
...
CAL is just a table with a single column of dates, one per day, no time, extending for n thousand days into the past/future
Wish to note that months have variable number of days so 6M is a bit of a misnomer.. might be better to call the month ones 180D, 90D etc
Also want to point out that your query performs a per row division of your data into into groups. If you want to perform sums up to 180 days after the date of the row you need to pull a year's worth of data so that on row 180(June) you have the December data available to sum (dec being 6 months from June)
If you then want to restrict your query to only showing up to June (but including data summed from 6 months after June) you need to wrap it all again in a sub query. You cannot "where between jan and jun" in the query that does the sum over because where clauses are done before window clauses (doing so will remove the dec data before it is summed)
Some other databases make this easier, Oracle and Postgres spring to mind; they can perform sum in a range where the other rows values are within some distance of the current row's values. SQL server only usefully supports distancing based on a row's index rather than its values (the distancing-based-on-values support is limited to "rows that have the same value", rather than "rows that have values n higher or lower than the current row"). I suppose the requirement could be met with a cross apply, or a coordinated sub in the select, though I'd be careful to check the performance..
SELECT *,
(SELECT SUM(tt.a) FROM x tt WHERE t.x = tt.x AND tt.y = t.y AND tt.z BETWEEN DATEADD(d, 1, t.z) AND DATEADD(d, 7, t.z) AS 1W
FROM
x t

Calculating time with datetime by groups

I have two tables Tickets and Tasks. When ticket is registered then it appears in Tickets table and every action that is made with the ticket is saved in the Tasks table. Tickets table includes information like who created the ticket, start and end dates (if it is closed) etc. Tasks table looks like this:
ID Ticket_ID Task_type_ID Task_type Group_ID Submit_Date
1 120 1 Opened 3 2016-12-09 11:10:22.000
2 120 2 Assign 4 2016-12-09 12:10:22.000
3 120 3 Paused 4 2016-12-09 12:30:22.000
4 120 4 Unpause 4 2016-12-10 10:30:22.000
5 120 2 Assign 6 2016-12-12 10:30:22.000
6 120 2 Assign 7 2016-12-12 15:30:22.000
7 120 5 Modify NULL 2016-12-13 15:30:22.000
8 120 6 Closed NULL 2016-12-13 16:30:22.000
I would like to calculate the time how long each group completed their task. The start time is the time when the ticket was assigned to certain group and end time is when that group completes their task (if they assign it elsewhere or close it). But it should not include the paused time(task_type_ID 3 to 4). Also when ticket is assigned to other group the new group ID appears in the previous task/row. If the task goes through multiple groups it should calculate how long the ticket was in the hands of every group.
I know it is complicated but maybe someone has an idea that I can start to build from.
This is a quite sophisticated gaps-and-island problem.
Here is one approach at it:
select distinct
ticket_id,
group_id,
sum(sum(datediff(minute, submit_date, lead_submit_date)))
over(partition by group_id) elapsed_minutes
from (
select
t.*,
row_number() over(partition by ticket_id order by submit_date) rn1,
row_number() over(partition by ticket_id, group_id order by submit_date) rn2,
lead(submit_date) over(partition by ticket_id order by submit_date) lead_submit_date
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id, rn1 - rn2
In the subquery, we assign row numbers to records within two different partitions (by tickets vs by ticket and group), and recover the date of the next record with lead().
We can then use the difference between the row numbers to build groups of "adjacent" records (where the tickets stays in the same group), while not taking into account periods when the ticket was paused. Aggregation comes into play here.
The final step is to compute the overall time spent in each group : this handles the case when a ticket is assigned to the same group more than once during its lifecycle (although that's not showing in your sample data, the description of the question makes it sound like that may happen). We could do this with another level of aggregation but I went for a window sum and distinct, which avoids adding one more level of nesting to the query.
Executing the subquery independently might help understanding the logic better (see the below db fiddle).
For your sample data, the query yields:
ticket_id | group_id | minutes_elapsed
--------: | -------: | --------------:
120 | 3 | 60
120 | 4 | 2900
120 | 6 | 300
120 | 7 | 1440
I actually think this is pretty simple. Just use lead() to get the next submit time value and aggregate by the ticket and group ignoring pauses:
select ticket_id, group_id, sum(dur_sec)
from (select t.*,
datediff(second, submit_date, lead(submit_date) over (partition by ticket_id order by submit_date)) as dur_sec
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id;
Here is a db<>fiddle (with thanks to GMB for creating the original fiddle).

Query for Offset Rolling Sum & Count - Oracle SQL

I've been working to build a distinct count and sum total of sales based on orders placed 4 to 180 days back for each day in the data table starting at Orders placed on day 181, then grouped by Month & Year, but have been unable to do it.
The end result would look something like the table below. Each order would show up multiple times, up to 176 times, but would be distinct for the given day (order 42999, placed on 10-01-2011 for example would be counted once on every day between 10-05-2011 and 2-01-2012 for example)
| OrdMonthYr | Grouped Order Count | Sum of Orders |
------------------------------------------------------
| 2011-06 | 140 | $450 |
| 2011-07 | 190 | $500 |
| 2011-08 | 250 | $600 |
------------------------------------------------------
The order count would take the total count of sales for a given day executed 4 to 180 days prior to that day (so March 1st, 2011 would have a distinct order count and order sum for orders placed between Nov 1st, 2010 and Feb 25th, 2011 as an example) followed by a function aggregating each of those totals up to month & year per the table above.
As I understand you want to get cumulative sum and count for the previous days from 4 to 180. But its not clear how it should be rolled up
If so you may use analytic functions. Next query will calculate it
select trunc(o.orderdate)
,count(*) over (order by trunc(o.orderdate) range between 180 PRECEDING AND 4 PRECEDING )
,sum(amount) over (order by trunc(o.orderdate) range between 180 PRECEDING AND 4 PRECEDING)
from orders o
What about rolling up orders to month. May be you need to take the first of every month and get sum and amount if so you may just take one row for each month from previous query:
select ord_date, cnt,sum_amount FRoM (
select trunc(o.orderdate) as ord_date
,count(*) over (order by trunc(o.orderdate) range between 180 PRECEDING AND 4 PRECEDING ) as cnt
,sum(amount) over (order by trunc(o.orderdate) range between 180 PRECEDING AND 4 PRECEDING) as sum_amount
,row_number() over (order by trunc(o.orderdate),rowid) as RN
from orders o)
WHERE rn = 1
and ord_date = trunc(ord_date,'MM')
Does this get at what you want?
select orderdate,
(select count(*)
from orders o
where o.orderdate between d.dte - 180 an d.dte - 4
) as cnt,
(select sum(amount)
from orders o
where o.orderdate between d.dte - 180 an d.dte - 4
) as amount
from (select distinct orderdate as dte from orders) d;

Get last entry from each user in database

I have a Postgresql database, and I'm having trouble getting my query right, even though this seems like a common problem.
My table looks like this:
CREATE TABLE orders (
account_id INTEGER,
order_id INTEGER,
ts TIMESTAMP DEFAULT NOW()
)
Everytime there is a new order, I use it to link the account_id and order_id.
Now my problem is that I want to get a list that has the last order (by looking at ts) for each account.
For example, if my data is:
account_id order_id ts
5 178 July 1
5 129 July 6
4 190 July 1
4 181 July 9
3 348 July 1
3 578 July 4
3 198 July 1
3 270 July 12
Then I'd like the query to return only the last row for each account:
account_id order_id ts
5 129 July 6
4 181 July 9
3 270 July 12
I've tried GROUP BY account_id, and I can use that to get the MAX(ts) for each account, but then I have no way to get the associated order_id. I've also tried sub-queries, but I just can't seem to get it right.
Thanks!
select distinct on (account_id) *
from orders
order by account_id, ts desc
https://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
The row_number() window function can help:
select account_id, order_id, ts
from (select account_id, order_id, ts,
row_number() over(partition by account_id order by ts desc) as rn
from tbl) t
where rn = 1

Using outer query result in a subquery in postgresql

I have two tables points and contacts and I'm trying to get the average points.score per contact grouped on a monthly basis. Note that points and contacts aren't related, I just want the sum of points created in a month divided by the number of contacts that existed in that month.
So, I need to sum points grouped by the created_at month, and I need to take the count of contacts FOR THAT MONTH ONLY. It's that last part that's tricking me up. I'm not sure how I can use a column from an outer query in the subquery. I tried something like this:
SELECT SUM(score) AS points_sum,
EXTRACT(month FROM created_at) AS month,
date_trunc('MONTH', created_at) + INTERVAL '1 month' AS next_month,
(SELECT COUNT(id) FROM contacts WHERE contacts.created_at <= next_month) as contact_count
FROM points
GROUP BY month, next_month
ORDER BY month
So, I'm extracting the actual month that my points are being summed, and at the same time, getting the beginning of the next_month so that I can say "Get me the count of contacts where their created at is < next_month"
But it complains that column next_month doesn't exist This is understandable as the subquery knows nothing about the outer query. Qualifying with points.next_month doesn't work either.
So can someone point me in the right direction of how to achieve this?
Tables:
Points
score | created_at
10 | "2011-11-15 21:44:00.363423"
11 | "2011-10-15 21:44:00.69667"
12 | "2011-09-15 21:44:00.773289"
13 | "2011-08-15 21:44:00.848838"
14 | "2011-07-15 21:44:00.924152"
Contacts
id | created_at
6 | "2011-07-15 21:43:17.534777"
5 | "2011-08-15 21:43:17.520828"
4 | "2011-09-15 21:43:17.506452"
3 | "2011-10-15 21:43:17.491848"
1 | "2011-11-15 21:42:54.759225"
sum, month and next_month (without the subselect)
sum | month | next_month
14 | 7 | "2011-08-01 00:00:00"
13 | 8 | "2011-09-01 00:00:00"
12 | 9 | "2011-10-01 00:00:00"
11 | 10 | "2011-11-01 00:00:00"
10 | 11 | "2011-12-01 00:00:00"
Edit
Now with running sum of contacts. My first draft used new contacts per month, which is obviously not what OP wants.
WITH c AS (
SELECT created_at
,count(id) OVER (order BY created_at) AS ct
FROM contacts
), p AS (
SELECT date_trunc('month', created_at) AS month
,sum(score) AS points_sum
FROM points
GROUP BY 1
)
SELECT p.month
,EXTRACT(month FROM p.month) AS month_nr
,p.points_sum
,( SELECT c.ct
FROM c
WHERE c.created_at < (p.month + interval '1 month')
ORDER BY c.created_at DESC
LIMIT 1) AS contacts
FROM p
ORDER BY 1
This works for any number of months across the years.
Assumes that no month is missing in the table points. If you want all months, including missing ones in points, generate a list of months with generate_series() and LEFT JOIN to it.
Build a running sum in a CTE with a window function.
Both CTE are not strictly necessary - for performance and simplification only.
Get contacts_count in a subselect.
Your original form of the query could work like this:
SELECT month
,EXTRACT(month FROM month) AS month_nr
,points_sum
,(SELECT count(*)
FROM contacts c
WHERE c.created_at < (p.month + interval '1 month')) AS contact_count
FROM (
SELECT date_trunc('MONTH', created_at) AS month
,sum(score) AS points_sum
FROM points p
GROUP BY 1
) p
ORDER BY 1
The fix for the immediate cause of your error is to put the aggregate into a subquery. You were mixing levels in a way that is impossible.
I expect my variant to be slightly faster with big tables. Not sure about smaller tables. Would be great if you'd report back with test results.
Plus a minor fix: < instead of <=.