HOW to find the avg difference in order time in SQL - sql

the dataset of subquery is here
id itm_id paid_at ord_r total_r
17 3266 2013-05-25 08:27:17 1 3
17 3219 2013-05-25 08:27:17 2 3
17 3964 2013-05-25 08:27:17 3 3
25 2105 2013-05-17 03:11:48 1 2
25 1376 2013-05-17 03:11:48 2 2
63 2140 2013-07-07 11:26:45 1 3
the code is here
for find out the average difference in order time, BUt i looked up here, and i found this piece of code
But i didn't understand why i use (toatl-1)
if someone kindly explain the process
i am doing this on mode analytics i don't know which machine it use, whether mysql or sqlserver
SELECT
user_id,
item_id,
CASE WHEN total_order-1 > 0
THEN datediff(day, max(paid_at), min(paid_at))/ (total_order-1)
ELSE datediff(day, max(paid_at), min(paid_at)) END AS avg_time
FROM
(SELECT
user_id,
item_id,
paid_at,
ROW_NUMBER( ) OVER (PARTITION BY user_id ORDER by paid_at ASC) AS order_rank,
COUNT(item_id) OVER(PARTITION BY user_id ORDER BY paid_at ASC) AS total_order
from
dsv1069.orders) user_level
But the problems is
ERROR: column "day" does not exist

Related

How to find the number of events for the first 24 hours for each user id

I'm working on snowflake to solve a problem. I wanted to find the number of events for the first 24 hours for each user id.
This is a snippet of the database table I'm working on. I modified the table and used a date format without the time for simplification purposes.
user_id
client_event_time
1
2022-07-28
1
2022-07-29
1
2022-08-21
2
2022-07-29
2
2022-07-30
2
2022-08-03
I used the following approach to find the minimum event time per user_id.
SELECT user_id, client_event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY client_event_time) row_number,
MIN(client_event_time) OVER (PARTITION BY user_id) MinEventTime
FROM Data
ORDER BY user_id, client_event_time;
user_id
client_event_time
row_number
MinEventTime
1
2022-07-28
1
2022-07-28
1
2022-07-29
2
2022-07-28
1
2022-08-21
3
2022-07-28
2
2022-07-29
1
2022-07-29
2
2022-07-30
2
2022-07-29
2
2022-08-03
3
2022-07-29
Then I tried to find the difference between the minimum event time and client_event_time, and if the difference is less than or equal to 24, I counted the client_event_time.
with NewTable as (
(SELECT user_id,client_event_time, event_type,
row_number() over (partition by user_id order by CLIENT_EVENT_TIME) row_number,
MIN(client_event_time) OVER (PARTITION BY user_id) MinEventTime
FROM Data
ORDER BY user_id, client_event_time))
SELECT user_id,
COUNT(case when timestampdiff(hh, client_event_time, MinEventTime) <= 24 then 1 else 0 end) AS duration
FROM NEWTABLE
GROUP BY user_id
I got the following result:
user_id
duration
1
3
2
3
I wanted to find the following result:
user_id
duration
1
2
2
2
Could you please help me solve this problem? Thanks!
This looks like a problem for windowed functions! I like them a lot.
Here's you sample data
DECLARE #table TABLE (user_id INT, client_event_time DATETIME)
INSERT INTO #table (user_id, client_event_time) VALUES
(1, '2022-07-28 13:30:00'),
(1, '2022-07-29 08:30:00'),
(1, '2022-08-21 12:34:56'),
(2, '2022-07-29 08:30:00'),
(2, '2022-07-30 13:30:00'),
(2, '2022-08-03 12:34:56')
I added some hours to it, so we can look at 24 hour windows more easily. For user_id 1 we can see they had 2 events in the 24 hours after their initial one. For user_id 2 there was only the first one. We can capture that with a MIN OVER, along with the actual datetimes.
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time
FROM #table
user_id FirstEventDateTime client_event_time
-------------------------------------------------------
1 2022-07-28 13:30:00.000 2022-07-28 13:30:00.000
1 2022-07-28 13:30:00.000 2022-07-29 08:30:00.000
1 2022-07-28 13:30:00.000 2022-08-21 12:34:56.000
2 2022-07-29 08:30:00.000 2022-07-29 08:30:00.000
2 2022-07-29 08:30:00.000 2022-07-30 13:30:00.000
2 2022-07-29 08:30:00.000 2022-08-03 12:34:56.000
Now we have the first datetime and each rows datetime in the resultset together, we can make a comparison:
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time, CASE WHEN DATEDIFF(HOUR,MIN(client_event_time) OVER (PARTITION BY user_id), client_event_time) < 24 THEN 1 ELSE 0 END AS EventsInFirst24Hours
FROM #table
user_id FirstEventDateTime client_event_time EventsInFirst24Hours
----------------------------------------------------------------------------
1 2022-07-28 13:30:00.000 2022-07-28 13:30:00.000 1
1 2022-07-28 13:30:00.000 2022-07-29 08:30:00.000 1
1 2022-07-28 13:30:00.000 2022-08-21 12:34:56.000 0
2 2022-07-29 08:30:00.000 2022-07-29 08:30:00.000 1
2 2022-07-29 08:30:00.000 2022-07-30 13:30:00.000 0
2 2022-07-29 08:30:00.000 2022-08-03 12:34:56.000 0
Now we have an indicator telling us which events occurred in the first 24 hours, all we really need is to sum it, but SQL Server is mean about using a windowed function in another aggregate, so we need to cheat and put it into a subquery.
SELECT user_id, SUM(EventsInFirst24Hours) AS CountOfEventsInFirst24Hours
FROM (
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time, CASE WHEN DATEDIFF(HOUR,MIN(client_event_time) OVER (PARTITION BY user_id), client_event_time) < 24 THEN 1 ELSE 0 END AS EventsInFirst24Hours
FROM #table
) a
GROUP BY user_id
And that gets us to the result:
user_id CountOfEventsInFirst24Hours
-----------------------------------
1 2
2 1
A little about what's going on with the windowed function:
MIN - the aggregation we want it to do. The common aggregate functions have windowed counterparts.
(client_event_time) - the value we want to do it to.
OVER (PARTITION BY user_id) - the window we want to set up. In this case we want to know the minimum datetime for each of the user_ids.
We can partition by as many columns as we'd like.
You can also use an ORDER BY with as many columns as you'd like, but that was not necessary here. Ex:
OVER (PARTITION BY column1, column2 ORDER BY column4, column5 DESC)
Partition (or group by) column1 and column2 and order by column4 and column5 descending.
Easier done with a qualify
with cte as
(select *
from mytable
qualify event_time<=min(event_time) over (partition by user_id) + interval '24 hours')
select user_id, count(*) as counts
from cte
group by user_id
If you want the count of events around 24 hours of the minimun event time, you canuse a group by CTE that givbes you all the minumum event tomes for all users
the rest is to get all the rows that are in the tme limit
WITH min_data as
(SELECT user_id,MIN(client_event_time) mindate FROM data GROUP BY user_id)
SELECT d.user_id, COUNT(*)
FROM data d JOIN min_data md ON d.user_id = md.user_id WHERE client_event_time <= mindate + INTERVAL '24 hour'
GROUP BY d.user_id
ORDER BY d.user_id
user_id
count
1
2
2
2

MSSQL - Running sum with reset after gap

I have been trying to solve a problem for a few days now, but I just can't get it solved. Hence my question today.
I would like to calculate the running sum in the following table. My result so far looks like this:
PersonID
Visit_date
Medication_intake
Previous_date
Date_diff
Running_sum
1
2012-04-26
1
1
2012-11-16
1
2012-04-26
204
204
1
2013-04-11
0
1
2013-07-19
1
1
2013-12-05
1
2013-07-19
139
343
1
2014-03-18
1
2013-12-05
103
585
1
2014-06-24
0
2
2014-12-01
1
2
2015-03-09
1
2014-12-01
98
98
2
2015-09-28
0
This is my desired result. So only the running sum over contiguous blocks (Medication_intake=1) should be calculated.
PersonID
Visit_date
Medication_intake
Previous_date
Date_diff
Running_sum
1
2012-04-26
1
1
2012-11-16
1
2012-04-26
204
204
1
2013-04-11
0
1
2013-07-19
1
1
2013-12-05
1
2013-07-19
139
139
1
2014-03-18
1
2013-12-05
103
242
1
2014-06-24
0
2
2014-12-01
1
2
2015-03-09
1
2014-12-01
98
98
2
2015-09-28
0
I work with Microsoft SQL Server 2019 Express.
Thank you very much for your tips!
This is a gaps and islands problem, and one approach uses the difference in row numbers method:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY PersonID
ORDER BY Visit_date) rn1,
ROW_NUMBER() OVER (PARTITION BY PersonId, Medication_intake
ORDER BY Visit_date) rn2
FROM yourTable
)
SELECT PersonID, Visit_date, Medication_intake, Previous_date, Date_diff,
CASE WHEN Date_diff IS NOT NULL AND Medication_intake = 1
THEN SUM(Date_diff) OVER (PARTITION BY PersonID, rn1 - rn2
ORDER BY Visit_date) END AS Running_sum
FROM cte
ORDER BY PersonID, Visit_date;
Demo
The CASE expression in the outer query computes the rolling sum for date diff along islands of records having a medication intake value of 1. For other records, or for records where date diff be null, the value generated is simply null.

Efficient (linear time) nested queries in SQL

From this table:
events
id
event_date
event_score
12
2020-04-10
13
2020-04-11
13
2020-04-14
8
13
2020-04-13
6
12
2020-04-15
14
2020-04-16
14
2020-04-17
14
2020-04-18
11
14
2020-04-19
14
2020-04-20
14
2020-04-22
12
2020-04-25
14
2020-04-30
I'm trying to get this result
results
id
first_score
last_score
12
13
6
8
14
11
11
One way to do that is through this query:
SELECT
DISTINCT id,
(
SELECT event_score
FROM events AS subquery
WHERE final_table.id=subquery.id
AND event_score IS NOT NULL
ORDER BY event_date
LIMIT 1
) AS `first score`,
(
SELECT event_score
FROM events AS subquery
WHERE final_table.id=subquery.id
AND event_score IS NOT NULL
ORDER BY event_date DESC
LIMIT 1
) AS `last score`
FROM sensors.events as final_table
but I suspect this takes quadratic time O(n*n) to compute. I know it can be done in linear time O(n) with Python but does anyone know how to do it in linear time with SQL?
The table is in MariaDB/MySQL
If you are running MariaDB 10.2.2 or higher, you could address this as a gaps-and-islands problem. The idea is to count how many non-null values appear on the preceding and following rows. We can then filter on the first non-null value in both directions, using conditional aggregation:
select id,
max(case when grp_asc = 1 then event_score end) as first_score,
max(case when grp_desc = 1 then event_score end) as last_score
from (
select e.*,
count(event_score) over(partition by id order by event_score ) as grp_asc,
count(event_score) over(partition by id order by event_score desc) as grp_desc
from events e
) e
group by id
order by id
I cannot assess the time complexity of this algorithm, but I would suspect that this should run faster than your original query, that requires executing two subqueries per distinct id.
Demo on DB Fiddle:
id | first_score | last_score
-: | ----------: | ---------:
12 | null | null
13 | 6 | 8
14 | 11 | 11
With a an index on (id, event_date, event_sore), then this should be quite fast:
SELECT id,
(SELECT event_score
FROM events AS subquery
WHERE final_table.id = subquery.id AND event_score IS NOT NULL
ORDER BY event_date
LIMIT 1
) AS `first score`,
(SELECT event_score
FROM events AS subquery
WHERE final_table.id=subquery.id AND event_score IS NOT NULL
ORDER BY event_date DESC
LIMIT 1
) AS `last score`
FROM (SELECT DISTINCT e.id
FROM sensors.events e
) as final_table;
Note that this moves the SELECT DISTINCT to a subquery. This is to ensure that MariaDB does not actually use a "distinct" algorithm for the SELECT DISTINCT -- the other columns would probably cause that to happen.
However, this is O(n log n) because the subqueries need to sort a small amount of data for each id -- as well as using an index to get to the right place.
I cannot think of a way to do this O(n) in SQL. I'm pretty sure the following constructs are all O(n log n):
Using an index for each row.
Sorting any portion of the data.
Using any window function with an order by -- although this might be true if there is just the right index.
But, SQL queries are still fast, particularly with indexes.

Can't get the cumulative sum(running total) within a group in SQL Server

I'm trying to get a running total within a group but my current code just gives me an aggregate sum.
For example, my data looks like this
ID ShiftNum Status Type Rate HourlyWage Hours Total_Amount
12542 1 Full A 1 12.5 40 500
12542 1 Full A 1 12.5 35 420
12542 2 Full A 1 10 40 400
12542 2 Full B 1.2 10 40 480
17842 1 Full A 1 11 27 297
17842 1 Full B 1.3 11 30 429
And what I want is a running total within the same ID, Shift Number, and Status. For example, I want something like this as my final result
ID ShiftNum Status Type Rate HourlyWage Hours Total_Amount Running_Tot
12542 1 Full A 1 12.5 40 500 500
12542 1 Full A 1 12.5 35 420 920
12542 2 Full A 1 10 40 400 400
12542 2 Full B 1.2 10 40 480 880
17842 1 Full A 1 11 27 297 297
17842 1 Full B 1.3 11 30 429 726
However, my current code just gives me the total sum within each group. For example, 920, 920 for row 1&2. Here's my code.
Select a.*,
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY ID, ShiftNum, Status) as Runnint_Tot
from table a
How do I fix my code to get the final result I want?
You need an ordering column that uniquely defines each row. There is not an obvious one in your row, but something like this:
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY hours) as Running_Tot
Or:
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status
ORDER BY (SELECT NULL)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as Running_Tot
The problem you are facing is because the ORDER BY keys have ties. The default window frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Note the RANGE. That means that all rows with ties are combined.
Also note that there is no utility to including the PARTITION BY keys in the ORDER BY (well . . . there is one exception in SQL Server if you don't care about the ordering, then including a key can be a handy short-cut). The ordering occurs within a partition.
If your rows can have exact duplicates, I would first suggest that you add a primary key. But, in the meantime, you could use:
with a as (
select a.*,
row_number() over (order by id, shiftnum, status) as seqnum
from tablea a
)
Select a.*,
SUM(Hours) OVER (PARTITION BY ID, ShiftNum, Status ORDER BY seqnum) as Running_Tot
from a;
The ordering will be arbitrary, but it will at least accumulate.

SQL ORDER BY with grouping

I have the following query
SELECT Id, Request, BookingDate, BookingId FROM Table ORDER BY Request DESC, Date
If a row has a similar ForeignKeyId, I would like that to go in before the next ordered row like:
Request Date ForeignKeyId
Request3 01-Jun-11 56
Request2 03-Jun-11 89
NULL 03-Jun-11 89
Request1 05-Jun-11 11
NULL 20-Jul-11 57
I have been looking at RANK and OVER but haven't found a simple fix.
EDIT
I've edited above to show the actual fields and pasted data using the following query from Andomar's answer
select *
from (
select row_number() over (partition by BookingId order by Request DESC) rn
, Request, BookingDate, BookingID
from Table
WHERE Date = '28 aug 11'
) G
order by
rn
, Request DESC, BookingDate
1 ffffff 23/01/2011 15:57 350821
1 ddddddd 10/01/2011 16:28 348856
1 ccccccc 13/09/2010 14:44 338120
1 aaaaaaaaaa 21/05/2011 20:21 364422
1 123 17/09/2010 16:32 339202
1 NULL NULL
2 gggggg 08/12/2010 14:39 346634
2 NULL NULL
2 17/09/2010 16:32 339202
2 NULL 10/04/2011 15:08 361066
2 NULL 02/05/2011 14:12 362619
2 NULL 11/06/2011 13:55 366082
3 NULL NULL
3 16/10/2010 13:06 343023
3 22/10/2010 10:35 343479
3 30/04/2011 10:49 362435
The booking ID's 339202 should appear next to each other but don't
You could partition by ForeignKeyId, then sort each second or lower row below their "head". With the "head" defined as the first row for that ForeignKeyId. Example, sorting on Request:
; with numbered as
(
select row_number() over (partition by ForeignKeyID order by Request) rn
, *
from #t
)
select *
from numbered n1
order by
(
select Request
from numbered n2
where n2.ForeignKeyID = n1.ForeignKeyID
and n2.rn = 1
)
, n1.Request
The subquery is required because SQL Server doesn't allow row_number in an order by clause.
Full example at SE Data.