Group by date and find median of processing time

Group by date and find median of processing time - sql

I select input date and output date from a database. I use a formula to indicate the processing time. Now, I would like the values to be grouped according to the date of receipt and the median of the processing time to be output for all grouped dates of receipt. Something like this:
The data I select:
input date | output date | processing time
2022-01-03 | 2022-01-03 | 0
2022-01-03 | 2022-01-06 | 3
2022-01-03 | 2022-01-11 | 8
2022-01-05 | 2022-01-10 | 5
2022-01-05 | 2022-01-15 | 10
The output I want:
input date | processing time
2022-01-03 | 3
2022-01-05 | 7.5
My SQL Code:
SELECT [received_date]
,CONVERT(date, [exported_on])
,DATEDIFF(day, [received_date], [exported_on]) AS processing_time
FROM [request] WHERE YEAR (received_date) = 2022
GROUP BY received_date, [exported_on]
ORDER BY received_date
How can I do this? Do I need a temp table to do this, or can I modify my query?

You could try using PERCENTILE_CONT
with cte as (
select input_date,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY processing_time) OVER(PARTITION BY input_date) as Median_Process_Time
FROM tableA
)
SELECT *
FROM cte
GROUP BY input_date, Median_Process_Time
db fiddle
Also you check check out the discussion here How to find the SQL medians for a grouping

Here my solution. Thank you for your help.
SET NOCOUNT ON;
DECLARE #working TABLE(entry_date date, exit_date date, work_time int)
INSERT INTO #working
SELECT [received] AS date_of_entry
,CONVERT(date, [exported]) AS date_of_exit
,DATEDIFF(day, [received], [exported]) AS processing_time
FROM [zsdt].[dbo].[antrag] WHERE YEAR([received]) = 2022 AND scanner_name IS NOT NULL AND exportiert_am IS NOT NULL AND NOT scanner_name = 'AP99'
GROUP BY [received], [exported]
ORDER BY [received] ASC
;WITH CTE AS
( SELECT entry_date,
work_time,
[half1] = NTILE(2) OVER(PARTITION BY entry_date ORDER BY work_time),
[half2] = NTILE(2) OVER(PARTITION BY entry_date ORDER BY work_time DESC)
FROM #working
WHERE work_time IS NOT NULL
)
SELECT entry_date,
(MAX(CASE WHEN Half1 = 1 THEN work_time END) +
MIN(CASE WHEN Half2 = 1 THEN work_time END)) / 2.0
FROM CTE
GROUP BY entry_date;

Related

SQL, rank for each instance of a partition

I am trying to to create a rank for each instance of a status occurring, for example
ID
Status
From_date
To_date
rank
1
Available
2022-01-01
2022-01-02
1
1
Available
2022-01-02
2022-01-03
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3
For each ID, for each instance of a status occurring, by from_date ascending.
I want to do this as i see this as the best way of getting to the final result i want which is
ID
Status
From_date
To_date
rank
1
Available
2022-01-01
2022-01-03
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3
I tried dense_rank(partition by id order by status, from_date but can see now why that wouldnt work. Not sure how to get to this result.

So with this CTE for the data:
with data(ID, Status, From_date, To_date) as (
select * from values
(1, 'Available', '2022-01-01', '2022-01-02'),
(1, 'Available', '2022-01-02', '2022-01-03'),
(1, 'Unavailable', '2022-01-03', '2022-01-10'),
(1, 'Available', '2022-01-10', '2022-01-20')
)
the first result, being rank can be done with CONDITIONAL_CHANGE_EVENT:
select *
,CONDITIONAL_CHANGE_EVENT( Status ) OVER ( PARTITION BY ID ORDER BY From_date ) as rank
from data;
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
0
1
Available
2022-01-02
2022-01-03
0
1
Unavailable
2022-01-03
2022-01-10
1
1
Available
2022-01-10
2022-01-20
2
and thus the keeps the first of each rank can be achieved with a QUALIFY/ROW_NUMBER, because the CONDITIONAL_CHANGE is a complex operation, needs wrapping in a sub-select, so the answer is not as short as I would like:
select * from (
select *
,CONDITIONAL_CHANGE_EVENT( Status ) OVER ( PARTITION BY ID ORDER BY From_date ) as rank
from data
)
qualify row_number() over(partition by id, rank ORDER BY From_date ) = 1
gives:
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
0
1
Unavailable
2022-01-03
2022-01-10
1
1
Available
2022-01-10
2022-01-20
2
Also, the final result minus the ranking can be done with:
select *
from data
qualify nvl(Status <> lag(status) over ( PARTITION BY ID ORDER BY From_date ), true)
ID
STATUS
FROM_DATE
TO_DATE
1
Available
2022-01-01
2022-01-02
1
Unavailable
2022-01-03
2022-01-10
1
Available
2022-01-10
2022-01-20
and thus a rank can be added at the end
select *
,rank() over ( PARTITION BY ID ORDER BY From_date ) as rank
from (
select *
from data
qualify nvl(Status <> lag(status) over ( PARTITION BY ID ORDER BY From_date ), true)
)
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3

This is a typical gaps-and-island problem, where islands are groups of consecutive rows that have the same status.
Here is one way to solve it with window functions:
select id, status,
min(from_date) from_date, max(to_date) to_date,
row_number() over (partition by id order by min(from_date)) rn
from (
select t.*,
row_number() over (partition by id order by from_date) rn1,
row_number() over (partition by id, status order by from_date) rn2
from mytable t
) t
group by id, status, rn1 - rn2
order by min(from_date)
This worked by ranking rows within two different partitions (with a without the status) ; the difference between the row numbers define the islands.

You can group consecutive status using conditional_change_event, then collapse the dates using min and max, and finally use row_number() to rank the events
with cte as
(select *,conditional_change_event(status) over (partition by id order by from_date) as rn
from t)
select id,
status,
min(from_date) as from_date,
max(to_date) as to_date,
row_number() over (partition by id, order by min(from_date), max(to_date)) as rank
from cte
group by id, status, rn
order by rank

Generate multiples rows of new column based on one value of another column

I have a table like below:
ID
Date
1
2022-01-01
2
2022-03-21
I want to add a new column based on the date and it should look like this
ID
Date
NewCol
1
2022-01-01
2022-02-01
1
2022-01-01
2022-03-01
1
2022-01-01
2022-04-01
1
2022-01-01
2022-05-01
2
2022-03-21
2022-04-21
2
2022-03-21
2022-05-21
Let's say that there is a #EndDate = 2022-05-31 (that's where it should stop)
I'm having a hard time trying to figure out how to do it in SSMS. Would appreciate any insights! Thanks :)

In the following solutions we leverage string_split with combination with replicate to generate new records.
select ID
,Date
,dateadd(month, row_number() over(partition by ID order by (select null)), Date) as NewCol
from (
select *
from t
outer apply string_split(replicate(',',datediff(month, Date, '2022-05-31')-1),',')
) t
ID
Date
NewCol
1
2022-01-01
2022-02-01
1
2022-01-01
2022-03-01
1
2022-01-01
2022-04-01
1
2022-01-01
2022-05-01
2
2022-03-21
2022-04-21
2
2022-03-21
2022-05-21
Fiddle
For SQL in Azure and SQL Server 2022 we have a cleaner solution based on [ordinal][4].
"The enable_ordinal argument and ordinal output column are currently
supported in Azure SQL Database, Azure SQL Managed Instance, and Azure
Synapse Analytics (serverless SQL pool only). Beginning with SQL
Server 2022 (16.x) Preview, the argument and output column are
available in SQL Server."
select ID
,Date
,dateadd(month, ordinal, Date) as NewCol
from (
select *
from t
outer apply string_split(replicate(',',datediff(month, Date, '2022-05-31')-1),',',1)
) t

with cal (id, dt) as
(
select id, date as dt from t
union all select id, dateadd(month, 1, dt) from cal where month(dt) < month('2022-05-31')
)
select t.id
,t.date
,cal.dt as new_col
from cal join t on t.id = cal.id and t.date != cal.dt
order by id, new_col
id
date
new_col
1
2022-01-01
2022-02-01
1
2022-01-01
2022-03-01
1
2022-01-01
2022-04-01
1
2022-01-01
2022-05-01
2
2022-03-21
2022-04-21
2
2022-03-21
2022-05-21
Fiddle

There are many ways to "explode" a row into a set, the simplest in my opinion is a recursive CTE:
DECLARE #endpoint date = '20220531';
DECLARE #prev date = DATEADD(MONTH, -1, #endpoint);
WITH x AS
(
SELECT ID, date, NewCol = DATEADD(MONTH, 1, date) FROM #d
UNION ALL
SELECT ID, date, DATEADD(MONTH, 1, NewCol) FROM x
WHERE NewCol < #prev
)
SELECT * FROM x
ORDER BY ID, NewCol;
Working example in this fiddle.
Keep in mind that if you could have > 100 months you'll need to add OPTION (MAXRECURSION) (or just consider using a different solution at scale).

Create sql Key based on datetime that is persistent overnight

I have a time series with a table like this
CarId
EventDateTime
Event
SessionFlag
CarId
EventDateTime
Event
SessionFlag
ExpectedKey
1
2022-01-01 7:00
Start
1
1-20220101-7
1
2022-01-01 7:05
Drive
1
1-20220101-7
1
2022-01-01 8:00
Park
1
1-20220101-7
1
2022-01-01 10:00
Drive
1
1-20220101-7
1
2022-01-01 18:05
End
0
1-20220101-7
1
2022-01-01 23:00
Start
1
1-20220101-23
1
2022-01-01 23:05
Drive
1
1-20220101-23
1
2022-01-02 2:00
Park
1
1-20220101-23
1
2022-01-02 3:00
Drive
1
1-20220101-23
1
2022-01-02 15:00
End
0
1-20220101-23
1
2022-01-02 16:00
Start
1
1-20220102-16
Other CarIds do exist.
What I am attempting to do is create the last column, ExpectedKey.
The problem I face though is midnight, as the same session can exist over two days.
The record above with ExpectedKey 1-20220101-23 is the prime example of what I'm trying to achieve.
I've played with using:
CASE
WHEN SessionFlag<> 0
AND
SessionFlag= LAG(SessionFlag) OVER (PARTITION BY Carid ORDER BY EventDateTime)
THEN FIRST_VALUE(CarId+'-'+Convert(CHAR(8),EventDateTime,112)+'-'+CAST(DATEPART(HOUR,EventDateTime)AS
VARCHAR))OVER (PARTITION BY CarId ORDER BY EventDateTime)
ELSE CarId+'-'+Convert(CHAR(8),EventDateTime,112)+'-'+CAST(DATEPART(HOUR,EventDateTime)AS VARCHAR) END AS SessionId
But can't seem to make it partition correctly overnight.
Can anyone off advice?

This is a classic gaps-and-islands problem. There are a number of solutions.
The simplest (if not that efficient) is partitioning over a windowed conditional count
WITH Groups AS (
SELECT *,
GroupId = COUNT(CASE WHEN t.Event = 'Start' THEN 1 END)
OVER (PARTITION BY t.CarId ORDER BY t.EventDateTime)
FROM YourTable t
)
SELECT *,
NewKey = CONCAT_WS('-',
t.CarId,
CONVERT(varchar(8), EventDateTime, 112),
FIRST_VALUE(DATEPART(hour, t.EventDateTime))
OVER (PARTITION BY t.CarId, t.GroupId ORDER BY t.EventDateTime
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
FROM Groups t;
db<>fiddle

using APPLY to get the Start event datetime and form the key with concat_ws
select *
from time_series t
cross apply
(
select top 1
ExpectedKey = concat_ws('-',
CarId,
convert(varchar(10), EventDateTime, 112),
datepart(hour, EventDateTime))
from time_series x
where x.Event = 'Start'
and x.EventDateTime <= t.EventDateTime
order by x.EventDateTime desc
) k

group a set of records by date in teradata

Currently I have data in a table as shown below:
date id value
1-Jan-13 1 100
2-Jan-13 1 100
3-Jan-13 1 100
4-Jan-13 1 200
5-Jan-13 1 200
6-Jan-13 1 100
7-Jan-13 1 100
I am trying to group the records based on the id and val and version records with startdate and end date .
Desired output:
start date end date id value
1-Jan-13 3-Jan-13 1 100
4-Jan-13 5-Jan-13 1 200
6-Jan-13 7-Jan-13 1 100

I'm not an expert in Teradata but you most likely, since windowing functions are supported (specifically ROW_NUMBER), be able to do something like this
SELECT MIN(date) start_date, MAX(date) end_date, id, value
FROM
(
SELECT date, id, value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) -
ROW_NUMBER() OVER (PARTITION BY id, value ORDER BY date) island
FROM table1
) q
GROUP BY id, value, island
ORDER BY start_date, end_date
Sample output:
| START_DATE | END_DATE | ID | VALUE |
|------------|------------|----|-------|
| 2013-01-01 | 2013-01-03 | 1 | 100 |
| 2013-01-04 | 2013-01-05 | 1 | 200 |
| 2013-01-06 | 2013-01-07 | 1 | 100 |
Here is SQLFiddle demo (It's a SQL Server demo, but should work as expected in Teradata)

The ROW_NUMBER version can be further simplified: modified SQL Fiddle
For Teradata:
SELECT
id,val,MIN(dt),MAX(dt)
FROM
(
SELECT
id,val,dt,
dt - ROW_NUMBER() OVER (PARTITION BY id ORDER BY val, dt) AS dummy
FROM table1
) AS dt
GROUP BY 1,2,dummy
And there are some hardly known functions in TD13.10 for processing time series data:
WITH cte(id,val,pd) AS
(
SELECT id, val, PERIOD(dt, dt+1) AS pd
FROM table1
)
SELECT
id, val,
BEGIN(pd) AS start_dt,
LAST(pd) AS end_dt
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.id,cte.val)
,cte.pd)
RETURNS (id INTEGER
,val INTEGER
,pd PERIOD(DATE)
,Nrm_Count INTEGER)
HASH BY id
LOCAL ORDER BY id, val, pd
) A
ORDER BY start_dt, end_dt

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?

Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).

Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"

Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE

Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by date and find median of processing time - sql

Related

SQL, rank for each instance of a partition

Generate multiples rows of new column based on one value of another column

Create sql Key based on datetime that is persistent overnight

group a set of records by date in teradata

Select distinct users group by time range

Categories

Resources