Getting a single 90th percentile value of the data - sql

I want to find the 90th percentile of the date difference however, the result returns multiple rows of the the percentile data. I only want a single row stating the 90th percentile, and I am having problem solving that.
Here is the sample data.
OWNER_ID
CREATED_TIME
STATUS_ID
1
2020-07-16 08:29:29.000
NEW
1
2022-02-21 04:38:01.000
PROCESSED
3
2022-02-28 14:24:28.000
1
3
2022-02-28 14:27:32.000
CONVERTED
4
2022-02-28 14:33:06.000
NEW
4
2022-02-28 14:33:19.000
IN_PROCESS
5
2022-03-01 12:01:48.000
NEW
5
2022-03-01 12:02:00.000
IN_PROCESS
This is my query for this.
SELECT
'percentile count' as name,
PERCENTILE_CONT(0.9) within group (order by temp.diff) over () as percentile_90
FROM
(SELECT OWNER_ID,
DATEDIFF(hour,
MIN (CASE WHEN STATUS_ID = 'NEW' THEN CREATED_TIME END),
MAX (CASE WHEN STATUS_ID = 'IN_PROCESS' THEN CREATED_TIME END) ) as diff
FROM table
WHERE CREATED_DATE Between DATEADD(DAY, -14, GETDATE()) AND GETDATE()
AND RESPONSIBLE_ID IN (2731,2727,2702,2730,2701,2699,2696)
GROUP BY OWNER_ID)temp
The logic of my problem is to first to get the date difference between STATUS_ID = NEW and STATUS_ID = IN_PROCESS for each OWNER_ID. Then I want to know the overall 90th Percentile of the date differences, returning a single result row.
My desired output is in this format containing just one row:
Name
90th Percentile
Percentile count
100

Related

Oracle PL/SQL group by with date field produces confusing results

In a query, I did a GROUP BY on a date field that produced these summary results. The Query was like this:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, std.created_on, COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), std.created_on
ORDER BY std.created_on ASC
Note these results for 3 datetimes on 19-Mar-2021 from the created_on field.
I then used TRUNC(created_on) to try to group all the records from a single day. This was the updated query:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, TRUNC(std.created_on), COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
I was expecting a result that would sum the 3 highlighted rows from the first query (2+165+164) instead I received a count of 166 for 19-Mar-2021. Why didn't I get the sum of (2+165+164)?
Because you have the same invoice number repeated at different times on the same day. As a simpler example, say you have:
CREATED_ON
INVOICE_NUMBER
2021-03-19 09:00:00
1
2021-03-19 09:00:00
2
2021-03-19 12:00:00
2
2021-03-19 15:00:00
1
2021-03-19 15:00:00
2
2021-03-19 15:00:00
3
That shows 6 rows, but only three distinct invoice numbers - 1, 2 and 3.
A simplified version of your first query gives:
SELECT std.created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY std.created_on
ORDER BY std.created_on ASC
CREATED_ON
COUNT_OF_INVOICES
2021-03-19 09:00:00
2
2021-03-19 12:00:00
1
2021-03-19 15:00:00
3
The sum of those counts currently matches the number of rows in the table, 6. (I haven't included any duplicates at the same time, so the distinct isn't doing anything at the moment, again to keep it simple.) The row for 09:00 counts invoice numbers 1 and 2; the row for 12:00 only counts 2; and the row for 15:00 counts 1, 2 and 3. The counts in all three rows include a count for invoice number 2, the first and third include a count for invoice number 1 - so the same invoice numbers are being counted multiple times.
SELECT TRUNC(std.created_on) as created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
CREATED_ON
COUNT_OF_INVOICES
2021-03-19 00:00:00
3
Now the single result is three, because that's how many distinct invoice numbers there are that day - it's now counting 1, 2 and 3 once each, not 2x2, 3x2 and 1x3.
If you didn't have the distinct then the second query would also get 6, because it wouldn't be eliminating the duplicates seen in the first query.
fiddle

Handling duplicates when rolling totals using OVER Partition by

I'm trying to get the rolling amount column totals for each date, from the 1st day of the month to whatever the date column value is, shown in the input table.
Output Requirements
Partition by the 'team' column
Restart rolling totals on the 1st of each month
Question 1
Is my below query correct to get my desired output requirements shown in Output Table below? It seems to work but I must confirm.
SELECT
*,
SUM(amount) OVER (
PARTITION BY
team,
month_id
ORDER BY
date ASC
) rolling_amount_total
FROM input_table;
Question 2
How can I handle duplicate dates, shown in the first 2 rows of Input Table? Whenever there is a duplicate date the amount is a duplicate as well. I see a solution here: https://stackoverflow.com/a/60115061/6388651 but no luck getting it to remove the duplicates. My non-working code example is below.
SELECT
*,
SUM(amount) OVER (
PARTITION BY
team,
month_id
ORDER BY
date ASC
) rolling_amount_total
FROM (
SELECT DISTINCT
date,
amount,
team,
month_id
FROM input_table
) t
Input Table
date
amount
team
month_id
2022-04-01
1
A
2022-04
2022-04-01
1
A
2022-04
2022-04-02
2
A
2022-04
2022-05-01
4
B
2022-05
2022-05-02
4
B
2022-05
Desired Output Table
date
amount
team
month_id
Rolling_Amount_Total
2022-04-01
1
A
2022-04
1
2022-04-02
2
A
2022-04
3
2022-05-01
4
B
2022-05
4
2022-05-02
4
B
2022-05
8
Q1. Your sum() over () is correct
Q2. Replace from input_table, in your first query, with :
from (select date, sum(amount) as amount, team, month_id
from input_table
group by date, team, month_id
) as t

Finding a difference in time in minutes for values in the same column

What I want to do here is to find the time difference between StatusID = 'Processed' and StatusID = 'NEW' according to its owner ID. and after getting the difference for each owner ID, I want to find the maximum, minimum and average time difference. This is the sample data for it.
OWNER_ID
CREATED_TIME
STATUS_ID
1
2020-07-16 08:29:29.000
NEW
1
2022-02-21 04:38:01.000
PROCESSED
3
2022-02-28 14:24:28.000
1
3
2022-02-28 14:27:32.000
CONVERTED
4
2022-02-28 14:33:06.000
NEW
4
2022-02-28 14:33:19.000
IN_PROCESS
5
2022-03-01 12:01:48.000
NEW
5
2022-03-01 12:02:00.000
IN_PROCESS
I have tried out this code to get the time difference but my code is not working.
SELECT
OWNER_ID,
DATEDIFF(SECOND, (SELECT CREATED_TIME
FROM table
WHERE STATUS_ID = 'IN_PROCESS'),
(SELECT CREATED_TIME
FROM table
WHERE STATUS_ID = 'NEW'))
FROM
table
GROUP BY
OWNER_ID
The desired output is in this format and after getting the result, I want to find the maximum, minimum and average time difference.
OWNER_ID
TIME_DIFFERENCE(in mins)
1
500
3
800
4
1300
Use a CASE expression with aggregate
SELECT OWNER_ID,
DATEDIFF(SECOND,
MIN (CASE WHEN STATUS_ID = 'NEW' THEN CREATED_TIME END),
MAX (CASE WHEN STATUS_ID = 'IN_PROCESS' THEN CREATED_TIME END) )
FROM table
GROUP BY OWNER_ID

SQL Troubleshooting Help on Table Structure

I'm attempting to calculate average number of days between a customer's 1st and 3rd purchase, but struggling to get the data ordered in a way that will allow me to calculate.
I currently have the below data table. (Note: Order sequence number refers to the number order for that customer.)
Order Date
Customer Number
Order Sequence Number
2020-09-20
1
1
2021-01-20
1
2
2021-01-21
1
3
2020-10-01
2
1
2020-08-06
3
1
2020-09-06
3
2
2020-09-09
3
3
I've been trying to get the data to look like the following table. [To then be able to calculate datediff on the last two columns.]
Customer Number
Order Count
First Order Date
Third Order Date
1
3
2020-09-20
2021-01-21
2
1
2020-10-01
Null
3
3
2020-08-06
2020-09-09
I've completely messed up the code, but here's what I've been trying.
CREATE TABLE X2 as
SELECT
customer_number,
max(order_sequence_number) as order_count,
CASE
WHEN order_sequence_number = 1 then order_date
ELSE null
END as first_order_date,
CASE
WHEN order_sequence_number = 3 then order_date
ELSE null
END as third_order_date
FROM X1
GROUP BY customer_number;
Can someone please tell me what I'm missing? Thanks in advance!
You are on the right track but you need aggregation functions:
SELECT customer_number,
max(order_sequence_number) as order_count,
MAX(CASE WHEN order_sequence_number = 1 THEN order_date END) as first_order_date,
MAX(CASE WHEN order_sequence_number = 3 THEN order_date END) as third_order_date
FROM X1
GROUP BY customer_number;
To get the difference in days, you would just subtract the two expressions using whatever date arithmetic is supported in your database.

T-SQL filtering records based on dates and time difference with other records

I have a table for which I have to perform a rather complex filter: first a filter by date is applied, but then records from the previous and next days should be included if their time difference does not exceed 8 hours compared to its prev or next record (depending if the date is less or greater than filter date).
For those adjacent days the selection should stop at the first record that does not satisfy this condition.
This is how my raw data looks like:
Id
Desc
EntryDate
1
Event type 1
2021-03-12 21:55:00.000
2
Event type 1
2021-03-12 01:10:00.000
3
Event type 1
2021-03-11 20:17:00.000
4
Event type 1
2021-03-11 05:04:00.000
5
Event type 1
2021-03-10 23:58:00.000
6
Event type 1
2021-03-10 11:01:00.000
7
Event type 1
2021-03-10 10:00:00.000
In this example set, if my filter date is '2021-03-11', my expected result set should be all records from that day plus adjacent records from 03-12 and 03-10 that satisfy the 8 hours condition. Note how record with Id 7 is not be included because record with Id 6 does not comply:
Id
EntryDate
2
2021-03-12 01:10:00.000
3
2021-03-11 20:17:00.000
4
2021-03-11 05:04:00.000
5
2021-03-10 23:58:00.000
Need advice how to write this complex query
This is a variant of gaps-and-islands. Define the difference . . . and then groups based on the differences:
with e as (
select t.*
from (select t.*,
sum(case when prev_entrydate > dateadd(hour, -8, entrydate) then 0 else 1 end) over (order by entrydate) as grp
from (select t.*,
lag(entrydate) over (order by entrydate) as prev_entrydate
from t
) t
)
select e.*
from e.*
where e.grp in (select e2.grp
from t e2
where date(e2.entrydate) = #filterdate
);
Note: I'm not sure exactly how filter date is applied. This assumes that it is any events on the entire day, which means that there might be multiple groups. If there is only one group (say the first group on the day), the query can be simplified a bit from a performance perspective.
declare #DateTime datetime = '2021-03-11'
select *
from t
where t.EntryDate between DATEADD(hour , -8 , #DateTime) and DATEADD(hour , 32 , #DateTime)