Oracle PL/SQL group by with date field produces confusing results - sql

In a query, I did a GROUP BY on a date field that produced these summary results. The Query was like this:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, std.created_on, COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), std.created_on
ORDER BY std.created_on ASC
Note these results for 3 datetimes on 19-Mar-2021 from the created_on field.
I then used TRUNC(created_on) to try to group all the records from a single day. This was the updated query:
SELECT
(CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END) AS BILLING_SOURCE, TRUNC(std.created_on), COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM onebiller.t_std_in_detail_his std
INNER JOIN onebiller.t_std_in_header h
ON h.job_id = std.job_id
WHERE std.invoice_number IS NOT NULL
GROUP BY (CASE
WHEN std.attribute_1 like '%709%' OR std.attribute_1 like '%999%' THEN 'COMPA' -- COMPA invoices either start with 709 or 999
WHEN h.manual_upload = 'Y' then 'MANUAL_UPLOAD'
ELSE 'OTHER' END), TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
I was expecting a result that would sum the 3 highlighted rows from the first query (2+165+164) instead I received a count of 166 for 19-Mar-2021. Why didn't I get the sum of (2+165+164)?

Because you have the same invoice number repeated at different times on the same day. As a simpler example, say you have:
CREATED_ON
INVOICE_NUMBER
2021-03-19 09:00:00
1
2021-03-19 09:00:00
2
2021-03-19 12:00:00
2
2021-03-19 15:00:00
1
2021-03-19 15:00:00
2
2021-03-19 15:00:00
3
That shows 6 rows, but only three distinct invoice numbers - 1, 2 and 3.
A simplified version of your first query gives:
SELECT std.created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY std.created_on
ORDER BY std.created_on ASC
CREATED_ON
COUNT_OF_INVOICES
2021-03-19 09:00:00
2
2021-03-19 12:00:00
1
2021-03-19 15:00:00
3
The sum of those counts currently matches the number of rows in the table, 6. (I haven't included any duplicates at the same time, so the distinct isn't doing anything at the moment, again to keep it simple.) The row for 09:00 counts invoice numbers 1 and 2; the row for 12:00 only counts 2; and the row for 15:00 counts 1, 2 and 3. The counts in all three rows include a count for invoice number 2, the first and third include a count for invoice number 1 - so the same invoice numbers are being counted multiple times.
SELECT TRUNC(std.created_on) as created_on,
COUNT(DISTINCT std.invoice_number) AS COUNT_OF_INVOICES
FROM std
GROUP BY TRUNC(std.created_on)
ORDER BY TRUNC(std.created_on) ASC
CREATED_ON
COUNT_OF_INVOICES
2021-03-19 00:00:00
3
Now the single result is three, because that's how many distinct invoice numbers there are that day - it's now counting 1, 2 and 3 once each, not 2x2, 3x2 and 1x3.
If you didn't have the distinct then the second query would also get 6, because it wouldn't be eliminating the duplicates seen in the first query.
fiddle

Related

redshift cumulative count records via SQL

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

Getting a single 90th percentile value of the data

I want to find the 90th percentile of the date difference however, the result returns multiple rows of the the percentile data. I only want a single row stating the 90th percentile, and I am having problem solving that.
Here is the sample data.
OWNER_ID
CREATED_TIME
STATUS_ID
1
2020-07-16 08:29:29.000
NEW
1
2022-02-21 04:38:01.000
PROCESSED
3
2022-02-28 14:24:28.000
1
3
2022-02-28 14:27:32.000
CONVERTED
4
2022-02-28 14:33:06.000
NEW
4
2022-02-28 14:33:19.000
IN_PROCESS
5
2022-03-01 12:01:48.000
NEW
5
2022-03-01 12:02:00.000
IN_PROCESS
This is my query for this.
SELECT
'percentile count' as name,
PERCENTILE_CONT(0.9) within group (order by temp.diff) over () as percentile_90
FROM
(SELECT OWNER_ID,
DATEDIFF(hour,
MIN (CASE WHEN STATUS_ID = 'NEW' THEN CREATED_TIME END),
MAX (CASE WHEN STATUS_ID = 'IN_PROCESS' THEN CREATED_TIME END) ) as diff
FROM table
WHERE CREATED_DATE Between DATEADD(DAY, -14, GETDATE()) AND GETDATE()
AND RESPONSIBLE_ID IN (2731,2727,2702,2730,2701,2699,2696)
GROUP BY OWNER_ID)temp
The logic of my problem is to first to get the date difference between STATUS_ID = NEW and STATUS_ID = IN_PROCESS for each OWNER_ID. Then I want to know the overall 90th Percentile of the date differences, returning a single result row.
My desired output is in this format containing just one row:
Name
90th Percentile
Percentile count
100

SQL Troubleshooting Help on Table Structure

I'm attempting to calculate average number of days between a customer's 1st and 3rd purchase, but struggling to get the data ordered in a way that will allow me to calculate.
I currently have the below data table. (Note: Order sequence number refers to the number order for that customer.)
Order Date
Customer Number
Order Sequence Number
2020-09-20
1
1
2021-01-20
1
2
2021-01-21
1
3
2020-10-01
2
1
2020-08-06
3
1
2020-09-06
3
2
2020-09-09
3
3
I've been trying to get the data to look like the following table. [To then be able to calculate datediff on the last two columns.]
Customer Number
Order Count
First Order Date
Third Order Date
1
3
2020-09-20
2021-01-21
2
1
2020-10-01
Null
3
3
2020-08-06
2020-09-09
I've completely messed up the code, but here's what I've been trying.
CREATE TABLE X2 as
SELECT
customer_number,
max(order_sequence_number) as order_count,
CASE
WHEN order_sequence_number = 1 then order_date
ELSE null
END as first_order_date,
CASE
WHEN order_sequence_number = 3 then order_date
ELSE null
END as third_order_date
FROM X1
GROUP BY customer_number;
Can someone please tell me what I'm missing? Thanks in advance!
You are on the right track but you need aggregation functions:
SELECT customer_number,
max(order_sequence_number) as order_count,
MAX(CASE WHEN order_sequence_number = 1 THEN order_date END) as first_order_date,
MAX(CASE WHEN order_sequence_number = 3 THEN order_date END) as third_order_date
FROM X1
GROUP BY customer_number;
To get the difference in days, you would just subtract the two expressions using whatever date arithmetic is supported in your database.

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions
You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Need to get the minimum start date and maximum end date, when there is no break in months

i have 8 rows as shown below,
Column1 Start_date end_date Row_number
1 2014-02-01 2014-02-28 1
1 2014-03-01 2014-03-31 2
1 2014-04-01 2014-04-30 3
1 2014-05-01 2014-05-31 4
1 2014-07-01 2014-07-31 5
1 2015-02-01 2015-02-28 6
1 2015-03-01 2015-03-31 7
I need result like below,
Column1 Start_date end_date
1 2014-02-01 2014-05-31
1 2014-07-01 2014-07-31
1 2015-02-01 2015-03-31
so when the end_date of first row is one day less than the start_date in next row, I need to group all the continuous rows like that and get the result as I shown. I need to do this only via SQL. please let me know, if anyone have any idea to solve this.
In the input record, you can see, first 4 rows are continuous, and 5th row is not continuous and 6th and 7th row is a continuous one.
Thanks in advance.
The trick here is that you need to first filter out only entries that are the ends of an interval, and then merge them together, rather than trying to keep a running count in one go.
So I don't know what flavour of SQL you're running, and I have no idea what you're trying to signify with Column1, but this should do the trick (written in SQL server flavour, but the only functions you need to adjust are the dateadd and the isnull). The fiddle is here
SELECT DISTINCT
CASE WHEN Q1.IsStart = 1
THEN Q1.start_date
ELSE LAG(start_date) OVER(ORDER BY Q1.Row_number) END AS start_date,
CASE WHEN Q1.IsEnding = 1
THEN Q1.end_date
ELSE LEAD(end_date) OVER(ORDER BY Q1.Row_number) END AS end_date
FROM
(SELECT
start_date,
end_date,
Row_number,
CASE WHEN DATEADD(day,1,end_date) =
ISNULL(LEAD(start_date) OVER(ORDER BY Row_number),
end_date)
THEN 0
ELSE 1 END AS IsEnding,
CASE WHEN DATEADD(day,-1,start_date) =
ISNULL(LAG(end_date) OVER(ORDER BY Row_number),
start_date)
THEN 0
ELSE 1 END AS IsStart
FROM table1) Q1
WHERE Q1.IsEnding = 1 OR Q1.IsStart = 1
For ANSI SQL/For those of you without LAG or LEAD:
SELECT
StartDates.start_date,
MIN(EndDates.end_date)
FROM
(SELECT
MainEntry.start_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable PrevEntry ON PrevEntry.row_number - 1 = MainEntry.row_number
WHERE
PrevEntry.end_date IS NULL OR
EXTRACT(day FROM (MainEntry.start_date - PrevEntry.end_date)) > 1) StartDates
INNER JOIN
(SELECT
MainEntry.end_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable NextEntry ON NextEntry.row_number + 1 = MainEntry.row_number
WHERE
NextEntry.start_date IS NULL OR
EXTRACT(day FROM (NextEntry.start_date - MainEntry.end_date)) > 1) EndDates
ON StartDates.row_number <= EndDates.row_number
GROUP BY
StartDates.start_date
Note that the GROUP BY could contain StartDates.row_number if that takes advantage of an index. Also note that this ANSI solution initially missed the edge cases of rows without any pairs (had INNER JOINs inside the subqueries).