Optimal way to create a histogram/frequency distribution in Oracle?

Optimal way to create a histogram/frequency distribution in Oracle? - sql

I have an events table with two columns eventkey (unique, primary-key) and createtime, which stores the creation time of the event as the number of milliseconds since Jan 1 1970 in a NUMBER column.
I would like to create a "histogram" or frequency distribution that shows me how many events were created in each hour of the past week.
Is this the best way to write such a query in Oracle, using the width_bucket() function? Is it possible to derive the number of rows that fall into each bucket using one of the other Oracle analytic functions rather than using width_bucket to determine what bucket number each row belongs to and doing a count(*) over that?
-- 1305504000000 = 5/16/2011 12:00am GMT
-- 1306108800000 = 5/23/2011 12:00am GMT
select
timestamp '1970-01-01 00:00:00' + numtodsinterval((1305504000000/1000 + (bucket * 60 * 60)), 'second') period_start,
numevents
from (
select bucket, count(*) as events from (
select eventkey, createtime,
width_bucket(createtime, 1305504000000, 1306108800000, 24 * 7) bucket
from events
where createtime between 1305504000000 and 1306108800000
) group by bucket
)
order by period_start

If your createtime were a date column, this would be trivial:
SELECT TO_CHAR(CREATE_TIME, 'DAY:HH24'), COUNT(*)
FROM EVENTS
GROUP BY TO_CHAR(CREATE_TIME, 'DAY:HH24');
As it is, casting the createtime column isn't too hard:
select TO_CHAR(
TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000),
'DAY:HH24') AS BUCKET, COUNT(*)
FROM EVENTS
WHERE createtime between 1305504000000 and 1306108800000
group by TO_CHAR(
TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000),
'DAY:HH24')
order by 1
If, alternatively, you're looking for the fencepost values (for example, where do I go from the first decile (0-10%) to the next (11-20%), you'd do something like:
select min(createtime) over (partition by decile) as decile_start,
max(createtime) over (partition by decile) as decile_end,
decile
from (select createtime,
ntile (10) over (order by createtime asc) as decile
from events
where createtime between 1305504000000 and 1306108800000
)

I'm unfamiliar with Oracle's date functions, but I'm pretty certain there's an equivalent way of writing this Postgres statement:
select date_trunc('hour', stamp), count(*)
from your_data
group by date_trunc('hour', stamp)
order by date_trunc('hour', stamp)

Pretty much the same response as Adam, but I would prefer to keep the period_start as a time field so it is easier to filter further if needed:
with
events as
(
select rownum eventkey, round(dbms_random.value(1305504000000, 1306108800000)) createtime
from dual
connect by level <= 1000
)
select
trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH') period_start,
count(*) numevents
from
events
where
createtime between 1305504000000 and 1306108800000
group by
trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH')
order by
period_start

Using oracle provided function "WIDTH_BUCKET" to accumulate continuous or fine-discrete data. The following example shows a way to create a histogram with 5 buckets and gather "COLUMN_VALUE" from 510 to 520 (so each bucket gets values of range 2). WIDTH_BUCKET will create additional id=0 and num_buckets+1 buckets for values below min and above max.
SELECT "BUCKET_ID", count(*),
CASE
WHEN "BUCKET_ID"=0 THEN -1/0F
ELSE 510+(520-510)/5*("BUCKET_ID"-1)
END "BUCKET_MIN",
CASE
WHEN "BUCKET_ID"=5+1 THEN 1/0F
ELSE 510+(520-510)/5*("BUCKET_ID")
END "BUCKET_MAX"
FROM
(
SELECT "COLUMN_VALUE",
WIDTH_BUCKET("COLUMN_VALUE", 510, 520, 5) "BUCKET_ID"
FROM "MY_TABLE"
)
group by "BUCKET_ID"
ORDER BY "BUCKET_ID";
Sample output
BUCKET_ID COUNT(*) BUCKET_MIN BUCKET_MAX
---------- ---------- ---------- ----------
0 45 -Inf 5.1E+002
1 220 5.1E+002 5.12E+002
2 189 5.12E+002 5.14E+002
3 43 5.14E+002 5.16E+002
4 3 5.16E+002 5.18E+002
In my table, there's no 518-520, so bucket with id=5 is not shown. On the other hand, there's values below min (510), so there's a bucket with id=0, gathering -inf to 510 values.

Related

Why group by date is returning multiple rows for the same date?

I have a query like the following.
select some_date_col, count(*) as cnt
from <the table>
group by some_date_col
I get something like that at the output.
13-12-2021, 6
13-12-2021, 8
13-12-2021, 9
....
How is that possible? Here some_date_col is of type Date.

A DATE is a binary data-type that is composed of 7 bytes (century, year-of-century, month, day, hour, minute and second) and will always have those components.
The user interface you use to access the database can choose to display some or all of those components of the binary representation of the DATE; however, regardless of whether or not they are displayed by the UI, all the components are always stored in the database and used in comparisons in queries.
When you GROUP BY a date data-type you aggregate values that have identical values down to an accuracy of a second (regardless of the accuracy the user interface).
So, if you have the data:
CREATE TABLE the_table (some_date_col) AS
SELECT DATE '2021-12-13' FROM DUAL CONNECT BY LEVEL <= 6 UNION ALL
SELECT DATE '2021-12-13' + INTERVAL '1' SECOND FROM DUAL CONNECT BY LEVEL <= 8 UNION ALL
SELECT DATE '2021-12-13' + INTERVAL '1' MINUTE FROM DUAL CONNECT BY LEVEL <= 9;
Then the query:
SELECT TO_CHAR(some_date_col, 'YYYY-MM-DD HH24:MI:SS') AS some_date_col,
count(*) as cnt
FROM the_table
GROUP BY some_date_col;
Will output:
SOME_DATE_COL
CNT
2021-12-13 00:01:00
9
2021-12-13 00:00:01
8
2021-12-13 00:00:00
6
The values are grouped according to equal values (down to the maximum precision stored in the date).
If you want to GROUP BY dates with the same date component but any time component then use the TRUNCate function (which returns a value with the same date component but the time component set to midnight):
SELECT TRUNC(some_date_col) AS some_date_col,
count(*) as cnt
FROM <the table>
GROUP BY TRUNC(some_date_col)
Which, for the same data outputs:
SOME_DATE_COL
CNT
13-DEC-21
23
And:
SELECT TO_CHAR(TRUNC(some_date_col), 'YYYY-MM-DD HH24:MI:SS') AS some_date_col,
count(*) as cnt
FROM the_table
GROUP BY TRUNC(some_date_col)
Outputs:
SOME_DATE_COL
CNT
2021-12-13 00:00:00
23
db<>fiddle here

Oracle date type holds a date and time component. If the time components do not match, grouping by that value will place the same date (with different times) in different groups:
The fiddle
CREATE TABLE test ( xdate date );
INSERT INTO test VALUES (current_date);
INSERT INTO test VALUES (current_date + INTERVAL '1' MINUTE);
With the default display format:
SELECT xdate, COUNT(*) FROM test GROUP BY xdate;
Result:
XDATE
COUNT(*)
13-DEC-21
1
13-DEC-21
1
Now alter the format and rerun:
ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MON-DD HH24:MI:SS';
SELECT xdate, COUNT(*) FROM test GROUP BY xdate;
The result
XDATE
COUNT(*)
2021-DEC-13 23:29:36
1
2021-DEC-13 23:30:36
1
Also try this:
SELECT to_char(xdate, 'YYYY-MON-DD HH24:MI:SS') AS formatted FROM test;
Result:
FORMATTED
2021-DEC-13 23:29:36
2021-DEC-13 23:30:36
and this:
SELECT to_char(xdate, 'YYYY-MON-DD HH24:MI:SS') AS formatted, COUNT(*) FROM test GROUP BY xdate;
Result:
FORMATTED
COUNT(*)
2021-DEC-13 23:29:36
1
2021-DEC-13 23:30:36
1

Condition and round to nearest 30 minutes interval multiple timestamps SQL BIG QUERY

I use goodle big query. My query includes 2 different timestamps: start_at and end_at.
The goal of the query is to round these 2 timestamps to the nearest 30 minutes interval, which I manage using this: TIMESTAMP_TRUNC(TIMESTAMP_SUB(start_at, INTERVAL MOD(EXTRACT(MINUTE FROM start_at), 30) MINUTE),MINUTE) and the same goes for end_at.
Events occur (net_lost_orders) at each rounded timestamp.
The 2 problems that I encounter are:
First, as long as start_at and end_at are in the same 30 min. interval, things work well but when it is not the case (for example when start_at: 19:15 (nearest 30 min interval is 19:00) / end_at: 21:15 (nearest 30 min interval is 21:00), the results are not as expected. Additionally, I do not only need the 2 extreme intervals but all 30 minutes interval between start_at and end_at(19:00/19:30/20:00/20:30/21:00 in the example).
Secondly, I don't manage to create a condition that allows to show each interval on a separate row. I have tried to CAST, TRUNCATE,EXTRACTthe timestamps and to use CASE WHEN and to GROUP BY without success.
Here's the final part of the query (timestamps rounded excluded):
...
-------LOST ORDERS--------
a AS (SELECT created_date, closure, zone_id, city_id, l.interval_start,
l.net as net_lost_orders, l.starts_at, CAST(DATETIME(l.starts_at, timezone)AS TIMESTAMP) as start_local_time
FROM `XXX`, UNNEST(lost_orders) as l),
b AS (SELECT city_id, city_name, zone_id, zone_name FROM `YYY`),
lost AS (SELECT DISTINCT created_date, closure, zone_name, city_name, start_local_time,
TIMESTAMP_TRUNC(TIMESTAMP_SUB(start_local_time, INTERVAL MOD(EXTRACT(MINUTE FROM start_local_time), 30) MINUTE),MINUTE) AS lost_order_30_interval,
net_lost_orders
FROM a LEFT JOIN b ON a.city_id=b.city_id AND a.zone_id=b.zone_id AND a.city_id=b.city_id
WHERE zone_name='Atlanta' AND created_date='2021-09-09'
ORDER BY rt ASC),
------PREPARATION CLOSURE START AND END INTERVALS------
f AS (SELECT
DISTINCT TIMESTAMP_TRUNC(TIMESTAMP_SUB(start_at, INTERVAL MOD(EXTRACT(MINUTE FROM start_at), 30) MINUTE),MINUTE) AS start_closure_30_interval,
TIMESTAMP_TRUNC(TIMESTAMP_SUB(end_at, INTERVAL MOD(EXTRACT(MINUTE FROM end_at), 30) MINUTE),MINUTE) AS end_closure_30_interval,
country_code,
report_date,
Day,
CASE
WHEN Day="Monday" THEN 1
WHEN Day="Tuesday" THEN 2
WHEN Day="Wednesday" THEN 3
WHEN Day="Thursday" THEN 4
WHEN Day="Friday" THEN 5
WHEN Day="Saturday" THEN 6
WHEN Day="Sunday" THEN 7
END AS Weekday_order,
report_week,
city_name,
events_mod.zone_name,
closure,
start_at,
end_at,
activation_threshold,
deactivation_threshold,
shrinkage_drive_time,
ROUND(duration/60,2) AS duration,
FROM events_mod
WHERE report_date="2021-09-09"
AND events_mod.zone_name="Atlanta"
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
ORDER BY report_date, start_at ASC)
------FINAL TABLE------
SELECT DISTINCT
start_closure_30_interval,end_closure_30_interval, report_date, Day, Weekday_order, report_week, f.city_name, f.zone_name, closure,
start_at, end_at, start_time,end_time, activation_threshold, deactivation_threshold, duration, net_lost_orders
FROM f
LEFT JOIN lost ON f.city_name=lost.city_name
AND f.zone_name=lost.zone_name
AND f.report_date=lost.created_date
AND f.start_closure_30_interval=lost.lost_order_30_interval
AND f.end_closure_30_interval=lost.lost_order_30_interval
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
Results:
Expected results:
I would be really grateful if you could help and explain me how to get all the rounded timestamps between start_at and end_aton separate rows. Thank you in advance. Best, Fabien
Spreadsheet here

Consider below approach
select intervals, any_value(t).*, sum(Nb_lost_orders) as Nb_lost_orders
from table1 t,
unnest(generate_timestamp_array(
timestamp_seconds(div(unix_seconds(starts_at), 1800) * 1800),
timestamp_seconds(div(unix_seconds(ends_at), 1800) * 1800),
interval 30 minute
)) intervals
left join (
select Nb_lost_orders,
timestamp_seconds(div(unix_seconds(Time_when_the_lost_order_occurred), 1800) * 1800) as intervals
from Table2
)
using(intervals)
group by intervals
if applied to sample data in your question
with Table1 as (
select 'Closure' Event, timestamp '2021-09-09 11:00:00' starts_at, timestamp '2021-09-09 11:45:00' ends_at union all
select 'Closure', '2021-09-09 12:05:00', '2021-09-09 14:10:00'
), Table2 as (
select 5 Nb_lost_orders, timestamp '2021-09-09 11:38:00' Time_when_the_lost_order_occurred
)
output is

ORACLE SQL: Hourly Date to be group by day time and sum of the amount

I have the following situation:
ID DATE_TIME AMOUNT
23 14-MAY-2021 10:47:01 5
23 14-MAY-2021 11:49:52 3
23 14-MAY-2021 12:03:18 4
How can get the sum of the amount and take the DATE by day not hourly?
Example:
ID DATE_TIME TOTAL
23 20210514 12
I tried this way but i got error:
SELECT DISTINCT ID, TO_CHAR(DATE_TIME, 'YYYYMMDD'), SUM(AMOUNT) AS TOTAL FROM MY_TABLE
WHERE ID ='23' AND DATE_TIME > SYSDATE-1
GROUP BY TOTAL, DATE_TIME

You don't need DISTINCT if you use GROUP BY - anything that is grouped must be distinct unless it joined to something else later on that caused it to repeat again
You were almost there too
SELECT ID, TO_CHAR(DATE_TIME, 'YYYYMMDD') AS DATE_TIME, SUM(AMOUNT) AS TOTAL
FROM MY_TABLE
WHERE ID ='23' AND DATE_TIME > SYSDATE-1
GROUP BY ID, TO_CHAR(DATE_TIME, 'YYYYMMDD')
You need to group by the output of the function, not the input. Not every database can GROUP BY aliases used in the select (technically the SELECT hasn't been done by the time the GROUP is done so the aliases don't exist yet, and you wouldnt group by the total because that's an aggregate (the result of summing up every various value in the group)
If you need to do further work with that date, don't convert it to a string.. Cut the time off using TRUNC:
SELECT ID, TRUNC(DATE_TIME) as DATE_TIME, SUM(AMOUNT) AS TOTAL
FROM MY_TABLE
WHERE ID ='23' AND DATE_TIME > SYSDATE-1
GROUP BY ID, TRUNC(DATE_TIME)
TRUNC can cut a date down to other parts, for example TRUNC(DATE_TIME, 'HH24') will remove the minutes and seconds but leave the hours

Convert the DATE column to a string with the required accuracy and then group on that:
SELECT ID,
TO_CHAR("DATE", 'YYYY-MM-DD'),
SUM(AMOUNT) AS TOTAL FROM MY_TABLE
WHERE ID ='23'
AND "DATE" > SYSDATE-1
GROUP BY ID, TO_CHAR("DATE", 'YYYY-MM-DD')
or truncate the value so that the time component is set to midnight for each date:
SELECT ID,
TRUNC("DATE"),
SUM(AMOUNT) AS TOTAL FROM MY_TABLE
WHERE ID ='23'
AND "DATE" > SYSDATE-1
GROUP BY ID, TRUNC("DATE")
(Note: DATE is a keyword and cannot be used as an identifier unless you use a quoted-identifier; and you would need to use the quotes, and the exact case, everytime you refer to the column. You would be better to rename the column to something else that is not a keyword.)

Data recurring in previous 90 days

I hope you can suppor me with a piece of code I'm writing. I'm working with the following query:
SELECT case_id, case_date, people_id FROM table_1;
and I've to search in the DB how many times the same people_id is repeted in the DB, (different case_id) considering the case_date -90days timeframe. Any advise on how to address that?
Data sample
Additional info: as results I'm expecting to have the list of people_id with how many cases received in the 90 days from the last case_date.
expected result sample:

The way I understood the question, it would be something like this:
select people_id,
case_id,
count(*)
from table_1
where case_date >= trunc(sysdate) - 90
group by people_id,
case_id

You want to filter WHERE the case_date is greater than or equal to 90 days before the start of today and then GROUP BY the people_id and COUNT the number of DISTINCT (different) case_id:
SELECT people_id,
COUNT( DISTINCT case_id ) AS number_of_cases
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id;
If you only want to count repeated case_id per person_id then:
SELECT person_id,
COUNT(*) AS number_of_repeated_cases
FROM (
SELECT case_id,
person_id,
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id,
case_id
HAVING COUNT(*) >= 2
)
GROUP BY
people_id;

I think you want window functions:
select t.*,
count(*) over (partition by people_idorder by case_date
range between interval '90' day preceding and current row
) as person_count_90_day
from t;

Irregular grouping of timestamp variable

I have a table organized as follows:
id lateAt
1231235 2019/09/14
1242123 2019/09/13
3465345 NULL
5676548 2019/09/28
8986475 2019/09/23
Where lateAt is a timestamp of when a certain loan's payment became late. So, for each current date - I need to look at these numbers daily - there's a certain amount of entries which are late for 0-15, 15-30, 30-45, 45-60, 60-90 and 90+ days.
This is my desired output:
lateGroup Count
0-15 20
15-30 22
30-45 25
45-60 32
60-90 47
90+ 57
This is something I can easily calculate in R, but to get the results back to my BI dashboard I'd have to create a new table in my database, which I don't think is a good practice. What is the SQL-native approach to this problem?

I would define the "late groups" using a range, the join against the number of days:
with groups (grp) as (
values
(int4range(0,15, '[)')),
(int4range(15,30, '[)')),
(int4range(30,45, '[)')),
(int4range(45,60, '[)')),
(int4range(60,90, '[)')),
(int4range(90,null, '[)'))
)
select grp, count(t.user_id)
from groups g
left join the_table t on g.grp #> current_date - t.late_at
group by grp
order by grp;
int4range(0,15, '[)') creates a range from 0 (inclusive) and 15 (exclusive)
Online example: https://rextester.com/QJSN89445

The quick and dirty way to do this in SQL is:
SELECT '0-15' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 0
AND (CURRENT_DATE - t.lateAt) < 15
UNION
SELECT '15-30' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 15
AND (CURRENT_DATE - t.lateAt) < 30
UNION
SELECT '30-45' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 30
AND (CURRENT_DATE - t.lateAt) < 45
-- Etc...
For production code, you would want to do something more like Ross' answer.

You didn't mention which DBMS you're using, but nearly all of them will have a construct known as a "value constructor" like this:
select bins.lateGroup, bins.minVal, bins.maxVal FROM
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
If your DBMS doesn't have it, then you can probably use UNION ALL:
SELECT '0-15' as lateGroup, 0 as minVal, 15 as maxVal
union all SELECT '15-30',15,30
union all SELECT '30-45',30,45
Then your complete query, with the sample data you provided, would look like this:
--- example from SQL Server 2012 SP1
--- first let's set up some sample data
create table #temp (id int, lateAt datetime);
INSERT #temp (id, lateAt) values
(1231235,'2019-09-14'),
(1242123,'2019-09-13'),
(3465345,NULL),
(5676548,'2019-09-28'),
(8986475,'2019-09-23');
--- here's the actual query
select lateGroup, count(*) as Count
from #temp as T,
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
) AS bins(lateGroup,minVal,maxVal)
where datediff(day,lateAt,getdate()) between minVal and maxVal
group by lateGroup
order by lateGroup
--- remove our sample data
drop table #temp;
Here's the output:
lateGroup Count
15-30 2
30-45 2
Note: rows with null lateAt are not counted.

I think you can do it all in one clear query :
with cte_lategroup as
(
select *
from (values(0,15,'0-15'),(15,30,'15-30'),(30,45,'30-45')) as t (mini, maxi, designation)
)
select
t2.designation
, count(*)
from test t
left outer join cte_lategroup t2
on current_date - t.lateat >= t2.mini
and current_date - lateat < t2.maxi
group by t2.designation;
With a preset like yours :
create table test
(
id int
, lateAt date
);
insert into test
values (1231235, to_date('2019/09/14', 'yyyy/mm/dd'))
,(1242123, to_date('2019/09/13', 'yyyy/mm/dd'))
,(3465345, null)
,(5676548, to_date('2019/09/28', 'yyyy/mm/dd'))
,(8986475, to_date('2019/09/23', 'yyyy/mm/dd'));

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimal way to create a histogram/frequency distribution in Oracle? - sql

I'm unfamiliar with Oracle's date functions, but I'm pretty certain there's an equivalent way of writing this Postgres statement: select date_trunc('hour', stamp), count(*) from your_data group by date_trunc('hour', stamp) order by date_trunc('hour', stamp)

Related

Why group by date is returning multiple rows for the same date?

Condition and round to nearest 30 minutes interval multiple timestamps SQL BIG QUERY

ORACLE SQL: Hourly Date to be group by day time and sum of the amount

Data recurring in previous 90 days

Irregular grouping of timestamp variable

Categories

Resources