I have a table T:
Entity type starttime sequence duration
1 A 2017010101 1 12
1 A 2017010102 2 11
1 A 2017010103 3 3
1 A 2017010104 4 1
1 A 2017010105 1 19
1 A 2017010106 2 18
2 A 2017010101 1 18
2 A 2017010102 1 100
3 A 2017010101 1 120
I need to aggregate the data so that each run of sequence has a total duration and the first starttime:
Entity type starttime sequence duration
1 A 2017010101 1 27
1 A 2017010105 1 37
2 A 2017010101 1 18
2 A 2017010102 1 100
3 A 2017010101 1 120
I believe this is a gaps-and-islands problem, but I can't quite figure it out...
I have tried to use a lead() over (partition by entity order by sequence) but this keeps grabbing the next run of sequence.
If sequence has no gaps then you can use row_number() and subtract sequence to create temporary column grp used next for aggregation:
select entity, type, min(starttime) starttime,
min(sequence) sequence, sum(duration) duration
from (select t.*,
row_number() over (partition by entity order by starttime) - sequence grp
from t)
group by entity, type, grp
order by entity, grp
Test:
with t(entity, type, starttime, sequence, duration) as (
select 1, 'A', 2017010101, 1, 12 from dual union all
select 1, 'A', 2017010102, 2, 11 from dual union all
select 1, 'A', 2017010103, 3, 3 from dual union all
select 1, 'A', 2017010104, 4, 1 from dual union all
select 1, 'A', 2017010105, 1, 19 from dual union all
select 1, 'A', 2017010106, 2, 18 from dual union all
select 2, 'A', 2017010101, 1, 18 from dual union all
select 2, 'A', 2017010102, 1, 100 from dual union all
select 3, 'A', 2017010101, 1, 120 from dual )
select entity, type, min(starttime) starttime,
min(sequence) sequence, sum(duration) duration
from (select t.*,
row_number() over (partition by entity order by starttime) - sequence grp
from t)
group by entity, type, grp
order by entity, grp
ENTITY TYPE STARTTIME SEQUENCE DURATION
---------- ---- ---------- ---------- ----------
1 A 2017010101 1 27
1 A 2017010105 1 37
2 A 2017010101 1 18
2 A 2017010102 1 100
3 A 2017010101 1 120
You don't need row_number() for this. You can just subtract the sequence from the starttime -- assuming starttime is a date. The difference is constant for each group of sequential values:
select entity, type, min(starttime) as starttime,
min(sequence) as sequence, sum(duration) as duration
from t
group by entity, type, (starttime - sequence)
order by entity, grp;
If starttime is a string then you need row_number() as Ponder suggests. If starttime is a number, then this works within a single month, but you probably want row_number().
Related
I have a dataset within a date range which has three columns, Product_type, date and metric. For a given product_type, data is not available for all days. For the missing rows, we would like to do a forward date fill for next n days using the last value of the metric.
Product_type
date
metric
A
2019-10-01
10
A
2019-10-02
12
A
2019-10-03
15
A
2019-10-04
5
A
2019-10-05
5
A
2019-10-06
5
A
2019-10-16
12
A
2019-10-17
23
A
2019-10-18
34
Here, the data from 2019-10-04 to 2019-10-06, has been forward filled. There might be bigger gaps in the dates, but we only want to fill the first n days.
Here, n=2, so rows 5 and 6 has been forward filled.
I am not sure how to implement this logic in SQL.
Here's one option. Read comments within code.
Sample data:
SQL> WITH
2 test (product_type, datum, metric)
3 AS
4 (SELECT 'A', DATE '2019-10-01', 10 FROM DUAL
5 UNION ALL
6 SELECT 'A', DATE '2019-10-02', 12 FROM DUAL
7 UNION ALL
8 SELECT 'A', DATE '2019-10-03', 15 FROM DUAL
9 UNION ALL
10 SELECT 'A', DATE '2019-10-04', 5 FROM DUAL
11 UNION ALL
12 SELECT 'A', DATE '2019-10-16', 12 FROM DUAL
13 UNION ALL
14 SELECT 'A', DATE '2019-10-18', 23 FROM DUAL),
Query begins here:
15 temp
16 AS
17 -- CB_FWD_FILL = 1 if difference between two consecutive dates is larger than 1 day
18 -- (i.e. that's the gap to be forward filled)
19 (SELECT product_type,
20 datum,
21 metric,
22 LEAD (datum) OVER (PARTITION BY product_type ORDER BY datum)
23 next_datum,
24 CASE
25 WHEN LEAD (datum)
26 OVER (PARTITION BY product_type ORDER BY datum)
27 - datum >
28 1
29 THEN
30 1
31 ELSE
32 0
33 END
34 cb_fwd_fill
35 FROM test)
36 -- original data from the table
37 SELECT product_type, datum, metric FROM test
38 UNION ALL
39 -- DATUM is the last date which is OK; add LEVEL pseudocolumn to it to fill the gap
40 -- with PAR_N number of rows
41 SELECT product_type, datum + LEVEL, metric
42 FROM (SELECT product_type, datum, metric
43 FROM (-- RN = 1 means that that's the first gap in data set - that's the one
44 -- that has to be forward filled
45 SELECT product_type,
46 datum,
47 metric,
48 ROW_NUMBER ()
49 OVER (PARTITION BY product_type ORDER BY datum) rn
50 FROM temp
51 WHERE cb_fwd_fill = 1)
52 WHERE rn = 1)
53 CONNECT BY LEVEL <= &par_n
54 ORDER BY datum;
Result:
Enter value for par_n: 2
PRODUCT_TYPE DATUM METRIC
--------------- ---------- ----------
A 2019-10-01 10
A 2019-10-02 12
A 2019-10-03 15
A 2019-10-04 5
A 2019-10-05 5 --> newly added
A 2019-10-06 5 --> rows
A 2019-10-16 12
A 2019-10-18 23
8 rows selected.
SQL>
Another solution:
WITH test (product_type, datum, metric) AS
(
SELECT 'A', DATE '2019-10-01', 10 FROM DUAL
UNION ALL
SELECT 'A', DATE '2019-10-02', 12 FROM DUAL
UNION ALL
SELECT 'A', DATE '2019-10-03', 15 FROM DUAL
UNION ALL
SELECT 'A', DATE '2019-10-04', 5 FROM DUAL
UNION ALL
SELECT 'A', DATE '2019-10-16', 12 FROM DUAL
UNION ALL
SELECT 'A', DATE '2019-10-18', 23 FROM DUAL
),
minmax(mindatum, maxdatum) AS (
SELECT MIN(datum), max(datum) from test
),
alldates (datum, product_type) AS
(
SELECT mindatum + level - 1, t.product_type FROM minmax,
(select distinct product_type from test) t
connect by mindatum + level <= (select maxdatum from minmax)
),
grouped as (
select a.datum, a.product_type, t.metric,
count(t.product_type) over(partition by a.product_type order by a.datum) as grp
from alldates a
left join test t on t.datum = a.datum
),
final_table as (
select g.datum, g.product_type, g.grp, g.rn,
last_value(g.metric ignore nulls) over(partition by g.product_type order by g.datum) as metric
from (
select g.*, row_number() over(partition by product_type, grp order by datum) - 1 as rn
from grouped g
) g
)
select datum, product_type, metric
from final_table
where rn <= &par_n
order by datum
;
I have this table:
Site_ID
Volume
RPT_Date
RPT_Hour
1
10
01/01/2021
1
1
7
01/01/2021
2
1
13
01/01/2021
3
1
11
01/16/2021
1
1
3
01/16/2021
2
1
5
01/16/2021
3
2
9
01/01/2021
1
2
24
01/01/2021
2
2
16
01/01/2021
3
2
18
01/16/2021
1
2
7
01/16/2021
2
2
1
01/16/2021
3
I need to select the RPT_Hour with the highest Volume for each set of dates
Needed Output:
Site_ID
Volume
RPT_Date
RPT_Hour
1
13
01/01/2021
1
1
11
01/16/2021
1
2
24
01/01/2021
2
2
18
01/16/2021
1
SELECT site_id, volume, rpt_date, rpt_hour
FROM (SELECT t.*,
ROW_NUMBER()
OVER (PARTITION BY site_id, rpt_date ORDER BY volume DESC) AS rn
FROM MyTable) t
WHERE rn = 1;
I cannot figure out how to group the table into like date groups. If I could do that, I think the rn = 1 will return the highest volume row for each date.
The way I see it, your query is OK (but rpt_hour in desired output is not).
SQL> with test (site_id, volume, rpt_date, rpt_hour) as
2 (select 1, 10, date '2021-01-01', 1 from dual union all
3 select 1, 7, date '2021-01-01', 2 from dual union all
4 select 1, 13, date '2021-01-01', 3 from dual union all
5 select 1, 11, date '2021-01-16', 1 from dual union all
6 select 1, 3, date '2021-01-16', 2 from dual union all
7 select 1, 5, date '2021-01-16', 3 from dual union all
8 --
9 select 2, 9, date '2021-01-01', 1 from dual union all
10 select 2, 24, date '2021-01-01', 3 from dual union all
11 select 2, 16, date '2021-01-01', 3 from dual union all
12 select 2, 18, date '2021-01-16', 1 from dual union all
13 select 2, 7, date '2021-01-16', 2 from dual union all
14 select 2, 1, date '2021-01-16', 3 from dual
15 ),
16 temp as
17 (select t.*,
18 row_number() over (partition by site_id, rpt_date order by volume desc) rn
19 from test t
20 )
21 select site_id, volume, rpt_date, rpt_hour
22 from temp
23 where rn = 1
24 /
SITE_ID VOLUME RPT_DATE RPT_HOUR
---------- ---------- ---------- ----------
1 13 01/01/2021 3
1 11 01/16/2021 1
2 24 01/01/2021 3
2 18 01/16/2021 1
SQL>
One option would be using MAX(..) KEEP (DENSE_RANK ..) OVER (PARTITION BY ..) analytic function without need of any subquery such as :
SELECT DISTINCT
site_id,
MAX(volume) KEEP (DENSE_RANK FIRST ORDER BY volume DESC) OVER
(PARTITION BY site_id, rpt_date) AS volume,
rpt_date,
MAX(rpt_hour) KEEP (DENSE_RANK FIRST ORDER BY volume DESC) OVER
(PARTITION BY site_id, rpt_date) AS rpt_hour
FROM t
GROUP BY site_id, rpt_date, volume, rpt_hour
ORDER BY site_id, rpt_date
Demo
I have the following table showing when customers bought a certain product. The data I have is CustomerID, Amount, Dat. I am trying to create the column ProductsIn30Days, which represents how many products a customer bought in the range Dat-30 days inclusive the current day.
For example, ProductsIn30Days for CustomerID 1 on Dat 25.3.2020 is 7, since the customer bought 2 products on 25.3.2020 and 5 more products on 24.3.2020, which falls within 30 days before 25.3.2020.
CustomerID
Amount
Dat
ProductsIn30Days
1
1
23.3.2018
1
1
2
24.3.2020
2
1
3
24.3.2020
5
1
2
25.3.2020
7
1
2
24.5.2020
2
1
1
15.6.2020
3
2
7
24.3.2017
7
2
2
24.3.2020
2
I tried something like this with no success, since the partition only works on a single date rather than on a range like I would need:
select CustomerID, Amount, Dat,
sum(Amount) over (partition by CustomerID, Dat-30)
from table
Thank you for help.
You can use an analytic SUM function with a range window:
SELECT t.*,
SUM(Amount) OVER (
PARTITION BY CustomerID
ORDER BY Dat
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS ProductsIn30Days
FROM table_name t;
Which, for the sample data:
CREATE TABLE table_name (CustomerID, Amount, Dat) AS
SELECT 1, 1, DATE '2018-03-23' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-03-24' FROM DUAL UNION ALL
SELECT 1, 3, DATE '2020-03-24' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-03-25' FROM DUAL UNION ALL
SELECT 1, 2, DATE '2020-05-24' FROM DUAL UNION ALL
SELECT 1, 1, DATE '2020-06-15' FROM DUAL UNION ALL
SELECT 2, 7, DATE '2017-03-24' FROM DUAL UNION ALL
SELECT 2, 2, DATE '2020-03-24' FROM DUAL;
Outputs:
CUSTOMERID
AMOUNT
DAT
PRODUCTSIN30DAYS
1
1
2018-03-23 00:00:00
1
1
2
2020-03-24 00:00:00
5
1
3
2020-03-24 00:00:00
5
1
2
2020-03-25 00:00:00
7
1
2
2020-05-24 00:00:00
2
1
1
2020-06-15 00:00:00
3
2
7
2017-03-24 00:00:00
7
2
2
2020-03-24 00:00:00
2
Note: If you have values on the same date then they will be tied in the order and always aggregated together (i.e. rows 2 & 3). If you want them to be aggregated separately then you need to order by something else to break the ties but that would not work with a RANGE window.
db<>fiddle here
I am using Oracle and trying to retrieve the total number of days a person was out of the office during the year. I have 2 tables involved:
Statuses
1 - Active
2 - Out of the Office
3 - Other
ScheduleHistory
RecordID - primary key
PersonID
PreviousStatusID
NextStatusID
DateChanged
I can easily find when the person went on vacation and when they came back, using
SELECT DateChanged FROM ScheduleHistory WHERE PersonID=111 AND NextStatusID = 2
and
SELECT DateChanged FROM ScheduleHistory WHERE PersonID=111 AND PreviousStatusID = 2
But in case a person went on vacation more than once, how can I can I calculate total number of days a person was out of the office. Is it possible to do programmatically, given only PersonID?
Here is some sample data:
RecordID PersonID PreviousStatusID NextStatusID DateChanged
-----------------------------------------------------------------------------
1 111 1 2 03/11/2020
2 111 2 1 03/13/2020
3 111 1 3 04/01/2020
4 111 3 1 04/07/2020
5 111 1 2 06/03/2020
6 111 2 1 06/05/2020
7 111 1 2 09/14/2020
8 111 2 1 09/17/2020
So from the data above, for the year 2020 for PersonID 111 the query should return 7
Try this:
with aux1 AS (
SELECT
a.*,
to_date(datechanged, 'MM/DD/YYYY') - LAG(to_date(datechanged, 'MM/DD/YYYY')) OVER(
PARTITION BY personid
ORDER BY
recordid
) lag_date
FROM
ScheduleHistory a
)
SELECT
personid,
SUM(lag_date) tot_days_ooo
FROM
aux1
WHERE
previousstatusid = 2
GROUP BY
personid;
If you want total days (or weekdays) for each year (and to account for periods when it goes over the year boundary) then:
WITH date_ranges ( personid, status, start_date, end_date ) AS (
SELECT personid,
nextstatusid,
datechanged,
LEAD(datechanged, 1, datechanged) OVER(
PARTITION BY personid
ORDER BY datechanged
)
FROM table_name
),
split_year_ranges ( personid, year, start_date, end_date, max_date ) AS (
SELECT personid,
TRUNC( start_date, 'YY' ),
start_date,
LEAST(
end_date,
ADD_MONTHS( TRUNC( start_date, 'YY' ), 12 )
),
end_date
FROM date_ranges
WHERE status = 2
UNION ALL
SELECT personid,
end_date,
end_date,
LEAST( max_date, ADD_MONTHS( end_date, 12 ) ),
max_date
FROM split_year_ranges
WHERE end_date < max_date
)
SELECT personid,
EXTRACT( YEAR FROM year) AS year,
SUM( end_date - start_date ) AS total_days,
SUM(
( TRUNC( end_date, 'IW' ) - TRUNC( start_date, 'IW' ) ) * 5 / 7
+ LEAST( end_date - TRUNC( end_date, 'IW' ), 5 )
- LEAST( start_date - TRUNC( start_date, 'IW' ), 5 )
) AS total_weekdays
FROM split_year_ranges
GROUP BY personid, year
ORDER BY personid, year
Which, for the sample data:
CREATE TABLE table_name ( RecordID, PersonID, PreviousStatusID, NextStatusID, DateChanged ) AS
SELECT 1, 111, 1, 2, DATE '2020-03-11' FROM DUAL UNION ALL
SELECT 2, 111, 2, 1, DATE '2020-03-13' FROM DUAL UNION ALL
SELECT 3, 111, 1, 3, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 4, 111, 3, 1, DATE '2020-04-07' FROM DUAL UNION ALL
SELECT 5, 111, 1, 2, DATE '2020-06-03' FROM DUAL UNION ALL
SELECT 6, 111, 2, 1, DATE '2020-06-05' FROM DUAL UNION ALL
SELECT 7, 111, 1, 2, DATE '2020-09-14' FROM DUAL UNION ALL
SELECT 8, 111, 2, 1, DATE '2020-09-17' FROM DUAL UNION ALL
SELECT 9, 222, 1, 2, DATE '2019-12-31' FROM DUAL UNION ALL
SELECT 10, 222, 2, 2, DATE '2020-12-01' FROM DUAL UNION ALL
SELECT 11, 222, 2, 2, DATE '2021-01-02' FROM DUAL;
Outputs:
PERSONID
YEAR
TOTAL_DAYS
TOTAL_WEEKDAYS
111
2020
7
7
222
2019
1
1
222
2020
366
262
222
2021
1
1
db<>fiddle here
Provided no vacation crosses a year boundary
with grps as (
SELECT sh.*,
row_number() over (partition by PersonID, NextStatusID order by DateChanged) grp
FROM ScheduleHistory sh
WHERE NextStatusID in (1,2) and 3 not in (NextStatusID, PreviousStatusID)
), durations as (
SELECT PersonID, min(DateChanged) DateChanged, max(DateChanged) - min(DateChanged) duration
FROM grps
GROUP BY PersonID, grp
)
SELECT PersonID, sum(duration) days_out
FROM durations
GROUP BY PersonID;
db<>fiddle
year_span is used to split an interval spanning across two years in two different records
H1 adds a row number dependent from PersonID to get the right sequence for each person
H2 gets the periods for each status change and extract 1st day of the year of the interval end
H3 split records that span across two years and calculate the right date_start and date_end for each interval
H calculates days elapsed in each interval for each year
final query sum up the records to get output
EDIT
If you need workdays instead of total days, you should not use total_days/7*5 because it is a bad approximation and in some cases gives weird results.
I have posted a solution to jump on fridays to mondays here
with
statuses (sid, sdescr) as (
select 1, 'Active' from dual union all
select 2, 'Out of the Office' from dual union all
select 3, 'Other' from dual
),
ScheduleHistory(RecordID, PersonID, PreviousStatusID, NextStatusID , DateChanged) as (
select 1, 111, 1, 2, date '2020-03-11' from dual union all
select 2, 111, 2, 1, date '2020-03-13' from dual union all
select 3, 111, 1, 3, date '2020-04-01' from dual union all
select 4, 111, 3, 1, date '2020-04-07' from dual union all
select 5, 111, 1, 2, date '2020-06-03' from dual union all
select 6, 111, 2, 1, date '2020-06-05' from dual union all
select 7, 111, 1, 2, date '2020-09-14' from dual union all
select 8, 111, 2, 1, date '2020-09-17' from dual union all
SELECT 9, 222, 1, 2, date '2019-12-31' from dual UNION ALL
SELECT 10, 222, 2, 2, date '2020-12-01' from dual UNION ALL
SELECT 11, 222, 2, 2, date '2021-01-02' from dual
),
year_span (n) as (
select 1 from dual union all
select 2 from dual
),
H1 AS (
SELECT ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY RecordID) PID, H.*
FROM ScheduleHistory H
),
H2 as (
SELECT
H1.*, H2.DateChanged DateChanged2,
EXTRACT(YEAR FROM H2.DateChanged) - EXTRACT(YEAR FROM H1.DateChanged) + 1 Y,
trunc(H2.DateChanged,'YEAR') Y2
FROM H1 H1
LEFT JOIN H1 H2 ON H1.PID = H2.PID-1 AND H1.PersonID = H2.PersonID
),
H3 AS (
SELECT Y, N, H2.PID, H2.RecordID, H2.PersonID, H2.NextStatusID,
CASE WHEN Y=1 THEN H2.DateChanged ELSE CASE WHEN N=1 THEN H2.DateChanged ELSE Y2 END END D1,
CASE WHEN Y=1 THEN H2.DateChanged2 ELSE CASE WHEN N=1 THEN Y2 ELSE H2.DateChanged2 END END D2
FROM H2
JOIN year_span N ON N.N <=Y
),
H AS (
SELECT PersonID, NextStatusID, EXTRACT(year FROM d1) Y, d2-d1 D
FROM H3
)
select PersonID, sdescr Status, Y, sum(d) d
from H
join statuses s on NextStatusID = s.sid
group by PersonID, sdescr, Y
order by PersonID, sdescr, Y
output
PersonID Status Y d
111 Active 2020 177
111 Other 2020 6
111 Out of the Office 2020 7
222 Out of the Office 2019 1
222 Out of the Office 2020 366
222 Out of the Office 2021 1
check the fiddle here
Oracle version 11g.
My table has records similar to these.
calendar_date ID record_count
25-OCT-2017 1 20
25-OCT-2017 2 40
25-OCT-2017 3 60
24-OCT-2017 1 70
24-OCT-2017 2 50
24-OCT-2017 3 10
20-OCT-2017 1 35
20-OCT-2017 2 60
20-OCT-2017 3 90
18-OCT-2017 1 80
18-OCT-2017 2 50
18-OCT-2017 3 45
i.e for each ID, there is one record count for a given calendar day. The days are NOT continuous, i.e there may be missing records for weekends/holidays etc. On such days, there will not be records available for any ID. However on working days there are entries available for each ID .
I need to get the average record count for last 30 business days for each id
I want an output like this. ( Don't go by the values. It is just a sample )
ID avg_count_last_30
1 150
2 130
3 110
I am trying to figure out the most efficient way to do this. I thought of using RANGE BETWEEN , ROWS BETWEEN etc , but unsure it would work.
Off course a query like this won't help as there are holidays in between.
select id, AVG(record_count) FROM mytable
where calendar_date between SYSDATE - 30 and SYSDATE - 1
group by id;
what I need is something like
select id , AVG(record_count) FROM mytable
where calendar_date between last_30th_business_day and last_business_day
group by id;
last_30th_business_day will be count(DISTINCT business_days ) starting from most recent business day going backwards till I count 30.
last_business_day will be most recent business day
Would like to know experts opinion on this and best approach.
Based on your comment try this one:
WITH mytable (calendar_date, ID, record_count) AS (
SELECT TO_DATE('25-10-2017', 'DD-MM-YYYY'), 1, 20 FROM dual UNION ALL
SELECT TO_DATE('25-10-2017', 'DD-MM-YYYY'), 2, 40 FROM dual UNION ALL
SELECT TO_DATE('25-10-2017', 'DD-MM-YYYY'), 3, 60 FROM dual UNION ALL
SELECT TO_DATE('24-10-2017', 'DD-MM-YYYY'), 1, 70 FROM dual UNION ALL
SELECT TO_DATE('24-10-2017', 'DD-MM-YYYY'), 2, 50 FROM dual UNION ALL
SELECT TO_DATE('24-10-2017', 'DD-MM-YYYY'), 3, 10 FROM dual UNION ALL
SELECT TO_DATE('20-10-2017', 'DD-MM-YYYY'), 1, 35 FROM dual UNION ALL
SELECT TO_DATE('20-10-2017', 'DD-MM-YYYY'), 2, 60 FROM dual UNION ALL
SELECT TO_DATE('20-10-2017', 'DD-MM-YYYY'), 3, 90 FROM dual UNION ALL
SELECT TO_DATE('18-10-2017', 'DD-MM-YYYY'), 1, 80 FROM dual UNION ALL
SELECT TO_DATE('18-10-2017', 'DD-MM-YYYY'), 2, 50 FROM dual UNION ALL
SELECT TO_DATE('18-10-2017', 'DD-MM-YYYY'), 3, 45 FROM dual),
t AS (
SELECT calendar_date, ID, record_count,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY calendar_date desc) AS RN
FROM mytable)
SELECT ID, AVG(RECORD_COUNT)
FROM t
WHERE rn <= 30
group by ID;