Oracle Running Total - sql

Looking for advice with 2 different types of sub-totals using PLSQL.
I need to pull a data set with 1) a unique headcount, and 2) a total number of credits, as a running total over time.
Raw Data:
This is the transactional data -- every time a student registers or a course, a record is inserted with the date, student id, and credits (along with course number and a bunch of other relevant data). One record per course per student.
STUDENT_ID CREDITS DATE
1 3 01-JAN-12
1 2 02-JAN-12
57 1 03-JAN-12
1 1 03-JAN-12
Processed Data:
This is what the boss needs to see -- it will be used for trending later (to see, for example, how this year's Jan-01 is measuring up against last year's Jan-01, etc.).
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 7 03-JAN-12
The brute approach to this is to write a bunch of separate SELECTS (one for each day), and UNION them together. For example:
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'01-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120101'
GROUP BY
'01-JAN-12'
UNION
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'02-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120102'
GROUP BY
'02-JAN-12'
UNION
...
And that works -- the results are accurate -- but as you can see -- this is nowhere near elegant -- and if you have to do it for 365 days, well...it's a beast. There's got to be a better way to do it.
So far in my search, I've learned about an 'OVER' clause that I can use -- like this:
SELECT
COUNT(DISTINCT STUDENT_ID) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) "UniqueHeadcount",
SUM(CREDIT_HR) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as "SumCredits",
TRUNC(RSTS_DATE) as "DATE"
FROM
REGISTRATIONS
This query is way, way shorter (yay) -- but has two significant problems that I can't yet find my way around. First is that it doesn't work (by design, aparently?) with the COUNT DISTINCT. So I comment that out for a moment, but then run into the second problem: it ignores the TRUNC() function. The RSTS_DATE, though it appears to be just a day/month/year value when you run a SELECT on it, actually holds the time as well, so the result set I get is not summed simply over date, but also over times -- so instead of one record per day, my processed data returns hundreds of records per day (one for each individual course registration). For example:
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 6 03-JAN-12 (hidden time: 07:32:27)
2 7 03-JAN-12 (hidden time: 08:01:33)
Not what I'm after.
So I'm looking for expertise -- if what I've explained so far makes sense -- is there another way to use the OVER clause, or perhaps there may be another feature of PLSQL altogether I should be using for this? I'm not strong in PLSQL if you can't tell, but if anyone can give me some direction -- even just words to google, I'd appreciate the help.
Thanks

Try this:
WITH CRdata AS
(
SELECT COUNT(DISTINCT STUDENT_ID) AS UniqueHeadcount,
SUM(CREDIT_HR) AS SumCredits,
TRUNC(RSTS_DATE) RSTS_DATE
FROM REGISTRATIONS
GROUP BY TRUNC(RSTS_DATE)
)
SELECT SUM(UniqueHeadcount) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS UniqueHeadcount,
SUM(SumCredits) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS SumCredits,
RSTS_DATE
FROM CRdata

Related

Impala get the difference between 2 dates excluding weekends

I'm trying to get the day difference between 2 dates in Impala but I need to exclude weekends.
I know it should be something like this but I'm not sure how the weekend piece would go...
DATEDIFF(resolution_date,created_date)
Thanks!
One approach at such task is to enumerate each and every day in the range, and then filter out the week ends before counting.
Some databases have specific features to generate date series, while in others offer recursive common-table-expression. Impala does not support recursive queries, so we need to look at alternative solutions.
If you have a table wit at least as many rows as the maximum number of days in a range, you can use row_number() to offset the starting date, and then conditional aggregation to count working days.
Assuming that your table is called mytable, with column id as primary key, and that the big table is called bigtable, you would do:
select
t.id,
sum(
case when dayofweek(dateadd(t.created_date, n.rn)) between 2 and 6
then 1 else 0 end
) no_days
from mytable t
inner join (select row_number() over(order by 1) - 1 rn from bigtable) n
on t.resolution_date > dateadd(t.created_date, n.rn)
group by id

SQL Server need the total of the previous 6 rows

I'm using SQL Server and I need to get the sum of the previous 6 rows of my table and place the results in its own column.
I'm able to get the 6th row back with the below query:
SELECT id
,FileSize
,LAG(FileSize,6) OVER (ORDER BY DAY(CompleteTime)) previous
FROM Jobs_analytics
group by id, CompleteTime, Jobs_analytics.FileSize
which gives me the six row back, but what I need is the sum of all six rows previous.
any help would be appreciate
Mike
You can use:
SELECT ja.id, ja.FileSize, CompleteTime,
SUM(FileSize) OVER (ORDER CompleteTime ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) as previous
FROM Jobs_analytics ja;
I don't see why GROUP BY is necessary. There are no aggregation functions.
Note that this takes 6 days including the current day. If you want the six preceding rows:
SELECT ja.id, ja.FileSize, DATE,
SUM(FileSize) OVER (ORDER BY CompleteTime ja.id ROWS BETWEEN 6 PRECEDING AND 1 PRECEDING) as previous
FROM Jobs_analytics ja

Query to select appropriate row and calculate elapsed time

I need some help in coming up with a query that will return the answer to the question “How long has a Help Desk Ticket been owned by the currently assigned group?” Following is a subset of the data model with some sample data:
Help Desk Cases
Case ID (PK) Assigned Person Assigned Group
123456 Robert Hardware
Help Desk Case Assignment History
Case ID (PK) Seq # (PK) Assigned Group Assigned Person Elapsed Time Row Added Date/Time
123456 1 Hardware 10
123456 2 Software 2
123456 3 Hardware Sam 1
123456 4 Software Sophie 6
123456 5 Hardware 8
123456 6 Hardware Sam 3
123456 7 Hardware Robert
The Elapsed Time column for the most recent row (Seq #7) is not updated until a subsequent row (Seq #8) is written, so I don’t think I can use an aggregate function. For the sample data above, I need to get the Row Added column from Seq # 5 and subtract it from the current date to get the total amount of time the case has been most recently assigned to the Hardware group (we ignore previous assignments such as Seq # 1 and Seq # 3).
The Query output for the example above should be:
Case ID Assigned Group Assigned Person Time Owned
123456 Hardware Robert Current Date - Seq #5 Row Added Date/Time
With Oracle 12c and higher...
select case_id,
last_assigned_group as assigned_group,
last_assigned_person as assigned_person,
nvl(last_row_added, systimestamp) - first_row_added as time_owned
from help_desk_case_assignment_history
match_recognize (
partition by case_id
order by seq#
measures
first(row_added) as first_row_added,
last(row_added) as last_row_added,
last(assigned_group) as last_assigned_group,
last(assigned_person) as last_assigned_person
one row per match
after match skip past last row
pattern (
assignment_run* case_end
)
define
assignment_run as (assigned_group = next(assigned_group)),
case_end as (elapsed_time is null or next(assigned_group) is null)
)
;
In human words: Per each helpdesk case ID find the last uninterrupted "run" of assignments within the same group. For the last "run" of assignments identify its starting time, ending time, and ending person. And display the found values.
With Oracle 11g and lower...
with xyz as (
select X.*,
case when lnnvl(assigned_group = lag(assigned_group) over (partition by case_id order by seq#)) then seq# end as assignment_run_start
from help_desk_case_assignment_history X
),
xyz2 as (
select X.*,
last_value(assignment_run_start) ignore nulls over (partition by case_id order by seq#) as assignment_run_id
from xyz X
),
xyz3 as (
select case_id, assigned_group, assignment_run_id,
max(assigned_person) keep (dense_rank last order by seq#) as last_assigned_person,
nvl(max(row_added) keep (dense_rank last order by seq#), systimestamp)
- min(row_added) keep (dense_rank first order by seq#)
as time_owned,
row_number() over (partition by case_id order by assignment_run_id desc) as last_group_ind
from xyz2 X
group by case_id, assigned_group, assignment_run_id
)
select case_id, assigned_group, last_assigned_person as assigned_person, time_owned
from xyz3
where last_group_ind = 1
;
Perhaps ugly, but pretty straightforward and working.
In human words:
Identify the boundaries (starts) of assignment runs as increasing numeric IDs.
Extend the found assignment run starts to the whole assignment runs.
Calculate the assignments' run times and last assigned persons.
Restrict the previous calculation to the last (by their ID) assignment run only.

Multiple aggregate sums from different conditions in one sql query

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

count occurrences for each week using db2

I am looking for some general advice rather than a solution. My problem is that I have a list of dates per person where due to administrative procedures, a person may have multiple records stored for this one instance, yet the date recorded is when the data was entered in as this person is passed through the paper trail. I understand this is quite difficult to explain so I'll give an example:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2000-01-01 B
1 2000-01-02 C
1 2003-04-01 A
1 2003-04-03 A
where I want to know how many valid records a person has by removing annoying audits that have recorded the date as the day the data was entered, rather than the date the person first arrives in the dataset. So for the above person I am only interested in:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2003-04-01 A
what makes this problem difficult is that I do not have the luxury of an audit column (the audit column here is just to present how to data is collected). I merely have dates. So one way where I could crudely count real events (and remove repeat audit data) is to look at individual weeks within a persons' history and if a record(s) exists for a given week, add 1 to my counter. This way even though there are multiple records split over a few days, I am only counting the succession of dates as one record (which after all I am counting by date).
So does anyone know of any db2 functions that could help me solve this problem?
If you can live with standard weeks it's pretty simple:
select
person, year(dt), week(dt), min(dt), min(audit)
from
blah
group by
person, year(dt), week(dt)
If you need seven-day ranges starting with the first date you'd need to generate your own week numbers, a calendar of sorts, e.g. like so:
with minmax(mindt, maxdt) as ( -- date range of the "calendar"
select min(dt), max(dt)
from blah
),
cal(dt,i) as ( -- fill the range with every date, count days
select mindt, 0
from minmax
union all
select dt+1 day , i+1
from cal
where dt < (select maxdt from minmax) and i < 100000
)
select
person, year(blah.dt), wk, min(blah.dt), min(audit)
from
(select dt, int(i/7)+1 as wk from cal) t -- generate week numbers
inner join
blah
on t.dt = blah.dt
group by person, year(blah.dt), wk