Aggregate based on flag (with qualifiers) - sql

I am trying to convert some calculated fields in Hyperion Studio to SQL. I got stuck trying to aggregate enrollment counts based on a shared course location/date/time. I have a flag created using LEAD to mark rows where the course location is identical to the row below. I need to roll up those consecutive rows, based on the flag, to get a total enrollment count for each location. The flag calculation includes exceptions (seen in example below, rows 2 and 3) where I specifically command it not to flag despite a shared location.
This is an example of base data that includes the flag (Roll Up is the field I'm trying to calculate):
SectionID Course Name Title Instructor Location Enrollment Flag Roll Up
1 EN.100.201 Title1 Prof. W Building 1 16
2 EN.550.365 Title2 Prof. X Building 2 5
3 EN.530.403 Title3 Prof. Y Building 2 30
4 EN.400.401 Title4 Prof. Z Building 3 25 Y
5 EN.400.601 Title4 Prof. Z Building 3 10
Here is the output I'm trying to achieve
SectionID Course Name Title Instructor Location Enrollment Flag Roll Up
1 EN.100.201 Title1 Prof. W Building 1 16 16
2 EN.550.365 Title2 Prof. X Building 2 5 5
3 EN.530.403 Title3 Prof. Y Building 2 30 30
5 EN.400.601 Title4 Prof. Z Building 3 10 35
Thanks in advance! Edit: made this easier to read; sorry, I'm new here.

Related

Select maximum value where another column is used for for the Grouping

I'm trying to join several tables, where one of the tables is acting as a
key-value store, and then after the joins find the maximum value in a
column less than another column. As a simplified example, I have the following three tables:
Documents:
DocumentID
Filename
LatestRevision
1
D1001.SLDDRW
18
2
P5002.SLDPRT
10
Variables:
VariableID
VariableName
1
DateReleased
2
Change
3
Description
VariableValues:
DocumentID
VariableID
Revision
Value
1
2
1
Created
1
3
1
Drawing
1
2
3
Changed Dimension
1
1
4
2021-02-01
1
2
11
Corrected typos
1
1
16
2021-02-25
2
3
1
Generic part
2
3
5
Screw
2
2
4
2021-02-24
I can use the LEFT JOIN/IS NULL thing to get the latest version of
variables relatively easily (see http://sqlfiddle.com/#!7/5982d/3/0).
What I want is the latest version of variables that are less than or equal
to a revision which has a DateReleased, for example:
DocumentID
Filename
Variable
Value
VariableRev
DateReleased
ReleasedRev
1
D1001.SLDDRW
Change
Changed Dimension
3
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-01
4
1
D1001.SLDDRW
Description
Drawing
1
2021-02-25
16
1
D1001.SLDDRW
Change
Corrected Typos
11
2021-02-25
16
2
P5002.SLDPRT
Description
Generic Part
1
2021-02-24
4
How do I do this?
I figured this out. Add another JOIN at the start to add in another version of the VariableValues table selecting only the DateReleased variables, then make sure that all the VariableValues Revisions selected are less than this date released. I think the LEFT JOIN has to be added after this table.
The example at http://sqlfiddle.com/#!9/bd6068/3/0 shows this better.

How to count the ID with the same prefix and store the total number in another column

I have a dataset in which I noticed that the ID comes with info for classification. Basically, the last 2 digits of ID stand for their sub-ID (01, 02, 03, etc) in the same family. Below is an example. I am trying to get another column (the 2nd column) to store the information of how many sub-IDs we have for the same family. e.g., 22302 belongs to family 223, which has 3 members: 22301, 22302, and 22303. So that I have a new feature for classification modeling. Not sure if there is a better idea to extract information. Anyway, can someone let me know how to extract the number in the same class (as shown the 2nd column)
ID Same class
23401 1
22302 3
43201 1
144501 2
144502 2
22301 3
22303 3
You can do it with str slice and transform
df['New']=df.groupby(df.ID.astype(str).str[:-2]).ID.transform('size')
df
Out[223]:
ID Sameclass New
0 23401 1 1
1 22302 3 3
2 43201 1 1
3 144501 2 2
4 144502 2 2
5 22301 3 3
6 22303 3 3

Query: Employee Training Schedules Based on Position/Workrole

My company sends folks to training. Based on projected new hires/transfers, I was asked to generate a report that estimates the number of seats we need in each course broken out by quarter.
Question: My question is two-fold:
What is the best way to represent a sequence of courses (i.e. prerequisites) in a relational DB?
How do I create the query(-ies) necessary to produce the following desired output:
Desired Output:
ID PersonnelID CourseID ProjectedStartDate ProjectedEndDate
1 1 1 1/14/2017 1/14/2017
2 2 1 2/17/2017 2/17/2017
3 2 2 2/18/2017 2/19/2017
4 2 3 2/20/2017 2/20/2017
5 3 49 1/18/2017 2/03/2017
6 …
Background Info: The courses are taken in-sequence: the first few courses are orientation courses for the company, and later courses are more specific to the employee's workrole. There are over 50 different courses, 40 different workroles and we're projecting ~1k new hires/transfers. Each work role must take a sequence of courses in a prescribed order, but I'm having trouble representing this ordering and subsequently writing the necessary query.
Existing Tables:
I have several tables that I've used to store the data: Personnel, LnkPersonnelToWorkroles,Workroles, LnkWorkrolesToCourses, and Courses (there's many others as well, but I omit them for the sake of scoping this question down). Here's some notional data from these tables:
Personnel (These are the projected new hires and their estimated arrival date.)
ID DisplayName RequiredCompletionDate
1 Kristel Bump 10/1/2016
2 Shelton Franke 3/11/2017
3 Shaunda Launer 4/16/2017
4 Clarinda Kestler 3/13/2017
5 My Wimsatt 6/6/2017
6 Gillian Bramer 10/25/2016
7 ...
Workroles (These are the positions in the company)
ID Workrole
1 Manager
2 Secretary
3 Admin Asst.
4 ...
LnkPersonnelToWorkroles (Links projected new hires to their projected workrole)
ID PersonnelID WorkroleID
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 ...
Courses (All courses available)
ID CourseName LengthInDays
1 Orientation 1
2 Email Etiquette 2
3 Workplace Safety 1
4 ...
LnkWorkrolesToCourses
(Links workroles to their required courses in a Many-to-Many relationship)
ID WorkroleID CourseID
1 1 1
2 2 1
3 2 2
4 2 3
5 3 49
6 ...
Thoughts: My approach is to first develop a person-by-person schedule based upon the new hire's target completion date and workrole. Then for each class, I could sum the number of new hires starting in that quarter.
I've considered trying to represent the courses in the most general way I could think of (i.e. using a directed acyclic graph), but since most of the courses have only a single prerequisite course, I think it's much easier to represent the prerequisites using the Prerequisites table below; however, I don't know how I would use this in a query.
Prerequisites (Is this a good idea?)
ID CourseID PrereqCourseID
1 2 1
2 3 1
3 4 1
4 5 4
5 ...
Note: I am not currently concerned with whether or not the courses are actually offered on those days; we will figure out the course schedules once we know approximately how many we need each quarter. Right now, we're trying to estimate the demand for each course.
Edit 1: To clarify the Desired Output table: if the person begins course 1 on day D, then they can't start course 2 until after they finish course 1, i.e. until the next day. For courses with a length L >1 days, the start date for a subsequent courses is delayed L days. Notice this effect playing out for workrole ID 2 in the Desired Output table: He is expected to arrive on 2/17, start and complete course 1 the same day, begin course 2 the next day (on 2/18), and finish course 2 the day after that (on 2/19).
I'm posting this answer because it gives me an approximate solution; other answers are still welcome.
I avoided a prerequisite table altogether and opted for a simpler approach: a partial ordering of the courses.
First, I drew the course prerequisite tree; it looked similar to this image:
I defined a partial ordering of the courses based on their depth in the prerequisite tree. In the picture above, CHM124 and High School Chem w/ Lab are priority 1, CHM152 is priority 2, CHM 153 is priority 3, CHM260 and CHM 270 are priority 4, and so on... This partial ordering was stored in the CoursePriority table:
CoursePriority:
ID CourseID Priority
1 1 1
2 2 2
3 3 3
4 4 3
5 5 4
6 6 3
7 ...
So that no two courses would every be taken at the same time, I perturbed each course's priority by a small random number using the following Update query:
UPDATE CoursePriority SET CoursePriority.Priority = [Priority]+Rnd([ID])/1000;
(I used [ID] as input to the Rnd method to ensure each course was perturbed by a different random number.) I ended up with this:
ID CourseID Priority
1 1 1.000005623
2 2 2.000094955
3 3 3.000036401
4 4 3.000052486
5 5 4.000076711
6 6 3.00000535
7 ...
The approach above answers my first question "What is the best [sensible] way to represent a sequence of courses (i.e. prerequisites) in a relational DB?" Now as for generating the course schedule...
First, I created a query qryLnkCoursesPriorities to link Courses to the CoursePriority table:
SELECT Courses.ID AS CourseID, Courses.DurationInDays, CoursePriority.Priority
FROM Courses INNER JOIN CoursePriority ON Courses.ID = CoursePriority.CourseID;
Result:
CourseID DurationInDays Priority
1 35 1.000076177
2 21 2.000148297
3 28 3.000094352
4 14 3.000081442
5...
Second, I created the qryWorkrolePriorityDelay query:
SELECT LnkWorkrolesToCourses.WorkroleID, qryLnkCoursePriorities.CourseID AS CourseID, qryLnkCoursePriorities.Priority, qryLnkCoursePriorities.DurationInDays, ([DurationInDays]+Nz(DSum("DurationInDays","qryLnkCoursePriorities","[Priority]>" & [Priority] & ""))) AS LeadTimeInDays
FROM LnkWorkrolesToCourses INNER JOIN qryLnkCoursePriorities ON LnkWorkrolesToCourses.CourseID = qryLnkCoursePriorities.CourseID
ORDER BY LnkWorkrolesToCourses.WorkroleID, qryLnkCoursePriorities.Priority;
Simply put: The qryWorkrolePriorityDelay query tells me how many days in advance each course should be taken to ensure the new hire can complete all subsequent courses prior to their required training completion deadline. It looks like this:
WorkroleID CourseID Priority DurationInDays LeadTimeInDays
1 7 1.000060646 7 147
1 1 1.000076177 35 140
1 2 2.000148297 21 105
1 4 3.000081442 14 84
1 6 3.000082824 14 70
1 3 3.000094352 28 56
1 5 4.000106905 28 28
2...
Finally, I was able to bring this all together to create the qryCourseSchedule query:
SELECT Personnel.ID AS PersonnelID, LnkWorkrolesToCourses.CourseID, [ProjectedHireDate]-[leadTimeInDays] AS ProjectedStartDate, [ProjectedHireDate]-[leadTimeInDays]+[Courses].[DurationInDays] AS ProjectedEndDate
FROM Personnel INNER JOIN (((LnkWorkrolesToCourses INNER JOIN (Courses INNER JOIN qryWorkrolePriorityDelay ON Courses.ID = qryWorkrolePriorityDelay.CourseID) ON (Courses.ID = LnkWorkrolesToCourses.CourseID) AND (LnkWorkrolesToCourses.WorkroleID = qryWorkrolePriorityDelay.WorkroleID)) INNER JOIN LnkPersonnelToWorkroles ON LnkWorkrolesToCourses.WorkroleID = LnkPersonnelToWorkroles.WorkroleID) INNER JOIN CoursePriority ON Courses.ID = CoursePriority.CourseID) ON Personnel.ID = LnkPersonnelToWorkroles.PersonnelID
ORDER BY Personnel.ID, [ProjectedHireDate]-[leadTimeInDays]+[Courses].[DurationInDays];
This query gives me the following output:
PersonnelID CourseID ProjectedStartDate ProjectedEndDate
1 7 5/7/2016 5/14/2016
1 1 5/14/2016 6/18/2016
1 2 6/18/2016 7/9/2016
1 4 7/9/2016 7/23/2016
1 6 7/23/2016 8/6/2016
1 3 8/6/2016 9/3/2016
1 5 9/3/2016 10/1/2016
2...
With this output, I created a pivot table, where course start dates were grouped by quarter and counted. This gave me exactly what I needed.

SQL Route Finder in Oracle - Recursion?

I am trying to build a simple route finder which calculates and stores the nodes of which a route traverses to get from A -- B. I have two tables; One which is made up of stages (The nodes and their 'next possible hops') and a route_stage table which should be able to store each route calculated with a unique route id.
Stage Table
STAGEID START_STATION NEXT_HOP_STATION LENGTH
---------- ------------------------------ ------------------------------ ----------
1 Penzance Plymouth 78
2 Plymouth Exeter 44.8
3 Exeter Taunton 36.6
4 Exeter Salisbury 96.6
5 Salisbury Basingstoke 38.2
6 Basingstoke Southampton 52.7
7 Southampton Poole 37
8 Poole Weymouth 31.6
9 Taunton Reading 99.5
10 Reading Basingstoke 18
11 Reading Paddington 40.9
12 Taunton Bristol 48.8
13 Bristol Bath 13
14 Bath Swindon 37.5
15 Swindon Reading 39.8
Route_Stage Table
ROUTEID STAGEID
---------- ----------
1 1
1 2
1 3
1 9
1 11
2 6
2 7
2 8
2 10
2 11
For the case of the above, the route with ID 1 Starts at Penzance and traverses, Plymouth, Exeter, Taunton, Reading and terminates at Paddington. Ideally I want to create a stored procedure that takes the entry parameters of a start and end station so the code inside will be able to calculate a suitable route.
I've had a look at recursion but got a bit lost, as I am not sure how the code should react when there are multiple potential paths from a node? How would it know which one was the correct one to go down.
Any help is greatly appreciated. Thanks!
For a single given starting position, this will (I think.. Sorry, typing by hand on an iPad) provide a row for each route that leaves that starting point.
SELECT
LEVEL as route_step,
t1.next_hop_station as next_station,
t1.stageid
FROM
stage t1
INNER JOIN stage t2
ON t2.start_station = t1.next_hop_station
START WITH
t1.start_station = 'your start station'
CONNECT BY
PRIOR t1.start_station = t1.next_hop_station
So, for start station Penzance:
Route_Step Next_Station StageID
1. Plymouth. 1
2. Exeter. 2
3. Taunton. 3
4. Reading. 9
5. Basingstoke. 10
6. Southampton 6
7. Poole. 7
8. Weymouth 8
5. Paddington. 11
3. Salisbury 4
4. Basingstoke. 5
5. Southampton. 6
6. Poole. 7
7. Weymouth. 8
* excuse the .'s!
Wrapping that with a join on your distinct starting stations (and removing the explicit START WITH clause so that you get routes from all stations, not just a single station) will give you what you need for your output table (although as per previous comments, I'm not sure what use that structure is to you, as you lose pertinent detail):
SELECT
First_Stage.stageid as routeid,
q.stageid
FROM
(
SELECT
LEVEL as route_step,
t1.next_hop_station as next_station,
t1.stageid
FROM
stage t1
INNER JOIN stage t2
ON t2.start_station = t1.next_hop_station
CONNECT BY
PRIOR t1.start_station = t1.next_hop_station
) q
INNER JOIN stage as first_stage
ON first_stage.stageid = q.stageid
AND q.route_step = 1

Row aggregation of count-distinct measure

I have a fairly simple project set up to demonstrate what I want here. Here's the data:
Group
ID Name
1 Group 1
2 Group 2
3 Group 3
Person
ID GroupID Age Name
1 1 18 John
2 1 21 Stephen
3 1 18 Kate
4 2 18 Mary
5 2 19 Joseph
6 2 19 Michael
7 3 21 David
8 3 22 Kevin
9 3 21 Julian
I have 1 measure in my cube called Person Count which is a Distinct count on Person ID
I have set up each non-ID column in the dimensions as attributes (Age, Person Name, Group).
When I process and browse the cube in Business Intelligence Development Studio, I get the following result set:
But what I actually want here are the rows for Age to aggregate up the count of the Person Count together, so here it should show 2 and only one row for 18.
Is this possible (and how)?
Turns out this was a problem with the way I set up the Age attribute for the dimension.
I had:
KeyColumns = Person.ID
ValueColumn = Person.Age.
I don't know why I did this, but the solution is to delete the content of ValueColumn and set the KeyColumns to Person.Age again.
I now get the following result:
Everything else is the same for the project; this was the only change and is exactly what I wanted. If I get any issues with it I will keep this post updated for anyone else who may run into this in the future.