SQL ranking effective dates - sql

There may be a very simple way to do this, but I can't quite think of it -- I have a dataset that returns a minimum job title and minimum effective date, then all effdts > than the min_effdt. In order to use this data in a charting program, I would like to rank each successive effdt if it exists, as in Min Role Effdt, then 2nd, 3rd, Max. Of course there could be anywhere from 2 to 20 jobs per person.
At first I considered trying a case statement, but I don't think that works when analyzing two columns at once. Is there a SQL statement that will allow ranking? Right now my data looks like
Employee Number | Min Base Role | Min Role Effdt | Base Role | Role Effdt
and comes from two tables, with the 2nd table brought in twice to get the Role / Effdt as Min, then All greater than Min.
I am using ORACLE. Code is below:
SELECT DISTINCT AL4.FULL_NAME,
AL4.EMPLOYEE_NUMBER,
AL4.HIRE_DATE,
AL4.DATE_OF_BIRTH,
AL4.AGE,
AL4.TERM_DATE,
AL4.ETHNIC_ORIGIN,
AL2.RECORDVALUE AS MIN_BASE_ROLE,
AL3.RECORDVALUE AS BASE_ROLE,
AL3.EFFECTIVE_START_DATE AS "ROLE EFFECTIVE DATE",
AL2.EFFECTIVE_START_DATE AS "MIN ROLE EFFDT"
FROM T1 AL2,
T2 AL3,
T3 AL4
WHERE AL4.PERSON_ID = AL2.PERSON_ID
AND AL4.PERSON_ID = AL3.PERSON_ID
AND AL4.EMPLOYEE_NUMBER = AL2.HISL_ID
AND AL4.EMPLOYEE_NUMBER = AL3.HISL_ID
AND AL2.RECORDTYPE = 'BASE_ROLE'
AND AL3.RECORDTYPE = 'BASE_ROLE'
AND AL2.EFFECTIVE_START_DATE = (SELECT MIN(A.EFFECTIVE_START_DATE) from T1 A where A.person_id = al2.person_id and a.recordtype = al2.recordtype)
AND AL3.EFFECTIVE_START_DATE > AL2.EFFECTIVE_START_DATE
AND (AL4.TERM_DATE >= '01-JAN-2012' or AL4.TERM_DATE is NULL)
order by AL4.EMPLOYEE_NUMBER

The function that you are looking for is row_number(). I think the expression you want is:
row_number() over (partition by AL4.EMPLOYEE_NUMBER
order by AL2.EFFECTIVE_START_DATE
) as ranking
The function row_number() says "assign a sequential number to a group of rows". The partition by clause defines the group, where the numbering starts over again at 1. The order by clause specifies the ordering within the group.
Similar functions rank() and dense_rank() might also be useful. They differ in how they handle duplicate values.

Related

How to get first row of 3 specific values of a column using Oracle SQL?

I have a table which has ID, FAMILY, ENV_XML_PATH and CREATED_DATE columns.
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
03-09-22 6:50:34AM
15826856
SCM
path3.xml
03-10-22 7:12:20AM
15826786
IC
path4.xml
02-10-22 12:50:52AM
15825965
CRM
path5.xml
02-10-22 1:50:52AM
15653951
null
path6.xml
04-10-22 12:50:52AM
15826840
FIN
path7.xml
03-10-22 2:34:09AM
15826841
SCM
path8.xml
02-10-22 8:40:52AM
15223450
IC
path9.xml
03-09-22 5:34:09AM
15026853
SCM
path10.xml
05-10-22 4:40:59AM
Now there are 18 DISTINCT values in FAMILY column and each value has multiple rows associated (as you can see from the above image).
What I want is to get the first row of 3 specific values (CRM, SCM and IC) in FAMILY column.
Something like this:
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
date1
15826856
SCM
path3.xml
date2
15826786
IC
path4.xml
date3
I am new to this, though I understand the logic but I am not sure how to implement it. Kindly help. Thanks.
You can use RANK for that. Something like this:
WITH groupedData AS
(SELECT id, family, env_xml_path, created_date,
RANK () OVER (PARTITION BY family ORDER BY id) AS r_num
FROM yourtable
GROUP BY id, family, env_xml_path, created_date)
SELECT id, family, env_xml_path, created_date
FROM groupedData
WHERE r_num = 1
ORDER BY id;
Thus, within the first query, your data will be grouped by family and sorted by the column you want (in my example, it will be sorted by id).
After that, you will use the second query to only take the first row of each family.
Add a WHERE clause to the first query if you need to apply further restrictions on the result set.
See here a working example: db<>fiddle
You could use a window function to get to know the row number of each partition in family ordered by the created_date, and then filter by the the three families you are interested in:
with row_window as (
select
id,
family,
env_xml_path,
created_date,
row_number() over (partition by family order by created_date asc) as rn
from <your_table>
where family in ('CRM', 'SCM', 'IC')
)
select
id,
family,
env_xml_path,
created_date
from row_window
where rn = 1
Output:
ID
FAMILY
ENV_XML_PATH
CREATED_DATE
15826841
CRM
path1.xml
03-09-22 6:50:34
15826856
SCM
path3.xml
03-10-22 7:12:20
15826786
IC
path4.xml
02-10-22 12:50:52
The question doesn't really specify what 'first' means, but I assume it means the first to be added in the table, aka the person whose date is the oldest. Try this code:
SELECT DISTINCT * FROM (yourTable) WHERE Family = 'CRM' OR
Family = 'SCM' OR Family = 'IC' ORDER BY Created_Date ASC FETCH FIRST (number) ROWS ONLY;
What it does:
Distinct - It selects different rows, which means you won't get same type of rows at the top.
Where - checks if certain condition is true
OR - it means that the select should choose rows that match those requirements. In the current situation the distinct clause means that same rows won't repeat, so you won't be getting 2 different 'CRM' family names, so it will find the first 'CRM' then the first 'SCM' and so on.
ORDER BY - orders the column in specified order. In the current one, if first rows mean the oldest, then by ordering them by date and using ASC the oldest(aka smallest date) will be at the top.
FETCH FIRST (number) ROWS ONLY - It selects only the very first couple of rows you want. For example if you need 3 different 'first' rows you need to get FETCH FIRST 3 ROWS ONLY. Combined with the distinct word it will only show 3 different rows.

How to count Boolean changes in PostgreSQL

My table looks like this
The goal is to count how many times the actuator_state of an specific actuator (from the column actuator_names) changes in a period of time. Keep in mind that a specific actuator has various actuators (For instance Heater has Heator0, Heator1, etc) and the goal is to count how many times has changed Heater0+ Heater1+ Heator2+ Heater3.... (Also the name of the table is state_actuator
I tried this:
SELECT actuator_nome AS NOME,
SUM (DISTINCT CASE WHEN actuator_state.actuator AND DISTINCT actuator_state.actuator_time AND DISTINCT actuator_state.actuator_state THEN 1 ELSE 0) AS TROCAS_ESTADO
FROM actuator_state WHERE actuator_time BETWEEN '2020-05-17 16:58:54' AND '2020-05-17 17:09:58' AND actuator_name='Heater'
The result should be
Heater: 5;
(for instance Heater0 has changed 3 times and Heater1 two times and other Heaters 0 changes)
You can use window functions for this:
select
actuator_name,
count(*) filter(where actuator_state <> lag_actuator_state) no_changes
from (
select
t.*,
lag(actuator_state)
over(partition by actuator_name, actuator order by actuator_time) lag_actuator_state
from mytable t
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
) t
group by actuator_name
The subquery uses lag() to retrieve the "previous" state of each actuator. Then, the outer query aggregates by actuator_name, and performs a count that increments by 1 everytime the consecutive values are not equal.
You can add additional filters in the where clause of the subquery as needed.
Note that this query does not count the first value in the period as a change. Only further changes are taken into account.
You can use lag():
select actuator_name,
count(*) filter (where prev_as is distinct from actuator_state)
from (select sa.*,
lag(actuator_state) over (partition by actuator order by actuator_time) as prev_as
from state_actuator sa
) sa
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
group by actuator_name;
You can filter on a particular name in the where clause as well.
Note that this counts the first appearance as a "change". It is not clear if that matches your intention.

How to Order only first 20 records in a resultset using SQL?

My requirement is to get the List of Diagnosis based on the most used Diagnosis. So, to achieve that I have added one Column named DiagnosisCounter in the tblDiagnosisMst Table of the database which increases by 1 for each Diagnosis the each time user selects it. So, my query is like below:
select DiagnosisID,DiagnosisCode,Name from tblDiagnosisMst
where GroupName = 'Common' and RecStatus = 'A' order by DiagnosisCounter desc,
Name asc
So, this query is helping me to get the list of Diagnosis but in descending order for Diagnosis and then alphabetically for Diagnosis Name. But now my client wants to show only 20 most used Diagnosis name at the top and then all the names should appear in alphabetical order. But unfortunately I am stuck in this point. It would be so appreciative if I get your helpful advice for this problem.
This should do the trick:
;With Ordered as (
select DiagnosisID,DiagnosisCode,Name,
ROW_NUMBER() OVER (ORDER BY DiagnosisCounter desc) as rn
from tblDiagnosisMst
where GroupName = 'Common' and RecStatus = 'A'
)
select * from Ordered
order by CASE WHEN rn <= 20 THEN rn ELSE 21 END,
Name asc
We use ROW_NUMBER to assign the numbers 1-x to each of the rows, based on the diagnosiscounter. We then use that value for the first ORDER BY condition if it's in 1-20, and all other rows sort equally in position 21. The second condition is then used as a tie-breaker to sort those remaining row by name.
Try this
SELECT TOP 20
* FROM tblDiagnosisMst ORDER BY DiagnosisCounter;

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

How to write an SQL query that retrieves high scores over a recent subset of scores -- see explaination

Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.
You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.
Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )
You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90