In PostgreSQL, in a table with multiple rows per unique ID, can you select one row by a condition per unique ID and other values? - sql

Specifically, I a table of test data in which I am trying to pick one row of data per student and session (i.e. fall, winter, spring). The problem is that there are some students who re-took the test within the same session, and I would like my query to handle these occurrences.
So let's say a student (with studentid = 12345) took a test twice in the fall--Once on September 23rd with a score of 85/100, and then again on October 3rd with a 75/100. I want to know two different queries, one to handle each of the following:
Return the row of their most recent test (i.e. the test from Oct 3rd)
Return the row of their highest scoring test (i.e. the test from Sept 23rd)
Here is an example of a table similar to the one I am working with:
| studentid | session | testdate | score | schoolyear |
-----------------------------------------------------------
| ...
| 42532 | Fall | '2020-10-01' | 68 | '2020-2021'
| 42532 | Winter | '2021-02-02' | 70 | '2020-2021'
| 12345 | Fall | '2020-09-23' | 85 | '2020-2021' <--- (this student has two records for the fall)
| 12345 | Fall | '2020-10-03' | 75 | '2020-2021' <---
| 12345 | Winter | '2021-01-10' | 79 | '2020-2021'
| 83456 | Fall | '2020-09-08' | 90 | '2019-2020'
| 83456 | Winter | '2021-01-18' | 83 | '2019-2020'
| ...
So I want to run a query similar to the following:
SELECT studentid, session, testdate, score
FROM exam_result
WHERE schoolyear = '2020-2021'
-- (something to filter out the multiples)
Where it returns 1 row per student AND session, for all students
Any help would be greatly appreciated!

If you want just one row, use fetch or limit:
select er.*
from exam_result er
where er.studentid = 12345
order by testdate desc
limit 1;
Just adjust the order by for the row you want.
For all tests, you would use distinct on:
select distinct on (er.studentid) er.*
from exam_result er
where . . . -- whatever other conditions you have
order by er.studentid, testdate desc

Which score you prefer to have; best or worst or something else? This query gives you the best score. I dropped testdate away because it isn't unique. If you need test date it makes query a bit more complicated. And what date you want if studet gets the same score twice?
SELECT studentid, session, MAX(score)
FROM exam_result
WHERE schoolyear = '2020-2021'
GROUP BY studentid, session
If you need exam date this this kind of query gives you the first exam date when a student get his highest score. This isn't tested, but you get idea. Make separate subquery which gets min date for student/score combination and join it to your original query.
SELECT
a.student_id.
a.session,
b.exam_date,
a.score
FROM
exam_resut a JOIN
(SELECT
student_id, session, MIN(exam_date)
FROM exam_result
WHERE schoolyear = '2020-2021'
GROUP BY student_id, session) b
ON a.studet_id = b.student_id and a.session = b.sesion
WHERE a.schoolyear = '2020-2021'
GROUP BY a.studentid, a.session, b.exam_date

Related

Is there a difference between Oracle SQL 'KEEP' for multiple columns and 'KEEP' for one and GROUP BY for the rest?

I'm just now learning about KEEP in Oracle SQL, but I cannot seem to find documentation that explains why their examples use KEEP in all columns that are not indexed.
I have a table with 5 columns
PERSON_ID | BRANCH | YEAR | STATUS | TIMESTAMP
123456 | 0001 | 2017 | 1 | 1-1-2017 (ROW 1)
123456 | 0001 | 2017 | 2 | 2-1-2017 (ROW 2)
123456 | 0002 | 2017 | 3 | 3-1-2017 (ROW 3)
123456 | 0001 | 2017 | 2 | 4-1-2017 (ROW 4)
123456 | 0001 | 2018 | 2 | 1-1-2018 (ROW 5)
123456 | 0001 | 2018 | 3 | 2-1-2018 (ROW 6)
I want to return the row of the most recent timestamp by person, branch, and year, so rows 3, 4, and 6.
RESULTS
PERSON_ID | BRANCH | YEAR | STATUS | TIME_STAMP
123456 | 0002 | 2017 | 3 | 3-1-2017 (ROW 3)
123456 | 0001 | 2017 | 2 | 4-1-2017 (ROW 4)
123456 | 0001 | 2018 | 3 | 2-1-2018 (ROW 6)
To get the entire row, I would normally I would write something like this:
SELECT *
FROM STATUS_TABLE a
WHERE a.TIME_STAMP =
(
SELECT MAX(sub.TIME_STAMP)
FROM STATUS_TABLE sub
WHERE a.PERSON_ID = sub.PERSON_ID
AND a.YEAR = sub.YEAR
AND a.BRANCH = sub.BRANCH
)
But I'm learning I can write this:
SELECT
a.PERSON_ID,
a.YEAR,
a.BRANCH,
MAX(a.STATUS) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC)
FROM STATUS_TABLE a
GROUP BY a.PERSON_ID, a.YEAR, a.BRANCH;
My concern is that a lot of the documentation and example I'm finding doesn't put all the group-by columns in GROUP BY, but rather they write a KEEP statement for many columns.
Like this:
SELECT
a.PERSON_ID,
MAX(a.YEAR) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC),
MAX(a.BRANCH) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC),
MAX(a.STATUS) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC)
FROM STATUS_TABLE a
GROUP BY a.PERSON_ID;
QUESTION
If I know that there will never be duplicates on TIME_STAMP for an ID, YEAR, and BRANCH, can I write it the first way or do I still need to write it the 2nd way. Using the first way, I get the results I'm expecting, but I can't seem to find any explanation of this method and what the differences may be.
Are there any?
Your aggregation queries are different. When you have:
GROUP BY a.PERSON_ID, a.YEAR, a.BRANCH
Your result set will have one row in the result set for each combination of the three columns.
If you specify:
GROUP BY a.PERSON_ID
Then there is one row only for each PERSON_ID. Under some circumstances, this is the same as the above version. But only when there is one YEAR and BRANCH per PERSON_ID. That is not true in your data.
These versions are functionally equivalent for most practical purposes to your version with the correlated subquery. One difference is what happens if any of the grouping/correlation columns are NULL. The GROUP BY keeps these groupings. The correlated subquery filters them out.

Calculate time span over a number of records

I have a table that has the following schema:
ID | FirstName | Surname | TransmissionID | CaptureDateTime
1 | Billy | Goat | ABCDEF | 2018-09-20 13:45:01.098
2 | Jonny | Cash | ABCDEF | 2018-09-20 13:45.01.108
3 | Sally | Sue | ABCDEF | 2018-09-20 13:45:01.298
4 | Jermaine | Cole | PQRSTU | 2018-09-20 13:45:01.398
5 | Mike | Smith | PQRSTU | 2018-09-20 13:45:01.498
There are well over 70,000 records and they store logs of transmissions to a web-service. What I'd like to know is how would I go about writing a script that would select the distinct TransmissionID values and also show the timespan between the earliest CaptureDateTime record and the latest record? Essentially I'd like to see what the rate of records the web-service is reading & writing.
Is it even possible to do so in a single SELECT statement or should I just create a stored procedure or report in code? I don't know where to start aside from SELECT DISTINCT TransmissionID for this sort of query.
Here's what I have so far (I'm stuck on the time calculation)
SELECT DISTINCT [TransmissionID],
COUNT(*) as 'Number of records'
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
Not sure how to get the difference between the first and last record with the same TransmissionID I would like to get a result set like:
TransmissionID | TimeToCompletion | Number of records |
ABCDEF | 2.001 | 5000 |
Simply GROUP BY and use MIN / MAX function to find min/max date in each group and subtract them:
SELECT
TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime))
FROM yourdata
GROUP BY TransmissionID
HAVING COUNT(*) > 1
Use min and max to calculate timespan
SELECT [TransmissionID],
COUNT(*) as 'Number of records',datediff(s,min(CaptureDateTime),max(CaptureDateTime)) as timespan
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
A method that returns the average time for all transmissionids, even those with only 1 record:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime)) * 1.0 / NULLIF(COUNT(*) - 1, 0)
FROM yourdata
GROUP BY TransmissionID;
Note that you may not actually want the maximum of the capture date for a given transmissionId. You might want the overall maximum in the table -- so you can consider the final period after the most recent record.
If so, this looks like:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second,
MIN(CaptureDateTime),
MAX(MAX(CaptureDateTime)) OVER ()
) * 1.0 / COUNT(*)
FROM yourdata
GROUP BY TransmissionID;

How to combine in one sql query in extra column the result of 2 group by queries?

Considering the following mdl_course_completions table that describes a course completion for a user:
id,bigint
userid,bigint
course,bigint
timeenrolled,bigint
timestarted,bigint
timecompleted,bigint
reaggregate,bigint
To determinate if a course has been finished by a student, I use a predicate on the timecompleted field.
When this field is null, the student has not finished the course, but when this field is not null, that means the student has finished the course.
Thus, the count of the number of students that finished course by course is given by:
SELECT mdl_course.fullname,count(*) as "number of students that didn't finish courses"
FROM mdl_course_completions
INNER JOIN mdl_course on mdl_course.id = mdl_course_completions.course
WHERE timecompleted IS NOT NULL
GROUP BY mdl_course.fullname
;
the result is:
| course name | number of students that finish courses |
|-------------|----------------------------------------|
| course 1 | 50 |
| course 2 | 200 |
| course 3 | 120 |
AND the count of the number of students that DIDN'T finished course by course is given by:
SELECT mdl_course.fullname,count(*) as "number student that didn't finish courses"
FROM mdl_course_completions
INNER JOIN mdl_course on mdl_course.id = mdl_course_completions.course
WHERE timecompleted IS NULL
GROUP BY mdl_course.fullname
;
the result is:
| course name | number of students that didn't finish courses |
|-------------|-----------------------------------------------|
| course 1 | 12 |
| course 2 | 12 |
| course 3 | 120 |
I wonder how can I combine this 2 queries to get in one query the results in an extra column such as:
| course name | number of students that finish courses | number of students that didn't finish courses |
|-------------|------------------------------------|-------------------------------------------|
| course 1 | 50 | 12 |
| course 2 | 200 | 12 |
| course 3 | 120 | 120 |
I am using postgresql.In my opinion, this kind of stuff is not related to database system. I just don't know how to proceed to combine these 2 queries in one in an extra column with the GROUP BY clause.
Use conditional aggregation.
SELECT mdl_course.fullname
,SUM((timecompleted IS NOT NULL)::int) as "number student that finish courses"
,SUM((timecompleted IS NULL)::int) as "number student that didn't finish courses"
FROM mdl_course_completions
INNER JOIN mdl_course on mdl_course.id = mdl_course_completions.course
GROUP BY mdl_course.fullname
From PostgreSQL 9.4 on, you can use the FILTER clause with aggregate functions:
count(*) FILTER (WHERE timecompleted IS NOT NULL)

SQL Where Query to Return Distinct Values

I have an app that has the built in initial Select option and only allows me to enter from the Where section. I have rows with duplicate values. I'm trying to get the list of just one record for each distinct value but am unsure how to get the statement to work. I've found one that almost does the trick but it doesn't give me any rows that had a dup. I assume due to the = so just need a way to get one for each that matches my where criteria. Examples below.
Initial Data Set
Date | Name | ANI | CallIndex | Duration
---------------------------------------------------------
2/2/2015 | John | 5555051000 | 00000.0001 | 60
2/2/2015 | John | | 00000.0001 | 70
3/1/2015 | Jim | 5555051001 | 00000.0012 | 80
3/4/2015 | Susan | | 00000.0022 | 90
3/4/2015 | Susan | 5555051002 | 00000.0022 | 30
4/10/2015 | April | 5555051003 | 00000.0030 | 35
4/11/2015 | Leon | 5555051004 | 00000.0035 | 10
4/15/2015 | Jane | 5555051005 | 00000.0050 | 20
4/15/2015 | Jane | 5555051005 | 00000.0050 | 60
4/15/2015 | Kevin | 5555051006 | 00000.0061 | 35
What I Want the Query to Return
Date | Name | ANI | CallIndex | Duration
---------------------------------------------------------
2/2/2015 | John | 5555051000 | 00000.0001 | 60
3/1/2015 | Jim | 5555051001 | 00000.0012 | 80
3/4/2015 | Susan | 5555051002 | 00000.0022 | 30
4/10/2015 | April | 5555051003 | 00000.0030 | 35
4/11/2015 | Leon | 5555051004 | 00000.0035 | 10
4/15/2015 | Jane | 5555051005 | 00000.0050 | 20
4/15/2015 | Kevin | 5555051006 | 00000.0061 | 35
Here is what I was able to get but when i run it I don't get the rows that did have dups callindex values. duration doesn't mattern and they never match up so if it helps to query using that as a filter that would be fine. I've added mock data to assist.
use Database
SELECT * FROM table
WHERE Date between '4/15/15 00:00' and '4/15/15 23:59'
and callindex in
(SELECT callindex
FROM table
GROUP BY callinex
HAVING COUNT(callindex) = 1)
Any help would be greatly appreciated.
Ok with the assistance of everyone here i was able to get the query to work perfectly within SQL. That said apparently the app I'm trying this on has a built in character limit and the below query is too long. This is the query i have to use as far as the restrictions and i have to be able to search both ID's at the same time because some get stamped with one or the other rarely both. I'm hoping someone might be able to help me shorten it?
use Database
select * from tblCall
WHERE
flddate between '4/15/15 00:00' and '4/15/15 23:59'
and fldAgentLoginID='1234'
and fldcalldir='incoming'
and fldcalltype='external'
and EXISTS (SELECT * FROM (SELECT MAX(fldCallName) AS fldCallName, fldCallID FROM tblCall GROUP BY fldCallID) derv WHERE tblCall.fldCallName = derv.fldCallName AND tblCall.fldCallID = derv.fldCallID)
or
flddate between '4/15/15 00:00' and '4/15/15 23:59'
and '4/15/15 23:59'
and fldPhoneLoginID='56789'
and fldcalldir='incoming'
and fldcalltype='external'
and EXISTS (SELECT * FROM (SELECT MAX(fldCallName) AS fldCallName, fldCallID FROM tblCall GROUP BY fldCallID) derv WHERE tblCall.fldCallName = derv.fldCallName AND tblCall.fldCallID = derv.fldCallID)
If the constraint is that we can only add to the WHERE clause, I don't think it's possible, due to there being 2 absolutely identical rows:
4/15/2015 | Jane | 5555051005 | 00000.0050
4/15/2015 | Jane | 5555051005 | 00000.0050
Is it possible that you can add HAVING or GROUP BY to the WHERE? or possibly UNION the SELECT to another SELECT statement? That may open up some additional possibilities.
Maybe with an union:
SELECT *
FROM table
GROUP BY Date, Name, ANI, CallIndex
HAVING ( COUNT(*) > 1 )
UNION
SELECT *
FROM table
WHERE Name not in (SELECT name from table
GROUP BY Date, Name, ANI, CallIndex
HAVING ( COUNT(*) > 1 ))
From your sample, it seems like you could just exclude rows in which there was no value in the ANI column. If that is the case you could simply do:
use Database
SELECT * FROM table
WHERE Date between '4/15/15 00:00' and '4/15/15 23:59'
and ANI is not null
If this doesn't work for you, let me know and I can see what else I can do.
Edit:
You've made it sound like the CallIndex combined with the Duration is a unique value. That seems somewhat doubtful to me, but if that is the case you could do something like this:
use Database
SELECT * FROM table
WHERE Date between '4/15/15 00:00' and '4/15/15 23:59'
and cast(callindex as varchar(80))+'-'+cast(min(duration) as varchar(80)) in
(SELECT cast(callindex as varchar(80))+'-'+cast(min(duration) as varchar(80))
FROM table
GROUP BY callindex)
There are two keywords you can use to get non-duplicated data, either DISTINCT or GROUP BY. In this case, I would use a GROUP BY, but you should read up on both.
This query groups all of the records by CallIndex and takes the MAX value for each of the other columns and should give you the results you want:
SELECT MAX(Date) AS Date, MAX(Name) AS Name, MAX(ANI) AS ANI, CallIndex
FROM table
GROUP BY CallIndex
EDIT
Since you can't use GROUP BY directly but you can have any SQL in the WHERE clause you can do:
SELECT *
FROM table
WHERE EXISTS
(
SELECT *
FROM
(
SELECT MAX(Date) AS Date, MAX(Name) AS Name, MAX(ANI) AS ANI, CallIndex
FROM table
GROUP BY CallIndex
) derv
WHERE table.Date = derv.Date
AND table.Name = derv.Name
AND table.ANI = derv.ANI
AND table.CallIndex = derv.CallIndex
)
This selects all rows from the table where there exists a matching row from the GROUP BY.
It won't be perfect, if any two rows match exactly, you'll still have duplicates, but that's the best you'll get with your restriction.
In your data, why not just do this?
SELECT *
FROM table
WHERE Date >= '2015-04-15' and Date < '2015-04-16'
ani is not null;
If the blank values are only a coincidence, then you have a problem just using a where clause. If the results are full duplicates (no column has a different value), then you probably cannot do what you want with just a where clause -- unless you are using SQLite, Oracle, or Postgres.

MIN() Function in SQL

Need help with Min Function in SQL
I have a table as shown below.
+------------+-------+-------+
| Date_ | Name | Score |
+------------+-------+-------+
| 2012/07/05 | Jack | 1 |
| 2012/07/05 | Jones | 1 |
| 2012/07/06 | Jill | 2 |
| 2012/07/06 | James | 3 |
| 2012/07/07 | Hugo | 1 |
| 2012/07/07 | Jack | 1 |
| 2012/07/07 | Jim | 2 |
+------------+-------+-------+
I would like to get the output like below
+------------+------+-------+
| Date_ | Name | Score |
+------------+------+-------+
| 2012/07/05 | Jack | 1 |
| 2012/07/06 | Jill | 2 |
| 2012/07/07 | Hugo | 1 |
+------------+------+-------+
When I use the MIN() function with just the date and Score column I get the lowest score for each date, which is what I want. I don't care which row is returned if there is a tie in the score for the same date. Trouble starts when I also want name column in the output. I tried a few variation of SQL (i.e min with correlated sub query) but I have no luck getting the output as shown above. Can anyone help please:)
Query is as follows
SELECT DISTINCT
A.USername, A.Date_, A.Score
FROM TestTable AS A
INNER JOIN (SELECT Date_,MIN(Score) AS MinScore
FROM TestTable
GROUP BY Date_) AS B
ON (A.Score = B.MinScore) AND (A.Date_ = B.Date_);
Use this solution:
SELECT a.date_, MIN(name) AS name, a.score
FROM tbl a
INNER JOIN
(
SELECT date_, MIN(score) AS minscore
FROM tbl
GROUP BY date_
) b ON a.date_ = b.date_ AND a.score = b.minscore
GROUP BY a.date_, a.score
SQL-Fiddle Demo
This will get the minimum score per date in the INNER JOIN subselect, which we use to join to the main table. Once we join the subselect, we will only have dates with names having the minimum score (with ties being displayed).
Since we only want one name per date, we then group by date and score, selecting whichever name: MIN(name).
If we want to display the name column, we must use an aggregate function on name to facilitate the GROUP BY on date and score columns, or else it will not work (We could also use MAX() on that column as well).
Please learn about the GROUP BY functionality of RDBMS.
SELECT Date_,Name,MIN(Score)
FROM T
GROUP BY Name
This makes the assumption that EACH NAME and EACH date appears only once, and this will only work for MySQL.
To make it work on other RDBMSs, you need to apply another group function on the Date column, like MAX. MIN. etc
SELECT T.Name, T.Date_, MIN(T.Score) as Score FROM T
GROUP BY T.Date_
Edit: This answer is not corrected as pointed out by JNK in comments
SELECT Date_,MAX(Name),MIN(Score)
FROM T
GROUP BY Date_
Here I am using MAX(NAME), it will pick one name if two names were found with the same goal numbers.
This will find Min score for each day (no duplicates), scored by any player. The name that starts with Z will be picked first than the name that starts with A.
Edit: Fixed by removing group by name