Group By Column, Select Most Recent Value - sql

I'm performing a query on a table which tracks the results of a test taken by students. The test is composed of multiple sections, and there is a column for each section score. Each row is an instance of the test taken by a student. The sections can either be taken all at once, or split into multiple attempts. For example, a student can take one section today, and the rest tomorrow. In addition, a student is allowed to retake any section of the test.
Sample Student:
StudentID WritingSection ReadingSection MathSection DateTaken
1 65 85 54 4/1/2013 14:53
1 98 NULL NULL 4/8/2013 13:13
1 NULL NULL 38 5/3/2013 12:43
A NULL means that the section was not administered for the given test instance, and a second section score means the section was retaken.
I want a query that groups by the StudentID such that there is only one row per student, and the most recent score for each section is returned. I'm looking for an efficient way to solve this problem as we have many hundreds of thousands of test attempts in the database.
Expected Result:
StudentID WritingSection ReadingSection MathSection DateTaken
1 98 85 38 5/3/2013 12:43
EDIT:
There have been a lot of good solutions. I want to experiment with each next week a little more before choosing the answer. Thanks everyone!

Sorry - my previous answer answered a DIFFERENT question than the one posed :) It will return all data from the MOST RECENT row. The question asked is to aggregate over all rows to grab the most recent score for each subject individually.
But I'm leaving it up there because the question I answered is a common one, and maybe someone landing on this question actually had that question instead :)
Now to answer the actual question:
I think the cleanest way to do this is with PIVOT and UNPIVOT:
SELECT StudentID, [WritingSection], [ReadingSection], [MathSection], MAX(DateTaken) DateTaken
FROM (
SELECT StudentID, Subject, DateTaken, Score
FROM (
SELECT StudentID, Subject, DateTaken, Score
, row_number() OVER (PARTITION BY StudentID, Subject ORDER BY DateTaken DESC) as rowNum
FROM Students s
UNPIVOT (
Score FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
) u
) x
WHERE x.rowNum = 1
) y
PIVOT (
MAX(Score) FOR Subject IN ([WritingSection],[ReadingSection],[MathSection])
) p
GROUP BY StudentID, [WritingSection], [ReadingSection], [MathSection]
The innermost subquery (x) uses SQL's UNPIVOT function to normalize the data (meaning to turn each student's score on each section of the test into a single row).
The next subquery out (y) is simply to filter the rows to only the most recent score FOR EACH SUBJECT INDIVIDUALLY (this is a workaround of the SQL bug that you can't use windowed functions like row_number() in a WHERE clause).
Lastly, since you want the data displayed back in the denormalized original format (1 column for each section of the test), we use SQL's PIVOT function. This simply turns rows into columns - one for each section of the test. Finally, you said you wanted the most recent test taken shown (despite the fact that each section could have its own unique "most recent" date). So we simply aggregate over those 3 potentially different DateTakens to find the most recent.
This will scale more easily than other solutions if there are more Sections added in the future - just add the column names to the list.

This is tricky. Each section score is coming potentially from a different record. But the normal rules of max() and min() don't apply.
The following query gets a sequence number for each section, starting with the latest non-NULL value. This is then used for conditional aggregation in the outer query:
select s.StudentId,
max(case when ws_seqnum = 1 then WritingSection end) as WritingSection,
max(case when rs_seqnum = 1 then ReadingSection end) as ReadingSection,
max(case when ms_seqnum = 1 then MathSection end) as MathSection,
max(DateTaken) as DateTaken
from (select s.*,
row_number() over (partition by studentid
order by (case when WritingSection is not null then 0 else 1 end), DateTaken desc
) as ws_seqnum,
row_number() over (partition by studentid
order by (case when ReadingSection is not null then 0 else 1 end), DateTaken desc
) as rs_seqnum,
row_number() over (partition by studentid
order by (case when MathSection is not null then 0 else 1 end), DateTaken desc
) as ms_seqnum
from student s
) s
where StudentId = 1
group by StudentId;
The where clause is optional in this query. You can remove it and it should still work on all students.
This query is more complicated than it needs to be, because the data is not normalized. If you have control over the data structure, consider an association/junction table, with one row per student per test with the score and test date as columns in the table. (Full normality would introduce another table for the test dates, but that probably isn't necessary.)

Joe's solution will return only one student id - the one that took the test the latest. The way to get the latest date for each student id is to use analytical functions. Here's an example if you're using Oracle database:
SELECT a.StudentID, a.DateTaken
FROM ( SELECT StudentID,
DateTaken,
ROW_NUMBER ()
OVER (PARTITION BY StudentID ORDER BY DateTaken DESC)
rn
FROM pto.test
ORDER BY DateTaken DESC) a
WHERE a.rn = 1
Note how the row_number() funciton will put 1 at the latest date of each student id. And on the outer select you just filter those records with rn = 1... Execute only the inner select to see how it works.
Let me know what kind of database you're using to give you a solution for it. Each database has it's own implementation of analytical functions but the logic is the same...

This is a pretty classic annoying problem in SQL - there's no super elegant way to do it. Here's the best I've found:
SELECT s.*
FROM Students s
JOIN (
SELECT StudentID, MAX(DateTaken) as MaxDateTaken
FROM Students
GROUP BY StudentID
) f ON s.StudentID = f.StudentID AND s.DateTaken = f.MaxDateTaken
Joining on the date field isn't super ideal (this breaks in the event of ties for a MAX) or fast (depending on how the table is indexed). If you have an int rowID that is unique across all rows, it would be preferable to do:
SELECT s.*
FROM Students s
JOIN (
SELECT rowID
FROM (
SELECT StudentID, rowID, row_number() OVER (PARTITION BY StudentID ORDER BY DateTaken DESC) as rowNumber
FROM Students
) x
WHERE x.rowNumber = 1
) f ON s.rowID = f.rowID

How about using the following to the maximum DateTaken?
SELECT max(DateTaken) FROM TABLE_NAME
WHERE StudentID = 1
You could use that in a sub query to get a row like?
SELECT WritingSection FROM TABLE_NAME
WHERE StudentID = 1 and DateTaken = (SELECT max(DateTaken) FROM TABLE_NAME
WHERE StudentID = 1 and WritingSection IS NOT NULL)
You would need to run this twice more for ReadingSection and MathSection?

SELECT student.studentid,
WRITE.writingsection,
READ.readingsection,
math.mathsection,
student.datetaken
FROM
-- list of students / max dates taken
(SELECT studentid,
Max(datetaken) datetaken
FROM test_record
GROUP BY studentid) student,
-- greatest date for student with a writingsection score (dont care what the date is here, just that the score comes from the greatest date)
(SELECT studentid,
writingsection
FROM test_record t
WHERE writingsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND writingsection IS NOT NULL)) WRITE,
(SELECT studentid,
readingsection
FROM test_record t
WHERE readingsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND readingsection IS NOT NULL)) READ,
(SELECT studentid,
mathsection
FROM test_record t
WHERE mathsection IS NOT NULL
AND datetaken = (SELECT Max(datetaken)
FROM test_record
WHERE studentid = t.studentid
AND mathsection IS NOT NULL)) math
WHERE
-- outer join in case a student has no score recorded for one or more of the sections
student.studentid = READ.studentid(+)
AND student.studentid = WRITE.studentid(+)
AND student.studentid = math.studentid(+);

Related

How to 'detect' a change of value in a column in SQL?

im new to SQL, i wanted to ask:
I have combined multiple tables with CTE and join and resulting on this Image here.
From this table, I wanted to detect and count how many workers changed the category from the 1st or 2nd job.
For example, Jonathan Carey has 'Sales Lapangan' as his first job_category, and changed to 'other' on his 2nd job, i wanted to count this job_category change as one.
I tried Case when, and while but i'm getting more confused.
This is my syntax for the table i created:
with data_apply2 as(with data_apply as(with all_apply as(with job_id as(select job_category,
row_number() over(order by job_category) as job_id
from job_post
group by job_category)
select jp.*, job_id.job_id from job_post jp
join job_id
on job_id.job_category=jp.job_category)
select ja.worker_id, wk.name, ja.id as id_application, aa.job_category, aa.job_id
from job_post_application ja
join all_apply aa
on aa.id=ja.job_post_id
join workers wk
on wk.id = ja.worker_id
order by worker_id,ja.id)
select *,
row_number() over(partition by worker_id order by worker_id) as worker_num
from data_apply)
Thank You
You can group by worker and check the number of distinct job categories:
SELECT worker_id,
COUNT(DISTINCT job_category) > 1 category_change
FROM data_apply
GROUP BY worker_id;
select case when job_category<> job_category then 1 else 0 end as cnt
from
(
select
worker_id,
name,
id_application,
job_category,
job_id,
worker_num,
coalesce(lag(job_category) over(partition by worker_id order by id_application), job_category) as job_category
from
sales_table
) x
This should help, using the Lag function I'm accessing the data over the previous row. and comparing it with the job_category and if they are not equal we are counting them as 1.

Largest number of job position of employees at a multiple Companies using SQL

I am trying to find out the most popular job position employees are working at a combination of companies. If there is a tie, however, then both are added to the table.
I have a file called employees_data.txt.
I have their name, company, job position, and age in that order.
Natali, Google, IT, 45
Nadia, Facebook, Sales, 25
Jacob, Google, IT, 32
Leonard, Bing, Custodian, 65
Kami, Amazon, Driver, 43
Paul, Facebook, Engineer, 31
Ashley, Walmart, IT, 34
Robert, Fedex, IT, 27
Rebecca, Ups, Driver, 29
Mal, Apple, Custodian, 73
Erin, Bing, Sales, 38
I know the expected outcome should be the IT position, I'm just unsure the sql command to read through and keep track of the positions.
Any help is greatly appreciated!
Feels like homework :laugh:
You need an aggregate (count, sum, min,max, etc,.) and a group by
select count(*), position
from t
group by position
https://www.db-fiddle.com/f/dUqdZaUGpHTAYv8vH1YhU1/0
to only return the 'top record' we can use a self join with row_number calculation like this... probably an easier and cleaner way to do it, but you get the idea.
SELECT count(*) as recordcount, t.position
FROM t
INNER JOIN (
SELECT *
,row_number() OVER (
ORDER BY recordCount DESC
) AS rn
FROM (
SELECT count(*) AS recordCount
,position
FROM t
GROUP BY position
) as a
) d ON t.position = d.position
AND d.rn = 1
group by t.position
https://www.db-fiddle.com/f/dUqdZaUGpHTAYv8vH1YhU1/1
You want aggregation with a window function. That is:
select p.*
from (select position, count(*) as cnt,
rank() over (order by count(*) desc) as seqnum
from t
group by position
) p
where seqnum = 1;
In the most recent version of Postgres, you don't even need a subquery because it now supports with ties:
select position, count(*) as cnt
from t
group by position
order by count(*) desc
fetch first 1 row with ties;
I suspect your assignment calls for a query something along these lines:
select job_position
from employees_data
group by job_position
order by count(*) desc
fetch first 1 row with ties
Assuming the table is call jobpositions, and the columns are as follows:
name, company, position,age
I would use:
select * from (
select position, COUNT(position) as countpos, ROW_NUMBER() OVER(ORDER BY count(position) DESC) as numpos
from jobpositions group by position order by count(position) desc
) tb1 where tb1.numpos=1
This seems to work in postgres, and i like it because it is simple.

How to count a temporary variable in SQLite?

I am working on a personal analytics project and I need to filter a SQL table. My SQL knowledge is very basic and moreover, I know that in Oracle but in this case I have to use SQLite and it seems to be quite different.
For example, suppose the table is
student physics chemistry maths history english
Brian 78 62 100 40 50
Bill 80 70 95 50 60
Brian 80 40 90 95 60
The table has repetition.
I asked a question earlier today, using the same example above, which would let me rank the subjects for each student.
How to RANK multiple columns of an entire table?
What I want to do now is find out which students had Maths in the top 3 among all subjects and group the table for each student. So the goal is to find out how many times did Brian have Maths in Top 3 of his scores.
IT WeiHan's answer to the previous question (https://www.db-fiddle.com/f/bjui5W1VWmHXcqKAhK5iBD/0 ) worked perfectly and displayed the rank of the subjects for each row. I used their answer and tried to modify it for this purpose.
with cte as (
select student,'physics' as class,physics as score from Table1 union all
select student,'chemistry' as class,chemistry as score from Table1 union all
select student,'maths' as class,maths as score from Table1 union all
select student,'history' as class,history as score from Table1 union all
select student,'english' as class,english as score from Table1
)
SELECT name,class,score,rnk,
(CASE
WHEN class = "maths" AND rnk <=3 THEN 1
ELSE 0
END) as maths_rank
FROM
(select student,class,score,RANK() OVER (partition by student order by score desc) rnk
from cte)
which gives a table like
name class score rnk maths_rank
Brian maths 100 1 1
I want to be able to count the maths_rank values or sum it (as it contains 1 or 0 values) and group the table on student name. I tried to count the maths_rank variable but that didn't work and resulted in errors. Please help me out with a solution.
If I understand correctly, you are on the right path. I think you just need a where clause:
with cte as (
select student,'physics' as class,physics as score from Table1 union all
select student,'chemistry' as class,chemistry as score from Table1 union all
select student,'maths' as class,maths as score from Table1 union all
select student,'history' as class,history as score from Table1 union all
select student,'english' as class,english as score from Table1
)
select t.*
from (select t.*,
rank() over (partition by student order by score desc) as subject_rank
from cte t
) t
where class = 'maths' and subject_rank <= 3;
Edit:
If you want the number of times maths was in the top 3, then:
select student, sum(case when class = 'maths' and subject_rank <= 3 then 1 else 0 end) as maths_top3
from (select t.*,
rank() over (partition by student order by score desc) as subject_rank
from cte t
) t
group by student;

Select rows of joined values using MAX

Consider the following schema.
Student:
StudentID uniqueidentifier
Name varchar(max)
FKTeacherID uniqueidentifier
TestScore:
TestScoreID uniqueidentifier
Score int
FKStudentID uniqueidentifier
My goal is to write a query that yields each teacher's highest test score and the student that achieved it. Returning the teacher's id (Student.FKTeacherID), the score that was achieved (TestScore.Score) and the student that achieved it (Student.Name).
I can write something like this to get the first two required columns:
SELECT FKTeacherID, MAX(Score) MaxScore
FROM Student
JOIN TestScore on FKStudentID = StudentID
GROUP BY FKTeacherID
But I can't obtain the relevant Student.Name without adding it to the group by clause, which would change the result set.
If I'm understanding correctly, one option is to use row_number() -- here's an example with a common-table-expression:
with cte as (
select s.fkteacherid,
ts.score,
s.name,
row_number() over (partition by s.fkteacherid order by ts.score desc) rn
from student s
inner join testscore ts on s.studentid = ts.fkstudentid
)
select fkteacherid, score, name
from cte
where rn = 1
The basic idea is to group by fkteacherid, ordering by the score desc within each group, and taking the first record from each group.
UPDATE
Please try the updated query if you might:
SELECT
FKTeacherID, Name, Score
FROM
Student
JOIN TestScore on FKStudentID = StudentID
JOIN
(
SELECT
B.FKTeacherID AS TeacherID, MAX(A.Score) MaxScore
FROM
Student B
JOIN TestScore A on A.FKStudentID = B.StudentID
GROUP BY
B.FKTeacherID
) As TeachersMaxScore
ON TeachersMaxScore.TeacherID = FKTeacherID
AND TeachersMaxScore.MaxScore = Score

Sum of Highest 5 numbers in SQL Server 2000

I am having a problem in query some data from database. My table is given below:
What i need is that sum of 5 highest total_marks from the table for each student.
Although i tried the code given below, but it is not returning what i expected.
SELECT s.studentid, SUM(s.total_marks)
FROM students s
WHERE s.sub_code IN (SELECT TOP 5 sub_code
FROM students a
WHERE a.studentid = s.studentid
ORDER BY total_marks DESC)
GROUP BY studentid
Please help me guys. Thanking you advance.
You query could work if there's unique/primary key on (studentId, subcode). At the moment, the query returns 6 records instead of 5 for studentId = 1, for example, beause of duplicate subcode 303.
Usually table should have a unique key, may be you can add incremental id to rewrite your query like:
select s.*
from students as s
where
s.id in (
select top 5 a.id
from students as a
where a.studentId = s.studentId
order by a.total_marks desc
);
Or, if you have unique combinations of (studentId, subcode, total_marks), you can use query like this:
select s.*
from students as s
where
exists (
select *
from (
select top 5 a.subcode, a.total_marks
from students as a
where a.studentId = s.studentId
order by a.total_marks desc
) as b
where b.subcode = s.subcode and b.total_marks = s.total_marks
);
sql fiddle demo
First you should select top 5 grades for each student -
select row_number() over (partition by studentid order by total_marks desc) as rank,
studentid,
total_marks
from students
where rank <= 5
from there you'll be able to use this as a subquery, and use group_by and sum:
select studentid, sum(total_marks)
from
(
select row_number() over (partition by studentid order by total_marks desc) as rank,
studentid,
total_marks
from students
where rank <= 5
) t
group by studentid
This isn't ideal, but the method you started to use requires a primary key column. You can simulate one with a temp table since SQL 2000.
CREATE TABLE #temp (
StudentID INT,
total_marks INT,
ID INT Identity(1,1)
)
INSERT INTO #temp (
StudentID,
total_Marks
)
Select
StudentID,
total_marks
FROM Students
SELECT s.studentid, SUM(s.total_marks)
FROM #temp s
WHERE s.ID IN (SELECT TOP 2
a.ID
FROM #temp a
WHERE a.studentid = s.studentid
ORDER BY total_marks DESC)
GROUP BY studentid
I think SQL 2000 may have a slightly more compact syntax for this, but SQL Fiddle won't let me test versions that old.
Please test this carefully. You will be dumping this entire table to a temp table and that's almost always a bad idea.
Also, ensure that there is some combination of fields not including the total that uniquely identifies a row, or consider adding a surrogate key column to the table.
SQL Fiddle Demo