Get top ranked values across multiple fields - sql

Imagine a table that has three fields, a unique user ID and two non-unique traits (eg age, sex, etc): user / traitA / traitB
In this table I want to pull the most frequent value for each trait in a single query. If our traits were School Year / Major then a result could be: Junior / Biology. Note, this does NOT mean Juniors in Biology are the most common combination, just that each value itself is most common in its trait.
This is obviously possible running two separate queries, grouping by a single fields and putting a rank and having combo in. But my specific problem has more fields and the cost to do subsequent queries is expensive.

Selecting single most common trait:
SELECT age
FROM table_name
GROUP BY age
ORDER BY COUNT(*) DESC
LIMIT 1
To select most common values from multiple columns this query worked in Postgre:
SELECT DISTINCT
FIRST_VALUE(age) OVER (ORDER BY count1 DESC) AS top1,
FIRST_VALUE(sex) OVER (ORDER BY count2 DESC) AS top2
FROM (
SELECT age,
sex,
COUNT(age) OVER (PARTITION BY age) AS count1,
COUNT(sex) OVER (PARTITION BY sex) AS count2
FROM some_table
) some_table

Related

SQL Select all MIN Values for each group

So I need to select all the students, having the minimum grade for each prof. For example, if Augustinus had two students with grade 1.0, then I would like to see both in the result.
Table of my data
What the result could look like, if the LIMIT was set to 10
So what I basically want is to see the best students that each prof has.
What I have tried is the following:
SELECT professor, student, min(note)
FROM temp
GROUP BY professor
ORDER BY note
The problem of course being that I only get one minimum value for each prof and not all minimum values.
*temp is just the table name
One way to solve these types of problems is to use a subquery to rank the grades for each class in a descending order. This involves a window function. With a second query you can limit the results based on your criteria of 10.
SELECT professor, student, note
FROM
(
SELECT professor,student,note,
row_number() over(partition by professor order by note desc) as downwardrank
) as rankings
WHERE
downwardrank <= 10
Just found a solution myself:
SELECT professor, student, note
FROM temp
WHERE (professor, note) IN
(SELECT professor, min(note)
FROM temp
GROUP BY professor
ORDER BY note)
ORDER BY note, professor, student
LIMIT 10

Find list of topper across each class when given individual scores for each subject

I need help in writing an efficient query to find a list of toppers (students with maximum total marks in each class) when we are given individual scores for each subject across different classes. We are required to return 3 columns: class, topper_student name and topper_student_total marks.
I have used multiple sub-queries to find a solution. I am sure there would be much better implementations available for this problem (maybe via joins or window functions?).
Input table and my solution can be found at SQL Fiddle link.
http://www.sqlfiddle.com/#!15/2919e/1/0
Input table:
It would be clearer to use temporary tables to store results along the way and make the result traceable, but the solution can be achieved with a single query:
WITH student_marks AS (
SELECT Class_num, Name, SUM(Marks) AS student_total_marks
FROM School
GROUP BY Class_num, Name
)
SELECT Class_num, Name, student_total_marks
FROM (
SELECT Class_num, Name, student_total_marks, ROW_NUMBER() OVER(partition by Class_num order by student_total_marks desc, Class_num) AS beststudentfirst
FROM student_marks
) A
WHERE A.beststudentfirst = 1
The query within WITH statement calculate a sum of marks for every student in a class. At this point, subject is not required anymore. The result is temporarily stored into student_marks.
Next, we need to create a counter (beststudentfirst) using ROW_NUMBER to number the total marks from the highest to the lowest in each class (order by student_total_marks desc, Class_num). The counter should be reinitiated each time the class changes (partition by Class_num order).
From this last result, we only need the counter (beststudentfirst) with the value of one. It is the top student in each class.
Window functions are the most natural way to approach this. If you always want exactly three students, then use row_number():
select Class_num, Name, total_marks
from (select name, class_num, sum(marks) as total_marks,
row_number() over (partition by class_num order by sum(marks) desc) as seqnum
from School
group by Class_num, Name
) s
where seqnum <= 1
order by class_num, total_marks desc;
If you want to take ties into account, then use rank() or dense_rank().
Here is the SQL Fiddle.
select Class_num,[Name],total_marks from
(
select Row_number() over (partition by class_num order by Class_num,SUM(Marks) desc) as
[RN],Class_num,[Name],SUM(Marks) as total_marks
from School
group by Class_num,[Name]
)A
where RN=1

Check if tables are identical using SQL in Oracle

I was asked this question during an interview for a Junior Oracle Developer position, the interviewer admitted it was a tough one:
Write a query/queries to check if the table 'employees_hist' is an exact copy of the table 'employees'. Any ideas how to go about this?
EDIT: Consider that tables can have duplicate records so a simple MINUS will not work in this case.
EXAMPLE
EMPLOYEES
NAME
--------
Jack Crack
Jack Crack
Jill Hill
These two would not be identical.
EMPLOYEES_HIST
NAME
--------
Jack Crack
Jill Hill
Jill Hill
If the tables have the same columns, you can use this; this will return no rows if the rows in both tables are identical:
(
select * from test_data_01
minus
select * from test_data_02
)
union
(
select * from test_data_02
minus
select * from test_data_01
);
Identical regarding what? Metadata or the actual table data too?
Anyway, use MINUS.
select * from table_1
MINUS
select * from table_2
So, if the two tables are really identical, i.e. the metadata and the actual data, it would return no rows. Else, it would prove that the data is different.
If, you receive an error, it would mean the metadata itself is different.
Update If the data is not same, and that one of the table has duplicates.
Just select the unique records from one of the table, and simply apply MINUS against the other table.
One possible solution, which caters for duplicates, is to create a subquery which does a UNION on the two tables, and includes the number of duplicates contained within each table by grouping on all the columns. The outer query can then group on all the columns, including the row count column. If the table match, there should be no rows returned:
create table employees (name varchar2(100));
create table employees_hist (name varchar2(100));
insert into employees values ('Jack Crack');
insert into employees values ('Jack Crack');
insert into employees values ('Jill Hill');
insert into employees_hist values ('Jack Crack');
insert into employees_hist values ('Jill Hill');
insert into employees_hist values ('Jill Hill');
with both_tables as
(select name, count(*) as row_count
from employees
group by name
union all
select name, count(*) as row_count
from employees_hist
group by name)
select name, row_count from both_tables
group by name, row_count having count(*) <> 2;
gives you:
Name Row_count
Jack Crack 1
Jack Crack 2
Jill Hill 1
Jill Hill 2
This tells you that both names appear once in one table and twice in the other, and therefore the tables don't match.
select name, count(*) n from EMPLOYEES group by name
minus
select name, count(*) n from EMPLOYEES_HIST group by name
union all (
select name, count(*) n from EMPLOYEES_HIST group by name
minus
select name, count(*) n from EMPLOYEES group by name)
You could merge the two tables and then subtract one of the tables from the result. If the result of the subtraction is an empty table then you know that the the tables must be the same since merge had no effect (every row and column were effectively the same)
How do I merge two tables with different column number while removing duplicates?
That link provides a good way to merge the two tables without duplicates without knowing what the columns are.
Ensure the rows are unique by adding a pseudo column
WITH t1 AS
(SELECT <All_Columns>
, row_number() OVER
(PARTITION BY <All_Columns>
ORDER BY <All_Columns>) row_num
FROM employees)
, t2 AS
(SELECT <All_Columns>
, row_number() OVER
(PARTITION BY <All_Columns>
ORDER BY <All_Columns>) row_num
FROM employees_hist)
(SELECT *
FROM t1
MINUS
SELECT *
FROM t2
UNION ALL
(SELECT *
FROM t1
MINUS
SELECT *
FROM t2)
Use row_number to make sure there are no duplicate rows. Now you can use minus and if there are no results, the tables are identical.
SELECT ROW_NUMBER() OVER (Order By Name), *
FROM tab1
MINUS
SELECT ROW_NUMBER() OVER (Order By Name), *
FROM tab2

Keeping Unique Rows with Group By Cube

Suppose I have data that includes the SSN of a student, the college campus they attended, and their wages for a given year. Like so...
create table #thetable (SSN int, campus int, wage int);
insert into #thetable(SSN, campus, wage)
values
(111111111,1,100),
(111111111,2,100),
(222222222,1,250),
(222222222,2,250),
(333333333,1,50),
(444444444,2,400);
Now, I want to get the average wage of the students at each campus, and the average wage of students from all campuses put together... So I do something like this:
select campus, avg(wage)
from #thetable
group by cube(campus);
The problem is that I don't want to double-count the students who attended two campuses when I'm grouping the campuses together. This is the output I'm getting (double counts students 111111111 and 2222222222):
Campus (no column name)
1 133
2 250
NULL 191
My desired output is this (no double counting):
Campus (no column name)
1 133
2 250
NULL 200
Can this be accomplished without using multiple queries and the UNION operator? If so, how? (Incidentally, I realize that this table is not normalized... would normalizing help?)
You can't do this with one column. The cube is going to rollup the values based on the calculations on each line. So, if a row is included in one calculation, it will be included in the sum.
You can do this, though, by weighting the values by 1 divided by the frequency. This "divides" a student equally across the campuses to each adds to 1:
select campus, avg(wage) as avg_wage, sum(wage*weight) / sum(weight) avg_wage_weighted
from (select t.*, (1.0 / count(*) over (partition by SSN)) as weight
from #thetable t
) t
group by cube(campus);
The second column should be the value you want. You can then embed this further in a subquery to get one column:
select campus, (case when campus is null then avg_wage_weighted else avg_wage end)
from (select campus, avg(wage) as avg_wage, sum(wage*weight) / sum(weight) avg_wage_weighted
from (select t.*, (1.0 / count(*) over (partition by SSN)) as weight
from #thetable t
) t
group by cube(campus)
) t
Here is a SQL Fiddle showing the solution.
Figured it out with a correlated sub-query. Works for me.
select campus,
(
select avg(wage)
from
(
select ssn, campus, wage, row_number() over(partition by SSN order by wage) as RN
from #thetable as inside
where (inside.campus=outside.campus or outside.campus is null)
) as middle
where RN=1
)
from #thetable outside
group by cube(campus);

sql query finding most often level appear

I have a table Student in SQL Server with these columns:
[ID], [Age], [Level]
I want the query that returns each age value that appears in Students, and finds the level value that appears most often. For example, if there are more 'a' level students aged 18 than 'b' or 'c' it should print the pair (18, a).
I am new to SQL Server and I want a simple answer with nested query.
You can do this using window functions:
select t.*
from (select age, level, count(*) as cnt,
row_number() over (partition by age order by count(*) desc) as seqnum
from student s
group by age, level
) t
where seqnum = 1;
The inner query aggregates the data to count the number of levels for each age. The row_number() enumerates these for each age (the partition by with the largest first). The where clause then chooses the highest values.
In the case of ties, this returns just one of the values. If you want all of them, use rank() instead of row_number().
One more option with ROW_NUMBER ranking function in the ORDER BY clause. WITH TIES used when you want to return two or more rows that tie for last place in the limited results set.
SELECT TOP 1 WITH TIES age, level
FROM dbo.Student
GROUP BY age, level
ORDER BY ROW_NUMBER() OVER(PARTITION BY age ORDER BY COUNT(*) DESC)
Or the second version of the query using amount each pair of age and level, and max values of count pair age and level per age.
SELECT *
FROM (
SELECT age, level, COUNT(*) AS cnt,
MAX(COUNT(*)) OVER(PARTITION BY age) AS mCnt
FROM dbo.Student
GROUP BY age, level
)x
WHERE x.cnt = x.mCnt
Demo on SQLFiddle
Another option but will require later version of sql-server:
;WITH x AS
(
SELECT age,
level,
occurrences = COUNT(*)
FROM Student
GROUP BY age,
level
)
SELECT *
FROM x x
WHERE EXISTS (
SELECT *
FROM x y
WHERE x.occurrences > y.occurrences
)
I realise it doesn't quite answer the question as it only returns the age/level combinations where there are more than one level for the age.
Maybe someone can help to amend it so it includes the single level ages aswell in the result set: http://sqlfiddle.com/#!3/d597b/9
with combinations as (
select age, level, count(*) occurrences
from Student
group by age, level
)
select age, level
from combinations c
where occurrences = (select max(occurrences)
from combinations
where age = c.age)
This finds every age and level combination in the Students table and counts the number of occurrences of each level.
Then, for each age/level combination, find the one whose occurrences are the highest for that age/level combination. Return the age and level for that row.
This has the advantage of not being tied to SQL Server - it's vanilla SQL. However, a window function like Gordon pointed out may perform better on SQL Server.