DISTINCT in a simple SQL query - sql

When executing SQL queries I have been trying to figure out the following:
In this example:
FROM albums AL
why is there a need to specify distinct? I thought that the Id being a primary key was enough to avoid duplicate results.

When you specify distinct you are specifying that you want the whole row to be distinct. For example if you have two rows:
ID=1 and Name='Joe Smith'
ID=2 and Name='Joe Smith'
then your query is going to return both rows because the different ID values make the rows distinct.
However, if you are selecting only the ID column (and it's your primary key) then the distinct is pointless.
If you're trying to find all of the unique names then you'd want to:
FROM albums AL

You are right, in your case there should be no need for the word distinct because you are asking for the id and the name. Now, for sake of example where distinct is necessary, say you had multiple id's with the same name. Let It Be is an album by both the Beatles and the Replacements. And let's say you were using your database to write out labels that only included the names of the albums. The query you would want would be:
select distinct al.name
from albums al;
Sometimes your database is not perfect and it ends up with a bunch of junk data. If the id has not been designated as unique, you might end up with duplicate records, and then you might want to avoid seeing the duplicates in your query results.


How can I select all fields except for those with non-distinct values?

I have a table which represents data for people that have applied. Each person has one PERSON_ID, but can have multiple APP_IDs. I want to select all of the columns except for APP_ID(because its values aren't distinct) for all of the distinct people in the table.
I can list every field individually in both the select and group by clause
This works:
where first = 'Rob' and last='Robot'
But there are twenty more fields that I may or may not use at any given time
Is there any shorter way to achieve this sort of selection without being so verbose?
select distinct is shorter:
from applications
where first = 'Rob' and last = 'Robot';
But you still have to list out the columns once.
Some more modern databases support an except clause that lets you remove columns from the wildcard list. To the best of my knowledge, Oracle has no similar concept.
You could write a query to bring the columns together from the system tables. That could simplify writing the query and help prevent misspellings.

Retrieving duplicate and original rows from a table using sql query

Say I have a student table with the following fields - student id, student name, age, gender, marks, class.Assume that due to some error, there are multiple entries corresponding to each student. My requirement is to identify the duplicate rows in the table and the filter criterion is the student name and the class.But in the query result, in addition to identifying the duplicate records, I also need to find the original student detail which got duplicated. Is there any method to do this. I went through this answer: SQL: How to find duplicates based on two fields?. But here it only specifies how to find the duplicate rows and not a means to identify the actual row that was duplicated. Kindly throw some light on the possible solution. Thanks.
First of all: if the columns you've listed are all in the same table, it looks like your database structure could use some normalization.
In terms of your question: I'm assuming your StudentID field is a database generated, primary key and so has not been duplicated. (If this is not the case, I think you have bigger problems than just duplicates).
I'm also assuming the duplicate row has a higher value for StudentID than the original row.
I think the following should work (Note: I haven't created a table to verify this so it might not be perfect straight away. If it doesn't it should be fairly close)
select dup.StudentID as DuplicateStudentID
dup.StudentName, dup.Age, dup.Gender, dup.Marks, dup.Class,
orig.StudentID as OriginalStudentId
from StudentTable dup
inner join (
-- Find first student record for each unique combination
select Min(StudentId) as StudentID, StudentName, Age, Gender, Marks, Class
from StudentTable t
group by StudentName, Age, Gender, Marks, Class
) orig on dup.StudentName = orig.StudenName
and dup.Age = orig.Age
and dup.Gender = orig.Gender
and dup.Marks = orig.Marks
and dup.Class = orig.Class
and dup.StudentID > orig.StudentID -- Don't identify the original record as a duplicate

SELECT DISTINCT. Please explain?

Wondering if someone could please explain the difference between these two queries and advise why one works and the other doesn't.
This one works. Gives me two records of the distinct GantryRtn value and their corresponding SSD value.
SELECT DISTINCT GantryRtn as Gantry, ROUND(Field.SSD,1) as SSD
FROM Field, PlanSetup, Course, Patient, Radiation
WHERE Field.RadiationSer=Radiation.RadiationSer
AND Radiation.PlanSetupSer=PlanSetup.PlanSetupSer
AND PlanSetup.CourseSer=Course.CourseSer
AND Course.PatientSer=Patient.PatientSer
AND Patient.PatientId='ZZZ456'
AND PlanSetup.PlanSetupId='F T1 R CHEST'
However there is a foreign key in the Field table that links to the primary key of another table that contains a plain text name for each field. I'd also like to extract that name (in a separate query if I have to) by pulling out this foreign key RadiationSer. But as soon as I put RadiationSer into the query, I lose my DISTINCT result.
SELECT DISTINCT GantryRtn as Gantry, ROUND(Field.SSD,1) as SSD, Field.RadiationSer
FROM Field, PlanSetup, Course, Patient, Radiation
WHERE Field.RadiationSer=Radiation.RadiationSer
AND Radiation.PlanSetupSer=PlanSetup.PlanSetupSer
AND PlanSetup.CourseSer=Course.CourseSer
AND Course.PatientSer=Patient.PatientSer
AND Patient.PatientId='ZZZ456'
AND PlanSetup.PlanSetupId='F T1 R CHEST'
This second query gives me 7 records with non-distinct GantryRtn values.
Why does this happen??
I have investigated using GROUP BY but this slows the query down and appears to pull ALL GantryRtn's out of the database (100s of records).
The DISTINCT keyword applys to a result set (all fields) and not just to the first field.
In your case:
SELECT DISTINCT GantryRtn as Gantry, ROUND(Field.SSD,1) as SSD, Field.RadiationSer
will return any records that are distinct (not the same) when taken together with Gantry, SSD, and RadiationSer
So, you may have 7 records for the same Gantry and with different values for RadiationSer.
If you'd like to first filter by distinct Gantry values you can accomplish that with a sub-query and an inner join but somehow you must settle on which RadiationSer value to use.

correlated query to update a table based on a select

I have these tables Genre and Songs. There is obviously many to many relationship btw them, as one genre can have (obviously) have many songs and one song may belong to many genre (say there is a song xyz, it belong to rap, it can also belong to hip-hop). I have this table GenreSongs which acts as a many to many relationship map btw these two, as it contains GenreID and SongID column. So, what I am supposed to do this, add a column to this Genre table named SongsCount which will contain the number of songs in this genre. I can alter table to add a column, also create a query that will give the count of song,
SELECT GenreID, Count(SongID) FROM GenreSongs GROUP BY GenreID
Now, this gives us what we require, the number of songs per genre, but how can I use this query to update the column I made (SongsCount). One way is that run this query and see the results, and then manually update that column, but I am sure everyone will agree that's not a programmtic way to do it.
I came to think I would require to create a query with a subquery, that would get the value of GenreID from outer query and then count of its value from inner query (correlated query) but I can't make any. Can any one please help me make this?
The question of how to approach this depends on the size of your data and how frequently it is updated. Here are some scenarios.
If your songs are updated quite frequently and your tables are quite large, then you might want to have a column in Genre with the count, and update the column using a trigger on the Songs table.
Alternatively, you could build an index on the GenreSong table on Genre. Then the following query:
select count(*)
from GenreSong gs
where genre = <whatever>
should run quite fast.
If your songs are updated infrequently or in a batch (say nightly or weekly), then you can update the song count as part of the batch. Your query might look like:
update Genre
set SongCnt = cnt
from (select Genre, count(*) as cnt from GenreCount gc group by Genre) gc
where Genre.genre = gc.Genre
And yet another possibility is that you don't need to store the value at all. You can make it part of a view/query that does the calculation on the fly.
Relational databases are quite flexible, and there is often more than one way to do things. The right approach depends very much on what you are trying to accomplish.
Making a table named SongsCount is just plainly bad design (redundant data and update overhead). Instead use this query for single results:
SELECT ID, ..., (SELECT Count(*) FROM GenreSongs WHERE GenreID = X) AS SongsCount FROM Genre WHERE ID = X
And this for multiple results (much more efficient):
SELECT ID, ..., SongsCount FROM (SELECT GenreID, Count(*) AS SongsCount FROM GenreSongs GROUP BY GenreID) AS sub RIGHT JOIN Genre AS g ON sub.GenreID = g.ID

How do I remove "duplicate" rows from a view?

I have a view which was working fine when I was joining my main table:
However I needed to add the following join:
Although I added DISTINCT, I still get a "duplicate" row. I say "duplicate" because the second row has a different value.
However, if I change the LEFT OUTER to an INNER JOIN, I lose all the rows for the clients who have these "duplicate" rows.
What am I doing wrong? How can I remove these "duplicate" rows from my view?
This question is not applicable in this instance:
How can I remove duplicate rows?
DISTINCT won't help you if the rows have any columns that are different. Obviously, one of the tables you are joining to has multiple rows for a single row in another table. To get one row back, you have to eliminate the other multiple rows in the table you are joining to.
The easiest way to do this is to enhance your where clause or JOIN restriction to only join to the single record you would like. Usually this requires determining a rule which will always select the 'correct' entry from the other table.
Let us assume you have a simple problem such as this:
Person: Jane
Pets: Cat, Dog
If you create a simple join here, you would receive two records for Jane:
This is completely correct if the point of your view is to list all of the combinations of people and pets. However, if your view was instead supposed to list people with pets, or list people and display one of their pets, you hit the problem you have now. For this, you need a rule.
SELECT Person.Name, Pets.Name
FROM Person
LEFT JOIN Pets pets1 ON pets1.PersonID = Person.ID
FROM Pets pets2
WHERE pets2.PersonID = pets1.PersonID
AND pets2.ID < pets1.ID);
What this does is apply a rule to restrict the Pets record in the join to to the Pet with the lowest ID (first in the Pets table). The WHERE clause essentially says "where there are no pets belonging to the same person with a lower ID value).
This would yield a one record result:
The rule you'll need to apply to your view will depend on the data in the columns you have, and which of the 'multiple' records should be displayed in the column. However, that will wind up hiding some data, which may not be what you want. For example, the above rule hides the fact that Jane has a Dog. It makes it appear as if Jane only has a Cat, when this is not correct.
You may need to rethink the contents of your view, and what you are trying to accomplish with your view, if you are starting to filter out valid data.
So you added a left outer join that is matching two rows? OFFICE_MIS.TABLE_CODE is not unique in that table I presume? you need to restrict that join to only grab one row. It depends on which row you are looking for, but you can do something like this...
OFFICE_MIS.ID = /* whatever the primary key is? */
(select top 1 om2.ID
from OFFICE_MIS om2
order by om2.ID /* change the order to fit your needs */)
If the secondd row has one different value than it is not really duplicate and should be included.
Instead of using DISTINCT, you could use a GROUP BY.
Group by all the fields that you want to be returned as unique values.
Use MIN/MAX/AVG or any other function to give you one result for fields that could return multiple values.
SELECT Office.Field1, Client.Field1, MIN(Office.Field1), MIN(Client.Field2)
FROM YourQuery
GROUP BY Office.Field1, Client.Field1
You could try using Distinct Top 1 but as Hunter pointed out, if there is if even one column is different then it should either be included or if you don't care about or need the column you should probably remove it. Any other suggestions would probably require more specific info.
EDIT: When using Distinct Top 1 you need to have an appropriate group by statement. You would really be using the Top 1 part. The Distinct is in there because if there is a tie for Top 1 you'll get an error without having some way to avoid a tie. The two most common ways I've seen are adding Distinct to Top 1 or you could add a column to the query that is unique so that sql would have a way to choose which record to pick in what would otherwise be a tie.