Can't get unique values - sql

I'm using my sql to get unique values from database
My query looks as follows, but somehow I'm not able to get unique results
SELECT
DISTINCT(company_name),
sp_id
FROM
Student_Training
ORDER BY
company_name
sp_id is the primary key, and then company_name is the companies name that needs to be unique
looks as follows
sp_id, company_name
1 comp1
2 comp2
3 comp2
4 comp3
Just not sorting this unique

DISTINCT works globally, on all the columns you SELECT. Here you're getting distinct pairs of (sp_id, company_name) values, but the individual values of each column may show duplicates.
That being said, it's extremely deceiving that MySQL authorizes the syntax SELECT DISTINCT(company_name), sp_id when it really means SELECT DISTINCT(company_name, sp_id). You don't need the parentheses at all, by the way.
Edit
Actually there's a reason why DISTINCT(company_name), sp_id is valid syntax: adding parentheses around an expression is always legal although it can be overkill: company_name is the same as (company_name) or even (((company_name))). Hence what that piece of SQL means is really: “DISTINCT [company_name in parentheses], [sp_id]”. The parentheses are attached to the column name, not the DISTINCT keyword, which, unlike aggregate function names, for example, does not need parentheses (AVG sp_id is not legal even if unambiguous for a human reader, it's always AVG(sp_id).)
For that matter, you could write SELECT DISTINCT company_name, (sp_id) or SELECT DISTINCT (company_name), (sp_id), it's exactly the same as the plain syntax without parentheses. Putting the list of columns inside parentheses – (company_name, sp_id) – is not legal SQL syntax, though, you can only SELECT “plain” lists of columns, unparenthesized (the form's spell-checker tells me this last expression is not an English word but I don't care. It's Friday afternoon after all).
Therefore, any database engine should accept this confusing syntax :-(

DISTINCT will make unique rows, meaning the unique combination of the field values in your query.
The following query will return a list of all the unique company_name together with the first match for sp_id.
SELECT sp_id, company_name
FROM Student_Training
GROUP BY company_name
And as Arthur Reutenauer has suggested, it is indeed pretty deceiving that MySQL allows the DISTINCT(fieldname) syntax when it actually means DISTINCT(field1, field2, ..., fieldn)

Which id you want in case when a single company_name has two or more id's?
This:
SELECT DISTINCT company_name
FROM Student_Training
will select only the company_name.
This:
SELECT company_name, MIN(id)
FROM Student_Training
GROUP BY
company_name
will select minimal id for each company name.

DISTINCT does not work the way you think it does. It gives you a distinct record. Stop and think about how it could return what you expect it to return. If you had a table as follows
id name
1 joe
2 joe
3 james
If it only returned distinct names, which id would it return for joe?
You may want
SELECT company_name, Min(sp_id) FROM Student_Tracking GROUP BY company_name
or perhaps (as above) just
SELECT DISTINCT company_name from student_training

Related

Retrieving column value in table2 via same ID in table1

I have this SQL query that returns overdue assignments
SELECT DUE_DATE,
SUBJECT,
ASSIGNMENT,
STUDENT_NAME,
TEACHER_NAME
FROM(SELECT DISTINCT
a.due_date AS due_date,
a.subject AS subject,
a.assignment AS assignment,
a.student_name AS student_name,
a.student_id AS student_id,
a.teacher_name AS teacher_name,
a.teacher_id AS teacher_id
FROM DB.ASSIGNMENT a,
DB.ALL b,
WHERE (trunc(a.DATE_CREATED) >= trunc(db.utc_sysdate)))
WHERE((trunc(due_date) < trunc(db.utc_sysdate));
and I want to include both the teacher and student emails as additional columns in my SQL query - I was wondering how to map their id in table ASSIGNMENT in order to get their respective emails in table ALL with the existing query I have?
We do lack some information, but - wouldn't your query be like this?
select distinct
a.due_date,
a.subject,
a.assignment,
a.student_name,
a.student_email,
a.teacher_name,
a.teacher_email
from db.assignment a join db.all b
on trunc(a.date_created) >= trunc(b.utc_sysdate)
and trunc(a.due_date) < trunc(b.utc_sysdate);
What's the difference, if compared to your query?
your query is invalid
comma after db.all b
the final where clause references db. "alias" (although it is probably schema name, according to inline view's from clause)
there's no point in aliasing column names using exactly the same name; what's the difference between a.due_date as due_date and a.due_date itself? None. So don't use it, you're just causing confusion
as you want to include student's and teacher's e-mail addresses, why don't you just do that? Add those columns into the query ...
it seems that you don't need an inline view; put both where conditions into the same query and remove columns you don't need (both IDs)

DISTINCT in a simple SQL query

When executing SQL queries I have been trying to figure out the following:
In this example:
SELECT DISTINCT AL.id, AL.name
FROM albums AL
why is there a need to specify distinct? I thought that the Id being a primary key was enough to avoid duplicate results.
When you specify distinct you are specifying that you want the whole row to be distinct. For example if you have two rows:
ID=1 and Name='Joe Smith'
ID=2 and Name='Joe Smith'
then your query is going to return both rows because the different ID values make the rows distinct.
However, if you are selecting only the ID column (and it's your primary key) then the distinct is pointless.
If you're trying to find all of the unique names then you'd want to:
SELECT DISTINCT AL.name
FROM albums AL
You are right, in your case there should be no need for the word distinct because you are asking for the id and the name. Now, for sake of example where distinct is necessary, say you had multiple id's with the same name. Let It Be is an album by both the Beatles and the Replacements. And let's say you were using your database to write out labels that only included the names of the albums. The query you would want would be:
select distinct al.name
from albums al;
Sometimes your database is not perfect and it ends up with a bunch of junk data. If the id has not been designated as unique, you might end up with duplicate records, and then you might want to avoid seeing the duplicates in your query results.

How to insert a count column into a sql query

I need the second column of the table retrieved from a query to have a count of the number of rows, so row one would have a 1, row 2 would have a 2 and so on. I am not very proficient with sql so I am sorry if this is a simple task.
A basic example of what I am doing would be is:
SELECT [Name], [I_NEED_ROW_COUNT_HERE],[Age],[Gender]
FROM [customer]
The row count must be the second column and will act as an ID for each row. It must be the second row as the text file it is generating will be sent to the state and they require a specific format.
Thanks for any help.
With your edit, I see that you want a row ID (normally called row number rather than "count") which is best gathered from a unique ID in the database (person_id or some other unique field). If that isn't possible, you can make one for this report with ROW_NUMBER() OVER (ORDER BY EMPLOYEE_ID DESC) AS ID, in your select statement.
select Name, ROW_NUMBER() OVER (ORDER BY Name DESC) AS ID,
Age, Gender
from customer
This function adds a field to the output called ID (see my tips at the bottom to describe aliases). Since this isn't in the database, it needs a method to determine how it will increment. After the over keyword it orders by Name in descending order.
Information on Counting follows (won't be unique by row):
If each customer has multiple entries but the selected fields are the same for that user and you are counting that user's records (summed in one result record for the user) then you would write:
select Name, count(*), Age, Gender
from customer
group by name, age, gender
This will count (see MSDN) all the user's records as grouped by the name, age and gender (if they match, it's a single record).
However, if you are counting all records so that your whole report has the grand total on every line, then you want:
select Name, (select count(*) from customer) as "count", Age, Gender
from customer
TIP: If you're using something like SSMS to write a query, dragging in columns will put brackets around the columns. This is only necessary if you have spaces in column names, but a DBA will tend to avoid that like the plague. Also, if you need a column header to be something specific, you can use the as keyword like in my first example.
W3Schools has a good tutorial on count()
The COUNT(column_name) function returns
the number of values (NULL values will not be counted) of the
specified column:
SELECT COUNT(column_name) FROM table_name;
The COUNT(*) function returns the number of records in a table:
SELECT COUNT(*) FROM table_name;
The COUNT(DISTINCT column_name) function returns the number of
distinct values of the specified column:
SELECT COUNT(DISTINCT column_name) FROM table_name;
COUNT(DISTINCT) works with ORACLE and Microsoft SQL Server, but
not with Microsoft Access.
It's odd to repeat the same number in every row but it sounds like this is what you're asking for. And note that this might not work in your flavor of SQL. MS Access?
SELECT [Name], (select count(*) from [customer]), [Age], [Gender]
FROM [customer]

JOIN on another table after GROUP BY and COUNT

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.
(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)
I have two tables:
Table: Person
-------------
key name cityKey
1 Alice 1
2 Bob 2
3 Charles 2
4 David 1
Table: City
-------------
key name
1 Albany
2 Berkeley
3 Chico
I'd like to do a query on the People (with some WHERE clause) that returns
the number of matching people in each city
the key for the city
the name of the city.
If I do
SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person
LEFT JOIN City ON Person.cityKey = City.key
GROUP BY Person.cityKey, City.name
I get the result that I want
count cityKey cityName
2 1 Albany
2 2 Berkeley
However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.
I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.
select PC.ct, City.key, City.name
from City
join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
on City.key = PC.key
if my SQL isn't too rusty :-)
...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...
However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.
However, I've read that throwing in
that last part of the GROUP BY clause
(City.name) just to make it work is
wrong.
It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:
the required tables are joined
the composite dataset is filtered through the WHERE clause
the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
they are then filtered again, through the HAVING clause
finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.
The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.
Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.
Because the combination of city name and city key is unique, the following are equivalent:
select count(person.key), min(city.key), min(city.name)
...
group by person.citykey
Or:
select count(person.key), city.key, city.name
...
group by person.citykey, city.key, city.name
Or:
select count(person.key), city.key, max(city.name)
...
group by city.key
All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.
P.S. If you'd like to count only different persons, even if they have multiple rows, try:
count(DISTINCT person.key)
instead of
count(person.key)

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1