Return all Fields and Distinct Rows - sql

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.

Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...

This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id

When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.

I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).

You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows

If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1

Related

Finding duplicate entries in any of several columns

I have code that identifies potential duplicate records based on the fact that several rows (with different IDs) have the same value in various other columns. This info gets manually reviewed, so I am not worried about the fact that a husband and wife could legitimately share an email address, for example. An example of the query I am using is this:
SELECT DISTINCT ID, Email
FROM Customers
WHERE Email IS NOT NULL AND Email != '' AND Email IN
(SELECT Email FROM Customers GROUP BY Email HAVING COUNT(DISTINCT ID) > 1)
ORDER BY Email;
Which gives me results like this:
ID Email
108 bob#hotmail.com
381 bob#hotmail.com
205 mary#gmail.com
772 mary#gmail.com
908 mary#gmail.com
This works great for my purposes, except when I try matching by phone number, which has multiple columns (HomePhone, BusinessPhone, CellPhone). This creates two problems - the first, which has been pretty well documented on this forum, is how to identify rows in which any of three columns contain a matching value (If a value in [row 1 column A, B, or C] matches a column in [row 2 column A, B, or C] then I want to select both rows). The second problem, which I haven't figured out yet and haven't found an answer to, is how to select [ID], [Value that Matched] as my output.
I suppose that I could select all three columns and do some further code magic in my program to make sense of it, but that prevents me from reusing existing code and also seems like the type of hack that a developer would use to keep from admitting that he needs help from a DBA. (Help!) In all seriousness, though, I am stuck trying to find an elegant solution, and any help would be appreciated.
Based on my understanding of the question,
You can initially use union all and get the different phone numbers into one column and group by that column to see if there are duplicates. Thereafter, join on the original table to get the customer id.
with cnts as (
select phone
from (select id,homephone phone from customers
union all
select id,businessphone from customers
union all
select id,cellphone from customers) x
group by phone
having count(distinct id) > 1
)
select c.id,cn.phone value_matched
from customers c
join cnts cn on cn.phone in (c.homephone,c.businessphone,c.cellphone)
order by 1,2
I would do this with apply:
select c.*, phone
from (select c.*, count(*) over (partition by phone) as cnt
from customers c cross apply
(select distinct v.phone
from (values (homephone), (businessphone), (cellphone)
) v(phone)
where v.phone is not null
) v(phone)
) c
where cnt > 1
order by phone;
The innermost subquery selects the distinct phones for each customer. The count(*) over . . . then counts the number of times that the phone appears (which because of the distinct is for different customers). The final where chooses phones that appear for multiple customers.

How to insert a count column into a sql query

I need the second column of the table retrieved from a query to have a count of the number of rows, so row one would have a 1, row 2 would have a 2 and so on. I am not very proficient with sql so I am sorry if this is a simple task.
A basic example of what I am doing would be is:
SELECT [Name], [I_NEED_ROW_COUNT_HERE],[Age],[Gender]
FROM [customer]
The row count must be the second column and will act as an ID for each row. It must be the second row as the text file it is generating will be sent to the state and they require a specific format.
Thanks for any help.
With your edit, I see that you want a row ID (normally called row number rather than "count") which is best gathered from a unique ID in the database (person_id or some other unique field). If that isn't possible, you can make one for this report with ROW_NUMBER() OVER (ORDER BY EMPLOYEE_ID DESC) AS ID, in your select statement.
select Name, ROW_NUMBER() OVER (ORDER BY Name DESC) AS ID,
Age, Gender
from customer
This function adds a field to the output called ID (see my tips at the bottom to describe aliases). Since this isn't in the database, it needs a method to determine how it will increment. After the over keyword it orders by Name in descending order.
Information on Counting follows (won't be unique by row):
If each customer has multiple entries but the selected fields are the same for that user and you are counting that user's records (summed in one result record for the user) then you would write:
select Name, count(*), Age, Gender
from customer
group by name, age, gender
This will count (see MSDN) all the user's records as grouped by the name, age and gender (if they match, it's a single record).
However, if you are counting all records so that your whole report has the grand total on every line, then you want:
select Name, (select count(*) from customer) as "count", Age, Gender
from customer
TIP: If you're using something like SSMS to write a query, dragging in columns will put brackets around the columns. This is only necessary if you have spaces in column names, but a DBA will tend to avoid that like the plague. Also, if you need a column header to be something specific, you can use the as keyword like in my first example.
W3Schools has a good tutorial on count()
The COUNT(column_name) function returns
the number of values (NULL values will not be counted) of the
specified column:
SELECT COUNT(column_name) FROM table_name;
The COUNT(*) function returns the number of records in a table:
SELECT COUNT(*) FROM table_name;
The COUNT(DISTINCT column_name) function returns the number of
distinct values of the specified column:
SELECT COUNT(DISTINCT column_name) FROM table_name;
COUNT(DISTINCT) works with ORACLE and Microsoft SQL Server, but
not with Microsoft Access.
It's odd to repeat the same number in every row but it sounds like this is what you're asking for. And note that this might not work in your flavor of SQL. MS Access?
SELECT [Name], (select count(*) from [customer]), [Age], [Gender]
FROM [customer]

JOIN on another table after GROUP BY and COUNT

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.
(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)
I have two tables:
Table: Person
-------------
key name cityKey
1 Alice 1
2 Bob 2
3 Charles 2
4 David 1
Table: City
-------------
key name
1 Albany
2 Berkeley
3 Chico
I'd like to do a query on the People (with some WHERE clause) that returns
the number of matching people in each city
the key for the city
the name of the city.
If I do
SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person
LEFT JOIN City ON Person.cityKey = City.key
GROUP BY Person.cityKey, City.name
I get the result that I want
count cityKey cityName
2 1 Albany
2 2 Berkeley
However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.
I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.
select PC.ct, City.key, City.name
from City
join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
on City.key = PC.key
if my SQL isn't too rusty :-)
...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...
However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.
However, I've read that throwing in
that last part of the GROUP BY clause
(City.name) just to make it work is
wrong.
It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:
the required tables are joined
the composite dataset is filtered through the WHERE clause
the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
they are then filtered again, through the HAVING clause
finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.
The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.
Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.
Because the combination of city name and city key is unique, the following are equivalent:
select count(person.key), min(city.key), min(city.name)
...
group by person.citykey
Or:
select count(person.key), city.key, city.name
...
group by person.citykey, city.key, city.name
Or:
select count(person.key), city.key, max(city.name)
...
group by city.key
All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.
P.S. If you'd like to count only different persons, even if they have multiple rows, try:
count(DISTINCT person.key)
instead of
count(person.key)

Can't get unique values

I'm using my sql to get unique values from database
My query looks as follows, but somehow I'm not able to get unique results
SELECT
DISTINCT(company_name),
sp_id
FROM
Student_Training
ORDER BY
company_name
sp_id is the primary key, and then company_name is the companies name that needs to be unique
looks as follows
sp_id, company_name
1 comp1
2 comp2
3 comp2
4 comp3
Just not sorting this unique
DISTINCT works globally, on all the columns you SELECT. Here you're getting distinct pairs of (sp_id, company_name) values, but the individual values of each column may show duplicates.
That being said, it's extremely deceiving that MySQL authorizes the syntax SELECT DISTINCT(company_name), sp_id when it really means SELECT DISTINCT(company_name, sp_id). You don't need the parentheses at all, by the way.
Edit
Actually there's a reason why DISTINCT(company_name), sp_id is valid syntax: adding parentheses around an expression is always legal although it can be overkill: company_name is the same as (company_name) or even (((company_name))). Hence what that piece of SQL means is really: “DISTINCT [company_name in parentheses], [sp_id]”. The parentheses are attached to the column name, not the DISTINCT keyword, which, unlike aggregate function names, for example, does not need parentheses (AVG sp_id is not legal even if unambiguous for a human reader, it's always AVG(sp_id).)
For that matter, you could write SELECT DISTINCT company_name, (sp_id) or SELECT DISTINCT (company_name), (sp_id), it's exactly the same as the plain syntax without parentheses. Putting the list of columns inside parentheses – (company_name, sp_id) – is not legal SQL syntax, though, you can only SELECT “plain” lists of columns, unparenthesized (the form's spell-checker tells me this last expression is not an English word but I don't care. It's Friday afternoon after all).
Therefore, any database engine should accept this confusing syntax :-(
DISTINCT will make unique rows, meaning the unique combination of the field values in your query.
The following query will return a list of all the unique company_name together with the first match for sp_id.
SELECT sp_id, company_name
FROM Student_Training
GROUP BY company_name
And as Arthur Reutenauer has suggested, it is indeed pretty deceiving that MySQL allows the DISTINCT(fieldname) syntax when it actually means DISTINCT(field1, field2, ..., fieldn)
Which id you want in case when a single company_name has two or more id's?
This:
SELECT DISTINCT company_name
FROM Student_Training
will select only the company_name.
This:
SELECT company_name, MIN(id)
FROM Student_Training
GROUP BY
company_name
will select minimal id for each company name.
DISTINCT does not work the way you think it does. It gives you a distinct record. Stop and think about how it could return what you expect it to return. If you had a table as follows
id name
1 joe
2 joe
3 james
If it only returned distinct names, which id would it return for joe?
You may want
SELECT company_name, Min(sp_id) FROM Student_Tracking GROUP BY company_name
or perhaps (as above) just
SELECT DISTINCT company_name from student_training

How to randomize order of data in 3 columns

I have 3 columns of data in SQL Server 2005 :
LASTNAME
FIRSTNAME
CITY
I want to randomly re-order these 3 columns (and munge the data) so that the data is no longer meaningful. Is there an easy way to do this? I don't want to change any data, I just want to re-order the index randomly.
When you say "re-order" these columns, do you mean that you want some of the last names to end up in the first name column? Or do you mean that you want some of the last names to get associated with a different first name and city?
I suspect you mean the latter, in which case you might find a programmatic solution easier (as opposed to a straight SQL solution). Sticking with SQL, you can do something like:
UPDATE the_table
SET lastname = (SELECT lastname FROM the_table ORDER BY RAND())
Depending on what DBMS you're using, this may work for only one line, may make all the last names the same, or may require some variation of syntax to work at all, but the basic approach is about right. Certainly some trials on a copy of the table are warranted before trying it on the real thing.
Of course, to get the first names and cities to also be randomly reordered, you could apply a similar query to either of those columns. (Applying it to all three doesn't make much sense, but wouldn't hurt either.)
Since you don't want to change your original data, you could do this in a temporary table populated with all rows.
Finally, if you just need a single random value from each column, you could do it in place without making a copy of the data, with three separate queries: one to pick a random first name, one a random last name, and the last a random phone number.
I suggest using newid with checksum for doing randomization
SELECT LASTNAME, FIRSTNAME, CITY FROM table ORDER BY CHECKSUM(NEWID())
In SQL Server 2005+ you could prepare a ranked rowset containing the three target columns and three additional computed columns filled with random rankings (one for each of the three target columns). Then the ranked rowset would be joined with itself three times using the ranking columns, and finally each of the three target columns would be pulled from their own instance of the ranked rowset. Here's an illustration:
WITH sampledata (FirstName, LastName, CityName) AS (
SELECT 'John', 'Doe', 'Chicago' UNION ALL
SELECT 'James', 'Foe', 'Austin' UNION ALL
SELECT 'Django', 'Fan', 'Portland'
),
ranked AS (
SELECT
*,
FirstNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
LastNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
CityNameRank = ROW_NUMBER() OVER (ORDER BY NEWID())
FROM sampledata
)
SELECT
fnr.FirstName,
lnr.LastName,
cnr.CityName
FROM ranked fnr
INNER JOIN ranked lnr ON fnr.FirstNameRank = lnr.LastNameRank
INNER JOIN ranked cnr ON fnr.FirstNameRank = cnr.CityNameRank
This is the result:
FirstName LastName CityName
--------- -------- --------
James Fan Chicago
John Doe Portland
Django Foe Austin
select *, rand() from table order by rand();
I understand some versions of SQL have a rand() that doesn't change for each line. Check for yours. Works on MySQL.