SQL Duplicate Rows - sql

I am new to SQL and was wondering if anyone could help me solve my problem.
I have a table that contains information as follows:
firstname lastname group orderinggroup date
tim s A Facebook 6/4/13
tim s A Facebook 6/4/13
tim s A Facebook 6/4/13
dan d B Google 4/5/12
dan d B Google 4/5/12
Something like that. I want it to look like this
firstname lastname group orderinggroup date
tim s A Facebook 6/4/13
dan d B Google 4/5/12
Where there aren't duplicates for tim and dan. I tried using DISTINCT but that only makes one column distinct, and I actually have many people named Tim, Dan, Groups that are A/B, etc. I was wondering if there is a method to take the distinct of multiple roles, e.g., Distinct of firstname, lastname, group, orderinggroup, and date. Last names matter. Thanks!

You could really use the max() function to aggregate some of the columns and do something like this:
select
firstname,
lastname,
[group],
max(orderinggroup) as orderinggroup,
max([date]) as [date]
from (VALUES
('tim','s','A','Facebook','6/4/13'),
('tim','s','A','Facebook','6/4/13'),
('tim','s','A','Facebook','6/4/13'),
('dan','d','B','Google','4/5/12'),
('dan','d','B','Google','4/5/12')) as A (firstname,lastname,[group],orderinggroup,[date])
--replace this with your tablename
GROUP BY firstname, lastname, [group]
ORDER BY [group]

This is bad data and DISTINCT will work. DISTINCT is actually great for this.
SELECT DISTINCT * from table
Why even list columns unless there is direct manipulation or subqueries going on? There's not even a JOIN in your statement. Please tell me with a comment why DISTINCT does not work in this situation as you mention you've used it before.
DISTINCT checks each column for matching values, if it hits a value that doesn't match on record A from record B it will spit out record B also because it's distinct from A. Even if everything before it matches.
In your example:
firstname lastname group orderinggroup date
tim s A Facebook 6/4/13
tim s A Facebook 6/4/13
tim s A Facebook 6/4/13
dan d B Google 4/5/12
dan d B Google 4/5/12
Let's start with Tim. Every field is exactly the same. Therefore Distinct would collapse all these records into one row. Same for Dan. Now if the lastname is different in your actual database (which should be reflected in your example) then DISTINCT will not work. However, the premise of your question would need to change. You would need to discern what in fact you want to reflect in your data set. Do you want to ignore the last name? Do you want to consider it? These questions are pertinent. Group By works also, but is unnecessary as you're not doing aggregates. I hope this helps.

You can use below query:
SELECT firstname, lastname, group, orderinggroup, date
FROM tablename
GROUP BY firstname, lastname, group, orderinggroup, date
HAVING count(*) = 1;

Related

How to do a subquery on a select statement to show something like this

I joined bunch of tables to get diagnosis for some people. I added a filter that only catches certain people who has certain diagnosis. In my example below, I separated those who have fever.
Here, Patient with ID 1 has 4 diagnosis and Patient with ID 2 has 3 diagnosis. They both have fever along with other problems.
Now I want show the other problem these folks have along with the fever diagnosis like this below example.
Is there any way I can group by Patient ID and show all of their diagnosis on one row? Like having a subquery on a select statement. I am not good with SQL so an example code would be very helpful.
You can group_concat
SELECT NAME, DateOfBirth, ID, GROUP_CONCAT(Diagnosis)
FROM <your_table>
GROUP BY NAME, DateOfBirth, ID

Find potential duplicate names in database

I have two tables in a SQL Server Database:
Table: People
Columns: ID, FirstName, LastName
Table: StandardNames
Columns: Nickname, StandardName
Sample Nicknames would be Rick, Rich, Richie when StandardName is Richard.
I would like to find duplicate contacts in my People table but replace any of the nicknames with the standard name. IE: sometimes I have Rich Smith other times it is Richard Smith in the People table. Is this possible? I realize it might be multiple joins to the same table but can't figure out how to start.
Firstly, you need to determine how many duplicates you have in your People table...
SELECT p.FirstName, COUNT(*)
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0
GROUP BY p.FirstName
HAVING COUNT(*) > 1
That's just to get an idea of what data you're trying to find in relation to the Nicknames that may possibly exist inside (as a wildcard word search) the Firstname and Lastname columns.
If you are happy with the items found then expand on the query to update the values.
Let's say you wanted to change the Firstname to be the Standardname...
UPDATE p2
SET p2.FirstName = p2.Standardname
FROM
(SELECT p.ID, sn.StandardName
FROM People AS p
INNER JOIN StandardNames AS sn
ON CHARINDEX(sn.Nickname, p.FirstName) > 0 OR
CHARINDEX(sn.Nickname, p.LastName) > 0) AS a
INNER JOIN People AS p2 ON p2.ID = a.ID
So this will obviously find all the People IDs that have a match based on the query above, and it will update the People table by replacing the FirstName with the StandardName.
However, there are issues with this due to the limitation of your question.
the StandardNames table should have its own ID field. All tables should have an ID column as its primary table. That's just my view.
this is only going to work for data it matches using the CHARINDEX() function. What you really need is something to find based on a "sound" or similarity to the nicknames. Check out the SOUNDEX() function and apply your logic from there.
And this is assuming your IDs above are unique!
Good luck
You could standardize the names by joining, and count the number of occurrences. Extracting the ID is a bit fiddly, but also quite possible. I'd suggest the following - use a case expression to find the contact with the standard name, and if you don't have one, just take the id of the first duplicate:
SELECT COALESCE(MIN(CASE FirstName WHEN StandardName THEN id END), MIN(id)),
StandardName,
LastName,
COUNT(*)
FROM People p
LEFT JOIN StandardNames s ON FirstName = Nickname AND
GROUP BY StandardName, LastName

Number of tuples outputted by GROUP BY primitive [duplicate]

I know that if you have one aggregate function in a SELECT statement, then all the other values in the statement must be either aggregate functions, or listed in a GROUP BY clause. I don't understand why that's the case.
If I do:
SELECT Name, 'Jones' AS Surname FROM People
I get:
NAME SURNAME
Dave Jones
Susan Jones
Amy Jones
So, the DBMS has taken a value from each row, and appended a single value to it in the result set. That's fine. But if that works, why can't I do:
SELECT Name, COUNT(Name) AS Surname FROM People
It seems like the same idea, take a value from each row and append a single value. But instead of:
NAME SURNAME
Dave 3
Susan 3
Amy 3
I get:
You tried to execute a query that does not include the specified expression 'ContactName' as part of an aggregate function.
I know it's not allowed, but the two circumstances seem so similar that I don't understand why. Is it to make the DBMS easier to implement? If anyone can explain to me why it doesn't work like I think it should, I'd be very grateful.
Aggregates doesn't work on a complete result, they only work on a group in a result.
Consider a table containing:
Person Pet
-------- --------
Amy Cat
Amy Dog
Amy Canary
Dave Dog
Susan Snake
Susan Spider
If you use a query that groups on Person, it will divide the data into these groups:
Amy:
Amy Cat
Amy Dog
Amy Canary
Dave:
Dave Dog
Susan:
Susan Snake
Susan Spider
If you use an aggreage, for exmple the count aggregate, it will produce one result for each group:
Amy:
Amy Cat
Amy Dog
Amy Canary count(*) = 3
Dave:
Dave Dog count(*) = 1
Susan:
Susan Snake
Susan Spider count(*) = 2
So, the query select Person, count(*) from People group by Person gives you one record for each group:
Amy 3
Dave 1
Susan 2
If you try to get the Pet field in the result also, that doesn't work because there may be multiple values for that field in each group.
(Some databases, like MySQL, does allow that anyway, and just returns any random value from within the group, and it's your responsibility to know if the result is sensible or not.)
If you use an aggregate, but doesn't specify any grouping, the query will still be grouped, and the entire result is a single group. So the query select count(*) from Person will create a single group containing all records, and the aggregate can count the records in that group. The result contains one row from each group, and as there is only one group, there will be one row in the result.
Think about it this way: when you call COUNT without grouping, it "collapses" the table to a single group making it impossible to access the individual items within a group in a select clause.
You can still get your result using a subquery or a cross join:
SELECT p1.Name, COUNT(p2.Name) AS Surname FROM People p1 CROSS JOIN People p2 GROUP BY p1.Name
SELECT Name, (SELECT COUNT(Name) FROM People) AS Surname FROM People
As others explained, when you have a GROUP BY or you are using an aggregate function like COUNT() in the SELECT list, you are doing a grouping of rows and therefore collapsing matching rows into one for every group.
When you only use aggregate functions in the SELECT list, without GROUP BY, think of it as you have a GROUP BY 1, so all rows are grouped, collapsed into one. So, if you have a hundred rows, the database can't really show you a name as there are a hundred of them.
However, for RDBMSs that have "windowing" functions, what you want is feasible. E.g. use aggregate functions without a GROUP BY.
Example for SQL-Server, where all rows (names) in the table are counted:
SELECT Name
, COUNT(*) OVER() AS cnt
FROM People
How does the above work?
It shows the Name like the
COUNT(*) OVER() AS cnt did not
exist and
It shows the COUNT(*) like if it was making a total grouping of the
table.
Another example. If you have a Surname field on the table, you can have something like this to show all rows grouped by Surname and counting how many people have same Surname:
SELECT Name
, Surname
, COUNT(*) OVER(PARTITION BY Surname) AS cnt
FROM People
Your query implicitly asks for different types of rows in your result set, and that is not allowed. All rows returned should be of the same type and have the same kind of columns.
'SELECT name, surname' wants to returns a row for every row in the table.
'SELECT COUNT(*)' wants to return a single row combining the results of all the rows in the table.
I think you're correct that in this case the database could plausibly just do both queries and then copy the result of 'SELECT COUNT(*)' into every result. One reason for not doing this is that it would be a stealth performance hit: you'd effectively be doing an extra self-join without declaring it anywhere.
Other answers have explained how to write a working version of this query, so I won't go into that.
The aggregate function and the group by clause aren't separate things, they're parts of the same thing that appear in different places in the query. If you wish to aggregate on a column, you must say what function to use for aggregation; if you wish to have an aggregation function, it has to be applied over some column.
The aggregate function takes values from multiple rows with a specific condition and combines them into one value. This condition is defined by the GROUP BYin your statement. So you can't use an aggregate function without a GROUP BY
With
SELECT Name, 'Jones' AS Surname FROM People
you simply select an additional column with a fixed value... but with
SELECT Name, COUNT(Name) AS Surname FROM People GROUP BY Name
you tell the DBMS to select the Names, remember how often every Name occured in the table and collapse them into one row. So if you omit the GROUP BY the DBMS can't tell, how to collapse the records

Many Fields In the Group By Clause

I am learning SQL now, and I have a question. I recently came across a query that hand a large number of column names in the group by clause. I've used group by clauses before, and I've only ever seen one column name included in it.
SELECT TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate, Transport.RetiredDate,
MAX(Repair.BeginWorkDate) AS LatestRepairDate
FROM Transport INNER JOIN
TransportType ON Transport.TransportTypeID = TransportType.TransportTypeID LEFT OUTER JOIN
Repair ON Transport.TransportNumber = Repair.TransportNumber
GROUP BY TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate,
Transport.RetiredDate
HAVING (Transport.RetiredDate IS NULL)
ORDER BY TransportType.Description, Transport.SerialNumber
Why are there so many columns in the group by clause?
Except in MySQL & SQLite (which are lenient about the GROUP BY with sometimes indeterminate results), most RDBMS require every non-aggregated column (MAX(),MIN(),SUM(),COUNT(), etc) that appears in the SELECT list to be in the GROUP BY.
The behavior of MySQL & SQLite when columns from SELECT aren't listed in GROUP BY is not well defined. If for example, you execute a query like:
SELECT firstname, lastname, COUNT(*) FROM names GROUP BY lastname
MySQL would give you a result without complaint.
However, if your table included two different values of firstname having the same lastname, your resultant COUNT(*) would count both of them while only returning the firstname of one of them. What's more, which firstname MySQL chooses to return isn't defined so you can't really rely on it returning the first of the pair, for example.
From a table like:
firstname, lastname
--------------------
Jane Smith
John Smith
Peter Jones
The not-fully-correct result might be:
firstname, lastname, COUNT(*)
-----------------------------
Jane Smith 2 <----wrong!
Peter Jones 1
Outside MySQL & SQLite, columns referenced anywhere in the SELECT list not also appearing in the GROUP BY will result in a query parse error.
Commonly here on Stack Overflow, we encounter users with questions about the GROUP BY, having just begun working with an RDBMS that is stricter about its usage. If you learn aggregates in MySQL first, chances are you'll need to relearn to do them properly when moving to a different RDBMS.

Geting value count from an Oracle Table

I have a table, that contains employees. Since the company I'm working for is quite big (>3k employees) It is only natural, that some of them have the same names. Now they can be differentiated by their usernames, but since a webpage needs a drop-down with all of these users, I need to add some extra data to their names.
I know I could first grab all of the users and then run them through a foreach and add a count to each of the user objects. That would be quite ineffective though. Therefore I'm in need of a good SQL query, that would do something like this. Could a sub-query be the thing I need?
My Table looks something like this:
name ----- surname ----- username
John Mayer jmaye
Suzan Harvey sharv
John Mayer jmay3
Now what I think would be great, if the query returned the same 3 fields and also a boolean if there is more than one person with the same name and surname combination.
Adding the flag to Daniel's answer...
SELECT NAME, SURNAME, USERNAME, DECODE(COUNT(*) OVER (PARTITION BY NAME, SURNAME), 1, 'N', 'Y')
FROM
YOUR_TABLE;
Please note that Oracle SQL has no support for booleans (sigh...)
This can be easily done with a count over partition:
SELECT NAME, SURNAME, USERNAME, COUNT(*) OVER (PARTITION BY NAME, SURNAME)
FROM
YOUR_TABLE;