SQL group by with a count - sql

I have a table (simplified below)
|company|name |age|
| 1 | a | 3 |
| 1 | a | 3 |
| 1 | a | 2 |
| 2 | b | 8 |
| 3 | c | 1 |
| 3 | c | 1 |
For various reason the age column should be the same for each company. I have another process that is updating this table and sometimes it put an incorrect age in. For company 1 the age should always be 3
I want to find out which companies have a mismatch of age.
Ive done this
select company, name age from table group by company, name, age
but dont know how to get the rows where the age is different. this table is a lot wider and has loads of columns so I cannot really eyeball it.
Can anyone help?
Thanks

You should not be including age in the group by clause.
SELECT company
FROM tableName
GROUP BY company, name
HAVING COUNT(DISTINCT age) <> 1
SQLFiddle Demo

If you want to find the row(s) with a different age than the max-count age of each company/name group:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
select * from cte
where age <> maxAge
Demontration
If you want to update the incorrect with the correct ages you just need to replace the SELECT with UPDATE:
WITH CTE AS
(
select company, name, age,
maxAge=(select top 1 age
from dbo.table1 t2
group by company,name, age
having( t1.company=t2.company and t1.name=t2.name)
order by count(*) desc)
from dbo.table1 t1
)
UPDATE cte SET AGE = maxAge
WHERE age <> maxAge
Demonstration

Since you mentioned "how to get the rows where the age is different" and not just the comapnies:
Add a unique row id (a primary key) if there isn't already one. Let's call it id.
Then, do
select id from table
where company in
(select company from table
group by company
having count(distinct age)>1)

Related

Why no similar ids in the results set when query with a correlated query inside where clause

I have a table with columns id, forename, surname, created (date).
I have a table such as the following:
ID | Forename | Surname | Created
---------------------------------
1 | Tom | Smith | 2008-01-01
1 | Tom | Windsor | 2008-02-01
2 | Anne | Thorn | 2008-01-05
2 | Anne | Baker | 2008-03-01
3 | Bill | Sykes | 2008-01-20
Basically, I want this to return the most recent name for each ID, so it would return:
ID | Forename | Surname | Created
---------------------------------
1 | Tom | Windsor | 2008-02-01
2 | Anne | Baker | 2008-03-01
3 | Bill | Sykes | 2008-01-20
I get the desired result with this query.
SELECT id, forename, surname, created
FROM name n
WHERE created = (SELECT MAX(created)
FROM name
GROUP BY id
HAVING id = n.id);
I am getting the result I want but I fail to understand WHY THE IDS ARE NOT BEING REPEATED in the result set. What I understand about correlated subquery is it takes one row from the outer query table and run the inner subquery. Shouldn't it repeat "id" when ids repeat in the outer query? Can someone explain to me what exactly is happening behind the scenes?
First, your subquery does not need a GROUP BY. It is more commonly written as:
SELECT n.id, n.forename, n.surname, n.created
FROM name n
WHERE n.created = (SELECT MAX(n2.created)
FROM name n2
WHERE n2.id = n.id
);
You should get in the habit of qualifying all column references, especially when your query has multiple table references.
I think you are asking why this works. Well, each row in the outer query is tested for the condition. The condition is: "is my created the same as the maximum created for all rows in the name table with the same id". In your data, only one row per id matches that condition, so ids are not repeated.
You can also consider joining the tables by created vs max(created) column values :
SELECT n.id, n.forename, n.surname, n.created
FROM name n
RIGHT JOIN ( SELECT id, MAX(created) as created FROM name GROUP BY id ) t
ON n.created = t.created;
or using IN operator :
SELECT id, forename, surname, created
FROM name n
WHERE ( id, created ) IN (SELECT id, MAX(created)
FROM name
GROUP BY id );
or using EXISTS with HAVING clause in the subquery :
SELECT id, forename, surname, created
FROM name n
WHERE EXISTS (SELECT id
FROM name
GROUP BY id
HAVING MAX(created) = n.created
);
Demo

postgresql find duplicates in column with ID

For instance, I have a table say "Name" with duplicate records in it:
Id | Firstname
--------------------
1 | John
2 | John
3 | Marc
4 | Jammie
5 | John
6 | Marc
How can I fetch duplicate records and display them with their receptive primary key ID?
Use Count()Over() window aggregate function
Select * from
(
select Id, Firstname, count(1)over(partition by Firstname) as Cnt
from yourtable
)a
Where Cnt > 1
SELECT t.*
FROM t
INNER JOIN
(SELECT firstname
FROM t
GROUP BY firstname
HAVING COUNT(*) > 1) sub
ON t.firstname = sub.firstname
A sub-query would do the trick. Select the first names that are found more than once your table, t. Then join these names back to the main table to pull in the primary key.

SQL Joins . One to many relationship

I have two tables as below
Table 1
-----------------------------------
UserID | UserName | Age | Salary
-----------------------------------
1 | foo | 22 | 33000
-----------------------------------
Table 2
------------------------------------------------
UserID | Age | Salary | CreatedDate
------------------------------------------------
1 | NULL | 35000 | 2015-01-01
------------------------------------------------
1 | 28 | NULL | 2015-02-01
------------------------------------------------
1 | NULL | 28000 | 2015-03-01
------------------------------------------------
I need the result like this.
Result
-----------------------------------
UserID | UserName | Age | Salary
-----------------------------------
1 | foo | 28 | 28000
-----------------------------------
This is just an example. In my real project I have around 6 columns like Age and Salary in above tables.
In table 2 , each record will have only have one value i.e if Age has value then Salary will be NULL and viceversa.
UPDATE :
Table 2 has CreatedDate Column. So i want to get latest "NOTNULL" CELL Value instead of maximum value.
You can get this done using a simple MAX() and GROUP BY:
select t1.userid,t1.username, MAX(t2.Age) as Age, MAX(t2.Salary) as Salary
from table1 t1 join
table2 t2 on t1.userid=t2.userid
group by t1.userid,t1.username
Result:
userid username Age Salary
--------------------------------
1 foo 28 35000
Sample result in SQL Fiddle
Note: I'm giving you the benefit of the doubt that you know what you're doing, and you just haven't told us everything about your schema.
It looks like Table 2 is actually an "updates" table, in which each row contains a delta of changes to apply to the base entity in Table 1. In which case you can retrieve each column's data with a correlated join (technically an outer-apply) and put the results together. Something like the following:
select a.UserID, a.UserName,
coalesce(aAge.Age, a.Age),
coalesce(aSalary.Salary, a.Salary)
from [Table 1] a
outer apply (
select Age
from [Table 2] x
where x.UserID = a.UserID
and x.Age is not null
and not exists (
select 1
from [Table 2] y
where x.UserID = y.UserID
and y.Id > x.Id
and y.Age is not null
)
) aAge,
outer apply (
select Salary
from [Table 2] x
where x.UserID = a.UserID
and x.Salary is not null
and not exists (
select 1
from [Table 2] y
where x.UserID = y.UserID
and y.Id > x.Id
and y.Salary is not null
)
) aSalary
Do note I am assuming you have at minimum an Id column in Table 2 which is monotonically increasing with each insert. If you have a "change time" column, use this instead to get the latest row, as it is better.
To get the latest value based on CreatedDate, you can use ROW_NUMBER to filter for latest rows. Here the partition is based UserID and the other columns, Age and Salary.
SQL Fiddle
;WITH Cte AS(
SELECT
UserID,
Age = MAX(Age),
Salary = MAX(Salary)
FROM(
SELECT *, Rn = ROW_NUMBER() OVER(
PARTITION BY
UserID,
CASE
WHEN Age IS NOT NULL THEN 1
WHEN Salary IS NOT NULL THEN 2
END
ORDER BY CreatedDate DESC
)
FROM Table2
)t
WHERE Rn = 1
GROUP BY UserID
)
SELECT
t.UserID,
t.UserName,
Age = ISNULL(c.Age, t.Age),
Salary = ISNULL(c.Salary, t.Salary)
FROM Table1 t
LEFT JOIN Cte c
ON t.UserID = c.UserID
following query should work(working fine in MSSQL) :
select a.userID,a.username,b.age,b.sal from <table1> a
inner join
(select userID,MAX(age) age,MAX(sal) sal from <table2> group by userID) b
on a.userID=b.userID

How to select all attributes (*) with distinct values in a particular column(s)?

Here is link to the w3school database for learners:
W3School Database
If we execute the following query:
SELECT DISTINCT city FROM Customers
it returns us a list of different City attributes from the table.
What to do if we want to get all the rows like that we get from SELECT * FROM Customers query, with unique value for City attribute in each row.
DISTINCT when used with multiple columns, is applied for all the columns together. So, the set of values of all columns is considered and not just one column.
If you want to have distinct values, then concatenate all the columns, which will make it distinct.
Or, you could group the rows using GROUP BY.
You need to select all values from customers table, where city is unique. So, logically, I came with such query:
SELECT * FROM `customers` WHERE `city` in (SELECT DISTINCT `city` FROM `customers`)
I think you want something like this:
(change PK field to your Customers Table primary key or index like Id)
In SQL Server (and standard SQL)
SELECT
*
FROM (
SELECT
*, ROW_NUMBER() OVER (PARTITION BY City ORDER BY PK) rn
FROM
Customers ) Dt
WHERE
(rn = 1)
In MySQL
SELECT
*
FORM (
SELECT
a.City, a.PK, count(*) as rn
FROM
Customers a
JOIN
Customers b ON a.City = b.City AND a.PK >= b.PK
GROUP BY a.City, a.PK ) As DT
WHERE (rn = 1)
This query -I hope - will return your Cities distinctly and also shows other columns.
You can use GROUP BY clause for getting distinct values in a particular column. Consider the following table - 'contact':
+---------+------+---------+
| id | name | city |
+---------+------+---------+
| 1 | ABC | Chennai |
+---------+------+---------+
| 2 | PQR | Chennai |
+---------+------+---------+
| 3 | XYZ | Mumbai |
+---------+------+---------+
To select all columns with distinct values in City attribute, use the following query:
SELECT *
FROM contact
GROUP BY city;
This will give you the output as follows:
+---------+------+---------+
| id | name | city |
+---------+------+---------+
| 1 | ABC | Chennai |
+---------+------+---------+
| 3 | XYZ | Mumbai |
+---------+------+---------+

Compare Multiple rows In SQL Server

I have a SQL Server database full of the following (fictional) data in the following structure:
ID | PatientID | Exam | (NON DB COLUMN FOR REFERENCE)
------------------------------------
1 | 12345 | CT | OK
2 | 11234 | CT | OK(Same PID but Different Exam)
3 | 11234 | MRI | OK(Same PID but Different Exam)
4 | 11123 | CT | BAD(Same PID, Same Exam)
5 | 11123 | CT | BAD(Same PID, Same Exam)
6 | 11112 | CT | BAD(Conflicts With ID 8)
7 | 11112 | MRI | OK(SAME PID but different Exam)
8 | 11112 | CT | BAD(Conflicts With ID 6)
9 | 11123 | CT | BAD(Same PID, Same Exam)
10 | 11123 | CT | BAD(Same PID, Same Exam)
I am trying to write a query with will go through an identify everything that isn't bad as per my example above.
Overall, a patient (identified by PatientId) can have many rows, but may not have 2 or more rows with the same exam!
I have attempted various modifications of exams I found on here but still with no luck.
Thanks.
You seem to want to identify duplicates, ranking them as good or bad. Here is a method using window functions:
select t.id, t.patientid, t.exam,
(case when cnt > 1 then 'BAD' else 'OK' end)
from (select t.*, count(*) over (partition by patientid, exam) as cnt
from table t
) t;
use Count() over() :
select *,case when COUNT(*) over(partition by PatientID, Exam) > 1 then 'bad' else 'ok'
from yourtable
You can also use:
;WITH CTE_Patients
(ID, PatientID, Exam, RowNumber)
AS
(
SELECT ID, PatientID, Exam
ROW_NUMBER() OVER (PARTITION BY PatientID, Exam ORDER BY ID)
FROM YourTableName
)
SELECT TableB.ID, TableB.PatientID, TableB.Exam, [DuplicateOf] = TableA.ID
FROM CTE_Patients TableB
INNER JOIN CTE_Patients TableA
ON TableB.PatientID = TableA.PatientID
AND TableB.Exam = TableA.Exam
WHERE TableB.RowNumber > 1 -- Duplicate rows
AND TableA.RowNumber = 1 -- Unique rows
I have a sample here: SQL Server – Identifying unique and duplicate rows in a table, you can identify unique rows as well as duplicate rows
If you don't want to use a CTE or Count Over, you can also group the Source table, and select from there...(but I'd be surprised if #Gordon was too far off the mark with the original answer :) )
SELECT a.PatientID, a.Exam, CASE WHEN a.cnt > 1 THEN 'BAD' ELSE 'OK' END
FROM ( SELECT PatientID
,Exam
,COUNT(*) AS cnt
FROM tableName
GROUP BY Exam
,PatientID
) a
Select those patients that never have 2 or more exams of same type.
select * from patients t1
where not exists (select 1 from patients t2
where t1.PatientID = t2.PatientID
group by exam
having count(*) > 1)
Or, if you want all rows, like in your example:
select ID,
PatientID,
Exam,
case when exists (select 1 from patients t2
where t1.PatientID = t2.PatientID
group by exam
having count(*) > 1) then 'BAD' else 'OK' end
from patients