SQL - find all instances where two columns are the same - sql

So I have a simple table that holds comments from a user that pertain to a specific blog post.
id | user | post_id | comment
----------------------------------------------------------
0 | john#test.com | 1001 | great article
1 | bob#test.com | 1001 | nice post
2 | john#test.com | 1002 | I agree
3 | john#test.com | 1001 | thats cool
4 | bob#test.com | 1002 | thanks for sharing
5 | bob#test.com | 1002 | really helpful
6 | steve#test.com | 1001 | spam post about pills
I want to get all instances where a user commented on the same post twice (meaning same user and same post_id). In this case I would return:
id | user | post_id | comment
----------------------------------------------------------
0 | john#test.com | 1001 | great article
3 | john#test.com | 1001 | thats cool
4 | bob#test.com | 1002 | thanks for sharing
5 | bob#test.com | 1002 | really helpful
I thought DISTINCT was what I needed but that just gives me unique rows.

You can use GROUP BY and HAVING to find pairs of user and post_id that have multiple entries:
SELECT a.*
FROM table_name a
JOIN (SELECT user, post_id
FROM table_name
GROUP BY user, post_id
HAVING COUNT(id) > 1
) b
ON a.user = b.user
AND a.post_id = b.post_id

DISTINCT removes all duplicate rows, which is why you're getting unique rows.
You can try using a CROSS JOIN (available as of Hive 0.10 according to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins):
SELECT mt.*
FROM MYTABLE mt
CROSS JOIN MYTABLE mt2
WHERE mt.user = mt2.user
AND mt.post_id = mt2.post_id
The performance might not be the best though. If you wanted to sort it, use SORT BY or ORDER BY.

DECLARE #MyTable TABLE (id int, usr varchar(50), post_id int, comment varchar(50))
INSERT #MyTable (id, usr, post_id, comment) VALUES (0,'john#test.com',1001,'great article')
INSERT #MyTable (id, usr, post_id, comment) VALUES (1,'bob#test.com',1001,'nice post')
INSERT #MyTable (id, usr, post_id, comment) VALUES (3,'john#test.com',1002,'I agree')
INSERT #MyTable (id, usr, post_id, comment) VALUES (4,'john#test.com',1001,'thats cool')
INSERT #MyTable (id, usr, post_id, comment) VALUES (5,'bob#test.com',1002,'thanks for sharing')
INSERT #MyTable (id, usr, post_id, comment) VALUES (6,'bob#test.com',1002,'really helpful')
INSERT #MyTable (id, usr, post_id, comment) VALUES (7,'steve#test.com',1001,'spam post about pills')
SELECT
T1.id,
T1.usr,
T1.post_id,
T1.comment
FROM
#MyTable T1
INNER JOIN #MyTable T2
ON T1.usr = T2.usr AND T1.post_id = T2.post_id
GROUP BY
T1.id,
T1.usr,
T1.post_id,
T1.comment
HAVING
Count(T2.id) > 1

Related

Finding only rows with non-duplicated values within a window partition

I want to look at why some descriptions are different for the same permit id. Here's the table (I'm using Snowflake):
create or replace table permits (permit varchar(255), description varchar(255));
// dupe permits, dupe descriptions, throw out
INSERT INTO permits VALUES ('1', 'abc');
INSERT INTO permits VALUES ('1', 'abc');
// dupe permits, unique descriptions, keep
INSERT INTO permits VALUES ('2', 'def1');
INSERT INTO permits VALUES ('2', 'def2');
INSERT INTO permits VALUES ('2', 'def3');
// dupe permits, unique descriptions, keep
INSERT INTO permits VALUES ('3', NULL);
INSERT INTO permits VALUES ('3', 'ghi1');
// unique permit, throw out
INSERT INTO permits VALUES ('5', 'xyz');
What I want is to query this table and get out only the sets of rows that have duplicate permit ids but different descriptions.
The output I want is this:
+---------+-------------+
| PERMIT | DESCRIPTION |
+---------+-------------+
| 2 | def1 |
| 2 | def2 |
| 2 | def3 |
| 3 | |
| 3 | ghi1 |
+---------+-------------+
I've tried this:
with with_dupe_counts as (
select
count(permit) over (partition by permit order by permit) as permit_dupecount,
count(description) over (partition by permit order by permit) as description_dupecount,
permit,
description
from permits
)
select *
from with_dupe_counts
where permit_dupecount > 1
and description_dupecount > 1
Which gives me permits 1 and 2 and counts descriptions whether they are unique or not:
+------------------+-----------------------+--------+-------------+
| PERMIT_DUPECOUNT | DESCRIPTION_DUPECOUNT | PERMIT | DESCRIPTION |
+------------------+-----------------------+--------+-------------+
| 2 | 2 | 1 | abc |
| 2 | 2 | 1 | abc |
| 3 | 3 | 2 | def1 |
| 3 | 3 | 2 | def2 |
| 3 | 3 | 2 | def3 |
+------------------+-----------------------+--------+-------------+
What I think would work would be
count(unique description) over (partition by permit order by permit) as description_dupecount
But as I'm realizing there are lots of things that don't work in window functions. This question isn't necessarily "how do I get count(unique x) to work in a window function" because I don't know if that is the best way to solve this.
A simple group by I don't think will work because I want to get the original rows back.
One method uses min() and max() and count():
select *
from (select p.*,
min(description) over (partition by permit) as min_d,
max(description) over (partition by permit) as max_d,
count(description) over (partition by permit) as cnt_d,
count(*) over (partition by permit) as cnt,
count(permit) over (partition by permit order by permit) as permit_dupecount
from permits
)
where min_d <> max_d or cnt_d <> cnt;
I would just use exists:
select p.*
from permits p
where exists (
select 1
from permits p1
where p1.permit = p.permit and p1.description <> p.description
)
To handle the null values, we can use standard null-safe equality operator IS DISTINCT FROM, which Snowlake supports:
select p.*
from permits p
where exists (
select 1
from permits p1
where
p1.permit = p.permit
and p1.description is distinct from p.description
)
Should work
SELECT DISTINCT p1.permit, p1.description
FROM permits p1
JOIN permits p2 ON p1.permit = p2.permit
WHERE p1.description != p2.description OR p1.description IS NULL AND p2.description IS NOT NULL
This is my go to:
with x as (
select permit, count(distinct description) cnt
from permits p1
group by permit
having cnt > 1
)
select p.*
from x
join permits p
on x.permit = p.permit;

SQL Server - select every combination

We have two tables below, I am trying to write a query that will select EVERY Purchase for EVERY person on the team. For example, it should show PersonA being associated to PurchaseID 1 and 2 because they are on the same Team as TeamA.
Is this possible? I thought a cross join would work but it seemed to bring back too many columns. I am running SQL Server.
Thank you
Purchases
| PurchaseID | PersonID |
|------------ |---------- |
| 1 | TeamA |
| 2 | TeamA |
| 3 | PersonA |
| 4 | PersonB |
| 5 | TeamB |
Teams
| TeamID | PersonID |
|-------- |---------- |
| 1 | PersonA |
| 1 | TeamA |
| 1 | PersonC |
| 2 | PersonB |
| 2 | TeamB |
Expected results (when filtered on PurchaseID 1):
| PurchaseID | PersonID |
|------------ |---------- |
| 1 | TeamA |
| 1 | PersonA |
| 1 | PersonC |
Your data structure is a little odd, but I think I understand what you want.
If PersonA made a purchase, and PersonA is on TeamA, then everyone on TeamA should be shown as being associated with the purchase, right? Like "I bought these doughnuts for my team, so everyone on my team gets a doughnut".
What you're going to want is to join Purchase to Team on PersonID, as you probably guessed. But then use a CROSS APPLY function, which is in inline table value function, to return all the people on the same team as the person in the "current row".
I used two common table expressions to represent your tables so I could run it. You'll just want the SELECT part:
with Purchases as (
select 1 as PurchaseID, 'TeamA' as PersonID
union select 2 as PurchaseID, 'TeamA' as PersonID
union select 3 as PurchaseID, 'PersonA' as PersonID
union select 4 as PurchaseID, 'PersonB' as PersonID
union select 5 as PurchaseID, 'TeamB' as PersonID
)
, Teams as (
select 1 as TeamID, 'PersonA' as PersonID
union select 1 as TeamID, 'TeamA' as PersonID
union select 1 as TeamID, 'PersonC' as PersonID
union select 2 as TeamID, 'PersonB' as PersonID
union select 2 as TeamID, 'TeamB' as PersonID
)
select Purchases.PurchaseID
, EveryTeamMember.PersonID
from Purchases
join Teams
on Teams.PersonID = Purchases.PersonID
cross apply (
select PersonID
from Teams InnerTable
where InnerTable.TeamID = Teams.TeamID
) as EveryTeamMember
where Purchases.PurchaseID = 1
If you are looking ti get all Team persons when the PersonID starts with Team then i think you should do a CROSS APPLY over all PersonID who starts with Team and UNION (NOT UNION ALL) Single Person purchases:
DECLARE #Purchases TABLE (
PurchaseID INT,
PersonID Varchar(50)
)
INSERT INTO #Purchases(PersonID,PurchaseID) VALUES ('TeamA', 1);
INSERT INTO #Purchases(PersonID,PurchaseID) VALUES ('TeamA', 2);
INSERT INTO #Purchases(PersonID,PurchaseID) VALUES ('PersonA', 3);
INSERT INTO #Purchases(PersonID,PurchaseID) VALUES ('PersonB', 4);
INSERT INTO #Purchases(PersonID,PurchaseID) VALUES ('TeamB', 5);
DECLARE #Teams TABLE (
TeamID INT,
PersonID Varchar(50)
)
INSERT INTO #Teams(PersonID,TeamID) VALUES ('PersonA', 1);
INSERT INTO #Teams(PersonID,TeamID) VALUES ('TeamA', 1);
INSERT INTO #Teams(PersonID,TeamID) VALUES ('PersonC', 1);
INSERT INTO #Teams(PersonID,TeamID) VALUES ('PersonB', 2);
INSERT INTO #Teams(PersonID,TeamID) VALUES ('TeamB', 2);
SELECT T1.PurchaseID,TeamPersons.PersonID
FROM #Purchases T1
INNER JOIN #Teams T2
ON T2.PersonID = T1.PersonID AND T1.PersonID LIKE'Team%'
CROSS APPLY (
SELECT PersonID
FROM #Teams T3
WHERE T3.TeamID = T2.TeamID
) AS TeamPersons
UNION
SELECT T1.PurchaseID
, T1.PersonID
FROM #Purchases T1
WHERE T1.PersonID NOT LIKE 'Team%'
Result

Returning most recent row SQL Server

I have this table
CREATE TABLE Test (
OrderID int,
Person varchar(10),
LastModified Date
);
INSERT INTO Test (OrderID, Person, LastModified)
VALUES (1, 'Sam', '2018-05-15'),
(1, 'Tim','2018-05-14'),
(1, 'Kim','2018-05-05'),
(1, 'Dave','2018-05-13'),
(1, 'James','2018-05-11'),
(1, 'Fred','2018-05-05');
select * result:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
| 1 | Tim | 2018-05-14 |
| 1 | Kim | 2018-05-05 |
| 1 | Dave | 2018-05-13 |
| 1 | James | 2018-05-11 |
| 1 | Fred | 2018-05-05 |
I am looking to return the most recent modified row which is the first row with 'Sam'.
Now i now i can use max to return the most recent date but how can i aggregate the person column to return sam?
Looking for a result set like
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
I ran this:
SELECT
OrderID,
max(Person) AS [Person],
max(LastModified) AS [LastModified]
FROM Test
GROUP BY
OrderID
but this returns:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Tim | 2018-05-15 |
Can someone advice me further please? thanks
*** UPDATE
INSERT INTO Test (OrderID, Person, LastModified)
VALUES (1, 'Sam', '2018-05-15'),
(1, 'Tim','2018-05-14'),
(1, 'Kim','2018-05-05'),
(1, 'Dave','2018-05-13'),
(1, 'James','2018-05-11'),
(1, 'Fred','2018-05-05'),
(2, 'Dave','2018-05-13'),
(2, 'James','2018-05-11'),
(2, 'Fred','2018-05-05');
So i would be looking for this result to be:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
| 2 | Dave | 2018-05-13 |
If you always want just one record (the latest modified one) per OrderID then this would do it:
SELECT
t2.OrderID
, t2.Person
, t2.LastModified
FROM (
SELECT
MAX( LastModified ) AS LastModified
, OrderID
FROM
Test
GROUP BY
OrderID
) t
INNER JOIN Test t2
ON t2.LastModified = t.LastModified
AND t2.OrderID = t.OrderID
Expanding on your comment ("thanks very much, is there a way i can do this if there is more than one orderID e.g. multiple people and lastmodified for multiple orderID's?"), in xcvd's answer, I assume what you therefore want is this:
WITH CTE AS(
SELECT OrderId,
Person,
LastModifed,
ROW_NUMBER() OVER (PARTITION BY OrderID ORDER BY LastModified DESC) AS RN
FROM YourTable)
SELECT OrderID,
Person,
LastModified
FROM CTE
WHERE RN = 1;
How about just using TOP (1) and ORDER BY?
SELECT TOP (1) t.*
FROM Test t
ORDER BY LastModified DESC;
If you want this for each orderid, then this is a handy method in SQL Server:
SELECT TOP (1) WITH TIES t.*
FROM Test t
ORDER BY ROW_NUMBER() OVER (PARTITION BY OrderId ORDER BY LastModified DESC);
"xcvd's" answer is perfect for this, I would just like to add another solution that can be used here for the sake of showing you a method that can be used in more complex situations than this. This solution uses a nested query (sub-query) to find the MAX(LastModified) regardless of any other field and it will use the result in the original query's WHERE clause to find any results that meet the new criteria. Cheers.
SELECT OrderID
, Person
, LastModified
FROM Test
WHERE LastModified IN (SELECT MAX(LastModified)
FROM Test)
Here is one other method :
select t.*
from Test t
where LastModified = (select max(t1.LastModified) from Test t1 where t1.OrderID = t.OrderID);

SQL delete almost identical rows

I have a table that have 5 columns, and instead of update, I've done insert of all rows(stupid mistake). How to get rid of duplicated records. They are identical except of the id. I can't remove all records, but I want do delete half of them.
ex. table:
+-----+-------+--------+-------+
| id | name | name2 | user |
+-----+-------+--------+-------+
| 1 | nameA | name2A | u1 |
| 12 | nameA | name2A | u1 |
| 2 | nameB | name2B | u2 |
| 192 | nameB | name2B | u2 |
+-----+-------+--------+-------+
How to do this?
I'm using Microsoft Sql Server.
Try the following.
DELETE
FROM MyTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyTable
GROUP BY Name, Name2, User)
That is untested so may need adapting. The following video will provide you with some more information about this query.
Video
This is more specific query than #TechDo as I find duplicates where name, name2 and user are identical not only name.
with duplicates as
(
select t.id, ROW_NUMBER() over (partition by t.name, t.name2, t.[user] order by t.id) as RowNumber
from YourTable t
)
delete duplicates
where RowNumber > 1
SQLFiddle demo to try it yourself: DEMO
Please try:
with c as
(
select
*, row_number() over(partition by name, name2, [user] order by id) as n
from YourTable
)
delete from c
where n > 1;

SQL Order By and "Not-So-Much Group"

Lets say I have a table:
--------------------------------------
| ID | DATE | GROUP | RESULT |
--------------------------------------
| 1 | 01/06 | Group1 | 12345 |
| 2 | 01/05 | Group2 | 54321 |
| 3 | 01/04 | Group1 | 11111 |
--------------------------------------
I want to order the result by the most recent date at the top but group the "group" column together, but still have distinct entries. The result that I want would be:
1 | 01/06 | Group1 | 12345
3 | 01/04 | Group1 | 11111
2 | 01/05 | Group2 | 54321
What would be a query to get that result?
thank you!
EDIT:
I'm using MSSQL. I'll look into translating the oracle query into MS SQL and report my results.
EDIT
SQL Server 2000, so OVER/PARTITION is not supported =[
Thank you!
You should specify what RDBMS you are using. This answer is for Oracle, may not work in other systems.
SELECT * FROM table
ORDER BY MAX(date) OVER (PARTITION BY group) DESC, group, date DESC
declare #table table (
ID int not null,
[DATE] smalldatetime not null,
[GROUP] varchar(10) not null,
[RESULT] varchar(10) not null
)
insert #table values (1, '2009-01-06', 'Group1', '12345')
insert #table values (2, '2009-01-05', 'Group2', '12345')
insert #table values (3, '2009-01-04', 'Group1', '12345')
select t.*
from #table t
inner join (
select
max([date]) as [order-date],
[GROUP]
from #table orderer
group by
[GROUP]
) x
on t.[GROUP] = x.[GROUP]
order by
x.[order-date] desc,
t.[GROUP],
t.[DATE] desc
use an order by clause with two params:
...order by group, date desc
this assumes that your date column does hold dates and not varchars
SELECT table2.myID,
table2.mydate,
table2.mygroup,
table2.myresult
FROM (SELECT DISTINCT mygroup FROM testtable as table1) as grouptable
JOIN testtable as table2
ON grouptable.mygroup = table2.mygroup
ORDER BY grouptable.mygroup,table2.mydate
SORRY, could NOT bring myself to use columns that were reserved names, rename the columns to make it work :)
this is MUCH simpler than the accepted answer btw.