Finding only rows with non-duplicated values within a window partition

Finding only rows with non-duplicated values within a window partition - sql

I want to look at why some descriptions are different for the same permit id. Here's the table (I'm using Snowflake):
create or replace table permits (permit varchar(255), description varchar(255));
// dupe permits, dupe descriptions, throw out
INSERT INTO permits VALUES ('1', 'abc');
INSERT INTO permits VALUES ('1', 'abc');
// dupe permits, unique descriptions, keep
INSERT INTO permits VALUES ('2', 'def1');
INSERT INTO permits VALUES ('2', 'def2');
INSERT INTO permits VALUES ('2', 'def3');
// dupe permits, unique descriptions, keep
INSERT INTO permits VALUES ('3', NULL);
INSERT INTO permits VALUES ('3', 'ghi1');
// unique permit, throw out
INSERT INTO permits VALUES ('5', 'xyz');
What I want is to query this table and get out only the sets of rows that have duplicate permit ids but different descriptions.
The output I want is this:
+---------+-------------+
| PERMIT | DESCRIPTION |
+---------+-------------+
| 2 | def1 |
| 2 | def2 |
| 2 | def3 |
| 3 | |
| 3 | ghi1 |
+---------+-------------+
I've tried this:
with with_dupe_counts as (
select
count(permit) over (partition by permit order by permit) as permit_dupecount,
count(description) over (partition by permit order by permit) as description_dupecount,
permit,
description
from permits
)
select *
from with_dupe_counts
where permit_dupecount > 1
and description_dupecount > 1
Which gives me permits 1 and 2 and counts descriptions whether they are unique or not:
+------------------+-----------------------+--------+-------------+
| PERMIT_DUPECOUNT | DESCRIPTION_DUPECOUNT | PERMIT | DESCRIPTION |
+------------------+-----------------------+--------+-------------+
| 2 | 2 | 1 | abc |
| 2 | 2 | 1 | abc |
| 3 | 3 | 2 | def1 |
| 3 | 3 | 2 | def2 |
| 3 | 3 | 2 | def3 |
+------------------+-----------------------+--------+-------------+
What I think would work would be
count(unique description) over (partition by permit order by permit) as description_dupecount
But as I'm realizing there are lots of things that don't work in window functions. This question isn't necessarily "how do I get count(unique x) to work in a window function" because I don't know if that is the best way to solve this.
A simple group by I don't think will work because I want to get the original rows back.

One method uses min() and max() and count():
select *
from (select p.*,
min(description) over (partition by permit) as min_d,
max(description) over (partition by permit) as max_d,
count(description) over (partition by permit) as cnt_d,
count(*) over (partition by permit) as cnt,
count(permit) over (partition by permit order by permit) as permit_dupecount
from permits
)
where min_d <> max_d or cnt_d <> cnt;

I would just use exists:
select p.*
from permits p
where exists (
select 1
from permits p1
where p1.permit = p.permit and p1.description <> p.description
)
To handle the null values, we can use standard null-safe equality operator IS DISTINCT FROM, which Snowlake supports:
select p.*
from permits p
where exists (
select 1
from permits p1
where
p1.permit = p.permit
and p1.description is distinct from p.description
)

Should work
SELECT DISTINCT p1.permit, p1.description
FROM permits p1
JOIN permits p2 ON p1.permit = p2.permit
WHERE p1.description != p2.description OR p1.description IS NULL AND p2.description IS NOT NULL

This is my go to:
with x as (
select permit, count(distinct description) cnt
from permits p1
group by permit
having cnt > 1
)
select p.*
from x
join permits p
on x.permit = p.permit;

Related

how to insert row without duplicate values in two differents table ( header and detail)

I have the follow scenario, I need to insert header and details, they are two different tables.
The field order should be insert unique order without duplicate values and the detail get the id from the header
the data received by csv is:
order, number, qty, price
-------------------------
1000,a1000,1,2.0
1000,a1001,2,3.0
1001,a1000,1,3.0
1001,a1001,1,3.0
1001,a1000,1,3.0
I have the follow function in pgsql:
this query does not work, it is duplicating the records. How I can solve this problem?
INSERT INTO public.HeaderTable ( order )
SELECT
order
FROM
public.HeaderTable
WHERE
NOT EXISTS (
SELECT
idpo
FROM
public.HeaderTable
WHERE
order = '1000'
)
LIMIT 1;
This second query I don't know how to make to insert the detail, get the id and all this thing if not exist the row else not insert the row...
INSERT INTO public.DetailsTable ( idh, product, qty )
SELECT
order
FROM
public.HeaderTable
WHERE
NOT EXISTS (
SELECT
idpo
FROM
public.HeaderTable
WHERE
order = '1000'
) LIMIT 1;
expected result:
note: this is the expected result insert
HeaderTable:
id | order
------------
1 | 1000
2 | 1001
DetailsTable:
id | idh | product | qty
----------------------------
1 | 1 | a1000 | 2.0
2 | 1 | a1001 | 3.0
3 | 2 | a1000 | 3.0
4 | 2 | a1001 | 3.0

Returning most recent row SQL Server

I have this table
CREATE TABLE Test (
OrderID int,
Person varchar(10),
LastModified Date
);
INSERT INTO Test (OrderID, Person, LastModified)
VALUES (1, 'Sam', '2018-05-15'),
(1, 'Tim','2018-05-14'),
(1, 'Kim','2018-05-05'),
(1, 'Dave','2018-05-13'),
(1, 'James','2018-05-11'),
(1, 'Fred','2018-05-05');
select * result:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
| 1 | Tim | 2018-05-14 |
| 1 | Kim | 2018-05-05 |
| 1 | Dave | 2018-05-13 |
| 1 | James | 2018-05-11 |
| 1 | Fred | 2018-05-05 |
I am looking to return the most recent modified row which is the first row with 'Sam'.
Now i now i can use max to return the most recent date but how can i aggregate the person column to return sam?
Looking for a result set like
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
I ran this:
SELECT
OrderID,
max(Person) AS [Person],
max(LastModified) AS [LastModified]
FROM Test
GROUP BY
OrderID
but this returns:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Tim | 2018-05-15 |
Can someone advice me further please? thanks
*** UPDATE
INSERT INTO Test (OrderID, Person, LastModified)
VALUES (1, 'Sam', '2018-05-15'),
(1, 'Tim','2018-05-14'),
(1, 'Kim','2018-05-05'),
(1, 'Dave','2018-05-13'),
(1, 'James','2018-05-11'),
(1, 'Fred','2018-05-05'),
(2, 'Dave','2018-05-13'),
(2, 'James','2018-05-11'),
(2, 'Fred','2018-05-05');
So i would be looking for this result to be:
| OrderID | Person | LastModified |
|---------|--------|--------------|
| 1 | Sam | 2018-05-15 |
| 2 | Dave | 2018-05-13 |

If you always want just one record (the latest modified one) per OrderID then this would do it:
SELECT
t2.OrderID
, t2.Person
, t2.LastModified
FROM (
SELECT
MAX( LastModified ) AS LastModified
, OrderID
FROM
Test
GROUP BY
OrderID
) t
INNER JOIN Test t2
ON t2.LastModified = t.LastModified
AND t2.OrderID = t.OrderID

Expanding on your comment ("thanks very much, is there a way i can do this if there is more than one orderID e.g. multiple people and lastmodified for multiple orderID's?"), in xcvd's answer, I assume what you therefore want is this:
WITH CTE AS(
SELECT OrderId,
Person,
LastModifed,
ROW_NUMBER() OVER (PARTITION BY OrderID ORDER BY LastModified DESC) AS RN
FROM YourTable)
SELECT OrderID,
Person,
LastModified
FROM CTE
WHERE RN = 1;

How about just using TOP (1) and ORDER BY?
SELECT TOP (1) t.*
FROM Test t
ORDER BY LastModified DESC;
If you want this for each orderid, then this is a handy method in SQL Server:
SELECT TOP (1) WITH TIES t.*
FROM Test t
ORDER BY ROW_NUMBER() OVER (PARTITION BY OrderId ORDER BY LastModified DESC);

"xcvd's" answer is perfect for this, I would just like to add another solution that can be used here for the sake of showing you a method that can be used in more complex situations than this. This solution uses a nested query (sub-query) to find the MAX(LastModified) regardless of any other field and it will use the result in the original query's WHERE clause to find any results that meet the new criteria. Cheers.
SELECT OrderID
, Person
, LastModified
FROM Test
WHERE LastModified IN (SELECT MAX(LastModified)
FROM Test)

Here is one other method :
select t.*
from Test t
where LastModified = (select max(t1.LastModified) from Test t1 where t1.OrderID = t.OrderID);

How to copy rows into a new a one to many relationship

I'm trying to copy a set of data in a one to many relationship to create a new set of the same data in a new, but unrelated one to many relationship. Lets call them groups and items. Groups have a 1-* relation with items - one group has many items.
I've tried to create a CTE to do this, however I can't get the items inserted (in y) as the newly inserted groups don't have any items associated with them yet. I think I need to be able to access old. and new. like you would in a trigger, but I can't work out how to do this.
I think I could solve this by introducing a previous parent id into the templateitem table, or maybe a temp table with the data required to enable me to join on that, but I was wondering if it is possible to solve it this way?
SQL Fiddle Keeps Breaking on me, so I've put the code here as well:
DROP TABLE IF EXISTS meta.templateitem;
DROP TABLE IF EXISTS meta.templategroup;
CREATE TABLE meta.templategroup (
templategroup_id serial PRIMARY KEY,
groupname text,
roworder int
);
CREATE TABLE meta.templateitem (
templateitem_id serial PRIMARY KEY,
itemname text,
templategroup_id INTEGER NOT NULL REFERENCES meta.templategroup(templategroup_id)
);
INSERT INTO meta.templategroup (groupname, roworder) values ('Group1', 1), ('Group2', 2);
INSERT INTO meta.templateitem (itemname, templategroup_id) values ('Item1A',1), ('Item1B',1), ('Item2A',2);
WITH
x AS (
INSERT INTO meta.templategroup (groupname, roworder)
SELECT distinct groupname || '_v1' FROM meta.templategroup where templategroup_id in (1,2)
RETURNING groupname, templategroup_id, roworder
),
y AS (
Insert INTO meta.templateitem (itemname, templategroup_id)
Select itemname, x.templategroup_id
From meta.templateitem i
INNER JOIN x on x.templategroup_id = i.templategroup_id
RETURNING *
)
SELECT * FROM y;

Use an auxiliary column templategroup.old_id:
ALTER TABLE meta.templategroup ADD old_id int;
WITH x AS (
INSERT INTO meta.templategroup (groupname, roworder, old_id)
SELECT DISTINCT groupname || '_v1', roworder, templategroup_id
FROM meta.templategroup
WHERE templategroup_id IN (1,2)
RETURNING templategroup_id, old_id
),
y AS (
INSERT INTO meta.templateitem (itemname, templategroup_id)
SELECT itemname, x.templategroup_id
FROM meta.templateitem i
INNER JOIN x ON x.old_id = i.templategroup_id
RETURNING *
)
SELECT * FROM y;
templateitem_id | itemname | templategroup_id
-----------------+----------+------------------
4 | Item1A | 3
5 | Item1B | 3
6 | Item2A | 4
(3 rows)
It's impossible to do that in a single plain sql query without an additional column. You have to store the old ids somewhere. As an alternative you can use plpgsql and anonymous code block:
Before:
select *
from meta.templategroup
join meta.templateitem using (templategroup_id);
templategroup_id | groupname | roworder | templateitem_id | itemname
------------------+-----------+----------+-----------------+----------
1 | Group1 | 1 | 1 | Item1A
1 | Group1 | 1 | 2 | Item1B
2 | Group2 | 2 | 3 | Item2A
(3 rows)
Insert:
do $$
declare
grp record;
begin
for grp in
select distinct groupname || '_v1' groupname, roworder, templategroup_id
from meta.templategroup
where templategroup_id in (1,2)
loop
with insert_group as (
insert into meta.templategroup (groupname, roworder)
values (grp.groupname, grp.roworder)
returning templategroup_id
)
insert into meta.templateitem (itemname, templategroup_id)
select itemname || '_v1', g.templategroup_id
from meta.templateitem i
join insert_group g on grp.templategroup_id = i.templategroup_id;
end loop;
end $$;
After:
select *
from meta.templategroup
join meta.templateitem using (templategroup_id);
templategroup_id | groupname | roworder | templateitem_id | itemname
------------------+-----------+----------+-----------------+-----------
1 | Group1 | 1 | 1 | Item1A
1 | Group1 | 1 | 2 | Item1B
2 | Group2 | 2 | 3 | Item2A
3 | Group1_v1 | 1 | 4 | Item1A_v1
3 | Group1_v1 | 1 | 5 | Item1B_v1
4 | Group2_v1 | 2 | 6 | Item2A_v1
(6 rows)

SQL - find all instances where two columns are the same

So I have a simple table that holds comments from a user that pertain to a specific blog post.
id | user | post_id | comment
----------------------------------------------------------
0 | john#test.com | 1001 | great article
1 | bob#test.com | 1001 | nice post
2 | john#test.com | 1002 | I agree
3 | john#test.com | 1001 | thats cool
4 | bob#test.com | 1002 | thanks for sharing
5 | bob#test.com | 1002 | really helpful
6 | steve#test.com | 1001 | spam post about pills
I want to get all instances where a user commented on the same post twice (meaning same user and same post_id). In this case I would return:
id | user | post_id | comment
----------------------------------------------------------
0 | john#test.com | 1001 | great article
3 | john#test.com | 1001 | thats cool
4 | bob#test.com | 1002 | thanks for sharing
5 | bob#test.com | 1002 | really helpful
I thought DISTINCT was what I needed but that just gives me unique rows.

You can use GROUP BY and HAVING to find pairs of user and post_id that have multiple entries:
SELECT a.*
FROM table_name a
JOIN (SELECT user, post_id
FROM table_name
GROUP BY user, post_id
HAVING COUNT(id) > 1
) b
ON a.user = b.user
AND a.post_id = b.post_id

DISTINCT removes all duplicate rows, which is why you're getting unique rows.
You can try using a CROSS JOIN (available as of Hive 0.10 according to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins):
SELECT mt.*
FROM MYTABLE mt
CROSS JOIN MYTABLE mt2
WHERE mt.user = mt2.user
AND mt.post_id = mt2.post_id
The performance might not be the best though. If you wanted to sort it, use SORT BY or ORDER BY.

DECLARE #MyTable TABLE (id int, usr varchar(50), post_id int, comment varchar(50))
INSERT #MyTable (id, usr, post_id, comment) VALUES (0,'john#test.com',1001,'great article')
INSERT #MyTable (id, usr, post_id, comment) VALUES (1,'bob#test.com',1001,'nice post')
INSERT #MyTable (id, usr, post_id, comment) VALUES (3,'john#test.com',1002,'I agree')
INSERT #MyTable (id, usr, post_id, comment) VALUES (4,'john#test.com',1001,'thats cool')
INSERT #MyTable (id, usr, post_id, comment) VALUES (5,'bob#test.com',1002,'thanks for sharing')
INSERT #MyTable (id, usr, post_id, comment) VALUES (6,'bob#test.com',1002,'really helpful')
INSERT #MyTable (id, usr, post_id, comment) VALUES (7,'steve#test.com',1001,'spam post about pills')
SELECT
T1.id,
T1.usr,
T1.post_id,
T1.comment
FROM
#MyTable T1
INNER JOIN #MyTable T2
ON T1.usr = T2.usr AND T1.post_id = T2.post_id
GROUP BY
T1.id,
T1.usr,
T1.post_id,
T1.comment
HAVING
Count(T2.id) > 1

Selecting row with highest ID based on another column

In SQL Server 2008 R2, suppose I have a table layout like this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 1 | 1 | TEST 1 |
| 2 | 1 | TEST 2 |
| 3 | 3 | TEST 3 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 6 | 6 | TEST 6 |
| 7 | 6 | TEST 7 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Is it possible to select every row with the highest UniqueID number, for each GroupID. So according to the table above - if I ran the query, I would expect this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 2 | 1 | TEST 2 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Been chomping on this for a while, but can't seem to crack it.
Many thanks,

SELECT *
FROM (SELECT uniqueid, groupid, title,
Row_number()
OVER ( partition BY groupid ORDER BY uniqueid DESC) AS rn
FROM table) a
WHERE a.rn = 1

With SQL-Server as rdbms you can use a ranking function like ROW_NUMBER:
WITH CTE AS
(
SELECT UniqueID, GroupID, Title,
RN = ROW_NUMBER() OVER (PARTITON BY GroupID
ORDER BY UniqueID DESC)
FROM dbo.TableName
)
SELECT UniqueID, GroupID, Title
FROM CTE
WHERE RN = 1
This returns exactly one record for each GroupID even if there are multiple rows with the highest UniqueID (the name does not suggest so). If you want to return all rows in then use DENSE_RANK instead of ROW_NUMBER.
Here you can see all functions and how they work: http://technet.microsoft.com/en-us/library/ms189798.aspx

Since you have not mentioned any RDBMS, this statement below will work on almost all RDBMS. The purpose of the subquery is to get the greatest uniqueID for every GROUPID. To be able to get the other columns, the result of the subquery is joined on the original table.
SELECT a.*
FROM tableName a
INNER JOIN
(
SELECT GroupID, MAX(uniqueID) uniqueID
FROM tableName
GROUP By GroupID
) b ON a.GroupID = b.GroupID
AND a.uniqueID = b.uniqueID
In the case that your RDBMS supports Qnalytic functions, you can use ROW_NUMBER()
SELECT uniqueid, groupid, title
FROM
(
SELECT uniqueid, groupid, title,
ROW_NUMBER() OVER (PARTITION BY groupid
ORDER BY uniqueid DESC) rn
FROM tableName
) x
WHERE x.rn = 1
TSQL Ranking Functions
The ROW_NUMBER() generates sequential number which you can filter out. In this case the sequential number is generated on groupid and sorted by uniqueid in descending order. The greatest uniqueid will have a value of 1 in rn.

SELECT *
FROM the_table tt
WHERE NOT EXISTS (
SELECT *
FROM the_table nx
WHERE nx.GroupID = tt.GroupID
AND nx.UniqueID > tt.UniqueID
)
;
Should work in any DBMS (no window functions or CTEs are needed)
is probably faster than a sub query with an aggregate

Keeping it simple:
select * from test2
where UniqueID in (select max(UniqueID) from test2 group by GroupID)
Considering:
create table test2
(
UniqueID numeric,
GroupID numeric,
Title varchar(100)
)
insert into test2 values(1,1,'TEST 1')
insert into test2 values(2,1,'TEST 2')
insert into test2 values(3,3,'TEST 3')
insert into test2 values(4,3,'TEST 4')
insert into test2 values(5,5,'TEST 5')
insert into test2 values(6,6,'TEST 6')
insert into test2 values(7,6,'TEST 7')
insert into test2 values(8,6,'TEST 8')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding only rows with non-duplicated values within a window partition - sql

Should work SELECT DISTINCT p1.permit, p1.description FROM permits p1 JOIN permits p2 ON p1.permit = p2.permit WHERE p1.description != p2.description OR p1.description IS NULL AND p2.description IS NOT NULL

This is my go to: with x as ( select permit, count(distinct description) cnt from permits p1 group by permit having cnt > 1 ) select p.* from x join permits p on x.permit = p.permit;

Related

how to insert row without duplicate values in two differents table ( header and detail)

Returning most recent row SQL Server

How to copy rows into a new a one to many relationship

SQL - find all instances where two columns are the same

Selecting row with highest ID based on another column

Categories

Resources