How to eliminate and show only non-duplicate records - sql

See the below table:
col1 col2
---- ----
1 | a
2 | b
3 | c
4 | a
5 | d
6 | b
7 | e
Now I want to show only the non-duplicate records. which means 3,5,7.
How to write a query to get the result?

SELECT col1, col2
FROM table
GROUP BY col2
HAVING COUNT(*) = 1;

SELECT B.*
FROM
(
SELECT col2
FROM YOURTABLE
GROUP BY col2
HAVING COUNT(*)=1
) A,
YOURTABLE B
WHERE A.col2 = B.col2

SELECT count(*) as cnt,col1, col2
FROM table
GROUP BY col2
HAVING cnt = 1;

Believe this is clear and correct enough:
SELECT *
FROM table
WHERE
col2 IN (SELECT col2 FROM table GROUP BY col2 HAVING COUNT(*) = 1)

Related

How can I find groups with more than one rows and list the rows in each such group?

I have a table "mytable" in a database.
Given a subset of the columns of the table, I would like to group by the subset of the columns, and find those groups with more than one rows:
For example, if the table is
col1 col2 col3
1 1 1
1 1 2
1 2 1
2 2 1
2 2 3
2 1 1
I am interested in finding groups by col1 and col2 with more than one rows, which are:
col1 col2 col3
1 1 1
1 1 2
and
col1 col2 col3
2 2 1
2 2 3
I was wondering how to write a SQL query for that purpose?
Is the following the best way to do that?
First get the col1 and col2 values of such groups:
SELECT col1 col2 COUNT(*)
FROM mytable
GROUP BY col1, col2
HAVING COUNT(*) > 1
Then based on the output of the previous query, manually write a query for each group:
SELECT *
FROM mytable
WHERE col1 = val1 AND col2 = val2
If there are many such groups, then I will have to manually write many queries, which can be a disadvantage.
I am using SQL Server.
Thanks.
This is a common problem. One solution is to get the "keys" in a derived table and join to that to get the rows.
declare #test as table (col1 int, col2 int, col3 int)
insert into #test values (1,1,1),(1,1,2),(1,2,1),(2,2,1),(2,2,3),(2,1,1)
select t.*
from #test t
inner join (
select col1, col2
from #test
group by col1, col2
having count(*) > 1
) k
on k.col1 = t.col1 and k.col2 = t.col2
col1 col2 col3
----------- ----------- -----------
1 1 1
1 1 2
2 2 1
2 2 3
The window function sum() over() may help here
Example
with cte as (
Select *
,Cnt = sum(1) over (partition by Col1,Col2)
From YourTable
)
Select *
From cte
Where Cnt>=2
Results
Another option (less performant)
Select top 1 with ties *
From YourTable
Order By case when sum(1) over (partition by Col1,Col2) > 1 then 1 else 2 end
Results

Unable to delete duplicate data from Netezza table

I am trying to delete duplicate records from netezza table. But few column contain null value so below code is not working.
DELETE FROM TABLE_NAME a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM TABLE_NAME b
WHERE a.COL1 = b.COL1
AND a.COL2 = b.COL2
AND a.COL3 = b.COL3);
Sample Data:-
COL1 COL2 COL3
X NULL Y
A NULL B
X NULL Y
X NULL Y
E VAL F
Expected result:
COL1 COL2 COL3
X NULL Y
A NULL B
E VAL F
Note: COL2 column contain null value.
We have total 30 columns in this table and 6 columns contain null value for duplicate records.
Can anyone please help me on this issue.
DELETE FROM TABLE_NAME a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM TABLE_NAME b
WHERE nvl(a.COL1,0) = nvl(b.COL1,0)
AND nvl(a.COL2,0) = nvl(b.COL2,0)
and nvl(a.COL3,0) = nvl(b.COL3,0));
Replace null value with 0 using NVL function
You can use the NVL function to translate nulls to something you can compare.
*Edit: you commented that NVL doesn't work. Alternatively, you can rewrite the query to explicitly handle NULL:
For instance:
DELETE FROM TABLE_NAME a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM TABLE_NAME b
WHERE((a.COL1 = b.COL1) or (a.COL1 is null and b.COL1 is null))
AND ((a.COL2 = b.COL2) or (a.COL2 is null and b.COL2 is null))
AND ((a.COL3 = b.COL3) or (a.COL3 is null and b.COL3 is null));
Try using the /=/ operator instead of =
It usually works for me in these situations
For context, what are the distribution columns for the table, how many rows are in your table, and what percentage of those are you expecting to be duplicates? Depending on the scale a CTAS approach might be a better fit than a DELETE.
That being said, here's an approach that get's the delete logic right, but might not be the best performer.
TESTDB.ADMIN(ADMIN)=> select * from table_name;
COL1 | COL2 | COL3
------+------+------
X | | Y
X | | Y
E | VAL | F
A | | B
X | | Y
(5 rows)
delete
from
table_name
where rowid in
( select
rowid
from
( select
rowid,
row_number() over (
partition by col1,
col2 ,
col3
order by
col1) rn
from
table_name
) foo
where rn > 1
) ;
DELETE 2
TESTDB.ADMIN(ADMIN)=> select * from table_name;
COL1 | COL2 | COL3
------+------+------
A | | B
X | | Y
E | VAL | F
(3 rows)

Query to get previous value

I have a scenerio where I need previous column value but it should not be same as current column value.
Table A:
+------+------+-------------+
| Col1 | Col2 | Lead_Col2 |
+------+------+-------------+
| 1 | A | NULL |
| 2 | B | A |
| 3 | B | A |
| 4 | C | B |
| 5 | C | B |
| 6 | C | B |
| 7 | D | C |
+------+------+-------------+
As Given above, I need previuos column(Col2) value. which is not same as current value.
Try:
select *
from (select col1,
col2,
lag(col2, 1) over(order by col1) as prev_col2
from table_a)
where col2 <> prev_col2
The name lead_col2 is misleading, because you really want a lag.
Here is a brute force method that uses a correlated subquery to get the index of the value and then joins the value in:
select aa.col1, aa.col2, aa.col2
from (select col1, col2,
(select max(col1) as maxcol1
from a a2
where a2.id < a.id and a2.col2 <> a.col2
) as prev_col1
from a
) aa left join
a
on aa.maxcol1 = a.col1
EDIT:
You can also use logic with lead() and ignore NULLs. If a value is the last in its sequence, then use that value, otherwise set it to NULL. Then use lag() with ignoreNULL`s:
select col1, col2,
lag(col3) over (order by col1 ignore nulls)
from (select col1, col2,
(case when col2 <> lead(col2) over (order by col1) then col2
end) as col3
from a
) a;
Try this:
select t.col1
,t.col2
,first_value(lag_col2) over (partition by col2 order by ord) lag_col2
from (select t.*
,case when lag_col2 = col2 then 1 else 0 end ord
from (select t.*
,lag (col2) over (order by col1) lag_col2
from table1 t
)t
)t
order by col1
SQL Fiddle

Find duplicate values in oracle

I'm using this query to find duplicate values in a table:
select col1,
count(col1)
from table1
group by col1
having count (col1) > 1
order by 2 desc;
But also I want to add another column from the same table, like this:
select col1,
col2,
count(col1)
from table1
group by col1
having count (col1) > 1
order by 2 desc;
I get an ORA-00979 error with that second query
How can I add another column in my search?
Your query should be
SELECT * FROM (
select col1,
col2,
count(col1) over (partition by col1) col1_cnt
from table1
)
WHERE col1_cnt > 1
order by 2 desc;
Presumably you want to get col2 for each duplicate of col1 that turns up. You can't really do that in a single query^. Instead, what you need to do is get your list of duplicates, then use that to retrieve any other associated values:
select col1, col2
from table1
where col1 in (select col1
from table1
group by col1
having count (col1) > 1)
order by col2 desc
^ Okay, you can, by using analytic functions, as #rs. demonstrated. For this scenario, I suspect that the nested query will be more efficient, but both should give you the same results.
Based on comments, it seems like you're not clear on why you can't just add the second column. Assume you have sample data that looks like this:
Col1 | Col2
-----+-----
1 | A
1 | B
2 | C
2 | D
3 | E
If you run
select Col1, count(*) as cnt
from table1
group by Col1
having count(*) > 1
then your results will be:
Col1 | Cnt
-----+-----
1 | 2
2 | 2
You can't just add Col2 to this query without adding it to the group by clause because the database will have no way of knowing which value you actually want (i.e. for Col1=1 should the DB return 'A' or 'B'?). If you add Col2 to the group by clause, you get the following:
select Col1, Col2, count(*) as cnt
from table1
group by Col1, Col2
having count(*) > 1
Col1 | Col2 | Cnt
-----+------+----
[no results]
This is because the count is for each combination of Col1 and Col2 (each of which are unique).
Finally, by using either a nested query (as in my answer) or an analytic function (as in #rs.'s answer), you'll get the following result (query changed slightly to return the count):
select t1.col1, t1.col2, cnt
from table1 t1
join (select col1, count(*) as cnt
from table1
group by col1
having count (col1) > 1) t2
on table1.col1 = t2.col1
Col1 | Col2 | Cnt
-----+------+----
1 | A | 2
1 | B | 2
2 | C | 2
2 | D | 2
You should list all selected columns in the group by clause as well.
select col1,
col2,
count(col1)
from table1
group by col1, col2
having count (col1) > 1
order by 2 desc;
Cause of Error
You tried to execute an SQL SELECT statement that included a GROUP BY
function (ie: SQL MIN Function, SQL MAX Function, SQL SUM Function,
SQL COUNT Function) and an expression in the SELECT list that was not
in the SQL GROUP BY clause.
select col1,
col2,
count(col1)
from table1
group by col1,col2
having count (col1) > 1
order by 2 desc;

SQL Server - Query to return groups with multiple distinct records

My table:
Col1 Col2
1 xyz
1 abc
2 abc
3 yyy
4 zzz
4 zzz
I have a table with two columns. I want to query for records where col1 has more than one DISTINCT col2 values. In the example table given above, the query should return records for col1 with value "1".
Expected query result:
Col1 Col2
1 xyz
1 abc
SELECT *
FROM tableName
WHERE Col1 IN
(
SELECT Col1
FROM tableName
GROUP BY Col1
HAVING COUNT(DISTINCT col2) > 1
)
SQLFiddle Demo
select t.col1, t.col2
from (
select col1
from tbl
group by col1
having MIN(col2) <> MAX(col2)
) x
join tbl t on t.col1 = c.col1