SQL query to efficiently select non-perfect duplicates

SQL query to efficiently select non-perfect duplicates - sql

I have a database table in the entity-attribute-value format which looks like this:
I wish to select all rows that have the same values for the 'entity' and 'attribute' columns, but have different values for the 'value' column. Multiple rows with the same values for all three columns should be treated as a single row. The way I achieved this is by using SELECT DISTINCT.
SELECT entity_id, attribute_name, COUNT(attribute_name) AS NumOcc
FROM (SELECT DISTINCT * FROM radiology) x
GROUP BY entity_id,attribute_name
HAVING COUNT(attribute_name) > 1
Response for this query
However, I have read that using SELECT DISTINCT is quite costly. I plan on using this query on very large tables, I am looking for a way to optimize this query, perhaps without using SELECT DISTINCT.
I am using PostgreSQL 10.3

select *
from radiology r
join (
select entity_id
, attribute_name
from radiology
group by
entity_id
, attribute_name
having count(distinct value) > 1
) dupe
on r.entity_id = dupe.entity_id
and r.attribute_name = dupe.attribute_name

This should work for you:
select a.* from radiology a join
(select entity, attribute, count(distinct value) cnt
from radiology
group by entity, attribute
having count(distinct value)>1)b
on a.entity=b.entity and a.attribute=b.attribute

I wish to select all rows that have the same values for the 'entity' and 'attribute' columns, but have different values for the 'value' column.
Your method does not do this. I would think exists:
select r.*
from radiology r
where exists (select 1
from radiology r2
where r2.entity = r.entity and r2.attribute = r.attribute and
r2.value <> r.value
);
If you just want the entity/attribute values with pairs, use group by:
select entity, attribute
from radiology
group by entity, attribute
having min(value) <> max(value);
Note that you could use having count(distinct value) > 1, but count(distinct) incurs more overhead than min() and max().

Related

How do i get all the values from 2 tables without doing a Cross Join

I have 2 tables with the following Schema
First ( id, user_id, user_agent, referrer, browser, device_type, IP)
Second ( id, user_id, name, properties)
Table First has a total of 512 entries for user_id 1. Table Second has total of 100 entries for user_id 1. Both tables track different User Activities, so anytime I try to join Table Second on First for user_id 1.
SELECT COUNT(*)
FROM first f
JOIN second AS s ON s.user_id = f.user_id
WHERE f.user_id = 1
I get a total of 51,200 returned rows. Definitely, a Cross Join (first * second) is being done. Is there no way I can get a less enormous returned result? perhaps first + second resul

I think you can use UNION ALL operator.
The SQL UNION ALL operator is used to combine the result sets of 2 or more SELECT statements. It does not remove duplicate rows between the various
Each SELECT statement within the UNION ALL must have the same number of fields in the result sets with similar data types. So that you need create same column's with null as ""
Or you can try use UNION
UNION removes duplicate rows.
UNION ALL does not remove duplicate rows.
select *
from(
SELECT id, user_id, user_agent, referrer, browser, device_type, IP, null as "name",
null as "properties"
FROM first f
UNION ALL
SELECT id, user_id, null as "user_agent", null as "referrer", null as "browser",
null as "device_type", null as "IP", name, properties
FROM second s) x
Where user_id = 1

use Left join and use the foreign key to query the two tables

How to take count of distinct rows which have a specific column with NULL values is all rows

I have a table CodeResult as follows:
Here we can notice that Code 123 alone has a Code2, that has a value in Result. I want to take a count of distinct Codes that has no values at all in Result. Which means, in this example, I should get 2.
I do not want to use group by clause because it will slow down the query.
Below code gives wrong result:
Select count(distinct code) from CodeResult where Result is Null

One method is two levels of aggregation:
select count(*)
from (select code
from t
group by code
having max(result) is null
) c;
A more clever method doesn't use a subquery. It counts the number of distinct codes and then removes the ones that have a result:
select ( count(distinct code) -
count(distinct case when result is not null then code end )
)
from t;

You simply can't avoid a GROUP BY: In all DBMSs I know, the query plan you get from a:
SELECT DISTINCT a,b,c FROM tab; ,
is the same as the one for:
SELECT a,b,c FROM tab GROUP BY a,b,c;

The following query will return each of the Code values for which there are no corresponding non-NULL values in CodeResult:
select distinct Code
from CodeResult as CR
where not exists
( select 42 from CodeResult as iCR where iCR.Code = CR.Code and iCR.CodeResult is not NULL );
Counting the rows is left as an exercise for the reader.

How to delete the duplicate data in table (Postgres)

I want to delete the duplicated data in a table , I know there is a way use
SELECT
fruit,
COUNT( fruit )
FROM
basket
GROUP BY
fruit
HAVING
COUNT( fruit )> 1
ORDER BY
fruit;
to find them , buy I need to determine every column's value is equal , which means tableA.* = tableA.* (except id , id is the auto-increment primary key )
and I tried this:
SELECT
*,
COUNT( * )
FROM
myTable
GROUP BY
*
HAVING
COUNT( * )> 1
ORDER BY
id;
but it says I can't use GROUP BY * , so how can I find & delete the duplicated data(need every column's value is equal except id)?

using
SELECT * DISTINCT
DISTINCT remove duplicated result

You need to try something similar to be below query. You apply PARTITION BY for the columns other than Id (as it is incrementing unique value). PARTITION BY should be applied for columns, for which you want to check duplicates.
Also refer to Row_Number in Postgres & Common Table expression in Postgres
WITH DuplicateTableRows AS
(
SELECT Id, Row_Number() OVER (PARTITION BY col1, col2... ORDER BY Id)
FROM
Table1
)
DELETE FROM Table1
WHERE Id IN (SELECT Id FROM Table1 WHERE row_number > 1)

You can do this using JSON:
select (to_jsonb(b) - 'id')
from basket b
group by 1
having count(*) > 1;
The result is as JSON. Unfortunately, to extract the values back into a record, you need to list the columns individually.

Get row count including column values in sql server

I need to get the row count of a query, and also get the query's columns in one single query. The count should be a part of the result's columns (It should be the same for all rows, since it's the total).
for example, if I do this:
select count(1) from table
I can have the total number of rows.
If I do this:
select a,b,c from table
I'll get the column's values for the query.
What I need is to get the count and the columns values in one query, with a very effective way.
For example:
select Count(1), a,b,c from table
with no group by, since I want the total.
The only way I've found is to do a temp table (using variables), insert the query's result, then count, then returning the join of both. But if the result gets thousands of records, that wouldn't be very efficient.
Any ideas?

#Jim H is almost right, but chooses the wrong ranking function:
create table #T (ID int)
insert into #T (ID)
select 1 union all
select 2 union all
select 3
select ID,COUNT(*) OVER (PARTITION BY 1) as RowCnt from #T
drop table #T
Results:
ID RowCnt
1 3
2 3
3 3
Partitioning by a constant makes it count over the whole resultset.

Using CROSS JOIN:
SELECT a.*, b.numRows
FROM YOUR_TABLE a
CROSS JOIN (SELECT COUNT(*) AS numRows
FROM YOUR_TABLE) b

Look at the Ranking functions of SQL Server.
SELECT ROW_NUMBER() OVER (ORDER BY a) AS 'RowNumber', a, b, c
FROM table;

You could do it like this:
SELECT x.total, a, b, c
FROM
table
JOIN (SELECT total = COUNT(*) FROM table) AS x ON 1=1
which will return the total number of records in the first column, followed by fields a,b & c

unique count of the columns?

i want to get a unique count of the of multiple columns containing the similar or different data...i am using sql server 2005...for one column i am able to take the unique count... but to take a count of multiple columns at a time, what's the query ?

You can run the following selected, getting the data from a derived table:
select count(*) from (select distinct c1, c2, from t1) dt

To get the count of combined unique column values, use
SELECT COUNT(*) FROM TableName GROUP BY UniqueColumn1, UniqueColumn2
To get the unique counts of multiple individual columns, use
SELECT COUNT(DISTINCT Column1), COUNT(DISTINCT Column2)
FROM TableName
Your question is not clear what exactly you want to achieve.

I think what you're getting at is individual SUMS from two unique columns in one query. I was able to accomplish this be using
SELECT FiscalYear, SUM(Col1) AS Col1Total, SUM(Col2) AS Col2Total
FROM TableName
GROUP BY FiscalYear
If your data is not numerical in nature, you can use CASE statements
SELECT FiscalYear, SUM(CASE WHEN ColA = 'abc' THEN 1 ELSE 0 END) AS ColATotal,
SUM(CASE WHEN ColB = 'xyz' THEN 1 ELSE 0 END) AS ColBTotal
FROM TableName
GROUP BY FiscalYear
Hope this helps!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL query to efficiently select non-perfect duplicates - sql

select * from radiology r join ( select entity_id , attribute_name from radiology group by entity_id , attribute_name having count(distinct value) > 1 ) dupe on r.entity_id = dupe.entity_id and r.attribute_name = dupe.attribute_name

This should work for you: select a.* from radiology a join (select entity, attribute, count(distinct value) cnt from radiology group by entity, attribute having count(distinct value)>1)b on a.entity=b.entity and a.attribute=b.attribute

Related

How do i get all the values from 2 tables without doing a Cross Join

How to take count of distinct rows which have a specific column with NULL values is all rows

How to delete the duplicate data in table (Postgres)

Get row count including column values in sql server

unique count of the columns?

Categories

Resources