Returning only duplicate rows from two tables - sql

Every thread I've seen so far has been to check for duplicate rows and avoiding them. I'm trying to get a query to only return the duplicate rows. I thought it would be as simple as a subquery, but I was wrong. Then I tried the following:
SELECT * FROM a
WHERE EXISTS
(
SELECT * FROM b
WHERE b.id = a.id
)
Was a bust too. How do I return only the duplicate rows? I'm currently going through two tables, but I'm afraid there are a large amount of duplicates.

use this query, maybe is better if you check the relevant column.
SELECT * FROM a
INTERSECT
SELECT * FROM b

I am sure your posted code would work too like
SELECT * FROM a
WHERE EXISTS
(
SELECT 1 FROM b WHERE id = a.id
)
You can as well do a INNER JOIN like
SELECT a.* FROM a
JOIN b on a.id = b.id;
You can as well use a IN operator saying
SELECT * FROM a where id in (select id from b);
If none of them, then you can use UNION if both table satisfies the union restriction along with ROW_NUMBER() function like
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS rn
FROM (
select * from a
union all
select * from b) xx ) yy
WHERE rn = 1;

Note: there's an ambiguity as to what you mean by a duplicate row, and whether you're talking about duplicate keys, or all fields being the same. My answer deals with all fields being the same; some of the others are assuming it's just the keys. It's unclear which you intend.
You might try
SELECT id, col1, col2 FROM a INNER JOIN b ON a.id = b.id
WHERE a.col1 = b.col1 AND a.col2 = b.col2
adding in other columns as necessary. The database engine should be intelligent enough to do the comparisons on the indexed columns first, so it'll be efficient as long as you don't have rows that are different only on lots of non-indexed fields. (If you do, then I don't think anything will do it particularly efficiently.)

Related

Can I select several tables in the same WITH query?

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?
One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.
While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

Efficient way to check if row exists for multiple records in postgres

I saw answers to a related question, but couldn't really apply what they are doing to my specific case.
I have a large table (300k rows) that I need to join with another even larger (1-2M rows) table efficiently. For my purposes, I only need to know whether a matching row exists in the second table. I came up with a nested query like so:
SELECT
id,
CASE cnt WHEN 0 then 'NO_MATCH' else 'YES_MATCH' end as match_exists
FROM
(
SELECT
A.id as id, count(*) as cnt
FROM
A, B
WHERE
A.id = B.foreing_id
GROUP BY A.id
) AS id_and_matches_count
Is there a better and/or more efficient way to do it?
Thanks!
You just want a left outer join:
SELECT
A.id as id, count(B.foreing_id) as cnt
FROM A
LEFT OUTER JOIN B ON
A.id = B.foreing_id
GROUP BY A.id

Issues with SQL Select utilizing Except and UNION All

Select *
From (
Select a
Except
Select b
) x
UNION ALL
Select *
From (
Select b
Except
Select a
) y
This sql statement returns an extremely wrong amount of data. If Select a returns a million, how does this entire statement return 100,000? In this instance, Select b contains mutually exclusive data, so there should be no elimination due to the except.
As already stated in the comment, EXCEPT does an implicit DISTINCT, according to this and the ALL in your UNION ALL cannot re-create the duplicates. Hence you cannot use your approach if you want to keep duplicates.
As you want to get the data that is contained in exactly one of the tables a and b, but not in both, a more efficient way to achieve that would be the following (I am just assuming the tables have columns id and c where id is the primary key, as you did not state any column names):
SELECT CASE WHEN a.id IS NULL THEN 'from b' ELSE 'from a' END as source_table
,coalesce(a.id, b.id) as id
,coalesce(a.c, b.c) as c
FROM a
FULL OUTER JOIN b ON a.id = b.id AND a.c = b.c -- use all columns of both tables here!
WHERE a.id IS NULL OR b.id IS NULL
This makes use of a FULL OUTER JOIN, excluding the matching records via the WHERE conditions, as the primary key cannot be null except if it comes from the OUTER side.
If your tables do not have primary keys - which is bad practice anyway - you would have to check across all columns for NULL, not just the one primary key column.
And if you have records completely consisting of NULLs, this method would not work.
Then you could use an approach similar to your original one, just using
SELECT ...
FROM a
WHERE NOT EXISTS (SELECT 1 FROM b WHERE <join by all columns>)
UNION ALL
SELECT ...
FROM b
WHERE NOT EXISTS (SELECT 1 FROM a WHERE <join by all columns>)
If you're trying to get any data that is in one table and not in the other regardless of which table, I would try something like the following:
select id, 'table a data not in b' from a where id not in (select id from b)
union
select id, 'table b data not in a' from b where id not in (select id from a)

Querying a table finding if child table's matching records exist in ANSI SQL

I have two tables A and B where there is one-to-many relationship.
Now I want some records from A and with this existence field that shows if B has any matching records. I don't want to use the count function as B has too many records that delays SQL execution. Either I don't want to use proprietary keywords like rownum of Oracle like below, as I need as much compatibility as possible.
select A.*, (
select 1 from B where ref_column = A.ref_column and rownum = 1
) existence
...
You would use left join + count anyway, select statement in select list can be executed multiple times while join will be done only once.
Also you can consider EXISTS:
select A.*, case when exists (
select 1 from B where ref_column = A.ref_column and rownum = 1
) then 1 else 0 end
Use an EXISTS clause. If the foreign key in B is indexed, performance should not be an issue.
SELECT *
FROM a
WHERE EXISTS (SELECT 1 FROM b WHERE b.a_id = a.id)

Efficient latest record query with Postgresql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.
this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST
what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past
On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.
If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.
You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)