Querying a table finding if child table's matching records exist in ANSI SQL - sql

I have two tables A and B where there is one-to-many relationship.
Now I want some records from A and with this existence field that shows if B has any matching records. I don't want to use the count function as B has too many records that delays SQL execution. Either I don't want to use proprietary keywords like rownum of Oracle like below, as I need as much compatibility as possible.
select A.*, (
select 1 from B where ref_column = A.ref_column and rownum = 1
) existence
...

You would use left join + count anyway, select statement in select list can be executed multiple times while join will be done only once.
Also you can consider EXISTS:
select A.*, case when exists (
select 1 from B where ref_column = A.ref_column and rownum = 1
) then 1 else 0 end

Use an EXISTS clause. If the foreign key in B is indexed, performance should not be an issue.
SELECT *
FROM a
WHERE EXISTS (SELECT 1 FROM b WHERE b.a_id = a.id)

Related

Better way to do corelated query having count in condition in AWS Athena sql

There are two table A and B. Table A has one to many relationship with B.
I want to fetch records from A and corresponding one single record from B (if B has one record),
If there is multiple record in Table B then find the one having status ='Active' find first.
Below is the query, running in oracle, but we want the same functionality running in AWS Athena, however correlated query is not supported in AWS athena sql. Athena supports ANSI Sql.
SELECT b.*
FROM A a ,B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)=1
)
UNION
SELECT * FROM
(SELECT b.*
FROM A a , B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id
FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
AND b2.status ='ACTIVE'
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)> 1
)
)
We need to put all the field in select or in aggregate function when using group by so group by not preferable.
A help would be much appreciated.
[]
2
Output result table
Joining the best row can be achieved with a lateral join.
select *
from a
outer apply
(
select *
from b
where b.txn_report_instruction_id = a.instruction_id
order by case when b.status = 'ACTIVE' then 1 else 2 end
fetch first row only
) bb;
Another option is a window function:
select *
from a
left join
(
select
b.*,
row_number() over (partition by txn_report_instruction_id
order by case when status = 'ACTIVE' then 1 else 2 end) as rn
from b
) bb on bb.txn_report_instruction_id = a.instruction_id and bb.rn = 1;
I don't know about amazon athena's SQL coverage. This is all standard SQL, however, except for OUTER APPLY I think. If I am not mistaken, the SQL standard requires LEFT OUTER JOIN LATERAL (...) ON ... instead, for which you need a dummy ON clause, such as ON 1 = 1. So if above queries fail, there is another option for you :-)

Filter by count from another table

This query works fine only without WHERE, otherwise there is an error:
column "cnt" does not exist
SELECT
*,
(SELECT count(*)
FROM B
WHERE A.id = B.id) AS cnt
FROM A
WHERE cnt > 0
Use a subquery:
SELECT a.*
FROM (SELECT A.*,
(SELECT count(*)
FROM B
WHERE A.id = B.id
) AS cnt
FROM A
) a
WHERE cnt > 0;
Column aliases defined in the SELECT cannot be used by the WHERE (or other clauses) for that SELECT.
Or, if the id on a is unique, you can more simply do:
SELECT a.*, COUNT(B.id)
FROM A LEFT JOIN
B
ON A.id = B.id
GROUP BY A.id
HAVING COUNT(B.id) > 0;
Or, if you don't really need the count, then:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Assumptions:
You need all columns from A in the result, plus the count from B. That's what your demonstrated query does.
You only want rows with cnt > 0. That's what started your question after all.
Most or all B.id exist in A. That's the typical case and certainly true if a FK constraint on B.id references to A.id.
Solution
Faster, shorter, correct:
SELECT * -- !
FROM (SELECT id, count(*) AS cnt FROM B) B
JOIN A USING (id) -- !
-- WHERE cnt > 0 -- this predicate is implicit now!
Major points
Aggregate before the join, that's typically (substantially) faster when processing the whole table or major parts of it. It also defends against problems if you join to more than one n-table. See:
Aggregate functions on multiple joined tables
You don't need to add the predicate WHERE cnt > 0 any more, that's implicit with the [INNER] JOIN.
You can simply write SELECT *, since the join only adds the column cnt to A.* when done with the USING clause - only one instance of the joining column(s) (id in the example) is added to the out columns. See:
How to drop one join key when joining two tables
Your added question in the comment
postgres really allows to have outside aggregate function attributes that are not behind group by?
That's only true if the PK column(s) is listed in the GROUP BY clause - which covers the whole row. Not the case for a UNIQUE or EXCLUSION constraint. See:
Return a grouped list with occurrences using Rails and PostgreSQL
SQL Fiddle demo (extended version of Gordon's demo).

Value present in more than one table

I have 3 tables. All of them have a column - id. I want to find if there is any value that is common across the tables. Assuming that the tables are named a.b and c, if id value 3 is present is a and b, there is a problem. The query can/should exit at the first such occurrence. There is no need to probe further. What I have now is something like
( select id from a intersect select id from b )
union
( select id from b intersect select id from c )
union
( select id from a intersect select id from c )
Obviously, this is not very efficient. Database is PostgreSQL, version 9.0
id is not unique in the individual tables. It is OK to have duplicates in the same table. But if a value is present in just 2 of the 3 tables, that also needs to be flagged and there is no need to check for existence in he third table, or check if there are more such values. One value, present in more than one table, and I can stop.
Although id is not unique within any given table, it should be unique across the tables; a union of distinct id should be unique, so:
select id from (
select distinct id from a
union all
select distinct id from b
union all
select distinct id from c) x
group by id
having count(*) > 1
Note the use of union all, which preserves duplicates (plain union removes duplicates).
I would suggest a simple join:
select a.id
from a join
b
on a.id = b.id join
c
on a.id = c.id
limit 1;
If you have a query that uses union or group by (or order by, but that is not relevant here), then you need to process all the data before returning a single row. A join can start returning rows as soon as the first values are found.
An alternative, but similar method is:
select a.id
from a
where exists (select 1 from b where a.id = b.id) and
exists (select 1 from c where a.id = c.id);
If a is the smallest table and id is indexes in b and c, then this could be quite fast.
Try this
select id from
(
select distinct id, 1 as t from a
union all
select distinct id, 2 as t from b
union all
select distinct id, 3 as t from c
) as t
group by id having count(t)=3
It is OK to have duplicates in the same table.
The query can/should exit at the first such occurrence.
SELECT 'OMG!' AS danger_bill_robinson
WHERE EXISTS (SELECT 1
FROM a,b,c -- maybe there is a place for old-style joins ...
WHERE a.id = b.id
OR a.id = c.id
OR c.id = b.id
);
Update: it appears the optimiser does not like carthesian joins with 3 OR conditions. The below query is a bit faster:
SELECT 'WTF!' AS danger_bill_robinson
WHERE exists (select 1 from a JOIN b USING (id))
OR exists (select 1 from a JOIN c USING (id))
OR exists (select 1 from c JOIN b USING (id))
;

Issues with SQL Select utilizing Except and UNION All

Select *
From (
Select a
Except
Select b
) x
UNION ALL
Select *
From (
Select b
Except
Select a
) y
This sql statement returns an extremely wrong amount of data. If Select a returns a million, how does this entire statement return 100,000? In this instance, Select b contains mutually exclusive data, so there should be no elimination due to the except.
As already stated in the comment, EXCEPT does an implicit DISTINCT, according to this and the ALL in your UNION ALL cannot re-create the duplicates. Hence you cannot use your approach if you want to keep duplicates.
As you want to get the data that is contained in exactly one of the tables a and b, but not in both, a more efficient way to achieve that would be the following (I am just assuming the tables have columns id and c where id is the primary key, as you did not state any column names):
SELECT CASE WHEN a.id IS NULL THEN 'from b' ELSE 'from a' END as source_table
,coalesce(a.id, b.id) as id
,coalesce(a.c, b.c) as c
FROM a
FULL OUTER JOIN b ON a.id = b.id AND a.c = b.c -- use all columns of both tables here!
WHERE a.id IS NULL OR b.id IS NULL
This makes use of a FULL OUTER JOIN, excluding the matching records via the WHERE conditions, as the primary key cannot be null except if it comes from the OUTER side.
If your tables do not have primary keys - which is bad practice anyway - you would have to check across all columns for NULL, not just the one primary key column.
And if you have records completely consisting of NULLs, this method would not work.
Then you could use an approach similar to your original one, just using
SELECT ...
FROM a
WHERE NOT EXISTS (SELECT 1 FROM b WHERE <join by all columns>)
UNION ALL
SELECT ...
FROM b
WHERE NOT EXISTS (SELECT 1 FROM a WHERE <join by all columns>)
If you're trying to get any data that is in one table and not in the other regardless of which table, I would try something like the following:
select id, 'table a data not in b' from a where id not in (select id from b)
union
select id, 'table b data not in a' from b where id not in (select id from a)

Efficient latest record query with Postgresql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.
this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST
what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past
On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.
If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.
You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)