Efficiently determine if any rows satisfy a predicate in Postgres - sql

I'd like to query the database as to whether or not one or more rows exist that satisfy a given predicate. However, I am not interested in the distinction between there being one such row, two rows or a million - just if there are 'zero' or 'one or more'. And I do not want Postgres to waste time producing an exact count that I do not need.
In DB2, I would do it like this:
SELECT 1 FROM SYSIBM.SYSDUMMY1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
and then checking if zero rows or one row was returned from the query.
But Postgres has no dummy table available, so what is the best option?
If I create a one-row dummy table myself and use that in place of SYSIBM.SYSDUMMY1, will the query optimizer be smart enough to not actually read that table when running the query, and otherwise 'do the right thing'?

PostgreSQL doesn't have a dummy table because you don't need one.
SELECT 1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
Alternatively if you want a true/false answer:
SELECT EXISTS(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')

How about just doing this?
SELECT (CASE WHEN EXISTS (SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE') THEN 1 ELSE 0 END)
1 means there is a value. 0 means no value.
This will always return one row.

If you are happy with "no row" if no row matches, you can even just:
SELECT 1 FROM real_table WHERE column = 'VALUE' LIMIT 1;
Performance is basically the same as with EXISTS. Key to performance for big tables is a matching index.

Related

I am using multiple queries to determine which of a set of filters generates an empty table. Is there a performant way to do this with a single query?

Toy example of the problem I am facing.
Suppose I have a collection of filterClauses and a single underlying table. I want to figure out which of these filterClauses fully filter out the entire table and which ones do not.
I am currently performing many of these SQL queries (one for each filterClause).
SELECT CASE WHEN NOT EXISTS
(SELECT * FROM table
WHERE {filterClause})
THEN 1 ELSE 0 END
Is there a better way to group all of these calls into a single query and get a result set back that is a mapping of filterClauses to whether it filtered out the entire table? I've considered approaches with CASE statements and UNION statements but am wondering if there are more efficient ways.
As an additional problem (optional), for each of the clauses that did not generate an empty table, I want to check certain things about the resulting table (i.e., whether a certain column has strictly positive values or whether or not all rows left are non-null for all columns). With the single query per filter clause approach, I could do these checks on the individually filtered tables. Any suggestions for how I can perform these checks on a batch level too?
Thanks in advance!
As to checking the effect of the filters: Compare the table's total row count with the row count of matched criteria:
select
count(*) as totalcount,
count(case when {filterClause1} then 1 end) as filter1count,
count(case when {filterClause2} then 1 end) as filter2count,
...
from table;
As to further narrowing that down in a batch: No idea right now. Sorry.
I'm not sure I got your point but here is what I think you need:
SELECT CASE WHEN NOT EXISTS
(SELECT * FROM table
WHERE {filterClause1})
THEN 1 ELSE 0 END as filterLabel1,
CASE WHEN NOT EXISTS
(SELECT * FROM table
WHERE {filterClause2})
THEN 1 ELSE 0 END as filterLabel2,...
This should give you a map of which conditions are true and which are false, if you want to verify more conditions in case one is true I only have one idea, and it only allows you to add one more verification:
SELECT CASE WHEN NOT EXISTS
(SELECT * FROM table
WHERE {filterClause1})
THEN
(
Select CASE WHEN
{subFilterClause}
THEN 1 ELSE 0 END
)
ELSE 0 END as filterLabel1
Obviously I do not recommend going this deep, it's better to have a procedure either on SQL side or Application side to get it done for you.

Can someone explain this query

here is the query
SELECT * FROM customers
WHERE
NOT EXISTS
(
SELECT 1 FROM brochure_requests
WHERE brochure_requests.first_name = customers.customer_first_name AND
brochure_requests.last_name = customers.customer_last_name
)
This query works just fine but I am not sure why it works. In the NOT EXISTS part SELECT 1 what is the 1 for. When I ran this query
select 1 from test2
Here were the results:
1
-----
1
1
1
1
1
1
1
1
1
1
1
..
How does the not exists query work?
The compiler is smart enough to ignore the actual SELECT in an EXISTS. So, basically, if it WOULD return rows because the filters match, that is all it cares about...the SELECT portion of the EXISTS never executes. It only uses the EXISTS clauses for evaluation purposes
I had this misconception for quite some time since you will see this SELECT 1 a lot. But, I have seen 42, *, etc....It never actually cares about the result, only that there would be one :). The key to keep in mind that SQL is a compiled language, so it will optimize this appropriately.
You could put a 1/0 and it will not throw a divide-by-zero exception...thus further proving that the result set is not evaluated. This is shown in this SQLFiddle
Code from Fiddle:
CREATE TABLE test (i int)
CREATE TABLE test2 (i int)
INSERT INTO test VALUES (1)
INSERT INTO test2 VALUES (1)
SELECT i
FROM test
WHERE EXISTS
(
SELECT 1/0
FROM test2
WHERE test2.i = test.i
)
And finally, more to your point, the NOT simply negates an EXISTS, saying to IGNORE any rows that match
The subquery is a correlated subquery joining between the customers and brochure_requests tables on the selected fields.
The EXISTS clause is simply a predicate that will only return the matching rows (and the NOT negates that).
The query :
select 1 from test2
shows you the value 1 as the value for all the records in test2 table.
Every SELECT query must have at least one column. I think that's why an unnamed column, which has the value 1, is used here.
The sub-query gives you the rows of the related Customers from the table brochure_requests.
NOT EXISTS causes the main query to return all the rows from the Customers table, which are not in the table brochure_requests.
The relational operator in question is known as 'antijoin' (alternatively 'not match' or 'semi difference'). In natural language: customers who do not match brochure_requests using the common attributes first_name and last_name.
A closely related operator is relational difference (alternatively 'minus' or 'except') e.g. in SQL
SELECT customer_last_name, customer_first_name
FROM customers
EXCEPT
SELECT last_name, first_name
FROM brochure_requests;
if customer requested a brochure, subquery returns 1 for this customer. and this customer not be added to return resultset. bcouse of NOT EXISTS clause.
Note: I don't know Oracle, and am not especially expert in SQL.
However, SELECT 1 from simply returns a 1 for every row matching the from clause. So the inner select can find a brochure_requests row whose name fields match those of the customer row currently being considered, it will produce a 1 result and fail the NOT EXISTS.
Hence the query selects all customers who do not have a brochure_request matching their name.
For each row of the table Customers, the query returns the rows when the sub-query.
NOT EXISTS returns no row.
If the sub-query in NOT EXISTS returns rows, then the rows of the table Customers are not returned.

Conditional ORDER BY depending on column values

I need to write a query that does this:
SELECT TOP 1
FROM a list of tables (Joins, etc)
ORDER BY Column X, Column Y, Column Z
If ColumnX is NOT NULL, then at the moment, I reselect, using a slightly different ORDER BY.
So, I do the same query, twice. If the first one has a NULL in a certain column, I return that row from my procedure. However, if the value isn't NULL - I have to do another identical select, except, order by a different column or two.
What I do now is select it into a temp table the first time. Then check the value of the column. If it's OK, return the temp table, else, redo the select and return that result set.
More details:
In english, the question I am asking the database:
Return my all the results for certain court appearance (By indexed foreign key). I expect around 1000 rows. Order it by the date of the appearance (column, not indexed, nullable), last appearance first. Check an 'importId'. If the import ID is not NULL for that top 1 row, then we need to run the same query - but this time, order by the Import ID (Last one first), and return that row. Or else, just return the top 1 row from the original query.
I'd say the BEST way to do this is in a single query is a CASE statement...
SELECT TOP 1 FROM ... ORDER BY
(CASE WHEN column1 IS NULL THEN column2 ELSE column1 END)
You could use a COALESCE function to turn nullable columns into orderby friendly values.
SELECT CAST(COALESCE(MyColumn, 0) AS money) AS Column1
FROM MyTable
ORDER BY Column1;
I used in Firebird (columns are numeric):
ORDER BY CASE <condition> WHEN <value> THEN <column1>*1000 + <column2> ELSE <column3>*1000 + <column4> END

SQL select performance top 1 VS select 1

select 1 from someTable where someColumn = #
or
select top 1 someColumn1 from someTable where someColumn2 = #
which one will be faster on a large scale table...
got no indexes at all on that table so that wont work.
thanks.
First one selects one column with value of literal 1 (a number) and as many rows as there are while second returns all the column but only for the first row.
It is not possible to compare the performance since they are doing different things.

Limit number of rows from join, in oracle

I apologize in advance for my long-winded question and if the formatting isn't up to par (newbie), here goes.
I have a table MY_TABLE with the following schema -
MY_ID | TYPE | REC_COUNT
1 | A | 1
1 | B | 3
2 | A | 0
2 | B | 0
....
The first column corresponds to an ID, the second is some type and 3rd some count. NOTE that the MY_ID column is not the primary key, there could be many records having the same MY_ID.
I want to write a stored procedure which will take an array of IDs and return the subset of them that match the following criteria -
the ID should match the MY_ID field of at least 1 record in the table and at least 1 matching record should not have TYPE = A OR REC_COUNT = 0.
This is the procedure I came up with -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND (m.type != 'A' OR m.rec_count != 0)
);
END get_id_subset;
But I really care about performance and some IDs could match 1000s of records in the table. There is an index on the MY_ID and TYPE column but no index on the REC_COUNT column. So I was thinking if there are more than 1000 rows that have a matching MY_ID field then I'll just return the ID without applying the TYPE and REC_COUNT predicates. Here's this version -
PROCEDURE get_id_subset(
iIds IN ID_ARRAY,
oMatchingIds OUT NOCOPY ID_ARRAY
)
IS
BEGIN
SELECT t.column_value
BULK COLLECT INTO oMatchingIds
FROM TABLE(CAST(iIds AS ID_ARRAY)) t, MY_TABLE m
WHERE (m.my_id = t.column_value)
AND ( ((SELECT COUNT(m.my_id) FROM m WHERE 1) >= 1000)
OR EXISTS (m.type != 'F' OR m.rec_count != 0)
);
END get_id_subset;
But this doesn't compile, I get the following error on the inner select -
PL/SQL: ORA-00936: missing expression
Is there another way of writing this? The inner select needs to work on the joined table.
And to clarify, I'm OK with the result set being different for this query. My assumption is that since there is an index on the my_id column, doing count(*) would be much cheaper than actually applying the rec_count predicate to 10000s of rows since there is no index on that column. Am I wrong?
I don't see your second query as being much if any improvement over the first. At best, the first subquery has to hit 1000 matching records in order to determine if the count is less than 1000, so I don't think it will save lots of work. Also it changes the actual result, and it's not clear from your description if you're saying that's OK as long as it's more efficient. (And if it is OK, then the business logic is very unclear -- why do the other conditions matter at all, if they don't matter when there's lots of records?)
You ask, "will the group by be applied before or after the predicate". I'm not clear what part of the query you're talking about, but logically speaking the order is always
Where predicates
Group By
Having predicates
The optimizer can change the order in which things are actually evaluated, but the result must always be logically equivalent to the above order of evaluation (barring optimizer bugs).
1000s of records is really not that much. Have you actually encountered a case where performance of the first query is unacceptable?
For either query, it may be better to rewrite the correlated EXISTS subquery as a non-correlated IN subquery. You need to test this.
You need to show actual execution plans to get more useful feedback.
Edit
For the kind of short-circuiting you're talking about, I think you need to rewrite your subquery (from the initial version of the query) like this (sorry, my first attempt at this wouldn't work because I tried to access a column from the top-level table in a sub-sub-query):
WHERE EXISTS (
SELECT /*+ NL_SJ */ 1
FROM MY_TABLE m
WHERE (m.my_id = t.column_value)
AND rownum <= 1000
HAVING MAX( CASE WHEN m.type != 'A' OR m.rec_count != 0 THEN 1 ELSE NULL END ) I S NOT NULL
OR MAX(rownum) >= 1000
)
That should force it to hit no more than 1,000 records per id, then return a row if either at least one row matches the conditions on type and rec_count, or the 1,000-record limit was reached. If you view the execution plan, you should expect to see a COUNT STOPKEY operation, which shows that Oracle is going to stop running a query block after a certain number of rows are returned.