Fast approximate counting in Postgres - sql

I'm querying my database (Postgres 8.4) with something like the following:
SELECT COUNT(*) FROM table WHERE indexed_varchar LIKE 'bar%';
The complexity of this is O(N) because Postgres has to count each row. Postgres 9.2 has index-only scans, but upgrading isn't an option, unfortunately.
However, getting an exact count of rows seems like overkill, because I only need to know which of the following three cases is true:
Query returns no rows.
Query returns one row.
Query returns two or more rows.
So I don't need to know that the query returns 10,421 rows, just that it returns more than two.
I know how to handle the first two cases:
SELECT EXISTS (SELECT COUNT(*) FROM table WHERE indexed_varchar LIKE 'bar%');
Which will return true if one or more rows exists and false is none exist.
Any ideas on how to expand this to encompass all three cases in a efficient manner?

SELECT COUNT(*) FROM (
SELECT * FROM table WHERE indexed_varchar LIKE 'bar%' LIMIT 2
) t;

Should be simple. You can use LIMIT to do what what you want and return data (count) using a CASE statement.
SELECT CASE WHEN c = 2 THEN 'more than one' ELSE CAST(c AS TEXT) END
FROM
(
SELECT COUNT(*) AS c
FROM (
SELECT 1 AS c FROM table WHERE indexed_varchar LIKE 'bar%' LIMIT 2
) t
) v

Related

How to optimize CTE query

I have following query (for the sake of this topic it was simplified as below):
WITH CTE
(
Columns,
DeliverDate,
LastReplayDate
)
AS
(
SELECT IIF(LastReplayDate IS NULL, IIF(LastReplayDate>= DeliverDate, LastReplayDate,DeliverDate),LastReplayDate) AS SortDateColumn,R.* FROM
(
SELECT
Columns,
DeliverDate,
LastReplayDate
FROM MY_TABLE
WHERE
CONDITIONS
AND ( FIRST_HEAVY_FUNCTION)
AND ( SECOND_HEAVY_FUNCTION)
) R
ORDER BY SortDateColumn DESC
OFFSET (#CurrentPageIndex - 1) * #PageSize ROWS
FETCH NEXT 10 ROWS ONLY
)
SELECT CTE. *
FROM CTE
OPTION (RECOMPILE);
As you see I use CTE query from sorted data from other query. Last one include pagging each 10 rows.
The most problematic for me here is part:
WHERE
CONDITIONS
AND ( FIRST_HEAVY_FUNCTION)
AND ( SECOND_HEAVY_FUNCTION)
Because of those conditions return time sometimes reach up to 4 minutes or so. Without it's fairly quick (8-20 sec).
Of course indexes were made and improved a query from 15 minutes. But this is still to slow.
I was wondering is it possible to move problematic conditions outside CTE and at the same time still get all 10 rows from paging? In case where rows count < 10 do another loop to collect missing rows to achieve exactly 10 rows as final result.
Is it possible? Or how to optimize a such query?
A couple of questions:
can you rewrite the functions as inline table valued function? That can allevate the pressure
Does the CONDITIONS remove enough rows already? So that the HEAVY_FUNCTIONs are just the cherry on top. In that case you might be able to fake a push down (but i wouldn't guarantee it's always working):
select *
from (
select *
FROM MY_TABLE
WHERE
CONDITIONS
) x
cross apply (
select 1 AS test
where dbo.FN_SLOW_FUNCTION(x.param1, x.param2) = 1
) filter1
cross apply (
select 1 AS test
where dbo.FN_ANOTHER_SLOW_FUNCTION(x.param3, x.param4) = 1
) filter2
ORDER BY SomeDate OFFSET ...
There's a OPTION hint called FORCE ORDER which might guarantee that SQL isn't rearranged too much
Are your indexes good?

Does PostgreSQL short-circuit its BOOL_OR() evaluation?

EXISTS is faster than COUNT(*) because it can be short-circuited
A lot of times, I like to check for existence of things in SQL. For instance, I do:
-- PostgreSQL syntax, SQL standard syntax:
SELECT EXISTS (SELECT .. FROM some_table WHERE some_boolean_expression)
-- Oracle syntax
SELECT CASE
WHEN EXISTS (SELECT .. FROM some_table WHERE some_boolean_expression) THEN 1
ELSE 0
END
FROM dual
In most databases, EXISTS is "short-circuited", i.e. the database can stop looking for rows in the table as soon as it has found one row. This is usually much faster than comparing COUNT(*) >= 1 as can be seen in this blog post.
Using EXISTS with GROUP BY
Sometimes, I'd like to do this for each group in a GROUP BY query, i.e. I'd like to "aggregate" the existence value. There's no EXISTS aggregate function, but PostgreSQL luckily supports the BOOL_OR() aggregate function, like in this statement:
SELECT something, bool_or (some_boolean_expression)
FROM some_table
GROUP BY something
The documentation mentions something about COUNT(*) being slow because of the obvious sequential scan needed to calculate the count. But unfortunately, it doesn't say anything about BOOL_OR() being short-circuited. Is it the case? Does BOOL_OR() stop aggregating new values as soon as it encounters the first TRUE value?
If you want to check for existence, I'm generally using a LIMIT/FETCH FIRST 1 ROW ONLY query:
SELECT .. FROM some_table WHERE some_boolean_expression
FETCH FIRST 1 ROW ONLY
This generally stops execution after the first hit.
The same technique can be applied using LATERAL for each row (group) from another table.
SELECT *
FROM (SELECT something
FROM some_table
GROUP BY something
) t1
LEFT JOIN LATERAL (SELECT ...
FROM ...
WHERE ...
FETCH FIRST 1 ROW ONLY) t2
ON (true)
In t2 you can use a WHERE clause that matches any row for the group. It's executed only once per group and aborted as soon as the first hit was found. However, whether this performs better or worse depends on your search predicates and indexing, of course.

Efficiently determine if any rows satisfy a predicate in Postgres

I'd like to query the database as to whether or not one or more rows exist that satisfy a given predicate. However, I am not interested in the distinction between there being one such row, two rows or a million - just if there are 'zero' or 'one or more'. And I do not want Postgres to waste time producing an exact count that I do not need.
In DB2, I would do it like this:
SELECT 1 FROM SYSIBM.SYSDUMMY1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
and then checking if zero rows or one row was returned from the query.
But Postgres has no dummy table available, so what is the best option?
If I create a one-row dummy table myself and use that in place of SYSIBM.SYSDUMMY1, will the query optimizer be smart enough to not actually read that table when running the query, and otherwise 'do the right thing'?
PostgreSQL doesn't have a dummy table because you don't need one.
SELECT 1 WHERE EXISTS
(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
Alternatively if you want a true/false answer:
SELECT EXISTS(SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE')
How about just doing this?
SELECT (CASE WHEN EXISTS (SELECT 1 FROM REAL_TABLE WHERE COLUMN = 'VALUE') THEN 1 ELSE 0 END)
1 means there is a value. 0 means no value.
This will always return one row.
If you are happy with "no row" if no row matches, you can even just:
SELECT 1 FROM real_table WHERE column = 'VALUE' LIMIT 1;
Performance is basically the same as with EXISTS. Key to performance for big tables is a matching index.

Can someone explain this query

here is the query
SELECT * FROM customers
WHERE
NOT EXISTS
(
SELECT 1 FROM brochure_requests
WHERE brochure_requests.first_name = customers.customer_first_name AND
brochure_requests.last_name = customers.customer_last_name
)
This query works just fine but I am not sure why it works. In the NOT EXISTS part SELECT 1 what is the 1 for. When I ran this query
select 1 from test2
Here were the results:
1
-----
1
1
1
1
1
1
1
1
1
1
1
..
How does the not exists query work?
The compiler is smart enough to ignore the actual SELECT in an EXISTS. So, basically, if it WOULD return rows because the filters match, that is all it cares about...the SELECT portion of the EXISTS never executes. It only uses the EXISTS clauses for evaluation purposes
I had this misconception for quite some time since you will see this SELECT 1 a lot. But, I have seen 42, *, etc....It never actually cares about the result, only that there would be one :). The key to keep in mind that SQL is a compiled language, so it will optimize this appropriately.
You could put a 1/0 and it will not throw a divide-by-zero exception...thus further proving that the result set is not evaluated. This is shown in this SQLFiddle
Code from Fiddle:
CREATE TABLE test (i int)
CREATE TABLE test2 (i int)
INSERT INTO test VALUES (1)
INSERT INTO test2 VALUES (1)
SELECT i
FROM test
WHERE EXISTS
(
SELECT 1/0
FROM test2
WHERE test2.i = test.i
)
And finally, more to your point, the NOT simply negates an EXISTS, saying to IGNORE any rows that match
The subquery is a correlated subquery joining between the customers and brochure_requests tables on the selected fields.
The EXISTS clause is simply a predicate that will only return the matching rows (and the NOT negates that).
The query :
select 1 from test2
shows you the value 1 as the value for all the records in test2 table.
Every SELECT query must have at least one column. I think that's why an unnamed column, which has the value 1, is used here.
The sub-query gives you the rows of the related Customers from the table brochure_requests.
NOT EXISTS causes the main query to return all the rows from the Customers table, which are not in the table brochure_requests.
The relational operator in question is known as 'antijoin' (alternatively 'not match' or 'semi difference'). In natural language: customers who do not match brochure_requests using the common attributes first_name and last_name.
A closely related operator is relational difference (alternatively 'minus' or 'except') e.g. in SQL
SELECT customer_last_name, customer_first_name
FROM customers
EXCEPT
SELECT last_name, first_name
FROM brochure_requests;
if customer requested a brochure, subquery returns 1 for this customer. and this customer not be added to return resultset. bcouse of NOT EXISTS clause.
Note: I don't know Oracle, and am not especially expert in SQL.
However, SELECT 1 from simply returns a 1 for every row matching the from clause. So the inner select can find a brochure_requests row whose name fields match those of the customer row currently being considered, it will produce a 1 result and fail the NOT EXISTS.
Hence the query selects all customers who do not have a brochure_request matching their name.
For each row of the table Customers, the query returns the rows when the sub-query.
NOT EXISTS returns no row.
If the sub-query in NOT EXISTS returns rows, then the rows of the table Customers are not returned.

Assistance with SQL statement

I'm using sql-server 2005 and ASP.NET with C#.
I have Users table with
userId(int),
userGender(tinyint),
userAge(tinyint),
userCity(tinyint)
(simplified version of course)
I need to select always two fit to userID I pass to query users of opposite gender, in age range of -5 to +10 years and from the same city.
Important fact is it always must be two, so I created condition if ##rowcount<2 re-select without age and city filters.
Now the problem is that I sometimes have two returned result sets because I use first ##rowcount on a table. If I run the query.
Will it be a problem to use the DataReader object to read from always second result set? Is there any other way to check how many results were selected without performing select with results?
Can you simplify it by using SELECT TOP 2 ?
Update: I would perform both selects all the time, union the results, and then select from them based on an order (using SELECT TOP 2) as the union may have added more than two. Its important that this next select selects the rows in order of importance, ie it prefers rows from your first select.
Alternatively, have the reader logic read the next result-set if there is one and leave the SQL alone.
To avoid getting two separate result sets you can do your first SELECT into a table variable and then do your ##ROWCOUNT check. If >= 2 then just select from the table variable on its own otherwise select the results of the table variable UNION ALLed with the results of the second query.
Edit: There is a slight overhead to using table variables so you'd need to balance whether this was cheaper than Adam's suggestion just to perform the 'UNION' as a matter of routine by looking at the execution stats for both approaches
SET STATISTICS IO ON
Would something along the following lines be of use...
SELECT *
FROM (SELECT 1 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender AND
M1.userAge - 5 >= M2.userAge AND
M1.userAge + 15 <= M2.userAge AND
M1.userCity = M2.userCity
LIMIT TO 2 ROWS
UNION
SELECT 2 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender
LIMIT TO 2 ROWS)
ORDER BY prio
LIMIT TO 2 ROWS;
I haven't tried it as I have no SQL Server and there may be dialect issues.