Should I always prefer EXISTS over COUNT() > 0 in SQL? - sql

I often encounter the advice that, when checking for the existence of any rows from a (sub)query, one should use EXISTS instead of COUNT(*) > 0, for reasons of performance. Specifically, the former can short-circuit and return TRUE (or FALSE in the case of NOT EXISTS) after finding a single row, while COUNT needs to actually evaluate each row in order to return a number, only to be compared to zero.
This all makes perfect sense to me in simple cases. However, I recently ran into a problem where I needed to filter groups in the HAVING clause of a GROUP BY, based on whether all values in a certain column of the group were NULL.
For the sake of clarity, let's see an example. Let's say I have the following schema:
CREATE TABLE profile(
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
google_account_id INTEGER NULL,
facebook_account_id INTEGER NULL,
FOREIGN KEY (user_id) REFERENCES user(id),
CHECK(
(google_account_id IS NOT NULL) + (facebook_account_id IS NOT NULL) = 1
)
)
I.e. each user (table not shown for brevity) has 0 or more profiles. Each profile is either a Google or a Facebook account. (This is the translation of subclasses or a sum type with some associated data — in my real schema, the account IDs are also foreign keys to different tables holding that associated data, but this is not relevant to my question.)
Now, say I wanted to count the Facebook profiles for all users who do NOT have any Google profiles.
At first, I wrote the following query using COUNT() = 0:
SELECT user_id, COUNT(facebook_account_id)
FROM profile
GROUP BY user_id
HAVING COUNT(google_account_id) = 0;
But then it occurred to me that the condition in the HAVING clause is actually just an existence check. So I then re-wrote the query using a subquery and NOT EXISTS:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
GROUP BY user_id
HAVING NOT EXISTS (
SELECT 1
FROM profile AS q
WHERE p.user_id = q.user_id
AND q.google_id IS NOT NULL
)
My question is two-fold:
Should I keep the second, re-formulated query, and use NOT EXISTS with a subquery instead of COUNT() = 0? Is this really more efficient? I reckon that the index lookup due to the WHERE p.user_id = q.user_id condition has some additional cost. Whether this additional cost is absorbed by the short-circuiting behavior of EXISTS could as well depend on the average cardinality of the groups, could it not?
Or could the DBMS perhaps be smart enough to recognize the fact that the grouping key is being compared against, and optimize this subquery away completely, by replacing it with the current group (instead of actually performing an index lookup for each group)? I seriously doubt that a DBMS could optimize away this subquery, while failing to optimize COUNT() = 0 into NOT EXISTS.
Efficiency aside, the second query seems significantly more convoluted and less obviously correct to me, so I'd be reluctant to use it even if it happened to be faster. What do you think, is there a better way? Could I have my cake and eat it too, by using NOT EXISTS in a simpler manner, for instance by directly referencing the current group from within the HAVING clause?

You should prefer EXISTS/NOT EXISTS over COUNT() in a subquery. So instead of:
select t.*
from t
where (select count(*) from z where z.x = t.x) > 0
You should instead use:
select t.*
from t
where exists (select 1 from z where z.x = t.x)
The reasoning for this is that the subquery can stop processing at the first match.
This reasoning doesn't apply in a HAVING clause after an aggregation -- all the rows have to be generated anyway so there is little value in stopping at the first match.
However, aggregation might not be necessary if you have a users table and don't really need the facebook count. You could use:
select u.*
from users u
where not exists (select 1
from profiles p
where p.user_id = u.user_id and p.google_id is not null
);
Also, the aggregation might be faster if you filter before the aggregation:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
WHERE NOT EXISTS (
SELECT 1
FROM profile p2
WHERE p2.user_id = p.user_id AND p2.google_id IS NOT NULL
)
GROUP BY user_id;
Whether it actually is faster depends on a number of factors, including the number of rows that are actually filtered out.

The first query seems like the right way to do what you want.
That's an aggregate query already, since you want to count the facebook accounts. The overhead to process the having clause, that counts the google accounts, should be tiny.
On the other hand, the second approach requires reopening the table and scanning it, which is most probably more expensive.

Related

update average/count from another table

I've been provided the below schema for this problem and I'm trying to do two things:
Update the ACCOUNT table's average_eval row with the average of the evaluation row from the POST_EVAL table per account_id.
Update the ACCOUNT table with a count of the number of posts per account_id, with default value 0 if the account_id has no post_id associated to it.
Here's the kicker : I MUST use the UPDATE statement and I'm not allowed to use triggers for these specific problems.
I've tried WITH clauses and GROUP BY but haven't gotten anywhere. Using postresql's pgadmin for reference.
Any help setting up these queries?
The first question can be done using something like this:
update account a
set average_eval = t.avg_eval
from (
select account_id, avg(evaluation) as avg_eval
from post_eval
group by account_id
) t
where t.account_id = a.account_id
The second question needs a co-related sub-query as there is no way to express an outer join in an UPDATE statement like the above:
update account a
set num_posts = (select count(*)
from post p
where p.account_id = a.account_id);
The count() will return zero (0) if there are no posts for that account. If a join was used (as in the first statement), the rows would not be updated at all, as the "join" condition wouldn't match.
I have not tested either of those statements, so they can contain typos (or even logical errors).
Unrelated, but: I understand that this is some kind of assignment, so you have no choice. But as RiggsFolly has mentioned: in general you should avoid storing information in a relational database that can be derived from existing data. Both values can easily be calculated in a view and then will always be up-to-date.

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Only return rows that match all criteria

Here is a rough schema:
create table images (
image_id serial primary key,
user_id int references users(user_id),
date_created timestamp with time zone
);
create table images_tags (
images_tag_id serial primary key,
image_id int references images(image_id),
tag_id int references tags(tag_id)
);
The output should look like this:
{"images":[
{"image_id":1, "tag_ids":[1, 2, 3]},
....
]}
The user is allowed to filter images based on user ID, tags, and offset image_id. For instance, someone can say "user_id":1, "tags":[1, 2], "offset_image_id":500, which will give them all images that are from user_id 1, have both tags 1 AND 2, and an image_id of 500 or less.
The tricky part is the "have both tags 1 AND 2". It is more straight-forward (and faster) to return all images that have either 1, 2, or both. I don't see any way around this other than aggregating, but it is much slower.
Any help doing this quickly?
Here is the current query I am using which is pretty slow:
select * from (
select i.*,u.handle,array_agg(t.tag_id) as tag_ids, array_agg(tag.name) as tag_names from (
select i.image_id, i.user_id, i.description, i.url, i.date_created from images i
where (?=-1 or i.user_id=?)
and (?=-1 or i.image_id <= ?)
and exists(
select 1 from image_tags t
where t.image_id=i.image_id
and (?=-1 or user_id=?)
and (?=-1 or t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?))
)
order by i.image_id desc
) i
left join image_tags t on t.image_id=i.image_id
left join tag using (tag_id) --not totally necessary
left join users u on i.user_id=u.user_id --not totally necessary
group by i.image_id,i.user_id,i.description,i.url,i.date_created,u.handle) sub
where (?=-1 or sub.tag_ids #> ?)
limit 100;
When the execution plan of this statement is determined, at prepare time, the PostgresSQL planner doesn't know which of these ?=-1 conditions will be true or not.
So it has to produce a plan to maybe filter on a specific user_id, or maybe not, and maybe filter on a range on image_id or maybe not, and maybe filter on a specific set of tag_id, or maybe not. It's likely to be a dumb, unoptimized plan, that can't take advantage of indexes.
While your current strategy of a big generic query that covers all cases is OK for correctness, for performance you might need to abandon it in favor or generating the minimal query given the parametrized conditions that are actually filled in.
In such a generated query, the ?=-1 or ... will disappear, only the joins that are actually needed will be present, and the dubious t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) will go or be reduced to what's strictly necessary.
If it's still slow given certain sets of parameters, then you'll have a much easier starting point to optimize on.
As for the gist of the question, testing the exact match on all tags, you might want to try the idiomatic form in an inner subquery:
SELECT image_id FROM image_tags
WHERE tag_id in (?,?,...)
GROUP BY image_id HAVING count(*)=?
where the last ? is the number of tags passed as parameters.
(and completely remove sub.tag_ids #> ? as an outer condition).
Among other things, your GROUP BY clause is likely wider than any of your indices (and/or includes columns in unlikely combinations). I'd probably re-write your query as follows (turning #Daniel's subquery for the tags into a CTE):
WITH Tagged_Images (SELECT Image_Tags.image_id, ARRAY_AGG(Tag.tag_id) as tag_ids,
ARRAY_AGG(Tag.name) as tag_names
FROM Image_Tags
JOIN Tag
ON Tag.tag_id = Image_Tags.tag_id
WHERE tag_id IN (?, ?)
GROUP BY image_id
HAVING COUNT(*) = ?)
SELECT Images.image_id, Images.user_id,
Images.description, Images.url, Images.date_created,
Tagged_Images.tag_ids, Tagged_Images.tag_names,
Users.handle
FROM Images
JOIN Tagged_Images
ON Tagged_Images.image_id = Images.image_id
LEFT JOIN Users
ON Users.user_id = Images.user_id
WHERE Images.user_id = ?
AND Images.date_created < ?
ORDER BY Images.date_created, Images.image_id
LIMIT 100
(Untested - no provided dataset. note that I'm assuming you're building the criteria dynamically, to avoid condition flags)
Here's some other stuff:
Note that Tagged_Images will have at minimum the indicated tags, but might have more. If you want images with only those tags (exactly 2, no more, no less), an additional level needs to be added to the CTE.
There's a number of examples floating around of stored procs that turn comma-separated lists into virtual tables (heck, I've done it with recursive CTEs), which you could use for the IN() clause. It doesn't matter that much here, though, due to needing dynamic SQL anyways...
Assuming that Images.image_id is auto-generated, doing ranges searches or ordering by it is largely pointless. There are relatively few cases where humans care about the value held here. Except in cases where you're searching for one specific row (for updating/deleting/whatever), conceptual data sets don't really care either; the value of itself is largely meaningless. What does image_id < 500 actually tell me? Nothing - just that a given number was assigned to it. Are you using it to restrict based on "early" versus "late" images? Then use the proper data for that, which would be date_created. For pagination? Well, you have to do that after all the other conditions, or you get weird page lengths (like 0 in some cases). Generated keys should be relied on for one property only: uniqueness. This is the reason I stuck it at the end of the ORDER BY - to ensure a consistent ordering. Assuming that date_created has a high enough resolution as a timestamp, even this is unnecessary.
I'm fairly certain your LEFT JOIN to Users should probably be a regular (INNER) JOIN, but you didn't provide enough information for me to be sure.
Aggregation is not likely to be the thing slowing you down. A query such as:
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
where images_tags.tag_id in (1,2)
group by images.image_id
having count(*) = 2
will get you all of the images that have tags 1 and 2 and it will run quickly if you have indexes on both image_tags columns:
create index on images_tags(tag_id);
create index on images_tags(image_id);
The slowest part of the query is likely to be the in part of the where clause. You can speed that up if you are prepared to create a temporary table with the target tags in:
create temp table target_tags(tag_id int primary key);
insert into target_tags values (1);
insert into target_tags values (2);
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
join target_tags on images_tags.tag_id=target_tags.tag_id
group by images.image_id
having count(*) = (select count(*) from target_tags)

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

SQL - table alias scope

I've just learned ( yesterday ) to use "exists" instead of "in".
BAD
select * from table where nameid in (
select nameid from othertable where otherdesc = 'SomeDesc' )
GOOD
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
And I have some questions about this:
1) The explanation as I understood was: "The reason why this is better is because only the matching values will be returned instead of building a massive list of possible results". Does that mean that while the first subquery might return 900 results the second will return only 1 ( yes or no )?
2) In the past I have had the RDBMS complainin: "only the first 1000 rows might be retrieved", this second approach would solve that problem?
3) What is the scope of the alias in the second subquery?... does the alias only lives in the parenthesis?
for example
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
AND
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeOtherDesc' )
That is, if I use the same alias ( o for table othertable ) In the second "exist" will it present any problem with the first exists? or are they totally independent?
Is this something Oracle only related or it is valid for most RDBMS?
Thanks a lot
It's specific to each DBMS and depends on the query optimizer. Some optimizers detect IN clause and translate it.
In all DBMSes I tested, alias is only valid inside the ( )
BTW, you can rewrite the query as:
select t.*
from table t
join othertable o on t.nameid = o.nameid
and o.otherdesc in ('SomeDesc','SomeOtherDesc');
And, to answer your questions:
Yes
Yes
Yes
You are treading into complicated territory, known as 'correlated sub-queries'. Since we don't have detailed information about your tables and the key structures, some of the answers can only be 'maybe'.
In your initial IN query, the notation would be valid whether or not OtherTable contains a column NameID (and, indeed, whether OtherDesc exists as a column in Table or OtherTable - which is not clear in any of your examples, but presumably is a column of OtherTable). This behaviour is what makes a correlated sub-query into a correlated sub-query. It is also a routine source of angst for people when they first run into it - invariably by accident. Since the SQL standard mandates the behaviour of interpreting a name in the sub-query as referring to a column in the outer query if there is no column with the relevant name in the tables mentioned in the sub-query but there is a column with the relevant name in the tables mentioned in the outer (main) query, no product that wants to claim conformance to (this bit of) the SQL standard will do anything different.
The answer to your Q1 is "it depends", but given plausible assumptions (NameID exists as a column in both tables; OtherDesc only exists in OtherTable), the results should be the same in terms of the data set returned, but may not be equivalent in terms of performance.
The answer to your Q2 is that in the past, you were using an inferior if not defective DBMS. If it supported EXISTS, then the DBMS might still complain about the cardinality of the result.
The answer to your Q3 as applied to the first EXISTS query is "t is available as an alias throughout the statement, but o is only available as an alias inside the parentheses". As applied to your second example box - with AND connecting two sub-selects (the second of which is missing the open parenthesis when I'm looking at it), then "t is available as an alias throughout the statement and refers to the same table, but there are two different aliases both labelled 'o', one for each sub-query". Note that the query might return no data if OtherDesc is unique for a given NameID value in OtherTable; otherwise, it requires two rows in OtherTable with the same NameID and the two OtherDesc values for each row in Table with that NameID value.
Oracle-specific: When you write a query using the IN clause, you're telling the rule-based optimizer that you want the inner query to drive the outer query. When you write EXISTS in a where clause, you're telling the optimizer that you want the outer query to be run first, using each value to fetch a value from the inner query. See "Difference between IN and EXISTS in subqueries".
Probably.
Alias declared inside subquery lives inside subquery. By the way, I don't think your example with 2 ANDed subqueries is valid SQL. Did you mean UNION instead of AND?
Personally I would use a join, rather than a subquery for this.
SELECT t.*
FROM yourTable t
INNER JOIN otherTable ot
ON (t.nameid = ot.nameid AND ot.otherdesc = 'SomeDesc')
It is difficult to generalize that EXISTS is always better than IN. Logically if that is the case, then SQL community would have replaced IN with EXISTS...
Also, please note that IN and EXISTS are not same, the results may be different when you use the two...
With IN, usually its a Full Table Scan of the inner table once without removing NULLs (so if you have NULLs in your inner table, IN will not remove NULLS by default)... While EXISTS removes NULL and in case of correlated subquery, it runs inner query for every row from outer query.
Assuming there are no NULLS and its a simple query (with no correlation), EXIST might perform better if the row you are finding is not the last row. If it happens to be the last row, EXISTS may need to scan till the end like IN.. so similar performance...
But IN and EXISTS are not interchangeable...