Query optimization, get rid of subqueries - sql

I'm using sqlite3, but its SQL support is rather standard, so as long as the SQL doesn't contain any proprietary extensions all should be good. My schema is the following:
create table test (
_id integer primary key,
name text,
enabled integer not null default 1
);
create table task (
_id integer primary key,
_test_id integer not null references test on delete cascade,
favorite integer not null default 0,
comment text
);
In short: there are tests which may be enabled or not; tests have multiple tasks, which can be favorite and may have a comment.
The two most complex queries I need to write are the following:
A select which retrieves information whether the database contains at least 1 favorite and at least 1 commented task for any enabled test (i.e. don't group by test). I came up with the following monstrosity:
select
exists(select task._id from task as task inner join test as test on task._test_id=test._id where task.favorite=1 and test.enabled=1 limit 1) as has_favorite,
exists(select task._id from task as task inner join test as test on task._test_id=test._id where task.comment is not null and test.enabled=1 limit 1) as has_commented;
A select which retrieves test core data (id, name etc.) along with information about its task count, whether the test contains at least 1 favorite and at least 1 commented task. I came up with this:
select
test.*,
(select count(*) from task where _test_id=test._id) as task_count,
exists(select _id from task where favorite=1 and _test_id=test._id limit 1) as has_favorite,
exists(select _id from task where comment is not null and _test_id=test._id limit 1) as has_commented
from test as test where test.enabled=1 order by test._id asc
Actually, the 'has_favorite' and 'has_commented' info are not the only ones, but they depict my doubts - these queries are pretty big, contain a fair amount of subqueries (and I read subselects are bad for performance) and duplication.
The question: would it be possible to write the queries more easily? Make them better, more concise? Not duplicate so much? For example, I'm thinking maybe there is a way to perform only one join between the task and test tables and somehow derive the data from there.
Edit: so it appears I can write this for the first one:
select
count(*) as task_count,
max(task.favorite) as has_favorite,
count(task.comment) as has_commented
from task as task inner join test as test on task._test_id=test._id where test.enabled=1;
and this for the second one:
select
test.*,
count(*) as task_count,
max(task.favorite) as has_favorite,
count(task.comment) as has_commented
from task as task inner join test as test on task._test_id=test._id where test.enabled=1 group by test._id;
If max(task.favorite) is anything > 0 it means at least 1 task is favorite. I could replace it with 'sum(task.favorite)' and if the sum > 0, there is a favorite.
Is this any better than the original proposals (with exists(subselect))? It seems way easier.

I eventually went with joins similar to the ones in my edits as they worked quite nicely and also allowed be to gather other information as well in one go.

Related

Should I always prefer EXISTS over COUNT() > 0 in SQL?

I often encounter the advice that, when checking for the existence of any rows from a (sub)query, one should use EXISTS instead of COUNT(*) > 0, for reasons of performance. Specifically, the former can short-circuit and return TRUE (or FALSE in the case of NOT EXISTS) after finding a single row, while COUNT needs to actually evaluate each row in order to return a number, only to be compared to zero.
This all makes perfect sense to me in simple cases. However, I recently ran into a problem where I needed to filter groups in the HAVING clause of a GROUP BY, based on whether all values in a certain column of the group were NULL.
For the sake of clarity, let's see an example. Let's say I have the following schema:
CREATE TABLE profile(
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
google_account_id INTEGER NULL,
facebook_account_id INTEGER NULL,
FOREIGN KEY (user_id) REFERENCES user(id),
CHECK(
(google_account_id IS NOT NULL) + (facebook_account_id IS NOT NULL) = 1
)
)
I.e. each user (table not shown for brevity) has 0 or more profiles. Each profile is either a Google or a Facebook account. (This is the translation of subclasses or a sum type with some associated data — in my real schema, the account IDs are also foreign keys to different tables holding that associated data, but this is not relevant to my question.)
Now, say I wanted to count the Facebook profiles for all users who do NOT have any Google profiles.
At first, I wrote the following query using COUNT() = 0:
SELECT user_id, COUNT(facebook_account_id)
FROM profile
GROUP BY user_id
HAVING COUNT(google_account_id) = 0;
But then it occurred to me that the condition in the HAVING clause is actually just an existence check. So I then re-wrote the query using a subquery and NOT EXISTS:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
GROUP BY user_id
HAVING NOT EXISTS (
SELECT 1
FROM profile AS q
WHERE p.user_id = q.user_id
AND q.google_id IS NOT NULL
)
My question is two-fold:
Should I keep the second, re-formulated query, and use NOT EXISTS with a subquery instead of COUNT() = 0? Is this really more efficient? I reckon that the index lookup due to the WHERE p.user_id = q.user_id condition has some additional cost. Whether this additional cost is absorbed by the short-circuiting behavior of EXISTS could as well depend on the average cardinality of the groups, could it not?
Or could the DBMS perhaps be smart enough to recognize the fact that the grouping key is being compared against, and optimize this subquery away completely, by replacing it with the current group (instead of actually performing an index lookup for each group)? I seriously doubt that a DBMS could optimize away this subquery, while failing to optimize COUNT() = 0 into NOT EXISTS.
Efficiency aside, the second query seems significantly more convoluted and less obviously correct to me, so I'd be reluctant to use it even if it happened to be faster. What do you think, is there a better way? Could I have my cake and eat it too, by using NOT EXISTS in a simpler manner, for instance by directly referencing the current group from within the HAVING clause?
You should prefer EXISTS/NOT EXISTS over COUNT() in a subquery. So instead of:
select t.*
from t
where (select count(*) from z where z.x = t.x) > 0
You should instead use:
select t.*
from t
where exists (select 1 from z where z.x = t.x)
The reasoning for this is that the subquery can stop processing at the first match.
This reasoning doesn't apply in a HAVING clause after an aggregation -- all the rows have to be generated anyway so there is little value in stopping at the first match.
However, aggregation might not be necessary if you have a users table and don't really need the facebook count. You could use:
select u.*
from users u
where not exists (select 1
from profiles p
where p.user_id = u.user_id and p.google_id is not null
);
Also, the aggregation might be faster if you filter before the aggregation:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
WHERE NOT EXISTS (
SELECT 1
FROM profile p2
WHERE p2.user_id = p.user_id AND p2.google_id IS NOT NULL
)
GROUP BY user_id;
Whether it actually is faster depends on a number of factors, including the number of rows that are actually filtered out.
The first query seems like the right way to do what you want.
That's an aggregate query already, since you want to count the facebook accounts. The overhead to process the having clause, that counts the google accounts, should be tiny.
On the other hand, the second approach requires reopening the table and scanning it, which is most probably more expensive.

Teiid not performing optimal join

For our Teiid Springboot project we use a row filter in a where clause to determine what results a user gets.
Example:
SELECT * FROM very_large_table WHERE id IN ('01', '03')
We want the context in the IN clause to be dynamic like so:
SELECT * FROM very_large_table WHERE id IN (SELECT other_id from very_small_table)
The problem now is that Teiid gets all the data from very_large_table and only then tries to filter with the where clause, this makes the query 10-20 times slower. The data in this very_small_tableis only about 1-10 records and it is based on the user context we get from Java.
The very_large_table is located on a Oracle database and the very_small_table is on the Teiid Pod/Container. Somehow I can't force Teiid to ship the data to Oracle and perform filtering there.
Things that I have tried:
I have specified the the foreign data wrappers as follows
CREATE FOREING DATA WRAPPER "oracle_override" TYPE "oracle" OPTIONS (EnableDependentsJoins 'true');
CREATE SERVER server_name FOREIGN DATA WRAPPER "oracle_override";
I also tried, exists statement or instead of a where clause use a join clause to see if pushdown happened. Also hints for joins don't seem to matter.
Sadly the performance impact at the moment is that high that we can't reach our performance targets.
Are there any cardinalities on very_small_table and very_large_table? If not the planner will assume a default plan.
You can also use a dependent join hint:
SELECT * FROM very_large_table WHERE id IN /*+ dj */ (SELECT other_id from very_small_table)
Often, exists performs better than in:
SELECT vlt.*
FROM very_large_table vlt
WHERE EXISTS (SELECT 1 FROM very_small_table vst WHERE vst.other_id = vlt.id);
However, this might end up scanning the large table.
If id is unique in vlt and there are no duplicates in vst, then a JOIN might optimize better:
select vlt.*
from very_small_table vst join
very_large_table vlt
on vst.other_id = vlt.id;

SQL - count relation instances

I have this SQL diagram:
I want to get all the subscribers that likes exactly 0 reports.
started by: SELECT * FROM subscriber HAVING count(...)
How do I count how many reporters a subscriber likes?
A relationship should get it's own table in an SQL database?
I'm not completely sure I understand your last question, but this sounds like a nice time for a NOT IN clause.
SELECT *
FROM Subscriber
WHERE Id NOT IN (SELECT SubscriberId
FROM Likes
INNER JOIN Reporter ON Reporter.Id = Likes.ReporterId)
The inner query there simply finds all the subscriber ids that have been reported, then the outer one grabs all the other ones. You might be able to improve efficiency of this query by changing that INNER JOIN to another IN, but you'd have to play with it.
As far as the task of counting them, I'd probably just do this. You could group and such, but this is simple,
SELECT *, (SELECT COUNT(*)
FROM Likes
INNER JOIN Reporter ON Reporter.Id = Likes.ReporterId
WHERE Likes.SubscriberId = Subscriber.Id) AS ReportersCount
FROM Subscriber
Note that for your listed task of finding the ones with zero reporters, the first query will be faster, because it will be able to short-circuit, rather than having to count every reporter for every row. Of course, neither should be too bad as long as you've got the appropriate indexes.

Why does breaking out this correlated subquery vastly improve performance?

I tried running this query against two tables which were very different sizes - #temp was about 15,000 rows, and Member is about 70,000,000, about 68,000,000 of which do not have the ID 307.
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id as varchar) NOT IN (
SELECT IndividualID
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307)
This query ran for 18 hours, before I killed it and tried something else, which was:
SELECT IndividualID
INTO #source
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id AS VARCHAR) NOT IN (
SELECT IndividualID
FROM #source)
And this ran for less than a second before giving me a result.
I was pretty surprised by this. I'm a middle-tier developer rather than a SQL expert and my understanding of what goes on under the hood is a little murky, but I would have presumed that, since the sub-query in my first attempt is the exact same code, asking for the exact same data as in the second attempt, that these would be roughly equivalent.
But that's obviously wrong. I can't look at the execution plan for my original query to see what SQL Server is trying to do. So can someone kindly explain why splitting the data out into a temp table is so much faster?
EDIT: Table schemas and indexes
The #temp table has two columns, Individual_ID int and Source_Code varchar(50)
Member and Person are more complex. They has 29 and 13 columns respectively so I don't really want to post them all in full. PersonID is an int and is the PK on Person and an FK on Member. IndividualID is a column on Person - this is not clear in the query as written.
I tried using a LEFT JOIN instead of NOT IN before asking the question. The performance on the second query wasn't noticeably different - both were sub-second. On the first query I let it run for an hour before stopping it, presuming it would make no significant difference.
I also added an index on #source, just like on the original table, so the performance impact should be identical.
First, your query has two faux pas's that really stick out. You are converting to varchar(), but you do not include a length argument. This should not be allowed! The default length varies by context and you need to be explicit.
Second, you are matching two keys in different tables and they seemingly have different types. Foreign key references should always have the same type. This can have a very big impact on performance. If you are dealing with tables that have millions of rows, then you need to pay some attention to the data structure.
To understand the difference in performance, you need to understand execution plans. The two queries have very different execution plans. My (educated) guess is that the first version version is using a nested loop join algorithm. The second version is using a more sophisticated algorithm. In your case, this would be due to the ability of SQL Server to maintain statistics on tables. So, instantiating the intermediate results actually helps the optimizer produce a better query plan.
The subject of how best to write this logic has been investigated a lot. Here is a very good discussion on the subject by Aaron Bertrand.
I do agree with Aaron on the preference for not exists in this case:
SELECT COUNT(*)
FROM #temp t
WHERE NOT EXISTS (SELECT 1
FROM Member m JOIN
Person p
ON p.PersonID = m.PersonID
WHERE MemberID <> 307 and individual_id = t. individual_id
);
However, I don't know if this will have better performance in this particular case.
This line is probably what kills the first query
WHERE CAST(individual_id as varchar) NOT IN
My guess would be that this forces a table scan rather than using any indexes.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.