Only return rows that match all criteria - sql

Here is a rough schema:
create table images (
image_id serial primary key,
user_id int references users(user_id),
date_created timestamp with time zone
);
create table images_tags (
images_tag_id serial primary key,
image_id int references images(image_id),
tag_id int references tags(tag_id)
);
The output should look like this:
{"images":[
{"image_id":1, "tag_ids":[1, 2, 3]},
....
]}
The user is allowed to filter images based on user ID, tags, and offset image_id. For instance, someone can say "user_id":1, "tags":[1, 2], "offset_image_id":500, which will give them all images that are from user_id 1, have both tags 1 AND 2, and an image_id of 500 or less.
The tricky part is the "have both tags 1 AND 2". It is more straight-forward (and faster) to return all images that have either 1, 2, or both. I don't see any way around this other than aggregating, but it is much slower.
Any help doing this quickly?
Here is the current query I am using which is pretty slow:
select * from (
select i.*,u.handle,array_agg(t.tag_id) as tag_ids, array_agg(tag.name) as tag_names from (
select i.image_id, i.user_id, i.description, i.url, i.date_created from images i
where (?=-1 or i.user_id=?)
and (?=-1 or i.image_id <= ?)
and exists(
select 1 from image_tags t
where t.image_id=i.image_id
and (?=-1 or user_id=?)
and (?=-1 or t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?))
)
order by i.image_id desc
) i
left join image_tags t on t.image_id=i.image_id
left join tag using (tag_id) --not totally necessary
left join users u on i.user_id=u.user_id --not totally necessary
group by i.image_id,i.user_id,i.description,i.url,i.date_created,u.handle) sub
where (?=-1 or sub.tag_ids #> ?)
limit 100;

When the execution plan of this statement is determined, at prepare time, the PostgresSQL planner doesn't know which of these ?=-1 conditions will be true or not.
So it has to produce a plan to maybe filter on a specific user_id, or maybe not, and maybe filter on a range on image_id or maybe not, and maybe filter on a specific set of tag_id, or maybe not. It's likely to be a dumb, unoptimized plan, that can't take advantage of indexes.
While your current strategy of a big generic query that covers all cases is OK for correctness, for performance you might need to abandon it in favor or generating the minimal query given the parametrized conditions that are actually filled in.
In such a generated query, the ?=-1 or ... will disappear, only the joins that are actually needed will be present, and the dubious t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) will go or be reduced to what's strictly necessary.
If it's still slow given certain sets of parameters, then you'll have a much easier starting point to optimize on.
As for the gist of the question, testing the exact match on all tags, you might want to try the idiomatic form in an inner subquery:
SELECT image_id FROM image_tags
WHERE tag_id in (?,?,...)
GROUP BY image_id HAVING count(*)=?
where the last ? is the number of tags passed as parameters.
(and completely remove sub.tag_ids #> ? as an outer condition).

Among other things, your GROUP BY clause is likely wider than any of your indices (and/or includes columns in unlikely combinations). I'd probably re-write your query as follows (turning #Daniel's subquery for the tags into a CTE):
WITH Tagged_Images (SELECT Image_Tags.image_id, ARRAY_AGG(Tag.tag_id) as tag_ids,
ARRAY_AGG(Tag.name) as tag_names
FROM Image_Tags
JOIN Tag
ON Tag.tag_id = Image_Tags.tag_id
WHERE tag_id IN (?, ?)
GROUP BY image_id
HAVING COUNT(*) = ?)
SELECT Images.image_id, Images.user_id,
Images.description, Images.url, Images.date_created,
Tagged_Images.tag_ids, Tagged_Images.tag_names,
Users.handle
FROM Images
JOIN Tagged_Images
ON Tagged_Images.image_id = Images.image_id
LEFT JOIN Users
ON Users.user_id = Images.user_id
WHERE Images.user_id = ?
AND Images.date_created < ?
ORDER BY Images.date_created, Images.image_id
LIMIT 100
(Untested - no provided dataset. note that I'm assuming you're building the criteria dynamically, to avoid condition flags)
Here's some other stuff:
Note that Tagged_Images will have at minimum the indicated tags, but might have more. If you want images with only those tags (exactly 2, no more, no less), an additional level needs to be added to the CTE.
There's a number of examples floating around of stored procs that turn comma-separated lists into virtual tables (heck, I've done it with recursive CTEs), which you could use for the IN() clause. It doesn't matter that much here, though, due to needing dynamic SQL anyways...
Assuming that Images.image_id is auto-generated, doing ranges searches or ordering by it is largely pointless. There are relatively few cases where humans care about the value held here. Except in cases where you're searching for one specific row (for updating/deleting/whatever), conceptual data sets don't really care either; the value of itself is largely meaningless. What does image_id < 500 actually tell me? Nothing - just that a given number was assigned to it. Are you using it to restrict based on "early" versus "late" images? Then use the proper data for that, which would be date_created. For pagination? Well, you have to do that after all the other conditions, or you get weird page lengths (like 0 in some cases). Generated keys should be relied on for one property only: uniqueness. This is the reason I stuck it at the end of the ORDER BY - to ensure a consistent ordering. Assuming that date_created has a high enough resolution as a timestamp, even this is unnecessary.
I'm fairly certain your LEFT JOIN to Users should probably be a regular (INNER) JOIN, but you didn't provide enough information for me to be sure.

Aggregation is not likely to be the thing slowing you down. A query such as:
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
where images_tags.tag_id in (1,2)
group by images.image_id
having count(*) = 2
will get you all of the images that have tags 1 and 2 and it will run quickly if you have indexes on both image_tags columns:
create index on images_tags(tag_id);
create index on images_tags(image_id);
The slowest part of the query is likely to be the in part of the where clause. You can speed that up if you are prepared to create a temporary table with the target tags in:
create temp table target_tags(tag_id int primary key);
insert into target_tags values (1);
insert into target_tags values (2);
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
join target_tags on images_tags.tag_id=target_tags.tag_id
group by images.image_id
having count(*) = (select count(*) from target_tags)

Related

What is index do i need?

I have problems with the performance of this query. If I remove Order by section all work well. But I really want it. I tried to use many indexes but have not any results. Can you help me pls?
SELECT *
FROM "refuel_request" AS "refuel_request"
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" = "bill_qr"."bill_qr_id"
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" = "order.car"."car_id"
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" = "refuel_request_status"."refuel_request_status_id"
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
ORDER BY "refuel_request".created_at desc
LIMIT 10
There is explain of this query
EXPLAIN (ANALYZE, BUFFERS)
Primary Keys and/or Foreign Keys
pk_refuel_request_id
refuel_request_bill_qr_id_fkey
refuel_request_user_id_fkey
All outer joind tables are 1:n related to refuel_request. This means your query is looking for the last ten created refuel requests with status 1 to 3.
You are outer joining the tables, because not every reful_request is related to a user, a bill_qr, a car, and a status. Or you outer join mistakenly. Anyway, none of the joins changes the number of retrieved rows; it's still one row per refuel request. In order to join the other tables' rows the DBMS just needs their primary key indexes. Nothing to worry about.
The only thing we must care about is finding the top reful_request rows for the statuses you are interested in as quickly as possible.
Use a partial index that only contains data for the statuses in question. The column you index is the created_at column, so as to get the top 10 immediately.
CREATE INDEX idx ON refuel_request (created_at DESC)
WHERE refuel_request_status_id IN (1, 2, 3);
Partial indexes are explained here: https://www.postgresql.org/docs/current/indexes-partial.html
You cannot have an index that supports both the WHERE condition and the ORDER BY, because you are using IN and not =.
The fastest option is to split the query into three parts, so that each part compares refuel_request.refuel_request_status_id with =. Combine these three queries with UNION ALL. Each of the queries has ORDER BY and LIMIT 10, and you wrap the whole thing in an outer query that has another ORDER BY and LIMIT 10.
Then you need these indexes:
CREATE INDEX ON refuel_request (refuel_request_status_id, created_at);
CREATE INDEX ON "user" (user_id);
CREATE INDEX ON bill_qr (bill_qr_id);
CREATE INDEX ON car (car_id);
CREATE INDEX ON refuel_request_status (refuel_request_status_id);
You need at least the indexes for the joins (do you really need LEFT joins?)
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
So, refuel_request.user_id must be in the index
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" =
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" =
bill_qr_id and car_id too
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" =
and refuel_request_status_id
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
refuel_request_status_id must be the first key in the index as we need it in the WHERE
ORDER BY "refuel_request".created_at desc
and then created_at since it's in the ORDER clause. This will not improve performances per se, but will allow to run the ORDER BY without requiring access to the table data, the same reason why we put the other non-WHERE columns in there. Of course a partial index is even better, we shift the WHERE in the partiality clause and use created_at for the rest (the LIMIT 10 now means that we can do without the extra columns in the index, since retrieving three 1:N rows costs very little; in a different situation we might find it useful to keep those extra columns).
So one index that contains, in this order:
refuel_request_status_id, created_at, bill_qr_id, car_id too, user_id
^ WHERE ^ ORDER ^ used by the JOINS
However, do you really need a SELECT *? I believe you'd get better performances if you only included the fields you're really going to use.
The most effective index for this query would be on refuel_request (refuel_request_status_id, created_at DESC) so that both the main filtering and the ordering can be done using the index. You also want indexes on the columns you're joining, but those tables are small and inconsequential at the moment. In any case, the index I suggest isn't actually going to help much with the performance pain points you're having right now. Here are some suggestions:
Don't use SELECT * unless you really need all of the columns from all of these tables you're joining. Specifying only the necessary columns means postgres can load less data into memory, and work over it faster.
Postgres is spending a lot of time on the joins, joining about a million rows each time, when you're really only interested in ten of those rows. We can encourage it do the order/limit first by rearranging the query somewhat:
WITH refuel_request_subset AS MATERIALIZED (
SELECT *
FROM refuel_request
WHERE refuel_request_status_id IN ('1', '2', '3')
ORDER BY created_at DESC
LIMIT 10
)
SELECT *
FROM refuel_request_subset AS refuel_request
LEFT OUTER JOIN user ON refuel_request.user_id = user.user_id
LEFT OUTER JOIN bill_qr ON refuel_request.bill_qr_id = bill_qr.bill_qr_id
LEFT OUTER JOIN car AS "order.car" ON refuel_request.car_id = "order.car".car_id
LEFT OUTER JOIN refuel_request_status ON refuel_request.refuel_request_status_id = refuel_request_status.refuel_request_status_id;
Note: This assumes that the LEFT JOINS will not add rows to the result set, as is the case with your current dataset.
This trick only really works if you have a fixed number of IDs, but you can do the refuel_request_subset query separately for each ID and then UNION the results, as opposed to using the IN operator. That would allow postgres to fully use the index mentioned above.

Should I always prefer EXISTS over COUNT() > 0 in SQL?

I often encounter the advice that, when checking for the existence of any rows from a (sub)query, one should use EXISTS instead of COUNT(*) > 0, for reasons of performance. Specifically, the former can short-circuit and return TRUE (or FALSE in the case of NOT EXISTS) after finding a single row, while COUNT needs to actually evaluate each row in order to return a number, only to be compared to zero.
This all makes perfect sense to me in simple cases. However, I recently ran into a problem where I needed to filter groups in the HAVING clause of a GROUP BY, based on whether all values in a certain column of the group were NULL.
For the sake of clarity, let's see an example. Let's say I have the following schema:
CREATE TABLE profile(
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
google_account_id INTEGER NULL,
facebook_account_id INTEGER NULL,
FOREIGN KEY (user_id) REFERENCES user(id),
CHECK(
(google_account_id IS NOT NULL) + (facebook_account_id IS NOT NULL) = 1
)
)
I.e. each user (table not shown for brevity) has 0 or more profiles. Each profile is either a Google or a Facebook account. (This is the translation of subclasses or a sum type with some associated data — in my real schema, the account IDs are also foreign keys to different tables holding that associated data, but this is not relevant to my question.)
Now, say I wanted to count the Facebook profiles for all users who do NOT have any Google profiles.
At first, I wrote the following query using COUNT() = 0:
SELECT user_id, COUNT(facebook_account_id)
FROM profile
GROUP BY user_id
HAVING COUNT(google_account_id) = 0;
But then it occurred to me that the condition in the HAVING clause is actually just an existence check. So I then re-wrote the query using a subquery and NOT EXISTS:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
GROUP BY user_id
HAVING NOT EXISTS (
SELECT 1
FROM profile AS q
WHERE p.user_id = q.user_id
AND q.google_id IS NOT NULL
)
My question is two-fold:
Should I keep the second, re-formulated query, and use NOT EXISTS with a subquery instead of COUNT() = 0? Is this really more efficient? I reckon that the index lookup due to the WHERE p.user_id = q.user_id condition has some additional cost. Whether this additional cost is absorbed by the short-circuiting behavior of EXISTS could as well depend on the average cardinality of the groups, could it not?
Or could the DBMS perhaps be smart enough to recognize the fact that the grouping key is being compared against, and optimize this subquery away completely, by replacing it with the current group (instead of actually performing an index lookup for each group)? I seriously doubt that a DBMS could optimize away this subquery, while failing to optimize COUNT() = 0 into NOT EXISTS.
Efficiency aside, the second query seems significantly more convoluted and less obviously correct to me, so I'd be reluctant to use it even if it happened to be faster. What do you think, is there a better way? Could I have my cake and eat it too, by using NOT EXISTS in a simpler manner, for instance by directly referencing the current group from within the HAVING clause?
You should prefer EXISTS/NOT EXISTS over COUNT() in a subquery. So instead of:
select t.*
from t
where (select count(*) from z where z.x = t.x) > 0
You should instead use:
select t.*
from t
where exists (select 1 from z where z.x = t.x)
The reasoning for this is that the subquery can stop processing at the first match.
This reasoning doesn't apply in a HAVING clause after an aggregation -- all the rows have to be generated anyway so there is little value in stopping at the first match.
However, aggregation might not be necessary if you have a users table and don't really need the facebook count. You could use:
select u.*
from users u
where not exists (select 1
from profiles p
where p.user_id = u.user_id and p.google_id is not null
);
Also, the aggregation might be faster if you filter before the aggregation:
SELECT user_id, COUNT(facebook_account_id)
FROM profile AS p
WHERE NOT EXISTS (
SELECT 1
FROM profile p2
WHERE p2.user_id = p.user_id AND p2.google_id IS NOT NULL
)
GROUP BY user_id;
Whether it actually is faster depends on a number of factors, including the number of rows that are actually filtered out.
The first query seems like the right way to do what you want.
That's an aggregate query already, since you want to count the facebook accounts. The overhead to process the having clause, that counts the google accounts, should be tiny.
On the other hand, the second approach requires reopening the table and scanning it, which is most probably more expensive.

Does EXCEPT execute faster than a JOIN when the table columns are the same

To find all the changes between two databases, I am left joining the tables on the pk and using a date_modified field to choose the latest record. Will using EXCEPT increase performance since the tables have the same schema. I would like to rewrite it with an EXCEPT, but I'm not sure if the implementation for EXCEPT would out perform a JOIN in every case. Hopefully someone has a more technical explanation for when to use EXCEPT.
There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.
That said, here is my guideline:
Use EXCEPT when at least one of the following is true:
The query is more readable (this will almost always be true).
Performance is improved.
And BOTH of the following are true:
The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).
It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).
In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.
In the following example, the LEFT JOIN is faster than EXCEPT by 70%
(PostgreSQL 9.4.3)
Example:
There are three tables. suppliers, parts, shipments.
We need to get all parts not supplied by any supplier in London.
Database(has indexes on all involved columns):
CREATE TABLE suppliers (
id bigint primary key,
city character varying NOT NULL
);
CREATE TABLE parts (
id bigint primary key,
name character varying NOT NULL,
);
CREATE TABLE shipments (
id bigint primary key,
supplier_id bigint NOT NULL,
part_id bigint NOT NULL
);
Records count:
db=# SELECT COUNT(*) FROM suppliers;
count
---------
1281280
(1 row)
db=# SELECT COUNT(*) FROM parts;
count
---------
1280000
(1 row)
db=# SELECT COUNT(*) FROM shipments;
count
---------
1760161
(1 row)
Query using EXCEPT.
SELECT parts.*
FROM parts
EXCEPT
SELECT parts.*
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
;
-- Execution time: 3327.728 ms
Query using LEFT JOIN with table, returned by subquery.
SELECT parts.*
FROM parts
LEFT JOIN (
SELECT parts.id
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
) AS subquery_tbl
ON (parts.id = subquery_tbl.id)
WHERE subquery_tbl.id IS NULL
;
-- Execution time: 1136.393 ms

Database query times out on heroku

I'm stress testing an app by adding loads and loads of items and forcing it to do lots of work.
select *, (
select price
from prices
WHERE widget_id = widget.id
ORDER BY id DESC
LIMIT 1
) as maxprice
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
that query selects from widgets (approx 8500) and prices has 777000 or so entries in it.
The query is timing out on the test environment which is using the basic Heroku shared database. (193mb in use of the 5gig max.)
What will solve that time out issue? The prices update each hour, so every hour you get 8500x new rows.
It's hugely excessive amounts for the app (in reality it's unlikely it would ever have 8500 widgets) but I'm wondering what's appropriate to solve this?
Is my query stupid? (i.e. is it a bad style of query to do that subselect - my SQL knowledge is terrible, one of the goals of this project is to improve it!)
Or am I just hitting a limit of a shared db and should expect to move onto a dedicated db (e.g. the min $200 per month dedicated postgres instance from Heroku.) given the size of the prices table? Is there a deeper issue in terms of how I've designed the DB? (i.e. it's a one to many, one widget has many prices.) Is there a more sensible approach?
I'm totally new to the world of sql and queries etc. at scale, hence the utter ignorance expressed above. :)
Final version after comments below:
#Dave wants the latest price per widget. You could do that in sub-queries and LIMIT 1 per widget, but in modern PostgreSQL, a window function does the job more elegantly. Consider first_value() / last_value():
SELECT w.*
, first_value(p.price) OVER (PARTITION BY w.id
ORDER BY created_at DESC) AS latest_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Original post for the maximum price per widget:
SELECT w.*
, max(p.price) AS max_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Fix table aliases.
Retrieve all columns of widgets like the question demonstrates
In PostgreSQL 8.3 you must spell out all non-aggregated columns of the SELECT list in the GROUP BY clause. In PostgreSQL 9.1 or later, the primary key column would cover the whole table. I quote the manual here:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause
I advice to never use mixed case identifiers like maxWidgetPrice. Unquoted identifiers are folded to lower case by default in PostgreSQL. Do yourself a favor and use lower case identifiers exclusively.
Always use explicit JOIN conditions where possible. It's the canonical SQL way and it's more readable.
OFFSET 0 is just noise
Indexes:
However, the key to performance are the right indexes. I would go two indexes like these:
CREATE INDEX widgets_created_at_idx ON widgets (created_at DESC);
CREATE INDEX prices_widget_id_idx ON prices(widget_id, price DESC);
The second one is a multicolumn index, that should provide best performance for retrieving the maximum prize after you have determined the top 20 widgets using the first index. Not sure if PostgreSQL 8.3 (default on Heroku shared db) is already smart enough to make the most of it. PostgreSQL 9.1 certainly is.
For the latest price (see comments), use this index instead:
CREATE INDEX prices_widget_id_idx ON prices(widget_id, created_at DESC);
You don't have to (and shouldn't) just trust me. Test performance and query plans with EXPLAIN ANALYZE with and without indexes and see for yourself. Index creation should be very fast, even for a million rows.
If you consider to switch to a standalone PostgreSQL database on Heroku, you may be interested in this recent Heroku blog post:
The default is now PostgreSQL 9.1.
There you can cancel long running queries now.
I'm not quite clear on what you are asking, but here is my understanding:
Find the widgets you want to price. In this case it looks like you are looking for the most recent 20 widgets:
SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
For each of the 20 widgets you found, it seems you want to find the highest associated price from the widget table:
SELECT s.id, MAX(p.price) AS maxWidgetPrice
FROM (SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
) s -- widget subset
, prices p
WHERE s.id = p.widget_id
GROUP BY s.id
prices.widget_id needs to be indexed for this to be effective. You don't want to process the entire prices table each time if it is relatively large, just the subset of rows you need.
EDIT: added "group by" (and no, this was not tested)

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.