OR query performance and strategies with Postgresql - sql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?

What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.

Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.

To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

Related

Abap subquery Where Cond [duplicate]

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

In an EXISTS can my JOIN ON use a value from the original select

I have an order system. Users with can be attached to different orders as a type of different user. They can download documents associated with an order. Documents are only given to certain types of users on the order. I'm having trouble writing the query to check a user's permission to view a document and select the info about the document.
I have the following tables and (applicable) fields:
Docs: DocNo, FileNo
DocAccess: DocNo, UserTypeWithAccess
FileUsers: FileNo, UserType, UserNo
I have the following query:
SELECT Docs.*
FROM Docs
WHERE DocNo = 1000
AND EXISTS (
SELECT * FROM DocAccess
LEFT JOIN FileUsers
ON FileUsers.UserType = DocAccess.UserTypeWithAccess
AND FileUsers.FileNo = Docs.FileNo /* Errors here */
WHERE DocAccess.UserNo = 2000 )
The trouble is that in the Exists Select, it does not recognize Docs (at Docs.FileNo) as a valid table. If I move the second on argument to the where clause it works, but I would rather limit the initial join rather than filter them out after the fact.
I can get around this a couple ways, but this seems like it would be best. Anything I'm missing here? Or is it simply not allowed?
I think this is a limitation of your database engine. In most databases, docs would be in scope for the entire subquery -- including both the where and in clauses.
However, you do not need to worry about where you put the particular clause. SQL is a descriptive language, not a procedural language. The purpose of SQL is to describe the output. The SQL engine, parser, and compiler should be choosing the most optimal execution path. Not always true. But, move the condition to the where clause and don't worry about it.
I am not clear why do you need to join with FileUsers at all in your subquery?
What is the purpose and idea of the query (in plain English)?
In any case, if you do need to join with FileUsers then I suggest to use the inner join and move second filter to the WHERE condition. I don't think you can use it in JOIN condition in subquery - at least I've never seen it used this way before. I believe you can only correlate through WHERE clause.
You have to use aliases to get this working:
SELECT
doc.*
FROM
Docs doc
WHERE
doc.DocNo = 1000
AND EXISTS (
SELECT
*
FROM
DocAccess acc
LEFT OUTER JOIN
FileUsers usr
ON
usr.UserType = acc.UserTypeWithAccess
AND usr.FileNo = doc.FileNo
WHERE
acc.UserNo = 2000
)
This also makes it more clear which table each field belongs to (think about using the same table twice or more in the same query with different aliases).
If you would only like to limit the output to one row you can use TOP 1:
SELECT TOP 1
doc.*
FROM
Docs doc
INNER JOIN
FileUsers usr
ON
usr.FileNo = doc.FileNo
INNER JOIN
DocAccess acc
ON
acc.UserTypeWithAccess = usr.UserType
WHERE
doc.DocNo = 1000
AND acc.UserNo = 2000
Of course the second query works a bit different than the first one (both JOINS are INNER). Depeding on your data model you might even leave the TOP 1 out of that query.

SQL queries with views and subqueries

select nid, avg, std from sView1
where sid = 4891
and nid in (select distinct nid from tblref where rid = 799)
and oidin (select distinct oid from tblref where rid = 799)
and anscount > 3
This is a query I'm currently trying to run. And running it like this takes about 3-4 seconds. However, if I replace the "4891" value with a subquery saying (select distinct sid from tblref where rid = 799) the procedure just hangs, even though the subquery only returns one sid.
The query is supposed to return a dataset with averages (avg) and standard deviations (std) over a resultset which is calculated through nested views in sView1. This dataset is then run through another view to get some top-level averages and stdevs.
The averages may need to include more than 1 sid (sid identifies a dataset).
It's difficult describing it more without revealing codebase and codestructure that shouldn't be revealed ;)
Can anyone suggest why the query hangs when trying to use the subquery? (The code is rebuilt from originally using nested cursors, since I have been told that cursors are the work of the devil, and nested cursors may make me sterile)
Try this. Exists returns as soon as it finds a matching condition, select distinct will require going through the dataset and optionally sorting it to remove the duplicates.
SELECT nid,avg,std from sView1 AS SV
WHERE EXISTS (SELECT * FROM TblRef AS TR WHERE sv.sid = Tr.sid AND Sv.nid = tr.nid AND sv.oid = tr.oid AND tr.rid = 799)
AND ansCount>3
Also, it is pretty difficult to provide a meaningful answer without access to query plans and table structures. So DDL and sample data will definitely help.

Sub-query Optimization Talk with an example case

I need advises and want to share my experience about Query Optimization. This week, I found myself stuck in an interesting dilemma.
I'm a novice person in mySql (2 years theory, less than one practical)
Environment :
I have a table that contains articles with a column 'type', and another table article_version that contain a date where an article is added in the DB, and a third table that contains all the article types along with types label and stuffs...
The 2 first tables are huge (800000+ fields and growing daily), the 3rd one is naturally small sized. The article tables have a lot of column, but we will only need 'ID' and 'type' in articles and 'dateAdded' in article_version to simplify things...
What I want to do :
A Query that, for a specified 'dateAdded', returns the number of articles for each types (there is ~ 50 types to scan).
What was already in place is 50 separate count, one for each document types oO ( not efficient, long(~ 5sec in general), ).
I wanted to do it all in one query and I came up with that :
SELECT type,
(SELECT COUNT(DISTINCT articles.ID)
FROM articles
INNER JOIN article_version
ON article_version.ARTI_ID = legi_arti.ID
WHERE type = td.NEW_ID
AND dateAdded = '2009-01-01 00:00:00') AS nbrArti
FROM type_document td
WHERE td.NEW_ID != ''
GROUP BY td.NEW_ID;
The external select (type_document) allow me to get the 55 types of documents I need.
The sub-Query is counting the articles for each type_document for the given date '2009-01-01'.
A common result is like :
* type * nbrArti *
*************************
* 123456 * 23 *
* 789456 * 5 *
* 16578 * 98 *
* .... * .... *
* .... * .... *
*************************
This query get the job done, but the join in the sub-query is making this extremely slow, The reason, if I'm right, is that a join is made by the server for each types, so 50+ times, this solution is even more slower than doing the 50 queries independently for each types, awesome :/
A Solution
I came up with a solution myself that drastically improve the performance with the same result, I just created a view corresponding to the subQuery, making the join on ids for each types... And Boom, it's f.a.s.t.
I think, correct me if I'm wrong, that the reason is the server only runs the JOIN statement once.
This solution is ~5 time faster than the solution that was already there, and ~20 times faster than my first attempt. Sweet
Questions / thoughts
With yet another view, I'll now need to check if I don't loose more than win when documents get inserted...
Is there a way to improve the original Query, by getting the JOIN statement out of the sub-query? (And getting rid of the view)
Any other tips/thoughts? (In Server Optimizing for example...)
Apologies for my approximating English, it'is not my primary language.
You cannot create a single index on (type, date_added), because these fields are in different tables.
Without the view, the subquery most probably selects article as a leading table and the index on type which is not very selective.
By creating the view, you force the subquery to calculate the sums for all types first (using a selective the index on date) and then use a JOIN BUFFER (which is fast enough for only 55 types).
You can achieve similar results by rewriting your query as this:
SELECT new_id, COALESCE(cnt, 0) AS cnt
FROM type_document td
LEFT JOIN
(
SELECT type, COUNT(DISTINCT article_id) AS cnt
FROM article_versions av
JOIN articles a
ON a.id = av.article_id
WHERE av.date = '2009-01-01 00:00:00'
GROUP BY
type
) q
ON q.type = td.new_id
Unfortunately, MySQL is not able to do table spools or hash joins, so to improve the performance you'll need to denormalize your tables: add type to article_version and create a composite index on (date, type).

Sorting and Filtering Records

is there any known pattern/algorithm on how to perform sorting or filtering a list of records (from database) in the correct way? My current attempt involves usage of a form that provides some filtering and sorting options, and then append these criteria and sorting algorithm to my existing SQL. However, I find it can be easily abused that users may get results that they are not suppose to see.
The application that I am building is a scheduler where I store all the events in a database table. Then depending on user's level/role/privilege a different subset of data will be displayed. Therefore my query can be as complicated as
SELECT *
FROM app_event._event_view
WHERE ((class_id = (SELECT class_id FROM app_event._ical_class WHERE name = 'PUBLIC') AND class_id != (SELECT class_id FROM app_event._ical_class WHERE name = 'PRIVATE') AND class_id != (SELECT class_id FROM app_event._ical_class WHERE name = 'CONFIDENTIAL')) OR user_id = '1')
AND calendar_id IN (SELECT calendar_id FROM app_event.calendar WHERE is_personal = 't')
AND calendar_id = (SELECT calendar_id FROM app_event.calendar WHERE name = 'personal')
AND state_id IN (1,2,3,4,5,6) AND EXTRACT(year FROM dtstart) = '2008'
AND EXTRACT(month FROM dtstart) = '11'
As I am more concern about serving the correct subset of information, performance is not a major concern at the moment (as the sql mentioned was generated by the script, clause by clause). I am thinking of turning all these complicated SQL statements to views, so there will be less chances that the SQL generated is inappropriate.
First off, this query will look and perform better if you use joins:
SELECT *
FROM
app_event._event_view EV
INNER JOIN app_event.calendar C
ON EV.calendar_id = C.calendar.id
INNER JOIN app_event._ical_class IC
ON C.class_id = EV.class_id
WHERE
C.is_personal = 't'
AND C.name = 'personal'
AND state_id IN (1,2,3,4,5,6)
AND EXTRACT(year FROM dtstart) = '2008'
AND EXTRACT(month FROM dtstart) = '11'
AND (
IC.name = 'PUBLIC' -- other two are redundant
OR user_id = '1'
)
That's a good start for keeping complexity down. Then, if you want to add additional filters directly to the SQL, you can append more "AND ..." statements at the end. Since all the criteria are connected with AND (the OR is safely contained within parentheses), there's no possibility of someone adding a criterion after these that somehow undoes the ones above - that is, once you've restricted the results with one part of a clause, you can't "un-restrict" it with a later clause if it's strung together with ANDs. (Of course, if those additional clauses go anywhere near user text input, you have to sanitize them to prevent SQL injection attacks.)
The database can filter faster than any other tier, in general, because using a WHERE clause on the database will often allow the database to avoid even reading the unnecessary records off the disk, which is several orders of magnitude slower than anything you can do with the CPU. Sorting is often faster on the DB as well, because the DB can often do the joins in such a way that records are already sorted in the order you want them. So, performance wise, if you can do it on the DB, all the better.
You said performance isn't really the key here, though, so adding filters programatically in the business logic might be fine for you, and easier to read than messing with SQL strings.
It's hard to understand that query, because I have to scroll massively and since I don't know the database...
But if the privilegues are "one dimensional", e.g. admins can see everything, power users can see less than admins, guests can see less than power users, etc.: you could probably implement the privilegues as an integer in both the event and the user, then
select from event where conditions AND event.privilegue <= user.privilegue
If you set the privilegue values something like 10 000 (high/admin), 5000, 1(lowest/guest) you can later add more levels in between without changing all your code.
Use ORM tool or use parameters like:
SELECT *
FROM app_event._event_view
WHERE
(
:p_start_year is null or
(state_id IN (1,2,3,4,5,6) AND EXTRACT(year FROM dtstart) = :p_start_year)
)
and
(
:p_date_start is null or
AND EXTRACT(month FROM dtstart) = :p_date_start
)