Why is this postgresql query so slow? - sql

I'm no database expert, but I have enough knowledge to get myself into trouble, as is the case here. This query
SELECT DISTINCT p.*
FROM points p, areas a, contacts c
WHERE ( p.latitude > 43.6511659465
AND p.latitude < 43.6711659465
AND p.longitude > -79.4677941889
AND p.longitude < -79.4477941889)
AND p.resource_type = 'Contact'
AND c.user_id = 6
is extremely slow. The points table has fewer than 2000 records, but it takes about 8 seconds to execute. There are indexes on the latitude and longitude columns. Removing the clause concering the resource_type and user_id make no difference.
The latitude and longitude fields are both formatted as number(15,10) -- I need the precision for some calculations.
There are many, many other queries in this project where points are compared, but no execution time problems. What's going on?

Did you forget something from your actual query? It's missing ANSI-89 joins between the three tables, giving you a cartesian product but only pulling out the POINTS records.

You're joining three tables, p, a, and c, but you aren't specifying how to attach them together. What you're getting is a full Cartesian join between all of the rows in all of the tables that match the criteria, then everything in areas.
You probably want to attach something in points to something in areas. And something in contacts with ... well, I don't know what your schema looks like.
Try sticking an "EXPLAIN" at the beginning for information on what's happening.

Probably you are missing the joins. Joining the table would be something like this.
SELECT DISTINCT p.*
FROM points p
JOIN areas a p ON a.FkPoint = p.id
JOIN contacts c ON c.FkArea = a.id
WHERE ( p.latitude > 43.6511659465
AND p.latitude < 43.6711659465
AND p.longitude > -79.4677941889
AND p.longitude < -79.4477941889)
AND p.resource_type = 'Contact'
AND c.user_id = 6
For better indexes on coordinates use Quadtree or R-Tree index implementation.
If you intentionally did not miss the joins, try a subquery like this.
select DISTINCT thePoints.*
(
SELECT DISTINCT p.*
FROM points p
WHERE ( p.latitude > 43.6511659465
AND p.latitude < 43.6711659465
AND p.longitude > -79.4677941889
AND p.longitude < -79.4477941889)
AND p.resource_type = 'Contact'
) as thePoints
, areas, contacts
WHERE c.user_id = 6

You need a rtree index and use the # operator, normal index won't work.
R-Tree
http://www.postgresql.org/docs/8.1/static/indexes-types.html
# operator
http://www.postgresql.org/docs/8.1/static/functions-geometry.html

Related

Oracle complex query with multiple joins on same table

I am dealing with a monster query ( ~800 lines ) on oracle 11, and its taking expensive resources.
The main problem here is a table mouvement with about ~18 million lines, on which I have like 30 left joins on this table.
LEFT JOIN mouvement mracct_ad1
ON mracct_ad1.code_portefeuille = t.code_portefeuille
AND mracct_ad1.statut_ligne = 'PROPRE'
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
AND mracct_ad1.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zias
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'PROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'PRAC'
AND mracct_zias.code_transaction = t.code_transaction
LEFT JOIN mouvement mracct_zixs
ON mracct_zias.code_portefeuille = t.code_portefeuille
AND mracct_zias.statut_ligne = 'XROPRE'
AND substr(mracct_zias.code_valeur,1,4) = 'MRAT'
AND mracct_zias.code_transaction = t.code_transaction
is there some way so I can get rid of the left joins, (union join or example) to make the query faster and consumes less? execution plan or something?
Just a note on performance. Usually you want to "rephrase" conditions like:
AND substr(mracct_ad1.code_valeur,1,4) = 'MRAC'
In simple words, expressions on the left side of the equality will prevent the best usage of indexes and may push the SQL optimizer toward a less than optimal plan. The database engine will end up doing more work than is really needed, and the query will be [much] slower. In extreme cases they can even decide to use a Full Table Scan. In this case you can rephrase it as:
AND mracct_ad1.code_valeur like 'MRAC%'
or:
AND mracct_ad1.code_valeur >= 'MRAC' AND mracct_ad1.code_valeur < 'MRAD'
I am guessing so. Your code sample doesn't make much sense, but you can probably do conditional aggregation:
left join
(select m.code_portefeuille, m.code_transaction,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as ad1,
max(case when m.statut_ligne = 'PROPRE' and m.code_valeur like 'MRAC%' then ? end) as zia,
. . . -- for all the rest of the joins as well
from mouvement m
group by m.code_portefeuille, m.code_transaction
) m
on m.code_portefeuille = t.code_portefeuille and m.code_transaction = t.code_transaction
You can probably replace all 30 joins with a single join to the aggregated table.

Position of ON and WHERE clauses and the efficiency performance

I have two tables, one called Health_User and the other called Diary. They have users' demographic information, and their recorded values respectively. What I want to do is retrieving the recorded values, but:
Excluding testers (not real users) with the "is_tester" column (boolean values) in Health_User, and
Excluding unreasonable values with too high or too low measurements in Diary.
So I have several queries which should get the same results:
# Query 1
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN (
SELECT id
FROM Health_User
WHERE is_tester = false
) AS u
ON d.user_id = u.id
WHERE ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 2
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Diary AS d
JOIN Health_User AS u
ON d.user_id = u.id
WHERE u.is_tester = false
AND ((d.glucose_value >= 20 AND d.glucose_value <= 600 AND d.unit = 'mg/dL')
OR (d.glucose_value >= 20/18.02 AND d.glucose_value <= 600/18.02 AND d.unit = 'mmol/L'));
# Query 3
SELECT d.user_id, d.id AS diary_id, d.glucose_value, d.unit
FROM Health_User AS u
JOIN (
SELECT id, user_id, glucose_value, unit
FROM Diary
WHERE ((glucose_value >= 20 AND glucose_value <= 600 AND unit = 'mg/dL')
OR (glucose_value >= 20/18.02 AND glucose_value <= 600/18.02 AND unit = 'mmol/L'))
) AS d
ON d.user_id = u.id
WHERE u.is_tester = false;
Here I have three questions:
Question 1: I would speculate that Query 1 would have better performance than Query 2, because a) it joins only one column instead of the whole table of Health_User and b) it filters out testers before joining the tables. Am I correct?
Question 2: The conditional limitation is more complex for Diary (See the last WHERE clause in Query 1). Is it better to switch Diary inside the JOIN and make Health_User outside like Query 3, or it makes no difference?
Question 3: Is there any even better solution in terms of performance?
There would be a difference if the database executed the queries in the order your queries suggest (first filter, then join or vice versa).
As it is, PostgreSQL has a query optimizer that rearranges the query to find the most efficient execution order, and all your queries will end up with the same execution plan, which you can verify using the SQL statement EXPLAIN.
For inner joins, it does not influence the result if you filter before or after the join; you could also write all the conditions into the join condition without changing the result. The optimizer knows that.
You can speed up execution by creating appropriate indexes. It depends on the distribution of the data to know if a certain index is useful. The rule of thumb is that indexes on selective conditions (that filter out many data) are more useful. Work with EXPLAIN to find the best indexes.

How can I optimize this SQL query? (Solarwinds Orion)

I'm very new to SQL, and still learning. I'm using a reporting tool called Solarwinds Orion, and I'm honestly not sure how specific the query I have written is to the program, so if there's anything in the query that's confusing, let me know and I'll try to figure out if it's specific to the program or not.
The problem with the query I'm running is that it times out after a very long time (maybe an hour) of running. The database I'm using is huge. Unfortunately I don't really know how huge, but I've been told it's huge.
Is there anything I am doing wrong that would have a huge performance impact?
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM
((NetflowApplicationSummary
INNER JOIN Nodes ON (NetflowApplicationSummary.NodeID = Nodes.NodeID))
INNER JOIN InterfaceTraffic ON (Nodes.NodeID = InterfaceTraffic.InterfaceID))
INNER JOIN Interfaces ON (Nodes.NodeID = Interfaces.NodeID)
WHERE
( InterfaceTraffic.DateTime > (GetDate()-30) )
AND
(Nodes.WANCircuit = 1)
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
EDIT: I ran COUNT() on each of my tables with the below result.
SELECT COUNT(*) FROM NetflowApplicationSummary # 50671011
SELECT COUNT(*) FROM Nodes # 898
SELECT COUNT(*) FROM InterfaceTraffic # 18000166
SELECT COUNT(*) FROM Interfaces # 3938
# Total : 68,676,013
I really have no idea if 68 million items is a huge database to be honest.
A couple of notes:
The INNER JOIN operator is associative, so get rid of those parenthesis in the FROM clause and let the optimizer figure out the best join order.
You may have an implied cursor from the getdate() function being called for every row. Store the value in a local variable and compare to that.
The resulting SQL should look like this:
DECLARE #Date as datetime = getdate() - 30;
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM NetflowApplicationSummary
INNER JOIN Nodes ON NetflowApplicationSummary.NodeID = Nodes.NodeID
INNER JOIN InterfaceTraffic ON Nodes.NodeID = InterfaceTraffic.InterfaceID
INNER JOIN Interfaces ON Nodes.NodeID = Interfaces.NodeID
WHERE InterfaceTraffic.DateTime > #Date
AND Nodes.WANCircuit = 1
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
Also, make sure you have an index on table InterfaceTraffic with a leading field of DateTime. If this doesn't exist you may need to pay the penalty of a first time creation of it.
If this doesn't help, then you may need to post the execution plan where it can be inspected.
Out of interest, also perform a count() on all four tables and post that result, just so members here can make their own assessment of how big your database really is. It is amazing how many non-technical people still think a 1 or 10 GB database is huge, while I run that easily on my workstation!

Sorting rows by count of a many-to-many associated record

I know there are a lot of other SO entries that seem like this one, but I haven't found one that actually answers my question so hopefully one of you can either answer it or point me to another SO question that is related.
Basically, I have the following query that returns Venues that have any CheckIns that contain the searched Keyword ("foobar" in this example).
SELECT DISTINCT v.*
FROM "venues" v
INNER JOIN "check_ins" c ON c."venue_id" = v."id"
INNER JOIN "keywordings" ks ON ks."check_in_id" = c."id"
INNER JOIN "keywords" k ON ks."keyword_id" = k."id"
WHERE (k."name" = 'foobar')
I want to SELECT and ORDER BY the count of the matched Keyword for each given Venue. E.g. if there have been 5 CheckIns that have been created, associated with that Keyword, then there should be a returned column (called something like keyword_count) with the value 5 which is sorted.
Ideally this should be done without any queries in the SELECT clause, or preferably none at all.
I've been struggling with this for a while and my mind is just going blank (perhaps it's been too long a day) so some help would be greatly appreciated here.
Thanks in advance!
Sounds like you need something like:
SELECT v.x, v.y, count(*) AS keyword_count
FROM "venues" v
INNER JOIN "check_ins" c ON c."venue_id" = v."id"
INNER JOIN "keywordings" ks ON ks."check_in_id" = c."id"
INNER JOIN "keywords" k ON ks."keyword_id" = k."id"
WHERE (k."name" = 'foobar')
GROUP BY v.x, v.y
ORDER BY 3

Is it possible to do this in NHibernate without using CreateSQLQuery?

Is it possible to do this in NHibernate without using CreateSQLQuery. Preferably with Linq To Nhibernate. The biggest question is how do I do joins not on a primary key?
SELECT DISTINCT calEvent.* From CalendarEvent as calEvent
LEFT JOIN UserChannelInteraction as channelInteraction on channelInteraction.ChannelToFollow_id = calEvent.Channel_id
LEFT JOIN UserCalendarEventInteraction as eventInteraction on eventInteraction.CalendarEvent_id = calEvent.Id
LEFT JOIN UserChannelInteraction as eventInteractionEvent on eventInteractionEvent.UserToFollow_id = eventInteraction.User_id
WHERE (calEvent.Channel_id = #intMainChannelID
OR channelInteraction.User_id = #intUserID
OR eventInteraction.User_id = #intUserID
OR (eventInteractionEvent.User_id = #intUserID AND eventInteraction.Status = 'Accepted'))
AND calEvent.StartDateTime >= #dtStartDate
AND calEvent.StartDateTime <= #dtEndDate
ORDER BY calEvent.StartDateTime asc
Hmmm... maybe you need to try to leverage subqueries?
Check this out: http://devlicio.us/blogs/derik_whittaker/archive/2009/04/06/simple-example-of-using-a-subquery-in-nhibernate-when-using-icriteria.aspx
You can do arbitrary joins by using Theta joins. A theta join is the Cartesian product, so it results in all possible combinations, which then can be filtered.
In NHibernate you can perform a theta style join like this (HQL):
from Book b, Review r where b.Isbn = r.Isbn
You can then add any filtering conditions you want to, order the results and everything else you might want to do.
from Book b, Review r where b.Isbn = r.Isbn where b.Title = 'My Title' or r.Name = 'John Doe' order by b.Author asc
Here is an article about theta joins in Hibernate (not NHibernate, but it's still relevant).
However, since the theta join is a Cartesian product, you might want to think twice and do some performance testing before you use that approach to do a three-join query.