I am executing a Postgresql database on a 2GB RAM VPS.
The settings are :
max_connections = 100
work_mem=1MB
shared_buffers=128MB
I am executing a pretty simple query with a million rows :
SELECT s.executionTime, g.date, s.name
FROM SimulationStatsGroup g
LEFT JOIN SimulationStats s ON s.group_id = g.id
WHERE g.name = 'general'
ORDER BY g.date DESC
I have 2 tables : SimulationStatsGroup and SimulationStats. SimulationStatsGroup contains between 1 to 13 SimulationStats. SimulationStats is a simple entity that contains numeric values like executionTime used by my application. Each SimulationStatsGroup and SimulationStats have a name.
Here is the EXPLAIN ANALYZE that I get : http://explain.depesz.com/s/auLK
Why is my query so long to execute ?
Create indexes on SimulationStats(group_id) and SimulationStatsGroup(id).
In the Sort (step #2) in the explain plan, it looks like the database is either lugging around unreferenced columns (not optimum) and/or sorting by them (ouch). Honestly though, I don't work on Postgres, so that is just an educated guess. The database engine may not be smart enough to discard unreferenced columns early in the process. I'd try this SQL to nudge the database engine into discarding unreferenced columns before it gets to the sort, and you may see a significant runtime improvement:
SELECT s.executionTime, g.date, s.name
FROM ( select id, date from SimulationStatsGroup WHERE g.name = 'general') as g
LEFT JOIN ( select s.group_id, s.name, s.executionTime from SimulationStats ) as s
ON s.group_id = g.id
ORDER BY g.date DESC
If this version shows a runtime improvement, please run another Explain, and let us know if there are less columns list in the Sort step. If so, my hunch was likely correct. If correct, hopefully the Postgres developers will take note and try to discard unreferenced columns for us in a future version, instead of us manually coding it.
Related
I've got a query that's decently sized on it's own, but there's one section of it that turns it into something ridiculously large (billions of rows returned type thing).
There must be a better way to write it than what I have done.
To simplify the section of the query in question, it takes the client details from one table and tries to find the most recent transaction dates in their savings and spending accounts (not the actual situation, but close enough).
I've joined it with left joins because if someone (for example) doesn't have a savings account, I still want the client details to pop up. But when there's hundreds of thousands of clients with tens of thousands of transactions, it's a little slow to run.
select client_id, max(e.transation_date), max(s.transaction_date)
from client_table c
left join everyday_account e
on c.client_id = e.client_id
left join savings_account s
on c.client_id = s.client_id
group by client_id
I'm still new to this so I'm not great at knowing how to optimise things, so is there any thing I should be looking at? Perhaps different joins, or something other than max()?
I've probably missed some key details while trying to simplify it, let me know if so!
Sometimes aggregating first, then joining to the aggregated result is faster. But this depends on the actual DBMS being used and several other factors.
select client_id, e.max_everyday_transaction_date, s.max_savings_transaction_date
from client_table c
left join (
select client_id, max(transaction_date) as max_everyday_transaction_date
from everyday_account
group by client_id
) e on c.client_id = e.client_id
left join (
select client_id, max(transaction_date) as max_savings_transaction_date
from savings_account
) s on c.client_id = s.client_id
The indexes suggested by Tim Biegeleisen should help in this case as well.
But as the query has to process all rows from all tables there no good way to speed up this query, other than throwing more hardware at it. If your database supports it, make sure parallel query is enabled (which will distribute the total work over multiple threads in the backend which can substantially improve query performance if the I/O system can keep up)
There are no WHERE or HAVING clauses, which basically means there is no explicit filtering in your SQL query. However, we can still try to optimize the joins using appropriate indices. Consider:
CREATE INDEX idx1 ON everyday_account (client_id, transation_date);
CREATE INDEX idx2 ON savings_account (client_id, transation_date);
These two indices, if chosen for use, should speed up the two left joins in your query. I also cover the transaction_date in both cases, in case that might help.
Side note: You might want to also consider just having a single table containing all customer accounts. Include a separate column which distinguishes between everyday and savings accounts.
I would suggest correlated subqueries:
select client_id,
(select max(e.transation_date)
from everyday_account e
where c.client_id = e.client_id
),
(select max(s.transaction_date)
from savings_account s
where c.client_id = s.client_id
)
from client_table c;
Along with indexes on everyday_account(client_id, transaction_date desc) and savings_account(client_id, transaction_date desc).
The subqueries should basically be index lookups (or very limited index scans), with no additional joining needed.
I've written the following query:
WITH m2 AS (
SELECT m.id, m.original_title, m.votes, l.name as lang
FROM movies m
JOIN movie_languages ml ON m.id = ml.movie_id
JOIN languages l ON l.id = ml.language_id
)
SELECT m.original_title
FROM movies m
WHERE NOT EXISTS (
SELECT 1
FROM m2
WHERE m.id = m2.id AND m2.lang <> 'English'
)
The results appear after 1.5 seconds.
After adding the following line at the end of the query, it takes at least 5 minutes to run it:
ORDER BY votes DESC;
It's not the size of the data, as ORDER BY on the entire table return results in notime.
What am I doing wrong?
Why is the ORDER BY adds so much time? (The query SELECT * FROM movies ORDER BY votes DESC returns immediately).
The order by in the CTE is irrelevant. But I would suggest aggregation for this purpose:
SELECT m.original_title
FROM movies m JOIN
movie_languages ml
ON m.id = ml.movie_id JOIN
languages l
ON l.id = ml.language_id
GROUP BY m.original_title, m.id
HAVING SUM(lang = 'English') = 0;
In order to examine your queries you may turn on the timer by entering .time on at the SQLite prompt. More importantly utilize the EXPLAIN function to see details on your query.
The query initially written does seem to be rather more complex than necessary as already pointed out above. It does not seem apparent what the necessity is for 'movie_languages' and 'languages' tables in general, but especially in this particular query. That would require more explanation on your part but I believe at least one could be removed thus speeding up your query.
The ORDER BY clause in SQLite is handled as described below.
SQLite attempts to use an index to satisfy the ORDER BY clause of a query when possible. When faced with the choice of using an index to satisfy WHERE clause constraints or satisfying an ORDER BY clause, SQLite does the same cost analysis described above and chooses the index that it believes will result in the fastest answer.
SQLite will also attempt to use indices to help satisfy GROUP BY clauses and the DISTINCT keyword. If the nested loops of the join can be arranged such that rows that are equivalent for the GROUP BY or for the DISTINCT are consecutive, then the GROUP BY or DISTINCT logic can determine if the current row is part of the same group or if the current row is distinct simply by comparing the current row to the previous row. This can be much faster than the alternative of comparing each row to all prior rows.
Since there is no index or type on votes stated and the above logic may be followed thus choosing 'the index that it believes will result in the fastest answer'. With the over-complicated query and no index on votes which is being used as ORDER BY then there is much more for it to figure out than necessary. Since the simple query with ORDER BY executes then the complexity of the query causing SQLite much more to compute than necessary.
Additionally the type of the column, most likely INTEGER, is important when sorting (and joining). Attempting to sort on a character type will not only get you wrong results in this case if votes end up above single digits it would be the wrong type to use (I'm not assuming you are just mentioning it).
So simplify the query, ensure your PRIMARY KEYS are properly set, and test it. If it is still not returning in time try an index on votes. This will give you much better insight into what is going on and how different changes affect your queries.
SQLite Documentation - check all and note 6. Sorting, Grouping and Compound SELECTs
SQLite Documentation - check 10. ORDER BY optimizations
You can do it with NOT EXISTS, without joins and aggregation (assuming that there is always at least 1 row for each movie in the table movie_languages):
SELECT m.*
FROM movies m
WHERE NOT EXISTS (
SELECT 1 FROM movie_languages ml
WHERE m.id = ml.movie_id
AND ml.language_id <> (SELECT l.id FROM languages l WHERE l.lang = 'English')
)
ORDER BY m.votes DESC
or with a LEFT join to languages to get the unmatched rows:
SELECT m.*
FROM movies m
INNER JOIN movie_languages ml ON m.id = ml.movie_id
LEFT JOIN languages l ON l.id = ml.language_id AND l.lang <> 'English'
WHERE l.id IS NULL
ORDER BY m.votes DESC
Refer to this link for more information:
here
In a nutshell, When you include an order by clause, the database builds a list of the rows in the correct order and then returns the data in that order.
The creation of the list mentioned above takes a lot of extra processing, translating into a longer execution time.
I am not sure whether the title of this question is correct or not.
I have a table for example users which contains different types of users. Like user type 10, 20, 30 etc.
In a query I need to join the user table, but I want only user type 20. So which of the below query perform better.
SELECT fields
FROM consumer c
INNER JOIN user u ON u.userid = c.userid
WHERE u.type = 20
In another way,
SELECT fields
FROM consumer c
INNER JOIN (SELECT user_fields FROM user WHERE type = 20) u ON u.userid = c.userid
Please advice.
Let's start with this query:
SELECT . . .
FROM consumer c INNER JOIN
user u
ON u.userid = c.userid
WHERE u.type = 20;
Assuming that type is relatively rare, you want indexes on the tables. The best indexes are probably user(type, userid) and customer(userid). It is possible that an index on user(userid, type) would be better (and would be unnecessary if userid is a clustered primary key).
The second query . . . well, from the SQL Server perspective it is probably the same. Why? SQL Server has a good optimizer. You can check the execution plans if you like. Because of the optimizer:
There is no benefit to having a subquery select only a handful of columns. For better or worse, SQL Server pushes that information down to the node that reads the data.
The where clause is not necessarily going to be evaluated before the join. SQL Server is smart enough to re-arrange operations.
Not all optimizers are this smart. In a database such as MySQL, MS Access, or SQLite, I'm pretty sure the first version is much better than the second.
Run the two queries in SSMS as a batch, and click "execution plan" , you will find that the execution plan of both queries, and the query cost (relative to the batch ): 50%
That means they are the same.
If they are different (in case of some optimization), you find the ratio different.
I simulated your query and find the query cost=50% ===> i.e they are the same.
It really depends on a various number of factors:
is "userid" on both table indexed?
is "type" on table "users" indexed?
how many rows in each table?
Usually a subquery produces slower performances, but depending on the conditions listed above and how your sql server installation is configured, both query can be resolved (and so, executed) as the same by the query analyzer.
SQLServer takes your query and tries to optimize it so it can happen that query B is "transformed" in query A.
Look at the QueryAnalyzer tool for both queries, and see if they have differences.
Generally speaking inner queries are better to be avoided, and you'll probably get the best performances doing query A.
Both your options are valid. Personally would code it like this;
SELECT fields
FROM consumer c
INNER JOIN user u ON u.userid = c.userid and u.type = 20
Run both queries in SQL Management Studio (query) and tick 'Include actual execution plan'. This will let you see the performance of your queries against each other. It will depend on your particular database.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.