I've got a query that's decently sized on it's own, but there's one section of it that turns it into something ridiculously large (billions of rows returned type thing).
There must be a better way to write it than what I have done.
To simplify the section of the query in question, it takes the client details from one table and tries to find the most recent transaction dates in their savings and spending accounts (not the actual situation, but close enough).
I've joined it with left joins because if someone (for example) doesn't have a savings account, I still want the client details to pop up. But when there's hundreds of thousands of clients with tens of thousands of transactions, it's a little slow to run.
select client_id, max(e.transation_date), max(s.transaction_date)
from client_table c
left join everyday_account e
on c.client_id = e.client_id
left join savings_account s
on c.client_id = s.client_id
group by client_id
I'm still new to this so I'm not great at knowing how to optimise things, so is there any thing I should be looking at? Perhaps different joins, or something other than max()?
I've probably missed some key details while trying to simplify it, let me know if so!
Sometimes aggregating first, then joining to the aggregated result is faster. But this depends on the actual DBMS being used and several other factors.
select client_id, e.max_everyday_transaction_date, s.max_savings_transaction_date
from client_table c
left join (
select client_id, max(transaction_date) as max_everyday_transaction_date
from everyday_account
group by client_id
) e on c.client_id = e.client_id
left join (
select client_id, max(transaction_date) as max_savings_transaction_date
from savings_account
) s on c.client_id = s.client_id
The indexes suggested by Tim Biegeleisen should help in this case as well.
But as the query has to process all rows from all tables there no good way to speed up this query, other than throwing more hardware at it. If your database supports it, make sure parallel query is enabled (which will distribute the total work over multiple threads in the backend which can substantially improve query performance if the I/O system can keep up)
There are no WHERE or HAVING clauses, which basically means there is no explicit filtering in your SQL query. However, we can still try to optimize the joins using appropriate indices. Consider:
CREATE INDEX idx1 ON everyday_account (client_id, transation_date);
CREATE INDEX idx2 ON savings_account (client_id, transation_date);
These two indices, if chosen for use, should speed up the two left joins in your query. I also cover the transaction_date in both cases, in case that might help.
Side note: You might want to also consider just having a single table containing all customer accounts. Include a separate column which distinguishes between everyday and savings accounts.
I would suggest correlated subqueries:
select client_id,
(select max(e.transation_date)
from everyday_account e
where c.client_id = e.client_id
),
(select max(s.transaction_date)
from savings_account s
where c.client_id = s.client_id
)
from client_table c;
Along with indexes on everyday_account(client_id, transaction_date desc) and savings_account(client_id, transaction_date desc).
The subqueries should basically be index lookups (or very limited index scans), with no additional joining needed.
Related
Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables?
Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop?
How to do it properly?
This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.
SELECT
s.contragent_id,
s.contragent_title,
s.product_id AS sort_id,
s.product_title AS sort_title,
COALESCE(SUM(s.amount), 0) AS amount,
COALESCE(SUM(s.price), 0) AS price,
COALESCE(SUM(s.discount), 0) AS discount,
COUNT(DISTINCT s.product_id) AS sorts_count,
COUNT(DISTINCT s.contragent_id) AS contragents_count,
dd.date,
~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
SELECT
s.id,
s.created_at,
s.contragent_id,
ca.title AS contragent_title,
p.id AS product_id,
p.title AS product_title,
sp.amount,
sp.price,
sp.discount
FROM sales s
LEFT JOIN sold_products sp
ON s.id = sp.sale_id
LEFT JOIN products p
ON sp.product_id = p.id
LEFT JOIN contragents ca
ON s.contragent_id = ca.id
WHERE s.created_at BETWEEN :caf AND :cat
AND s.plant_id = :plant_id
AND (s.is_cache = :is_cache OR :is_cache IS NULL)
AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
(dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
(dd.date, s.contragent_id, s.contragent_title),
(dd.date)
)
This is an example of what you are talking about:
Join, then aggregate:
select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;
Aggregate then join:
select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
select department_id, count(*) as number_of_johns
from employees
where first_name = 'John'
group by department_id
) e on e.department_id = e.department_id;
Question
You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)
Thinking and guessing
Let's see what the queries do:
The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
The second query, well, same same.
Answer
You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.
I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)
Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.
I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.
I'm experimenting with PostgreSQL (v9.3). I have a quite large database, and often I need to execute queries with 8-10 joined tables (as source of large data grids). I'm using Devexpress XPO as the ORM above PostgreSQL, so unfortunately I don't have any control over how joins are generated.
The following example is a fairly simplified one, the real scenario is more complex, but as far as my examination the main problem can be seen on this too.
Consider the following variants of the (semantically) same query:
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN orderdetails od ON o.details = od.oid
LEFT JOIN customers c ON o.customer = c.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
SELECT o.*, c.*, od.*
FROM orders o
LEFT JOIN customers c ON o.customer = c.oid
LEFT JOIN orderdetails od ON o.details = od.oid
WHERE c.code = 32435 and o.date > '2012-01-01';
The orders table contains about 1 million rows, and the customers about 30 thousand. The order details contains the same amount as orders due to a one-to-one relation.
UPDATE:
It seems like the example is too simplified to reproduce the issue, because I checked again and in this case the two execution plain is identical. However in my real query where there are much more joins, the problem occures: if I put customers as the first join, the execution is 100x faster. I'll add my real query, but due to the hungarian language and the fact that it's been generated by XPO and Npgsql makes it less readable.
The first query is significantly slower (about 100x) than the second, and when I output the plans with EXPLAIN ANALYZE I can see that the order of the joins reflects to their position in the query string. So firstly the two "giant" tables are joined together, and then after the filtered customer table is joined (where the filter selects only one row).
The second query is faster because the join starts with that one customer row, and after that it joins the 20-30 order details rows.
Unfortunately in my case XPO generates the first version so I'm suffering with performance.
Why PostgreSQL query planner not noticing that the join on customers has a condition in the WHERE clauuse? IMO the correct optimization would be to take those joins first which has any kind of filter, and then take those joins which participate only in selection.
Any kind of help or advice is appreciated.
Join orders only matters, if your query's joins not collapsed. This is done internally by the query planner, but you can manipulate the process with the join_collapse_limit runtime option.
Note however, the query planner will not find every time the best join order by default:
Constraining the planner's search in this way is a useful technique both for reducing planning time and for directing the planner to a good query plan. If the planner chooses a bad join order by default, you can force it to choose a better order via JOIN syntax — assuming that you know of a better order, that is. Experimentation is recommended.
For the best performance, I recommend to use some kind of native querying, if available. Raising the join_collapse_limit can be a good-enough solution though, if you ensure, this hasn't caused other problems.
Also worth to mention, that raising join_collapse_limit will most likely increase the planning time.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.
Suppose that I have a customers table which has an account_number and a billing_account_number, if I link to another table which I need to test for either, I could use the following or clause:
select c.name
from customers c
,left join credit_terms ct on account = c.account_number
or account = c.billing_account
However, I have found that the following works equally
select c.name
from customers c
,left join credit_terms ct on account in (c.account_number, c.billing_account)
Now suppose that credit_terms.account is indexed, would that index get used in both cases? Are both statements just as equal? Is there any cost associated with one or the other?
I do apologise for being naive though I am fairly new to moderate levels of SQL.
OpenEdge uses a cost based optimizer so the particular query plan will be influenced by the statistics relevant to the query -- it may, or may not, use the indexes that you expect depending on what the optimizer knows about the data.
This knowledgebase article explains OpenEdge's SQL query plan:
http://progresscustomersupport-survey.force.com/ProgressKB/articles/Article/P62658?retURL=%2Fapex%2FProgressKBHome&popup=false
You must also periodically update SQL statistics for the optimizer to do a good job:
http://progresscustomersupport-survey.force.com/ProgressKB/articles/Article/20992?retURL=%2Fapex%2Fprogresskbsearch&popup=false
I have no knowledge of openedge, but I do not think either one will use indexes on account. The 2 of them are basically the same.
In case you want to use the index I would do something like this:
select c.name
from customers c
left join credit_terms ct on ct.account = c.account_number
union
select c.name
from customers c
left join credit_terms ct on ct.account = c.billing_account
Then again, this is somehow speculating, as I say again I have no knowledge of openedge.