Optimizing PostgreSQL query with multiple joins, order by and small limit - sql

I need some tips for optimizing queries fetching from large tables.
In this example I have 5 tables:
Brands
- id_brand
- b_name
Products
- id_product
- p_name
- ean
...
- fk_brand
Prod_attributes
- id_prod_att
- size_basic
...
- fk_product
Stores
- id_store
- s_name
...
Stocks
- id_stock
- stock_amount
- fk_prod_att
- fk_store
I need a query with ordered list of stocks, limited, so this is the general approach I used:
SELECT stores.s_name, stocks.stock_amount, prod_attributes.size_basic,
products.p_name, products.ean, brands.b_name
FROM (stocks
INNER JOIN stores
ON stocks.fk_store = stores.id_store)
INNER JOIN (prod_attributes
INNER JOIN (products
INNER JOIN brands
ON products.fk_brand = brands.id_brand)
ON prod_attributes.fk_product = products.id_product)
ON stocks.fk_prod_att = prod_attributes.id_prod_att
ORDER BY s_name, p_name, size_basic
LIMIT 25 OFFSET 0
This works fast on small tables, but when the tables grow the query gets very expensive. With 3,5M rows in Stocks, 300K in Prod_attributes, 25K Products it executes in over 8800ms, which is not acceptable for me.
All forgein keys have indexes and DB has been vacuum-analyzed recently.
I know that the issue lies in the ORDER BY part, because of it the query does not use indexes and does sequential scans. If I remove the ordering then the query is very fast.
For solving this I know the I can remove ORDER BY, but it's not feasible option for me. De-normalization of the DB or materialized view could help here also - again I would like to avoid this if possible.
What else I can do to speed this query up?
EXPLAIN ANALYZE:
- slow with order by: http://explain.depesz.com/s/AHO
- fast without order by: http://explain.depesz.com/s/NRxr

A possible way to go is to remove stores from the join. Instead, you could:
Loop through stores (order by s_name) in a stored procedure or in source code and, for each store, execute the join filtering on stocks.fk_store. You could break the loop whenever you obtain a sufficient number of records.
If possible, partition stocks using fk_store key, in order to reduce heavily the number of tuples in the join.
In this way you should have a good benefit.

Related

Cost (%CPU) around the same after denormalization?

Using SQL Oracle. I create a query to find the total counts of orders for food.
EXPLAIN PLAN FOR
SELECT FOOD.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
INNER JOIN CUSTOMER ON O_CUSTID = C_CUSTID
INNER JOIN FOOD ON C_FOODKEY = F_FOODKEY
GROUP BY FOOD.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
This returns cost (%CPU) of 3250 at row ID 0 in the plan table output.
I learnt that denormalization will speed up the query and reduce the cost. In this case, I copied the food name from my table FOOD to ORDERS to avoid the INNER JOIN. I should get a better cost (%CPU) usage.
I used this query next
EXPLAIN PLAN FOR
SELECT ORDERS.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
GROUP BY ORDERS.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
The cost (%CPU) did not change much at all - the value is 3120 at row ID 0 in the plan table output.
Isn't denormalization and removal of the INNER JOIN suppose to improve my cost? The improvement is so insignificant in my case. What's the issue here?
This is too long for a comment. You would have to study the execution plan. However, joins on primary keys are often not particularly expensive.
What is expensive is the GROUP BY, because this requires moving data around. You could try adding an index on F_NAME in the second query.
Your data model is also unusual. It is unclear why a column called FOOD would be stored at the CUSTOMER level.
C_FOODKEY should be most likely O_FOODKEY to make sense
You do not need denormalization. What you are doing is - you explode all the rows by joins and then you group them together. Do not do this in the first place and try something like this
SELECT FOOD.F_NAME,
(SELECT COUNT(*)
FROM ORDERS
WHERE O_FOODKEY = F_FOODKEY) AS fCount
FROM FOOD

Oracle FIRST_ROWS optimizer hint

I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.

Small vs Large and Large vs Small sql joins [duplicate]

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.

Does Sql JOIN order affect performance?

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.

Hash Join Showing up on Full Text Query - SQL Server 2005

I have the following Query in SQL Server 2005:
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = 13225
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, "%clip%") ')
What's interesting is that using this generates a Hash Match whereas if I use a different SUPPLIER_ORGANIZATION_ID (older supplier), it uses a Merge Join. Obviously the Hash is much slower than the Merge Join. What I don't get is why there is a difference, and what's needed to make it run faster?
FYI, there are about 5 million records in the PRODUCTS table. When supplier organization id is selected (13225), there are about 25000 products for that supplier.
Thanks in advance.
I'd try using the OPTIMIZE FOR Query Hint to force it one way or the other.
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = #Supplier_Organisation_Id
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, #Keywords) ')
OPTION (OPTIMIZE FOR (#Supplier_Organisation_Id = 1000 ))
One other thing is your STATISTICS might out of date, the tipping point for automatic updates is often not low enough meaning that the query plan chosen may not be ideal for your data. I'd suggest trying updating the STATISTICS on your Products table, perhaps creating a job to do this on a regular basis if this is part of the problem.