Cost (%CPU) around the same after denormalization? - sql

Using SQL Oracle. I create a query to find the total counts of orders for food.
EXPLAIN PLAN FOR
SELECT FOOD.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
INNER JOIN CUSTOMER ON O_CUSTID = C_CUSTID
INNER JOIN FOOD ON C_FOODKEY = F_FOODKEY
GROUP BY FOOD.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
This returns cost (%CPU) of 3250 at row ID 0 in the plan table output.
I learnt that denormalization will speed up the query and reduce the cost. In this case, I copied the food name from my table FOOD to ORDERS to avoid the INNER JOIN. I should get a better cost (%CPU) usage.
I used this query next
EXPLAIN PLAN FOR
SELECT ORDERS.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
GROUP BY ORDERS.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
The cost (%CPU) did not change much at all - the value is 3120 at row ID 0 in the plan table output.
Isn't denormalization and removal of the INNER JOIN suppose to improve my cost? The improvement is so insignificant in my case. What's the issue here?

This is too long for a comment. You would have to study the execution plan. However, joins on primary keys are often not particularly expensive.
What is expensive is the GROUP BY, because this requires moving data around. You could try adding an index on F_NAME in the second query.
Your data model is also unusual. It is unclear why a column called FOOD would be stored at the CUSTOMER level.

C_FOODKEY should be most likely O_FOODKEY to make sense
You do not need denormalization. What you are doing is - you explode all the rows by joins and then you group them together. Do not do this in the first place and try something like this
SELECT FOOD.F_NAME,
(SELECT COUNT(*)
FROM ORDERS
WHERE O_FOODKEY = F_FOODKEY) AS fCount
FROM FOOD

Related

Do I need to filter all subqueries before joining very large tables for optimization

I have three tables that I'm trying to join with over a billion rows per table. Each table has an index on the billing date column. If I just filter the left table and do the joins, will the query run efficiently or do I need to put the same date filter in each subquery?
I.E. will the first query run much slower than the second query?
select item, billing_dollars, IC_billing_dollars
from billing
left join IC_billing on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
select item, billing_dollars, IC_billing_dollars
from billing
left join (select * from IC_billing where IC_billing.date = '2019-09-24') on billing.invoice_number = IC_billing.invoice_number
where billing.date = '2019-09-24'
I don't want to run this without knowing whether the query will perform well as there aren't many safeguards for poor performing queries. Also, if I need to write the query in the second way, is there a way to only have the date filter in one location rather than having it show up multiple times in the query?
That depends.
Consider your query:
select b.item, b.billing_dollars, icb.IC_billing_dollars
from billing b left join
IC_billing icb
on b.invoice_number = icb.invoice_number
where b.date = '2019-09-24';
(Assuming I have the columns coming from the correct tables.)
The optimal strategy is an index on billing(date, invoice_number) -- perhaps also with item and billing_dollars to the index; and ic_billing(invoice_number) -- perhaps with IC_billing_dollars.
I can think of two situations where filtering on date in ic_billing would be useful.
First, if there is an index on (invoice_date, invoice_number), particularly a primary key definition. Then using this index is usually preferred, even if another index is available.
Second, if ic_billing is partitioned by invoice_date. In that case, you will want to specify the partition for performance.
Generally, though, the additional restriction on the invoice date does not help. In some databases, it might even hurt performance (particularly if the subquery is materialized and the outer query does not use an appropriate index).

Optimizing PostgreSQL query with multiple joins, order by and small limit

I need some tips for optimizing queries fetching from large tables.
In this example I have 5 tables:
Brands
- id_brand
- b_name
Products
- id_product
- p_name
- ean
...
- fk_brand
Prod_attributes
- id_prod_att
- size_basic
...
- fk_product
Stores
- id_store
- s_name
...
Stocks
- id_stock
- stock_amount
- fk_prod_att
- fk_store
I need a query with ordered list of stocks, limited, so this is the general approach I used:
SELECT stores.s_name, stocks.stock_amount, prod_attributes.size_basic,
products.p_name, products.ean, brands.b_name
FROM (stocks
INNER JOIN stores
ON stocks.fk_store = stores.id_store)
INNER JOIN (prod_attributes
INNER JOIN (products
INNER JOIN brands
ON products.fk_brand = brands.id_brand)
ON prod_attributes.fk_product = products.id_product)
ON stocks.fk_prod_att = prod_attributes.id_prod_att
ORDER BY s_name, p_name, size_basic
LIMIT 25 OFFSET 0
This works fast on small tables, but when the tables grow the query gets very expensive. With 3,5M rows in Stocks, 300K in Prod_attributes, 25K Products it executes in over 8800ms, which is not acceptable for me.
All forgein keys have indexes and DB has been vacuum-analyzed recently.
I know that the issue lies in the ORDER BY part, because of it the query does not use indexes and does sequential scans. If I remove the ordering then the query is very fast.
For solving this I know the I can remove ORDER BY, but it's not feasible option for me. De-normalization of the DB or materialized view could help here also - again I would like to avoid this if possible.
What else I can do to speed this query up?
EXPLAIN ANALYZE:
- slow with order by: http://explain.depesz.com/s/AHO
- fast without order by: http://explain.depesz.com/s/NRxr
A possible way to go is to remove stores from the join. Instead, you could:
Loop through stores (order by s_name) in a stored procedure or in source code and, for each store, execute the join filtering on stocks.fk_store. You could break the loop whenever you obtain a sufficient number of records.
If possible, partition stocks using fk_store key, in order to reduce heavily the number of tuples in the join.
In this way you should have a good benefit.

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?
I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.
If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.
if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.
You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.
For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.
You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.
Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.

Hash Join Showing up on Full Text Query - SQL Server 2005

I have the following Query in SQL Server 2005:
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = 13225
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, "%clip%") ')
What's interesting is that using this generates a Hash Match whereas if I use a different SUPPLIER_ORGANIZATION_ID (older supplier), it uses a Merge Join. Obviously the Hash is much slower than the Merge Join. What I don't get is why there is a difference, and what's needed to make it run faster?
FYI, there are about 5 million records in the PRODUCTS table. When supplier organization id is selected (13225), there are about 25000 products for that supplier.
Thanks in advance.
I'd try using the OPTIMIZE FOR Query Hint to force it one way or the other.
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = #Supplier_Organisation_Id
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, #Keywords) ')
OPTION (OPTIMIZE FOR (#Supplier_Organisation_Id = 1000 ))
One other thing is your STATISTICS might out of date, the tipping point for automatic updates is often not low enough meaning that the query plan chosen may not be ideal for your data. I'd suggest trying updating the STATISTICS on your Products table, perhaps creating a job to do this on a regular basis if this is part of the problem.