Hash Join Showing up on Full Text Query - SQL Server 2005 - sql

I have the following Query in SQL Server 2005:
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = 13225
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, "%clip%") ')
What's interesting is that using this generates a Hash Match whereas if I use a different SUPPLIER_ORGANIZATION_ID (older supplier), it uses a Merge Join. Obviously the Hash is much slower than the Merge Join. What I don't get is why there is a difference, and what's needed to make it run faster?
FYI, there are about 5 million records in the PRODUCTS table. When supplier organization id is selected (13225), there are about 25000 products for that supplier.
Thanks in advance.

I'd try using the OPTIMIZE FOR Query Hint to force it one way or the other.
SELECT
PRODUCT_ID
FROM
PRODUCTS P
WHERE
SUPPLIER_ORGANIZATION_ID = #Supplier_Organisation_Id
AND ACTIVE_FLAG = 'Y'
AND CONTAINS(*, 'FORMSOF(Inflectional, #Keywords) ')
OPTION (OPTIMIZE FOR (#Supplier_Organisation_Id = 1000 ))
One other thing is your STATISTICS might out of date, the tipping point for automatic updates is often not low enough meaning that the query plan chosen may not be ideal for your data. I'd suggest trying updating the STATISTICS on your Products table, perhaps creating a job to do this on a regular basis if this is part of the problem.

Related

Cost (%CPU) around the same after denormalization?

Using SQL Oracle. I create a query to find the total counts of orders for food.
EXPLAIN PLAN FOR
SELECT FOOD.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
INNER JOIN CUSTOMER ON O_CUSTID = C_CUSTID
INNER JOIN FOOD ON C_FOODKEY = F_FOODKEY
GROUP BY FOOD.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
This returns cost (%CPU) of 3250 at row ID 0 in the plan table output.
I learnt that denormalization will speed up the query and reduce the cost. In this case, I copied the food name from my table FOOD to ORDERS to avoid the INNER JOIN. I should get a better cost (%CPU) usage.
I used this query next
EXPLAIN PLAN FOR
SELECT ORDERS.F_NAME, COUNT(ORDERS.O_ORDERID)
FROM ORDERS
GROUP BY ORDERS.F_NAME;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
The cost (%CPU) did not change much at all - the value is 3120 at row ID 0 in the plan table output.
Isn't denormalization and removal of the INNER JOIN suppose to improve my cost? The improvement is so insignificant in my case. What's the issue here?
This is too long for a comment. You would have to study the execution plan. However, joins on primary keys are often not particularly expensive.
What is expensive is the GROUP BY, because this requires moving data around. You could try adding an index on F_NAME in the second query.
Your data model is also unusual. It is unclear why a column called FOOD would be stored at the CUSTOMER level.
C_FOODKEY should be most likely O_FOODKEY to make sense
You do not need denormalization. What you are doing is - you explode all the rows by joins and then you group them together. Do not do this in the first place and try something like this
SELECT FOOD.F_NAME,
(SELECT COUNT(*)
FROM ORDERS
WHERE O_FOODKEY = F_FOODKEY) AS fCount
FROM FOOD

T-SQL. What is better: join then group or group then join

I have 2 tables:
Order:
IdProduct (what is ordered - FK to Product table)
Price (what is the total price for offer)
Piece (i.e. count - how many products are ordered?)
Product:
Id
Name
And there are 2 SQL statements that return products for the best price per item:
Statement #1:
SELECT
p.Name,
MIN (Price / Piece) AS MinPrice
FROM
[ORDER] o
JOIN
Product p ON IdProduct = p.Id
GROUP BY
p.Name
Statement #2:
SELECT p.Name, t.MinPrice
FROM
(SELECT IdProduct, MIN(Price/Piece) AS MinPrice
FROM [Order]
GROUP BY IdProduct) t
JOIN
Product p ON p.Id = t.IdProduct
I investigated execution plans in Microsoft SQL Server Management Studio and they look very similar, though I have several observations:
Why does the first plan use [order by name] instruction? It's output has product names that are ordered "asc" even if I don't use T-SQL Order instruction
This implicit "order by name asc" slows down the first sql. When I add "order by name asc" to second sql - they become identical for execution plan cost.
I guess that sql #2 should outperfom #1 because of:
a). It groups by PK (that is integer), not by name (that has nvarchar column type, moreover it is not indexed)
b). It joins tables only after the first one is grouped that should maximize performance (compared to joining full 2 tables as it's expected for the first sql) - but execution plans show the same estimated execution cost nevertheless.
What SQL statement would you prefer and why? May be you have your own version for SQL statement?
Personally, I would prefer statement 2. My reason is quite different from what you would expect.
Have you realized your 2 statements are not built to return the same results?
The 1st query does NOT group records by product, it groups them by product name. In most DB out there, columns called name are never unique. Therefore, the 2 GROUP BY are not equivalent (maybe your test data happens to make the 2 results identical but that's only luck playing here).
Here is what should have been written:
SELECT
p.Name,
MIN (Price / Piece) AS MinPrice
FROM
[ORDER] o
JOIN
Product p ON IdProduct = p.Id
GROUP BY
IdProduct, p.Name /* GROUP BY PK on Product */
IMHO, the 2nd syntax is a good protection against that kind of mistake. I advise that is the one you use.
That will save you some troubles when you work on legacy DB with 100+ tables instead of 2 tables you created and filled yourself, not to mention the 1st statement could appear to work correctly for a long time until, finally, Product.name becomes non unique.
BTW, the implicit order by was hinting it is not using the PK column. It is not slowing down your query. It is ordering records in preparation for the GROUP BY
PS: to answer your question about performance, your 2nd statement vs the one I have written about should be very similar (thanks to the query planner).
I have sometimes seen the 1st statement to be significantly slower but never significantly faster than the 2nd (if exceptions exist, they are rare enough for me to have missed them).
PPS: Since you aggregate data from Product, adding a WHERE on a field from Order could make things more complicated for performance.
I am afraid that is the kind of things you have to try every single time a new query is being developed.

Query equivalence with DISTINCT

Let us have a simple table order(id: int, category: int, order_date: int) created using the following script
IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL DROP TABLE dbo.orders
SELECT TOP 1000000
NEWID() id,
ABS(CHECKSUM(NEWID())) % 100 category,
ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM sys.sysobjects
CROSS JOIN sys.all_columns
Now, I have two equivalent queries (at least I believe that they are equivalent):
-- Q1
select distinct o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from orders o1
-- Q2
select o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from (select distinct category from orders) o1
However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.
The difference is still there if add requested index
CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)
Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:
CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)
My question are:
Are these queries equivalent?
If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
If no, could you post a counter example?
EDIT
My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?
The queries are functionally equivalent, meaning that they should return the same data.
However, they are interpreted differently by the SQL engine. The first (SELECT DISTINCT) generates all the results and then removes the duplicates.
The second extracts the distinct values first, so the subquery is only called on the appropriate subset.
An index might make either query more efficient, but it won't fundamentally affect whether the distinct processing occurs before or after the subquery.
In this case, the results are the same. However, that is not necessarily true depending on the subquery.

Small vs Large and Large vs Small sql joins [duplicate]

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.

Optimising a SELECT query that runs slow on Oracle which runs quickly on SQL Server

I'm trying to run the following SQL statement in Oracle, and it takes ages to run:
SELECT orderID FROM tasks WHERE orderID NOT IN
(SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL)
If I run just the sub-part that is in the IN clause, that runs very quickly in Oracle, i.e.
SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL
Why does the whole statement take such a long time in Oracle? In SQL Server the whole statement runs quickly.
Alternatively is there a simpler/different/better SQL statement I should use?
Some more details about the problem:
Each order is made of many tasks
Each order will be allocated (one or more of its task will have engineer1 and engineer2 set) or the order can be unallocated (all its task have null values for the engineer fields)
I am trying to find all the orderIDs that are unallocated.
Just in case it makes any difference, there are ~120k rows in the table, and 3 tasks per order, so ~40k different orders.
Responses to answers:
I would prefer a SQL statement that works in both SQL Server and Oracle.
The tasks only has an index on the orderID and taskID.
I tried the NOT EXISTS version of the statement but it ran for over 3 minutes before I cancelled it. Perhaps need a JOIN version of the statement?
There is an "orders" table as well with the orderID column. But I was trying to simplify the question by not including it in the original SQL statement.
I guess that in the original SQL statement the sub-query is run every time for each row in the first part of the SQL statement - even though it is static and should only need to be run once?
Executing
ANALYZE TABLE tasks COMPUTE STATISTICS;
made my original SQL statement execute much faster.
Although I'm still curious why I have to do this, and if/when I would need to run it again?
The statistics give Oracle's
cost-based optimzer information that
it needs to determine the efficiency
of different execution plans: for
example, the number of rowsin a table,
the average width of rows, highest and
lowest values per column, number of
distinct values per column, clustering
factor of indexes etc.
In a small database you can just setup
a job to gather statistics every night
and leave it alone. In fact, this is
the default under 10g. For larger
implementations you usually have to
weigh the stability of the execution
plans against the way that the data
changes, which is a tricky balance.
Oracle also has a feature called
"dynamic sampling" that is used to
sample tables to determine relevant
statistics at execution time. It's
much more often used with data
warehouses where the overhead of the
sampling it outweighed by the
potential performance increase for a
long-running query.
Often this type of problem goes away if you analyze the tables involved (so Oracle has a better idea of the distribution of the data)
ANALYZE TABLE tasks COMPUTE STATISTICS;
The "IN" - clause is known in Oracle to be pretty slow. In fact, the internal query optimizer in Oracle cannot handle statements with "IN" pretty good. try using "EXISTS":
SELECT orderID FROM tasks WHERE orderID NOT EXISTS
(SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL)`print("code sample");`
Caution: Please check if the query builds the same data results.
Edith says: ooops, the query is not well formed, but the general idea is correct. Oracle has to fulfill a full table scan for the second (inner) query, build the results and then compare them to the first (outer) query, that's why it's slowing down. Try
SELECT orderID AS oid FROM tasks WHERE NOT EXISTS
(SELECT DISTINCT orderID AS oid2 FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL and oid=oid2)
or something similiar ;-)
I would try using joins instead
SELECT
t.orderID
FROM
tasks t
LEFT JOIN tasks t1
ON t.orderID = t1.orderID
AND t1.engineer1 IS NOT NULL
AND t1.engineer2 IS NOT NULL
WHERE
t1.orderID IS NULL
also your original query would probably be easier to understand if it was specified as:
SELECT orderID FROM orders WHERE orderID NOT IN
(SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL)
(assuming you have orders table with all the orders listed)
which can be then rewritten using joins as:
SELECT
o.orderID
FROM
orders o
LEFT JOIN tasks t
ON o.orderID = t.orderID
AND t.engineer1 IS NOT NULL
AND t.engineer2 IS NOT NULL
WHERE
t.orderID IS NULL
I agree with TZQTZIO, I don't get your query.
If we assume the query did make sense then you might want to try using EXISTS as some suggest and avoid IN. IN is not always bad and there are likely cases which one could show it actually performs better than EXISTS.
The question title is not very helpful. I could set this query up in one Oracle database and make it run slow and make it run fast in another. There are many factors that determine how the database resolves the query, object statistics, SYS schema statistics, and parameters, as well as server performance. Sqlserver vs. Oracle isn't the problem here.
For those interested in query tuning and performance and want to learn more some of the google terms to search are "oak table oracle" and "oracle jonathan lewis".
Some questions:
How many rows are there in tasks?
What indexes are defined on it?
Has the table been analyzed recently?
Another way to write the same query would be:
select orderid from tasks
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL
However, I would rather expect the query to involve an "orders" table:
select orderid from ORDERS
minus
select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL
or
select orderid from ORDERS
where orderid not in
( select orderid from tasks
where engineer1 IS NOT NULL AND engineer2 IS NOT NULL
)
or
select orderid from ORDERS
where not exists
( select null from tasks
where tasks.orderid = orders.orderid
and engineer1 IS NOT NULL OR engineer2 IS NOT NULL
)
I think several people have pretty much the right SQL, but are missing a join between the inner and outer queries.
Try this:
SELECT t1.orderID
FROM tasks t1
WHERE NOT EXISTS
(SELECT 1
FROM tasks t2
WHERE t2.orderID = t1.orderID
AND t2.engineer1 IS NOT NULL
AND t2.engineer2 IS NOT NULL)
"Although I'm still curious why I have to do this, and if/when I would need to run it again?"
The statistics give Oracle's cost-based optimzer information that it needs to determine the efficiency of different execution plans: for example, the number of rowsin a table, the average width of rows, highest and lowest values per column, number of distinct values per column, clustering factor of indexes etc.
In a small database you can just setup a job to gather statistics every night and leave it alone. In fact, this is the default under 10g. For larger implementations you usually have to weigh the stability of the execution plans against the way that the data changes, which is a tricky balance.
Oracle also has a feature called "dynamic sampling" that is used to sample tables to determine relevant statistics at execution time. It's much more often used with data warehouses where the overhead of the sampling it outweighed by the potential performance increase for a long-running query.
Isn't your query the same as
SELECT orderID FROM tasks
WHERE engineer1 IS NOT NULL OR engineer2 IS NOT NULL
?
How about :
SELECT DISTINCT orderID FROM tasks t1 WHERE NOT EXISTS (SELECT * FROM tasks t2 WHERE t2.orderID=t1.orderID AND (engineer1 IS NOT NULL OR engineer2 IS NOT NULL));
I am not a guru of optimization, but maybe you also overlooked some indexes in your Oracle database.
Another option is to use MINUS (EXCEPT on MSSQL)
SELECT orderID FROM tasks
MINUS
SELECT DISTINCT orderID FROM tasks WHERE engineer1 IS NOT NULL
AND engineer2 IS NOT NULL
If you decide to create an ORDERS table, I'd add an ALLOCATED flag to it, and create a bitmap index. This approach also forces you to modify the business logic to keep the flag updated, but the queries will be lightning fast. It depends on how critical are the queries for the application.
Regarding the answers, the simpler the better in this case. Forget subqueries, joins, distinct and group bys, they are not needed at all!
The Oracle optimizer does a good job of processing MINUS statements. If you re-write your query using MINUS, it is likely to run quite quickly:
SELECT orderID FROM tasks
MINUS
SELECT DISTINCT orderID FROM tasks WHERE
engineer1 IS NOT NULL AND engineer2 IS NOT NULL
What proportion of the rows in the table meet the condition "engineer1 IS NOT NULL AND engineer2 IS NOT NULL"?
This tells you (roughly) whether it might be worth trying to use an index to retrieve the associated orderid's.
Another way to write the query in Oracle that would handle unindexed cases very well would be:
select distinct orderid
from
(
select orderid,
max(case when engineer1 is null and engineer2 is null then 0 else 1)
over (partition by orderid)
as max_null_finder
from tasks
)
where max_null_finder = 0
New take.
Iff:
The COUNT() function does not count NULL values
and
You want the orderID of all tasks where none of the tasks have either engineer1 or engineer2 set to a value
then this should do what you want:
SELECT orderID
FROM tasks
GROUP BY orderID
HAVING COUNT(engineer1) = 0 AND COUNT(engineer2) = 0
Please test it.
I agree with ΤΖΩΤΖΙΟΥ and wearejimbo that your query should be...
SELECT DISTINCT orderID FROM Tasks
WHERE Engineer1 IS NULL OR Engineer2 IS NULL;
I don't know about SQL Server, but this query won't be able to take advantage of any indexes because null rows aren't in indexes. The solution to this would be to re-write the query in a way that would allow a function based index to be created that only includes the null value rows. This could be done with NVL2, but would likely not be portable to SQL Server.
I think the best answer is not one that meets your criteria and that is write a different statement for each platform that is best for that platform.
If you have no index over the Engineer1 and Engineer2 columns then you are always going to generate a Table Scan in SQL Server and the equivalent whatever that may be in Oracle.
If you just need the Orders that have unallocated tasks then the following should work just fine on both platforms, but you should also consider adding the indexes to the Tasks table to improve query perfomance.
SELECT DISTINCT orderID
FROM tasks
WHERE (engineer1 IS NULL OR engineer2 IS NULL)
Here is an alternate approach which I think gives what you want:
SELECT orderID
FROM tasks
GROUP BY orderID
HAVING COUNT(engineer1) = 0 OR COUNT(engineer2) = 0
I'm not sure if you want "AND" or "OR" in the HAVING clause. It sounds like according to business logic these two fields should either both be populated or both be NULL; if this is guaranteed then you could reduce the condition to just checking engineer1.
Your original query would, I think, give multiple rows per orderID, whereas mine will only give one. I am guessing this is OK since you are only fetching the orderID.
Sub-queries are "bad" with Oracle. It's generally better do use joins.
Here's an article on how to rewrite your subqueries with join :
http://www.dba-oracle.com/sql/t_rewrite_subqueries_performance.htm