Why the difference in speed between these SQL queries? - sql

I'm currently trying to write a query to sift through our ERP DB and I noticed a very odd drop in speed when I removed a filter condition.
This is what the Query looked like before (it took less than a second to complete, typically returned anywhere from 10 to hundreds of records depending on the order and item)
SELECT TOP 1000 jobmat.job, jobmat.suffix, jobmat.item, jobmat.matl_qty,
jobmat.ref_type, jobmat.ref_num, spec.NoteContent, spec.NoteDesc,
job.ord_num, jobmat.RowPointer
FROM jobmat
INNER JOIN ObjectNotes AS obj ON obj.RefRowPointer = jobmat.RowPointer
INNER JOIN SpecificNotes AS spec ON obj.SpecificNoteToken = spec.SpecificNoteToken
INNER JOIN job ON job.job = jobmat.job AND job.suffix = jobmat.suffix
WHERE ord_num LIKE '%3766%' AND ref_type = 'P' AND
(spec.NoteDesc LIKE '%description%' OR spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
And this is what I changed the WHERE Statement too:
WHERE ord_num LIKE '%3766%' AND ref_type = 'P' AND
spec.NoteContent LIKE '%COMPANY%DWG%1162402%'
Running it after having made this modifcation bumped my runtime up to like 9 seconds (returns normally 1-3 records). Is there an obvious reason that I'm missing? I would have thought that the same should have been roughly the same. Any help is greatly appreciated!
Edit: I have run both versions of this query many times to test, and the runtimes are fairly consistant; <1 second for version 1, ~9 seconds for version 2.

Your original query has to find 1000 matching rows before it returns the result set. It is doing this by creating a result set based on the JOINs and then filtering based on the WHERE clause.
The condition:
(spec.NoteDesc LIKE '%description%' OR spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
matches more rows than:
(spec.NoteContent LIKE '%COMPANY%DWG%1162402%')
Hence, the second version has to go through and process more rows to get 1,000 matches.
That is why the second version is slower.
If you want the second version to seem faster, you can add an ORDER BY clause. This requires reading all the data in both cases. Because the ORDER BY then takes additional time related to the number of matching rows, you'll probably see that the second is slightly faster than the first. One caveat: both will then take at least 9 seconds.
EDIT:
I like the above explanation, but if all the rows are processed in both cases, then it might not be correct. Here is another explanation, but it is more speculative.
It depends on NoteContent being very long, so the LIKE has a noticeable effect on query performance. The second version might be "optimized" by pushing the filtering clauses to the node in the processing that reads the data. This is something that SQL Server regularly does. That means that the LIKE is being processed on all the data in the second case.
Because of the OR condition, the first example cannot push the filtering down to that level. Then, the condition on NoteDescription is short-circuited in multiple ways:
The join conditions.
The other WHERE conditions.
The first LIKE condition.
Because of all this filtering, the second condition is run on many fewer rows, and the code runs faster. Usually filtering before joining is a good idea, but there are definitely exceptions to this rule.

Related

Is SELECT TOP always the fastest way to get a preview of a query you want on SQL?

I have the following query that I've ran:
SELECT TOP 100 certs.CertId, COUNT(cluster.BGTJobId) C
FROM [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
INNER JOIN [Eagle].[raw].[certs] certs
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Ultimately, I want to get the full results and not just the top 100, but for previewing purposes, is this the fastest way to go?
Since you've mentioned this is for preview purposes, so I'm assuming you just want data out of the query and you want it to run FAST regardless of the data it returns, and seeing that you mentioned that the query takes 14 minutes to execute, a quick 'hack-fix' would be to use something like below:
SELECT
certs.CertId
, COUNT(cluster.BGTJobId)
FROM
(SELECT TOP 100
certs.CertId
FROM [Eagle].[raw].[certs] certs) certs
INNER JOIN [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Aggregating data (in your case COUNT) is a very expensive operation and should be done only at the last part of the query on as little data as possible. That is why, for "preview" purposes I have selected onyl the first 100 certificates and made the COUNT on that data.
However, because you mentioned that the query takes 14 minutes to run, the problems are elsewhere and usually this is due to design (query design, index design or even table design).
You should ask yourself if you really want to go over all of the data in the tables and get all of the matching rows from both tables, and aren't you possibly missing a WHERE clause?
If you do decide that there is a WHERE clause needed, are there any indexes to help filtering the data based on the conditions of your WHERE clause (and even the join columns - certs.BGTJobId and cluster.BGTJobId?
Yes top select query is fastest for preview purposes that why its also shown in management studio GUI right click. But if you are running custom query just check that where clause / grouping etc are done is part of the clustered index.

How to do server paging in SQL correct?

My situation: My application is slow. As slow as it gets... mostly because I have the feeling my Server paging for my dataTables / grids are wrongly implemented.
Let's start:
I have a SQL Server 2008 database, one table with all the information, 10 columns in it, at the moment 19K rows
My application is based on a JavaScript and ASP.Net backend code.
My SQL query is:
WITH Ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Created DESC) AS 'RowNumber'
FROM Meetings
WHERE State IN ('Appointed', 'Accepted')
AND [xxx] LIKE '%1%'
AND [yyy] LIKE '%2%'
)
SELECT *
FROM Ordered
WHERE RowNumber BETWEEN 1 AND 41;
So at the moment this query runs around 27 to 32 seconds, which means over 30 seconds I got a timeout... on 19k rows in 1 year... which means in 1 month latest every query will run against dead...
As far as I am understand the order for this query is the problem: No index done here.
Because the query first sorts, then selects all with a manual row number, then selects only 40... (of course on page 2 of my grid it gets Rows 41 to 81...)
I COULD do an Index on my "Created desc" and the query would be much much faster, BUT every column is sortable for my grid which means "Created desc" could be every other column of my table and of course desc and asc order!
So, how to improve this?
//Edit:
Sorry to forget that:
The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds...
Which means the "WITH ORDERES AS" is the problem here!
First things first: you have a performance problem, approach it with a proper methodology and measure appropriately. The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds... Which means ... is amateurism. Read How to analyse SQL Server performance for correct ways to measure performance. And before we continue, if you start from 6 seconds you have already lost the game.
Now, on to the question.
WHERE State in('Appointed','Accepted') AND [xxx] LIKE '%1%' AND [yyy] LIKE '%2%'
This expression is basically non-indexable. Even if you add an index on State it will not help because of the low cardinality (few values with many rows each). And like '% ... %' is unindexable because it searches for values in the middle of the text.
You could try to replace like '% ... %' with a full-text search like CONTAINS ... which will be faster, provider you search for specific enough terms. But it does require you to deploy and configure properly the full-text indexes.
As for the paging, I do not favor much the ROWNUMBER approach. Even when a sort column exists, it involves a scan and count to skip the number of rows and gets slower and slower as you go to higher pages. I much more recommend the key based approach:
SELECT TOP (page size) ...
WHERE keys > <last row>
ORDER BY...
but this approach is more difficult to implement as it requires keeping track of keys rather than the page number.
But expect no miracles. You are asking a relational OLTP system to do the work of an ElasticSearch/Solr. It will never work as you expect. Use a tool appropriate for the job (a Search engine). Also read Dynamic Search Conditions in T‑SQL for a more thorough discussion, but again, expect no miracles.

SQL Server: Returning different number of rows from the same source data?

I have encountered today by far the most strange behaviour of SQL Server 2008 for me so far (or maybe I was already too tired to figure out what is going on).
I have a fairly complicated statement operating on around 1,100,000 rows. My source data is in table used by the subquery and it doesn't change. The statement looks something like this (I'll include only the parts that could cause the error in my opinion):
SELECT
-- here are some columns straight from the subquery and some hardcoded columns like
'My company' AS CompanyName
-- and here is the most important part
SUM(Subq.ImportantFloatWithBigPrecision)
FROM
(
SELECT
-- Here are some simple columns fetched from table and after that:
udf.GetSomeMappedValue(Value) AS MappedValueFromFunction
,ImportantFloatWithBigPrecision
FROM MyVeryBigTable
WHERE ImportantFloatWithBigPrecision <> 0
AND(...)
) Subq
GROUP BY everything except SUM(Subq.ImportantFloatWithBigPrecision)
HAVING SUM(Subq.ImportantFloatWithBigPrecision)<>0
ORDER BY (...)
I've checked the subquery quite a few times and it returns the same results every time, but the whole query returns 850-855 rows from the same data! Most of the time (around 80%) it is 852 rows, but sometimes its 855, sometimes 850 and I really have no idea why.
Surprisingly removing the <> 0 condition from the subquery helped at first glance (so far 6 times I got the same results), but it has drastic impact on performance (for this amount of rows it runs about 8-9 mins longer (remember the udf? :/)
Could someone explain this to me? Any ideas/questions? I'm completely clueless...
EDIT (inspired by Crono): We're running this on two environments: dev and test and comparing the results so maybe some of the settings may differ.
EDIT 2: Values of ImportantFloatWithBigPrecision are from a really wide range (around -1,000,000 to + 1,000,000) but there are 'few' (propably in this scale around 25k-30k) rows which have values really close to 0 (first non-zero digit on 6-th place after separator, some start even further) both negative and positive.

Select optimization in Access

I'm in serious trouble, I've a huge subtle query that takes huge time to execute. Actually it freezes Access and sometimes I have to kill it the query looks like:
SELECT
ITEM.*,
ERA.*,
ORDR.*,
ITEM.COnTY1,
(SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1) AS NewConTy1,
ITEM.COnValue1,
(SELECT TOP 1 KBETR FROM GEN_KUMV WHERE KNUMV = ERA.DOCCOND AND KSCHL = (SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1)) AS NewCOnValue1
--... etc: this continues until ConTy40
FROM
GEN_ITEMS AS ITEM,
GEN_ORDERS AS ORDR,
GEN_ERASALES AS ERA
WHERE
ORDR.ORDER_NUM = ITEM.ORDER_NUM AND -- link between ITEM and ORDR
ERA.concat = ITEM.concat -- link between ERA and ITEM
I won't provide you with the tables schema since the query works, what I'd like to know is if there's a way to add the NewConTy1 and NewConValue1 using another technique to make it more efficient. The thing is that the Con* fields goes from 1 to 40 so I've to align them along (NewConTy1 next to ConTy1 with NewConValue1 next to new ConValue2... etc until 40).
ConTy# and ConTyValue# are in ITEMS (each in a field)
NewConty# and NewConValue# are in ERA (each in a record)
I really hope my explanation is enough to figure out my issue,
Looking forward to hearing from you guys
EDIT:
Ignore the TOP 1 in the SELECTS, it's because current dumps of data I have aren't accurate it's going to be removed later
EDIT 2:
Another thing my query returns up to 230 fields also lol
Thanks
Miloud
Have you considered a union query to normalize items?
SELECT "ConTy1" As CTName, Conty1 As CTVal,
"ConTyValue1" As CTVName, ConTyValue1" As CTVVal
FROM ITEMS
UNION ALL
SELECT "ConTy2" As CTName, Conty2 As CTVal,
"ConTyValue2" As CTVName, ConTyValue2" As CTVVal
FROM ITEMS
<...>
UNION ALL
SELECT "ConTy40" As CTName, Conty40 As CTVal,
"ConTyValue40" As CTVName, ConTyValue40" As CTVVal
FROM ITEMS
This can either be a separate query that links in to your main query, or a sub query of your main query, if that is more convenient. It should then be easy enough to draw in the relationship to the NewConty# and NewConValue# in ERA.
Remou's answer gives what you want - significantly different approach. It's been a while since I've meddled with MS Access query optimization, and had forgot about the details of its planner, but you might want to try a trivial suggestion to actually make your
WHERE conditions
into
INNER JOIN ON conditions
You are firing 40ish correlated subqueries so the above probably will not help (again Remou's answer takes significantly different approach and you might see real improvements there), but do let us know as it is trivial to test.
Another approach that you can take is to materialize expensive part and take Remou's idea but split it into different parts where you can join directly.
For example your first subquery is correlated on ITEM.COnTY1, your second is correlated on ERA.DOCCOND and ITEM.ConTY1.
If you classify your subqueries according to correlated keys then you can save them as queries (or materialize them as make table queries) and join on them (or the newly created tables), which should might perform much faster (and in the case of make tables will perform much faster, at the expense of materializing - so you'll have to run some queries before getting latest data - this can be encapsulated in a macro or VBA function/sub).
Otherwise (for example if you run the above query regularly as a part of your normal business use case) - redesign your DB.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.