Related
I'm trying to optimize my queries by using ColdFusion's Query of Queries feature to access a cached query of about 45,000 words.
With this below query I had lots of success in speed switching to QoQ:
<cfquery name="FindAnagrams" dbtype="query" >
SELECT AllWords.Word, AllWords.AnagramKey
FROM AllWords
WHERE AllWords.WordLength = #i#
</cfquery>
Executions went from ~400ms to ~15ms.
This below query however was only slightly reduced in execution time (from ~500ms to ~400ms):
<cfquery name="TopStartWith" dbtype="query" maxrows="15">
SELECT AllWords.Word
FROM AllWords
WHERE AllWords.Word LIKE <cfoutput>'#Word#%' </cfoutput>
AND AllWords.Word <> '#Word#'
ORDER BY AllWords.Frequency DESC;
</cfquery>
Removing 'Maxrows' did not really help. My Database fields are indexed and I'm ad the end of my knowledge of optimizing queries (Can you index a column of a CF QoQ object?) I suspect it is the 'ORDER BY' that is causing the delay, but am unsure. How can I further improve the speed of such queries? Many thanks.
For optimizing the second query, there are a couple of approaches you could take.
Firstly, see if your database supports something like function-based indexes (an oracle term, but it is available in other platforms). See this for a mySQL example: Is it possible to have function-based index in MySQL?
Secondly, you could pre-process your words into a structure which supports the query you're after. I'm assuming you're currently loading the query into application or session scope elsewhere. When you do that you could also process the words into a structure like:
{
'tha':['thames','that'],
'the':['them','then','there'],
//etc
}
Instead of running a QoQ, you get the first 3 letters of the word, look up the array, then iterate over it, finding matches. Essentially, it's pretty similar to what a function-based index is doing, but in code. You're trading memory for speed, but with on 45000 words, the structure isn't going to be enormous.
The LIKE clause probably causes the poor performance of your second query. You can see a similar performance penalty if you use LIKE in a regular database query. Since LIKE performs a wildcard search against the entire string stored in the database column, it can't just do an EQUALS comparison.
I'm a database noobie when it comes to even moderately large data sets. I have a SQL database (multiple sql databases actually, a SQLite, Postgres, and MySQL database) all containing the same data, dumped from IMDB. I want to benchmark these different databases. The main table I want to query has about 15 million rows. I want a query that crosses two movies, right now my query looks like this
SELECT * from acted_in INNER JOIN actors
ON acted_in.idactors = actors.idactors WHERE
(acted_in.idmovies = %d OR acted_in.idmovies = %d)
the parameters are randomly generated ids. I want to test the relative speed of the databases by running this query multiple times for randomly generated movies and seeing the amount of time it takes on average. My question is, is there any better way to do the same query, I want to join who acted in what with their information from either of the two movies as this will be the core functionality for the project I am working on, right now the speed is abysmal currently the average speed for a single query is
sqlite: 7.160171360969543
postgres: 8.263306670188904
mysql: 13.27652293920517
This is average time per query (sample space of only 100 queries but it is significant enough for now). So can I do any better? The current run time is completely unacceptable for any practical use. I don't think the joining takes a lot of time, by removing it I get nearly the same results so I believe the lookup is what is taking a long time, as I don't gain a significant speed up when I don't join or look up using the OR conditional.
The thing you don't mention here is having any indexes in the databases. Generally, the way you speed up a query (except for terribly written ones, which this is not) is by adding indexes to the things which are used in join or where criteria. This would slow down updates since the indexes need to be updated any time the table is updated, but would speed up selections using those attributes quite substantially. You may wish to consider adding indexes to any attributes you use which are not already primary keys. Be sure to use the same index type in all databases to be fair.
First off, microbenchmarks on databases are pretty noninformative and it's not a good idea to base your decision on them. There are dozens of better criteria to select a db, like reliability, behavior under heavy loads, availability of certain features (eg an extensible language like the PostGIS extension for postgres, partitioning, ...), license (!!), and so on.
Second, if you want to tune your database, or database server, there's a number of things you have to consider. Some important ones:
db's like lots of memory and fast disks, so setup your server with ample quantities of both.
use the query analysis features offered by all major db's (eg the very visual explain feature in pgadmin for postgres) to analyze the behavior of queries that are important for your use case, and adapt the db based on what you learn from these analyses (eg extra or other indexes)
study to understand your db server well, these are pretty sophisticated programs with lots of settings that influence their behavior and performance
make sure you understand the workload your db is subjected to, eg by using a tool like pgfouine for postgres, others exist for other brands of databases.
I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.
I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)
My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.
I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).
Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.
Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).
I found that by doing something like this:
INSERT INTO PreCalcs
SELECT param1
,param2
,dbo.udf_result(param1, param2) AS result
FROM (
SELECT DISTINCT param1, param2 FROM big_table
)
When PreCalcs is suitably indexed, the combination of that with:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON PreCalcs.param1 = big_table.param1
AND PreCalcs.param2 = big_table.param2
You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.
The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON (
PreCalcs.param1 = big_table.param1
OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
)
AND (
PreCalcs.param2 = big_table.param2
OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
)
Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.
I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?
Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.
Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.
I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.
How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?
By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.
No optimiser in existence would be able to divine this useful information.
The scalar udf is a black box.
For a more practical solution, I'd use a computed, persisted columns that does the udf call.
So it's available in all queries can be indexed/included.
This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.