SQL Server: Performance of searching for hex strings in large tables (using LIKE, Full-Text Search, etc.) - sql

I have a table with 40+ million rows in MS SQL Server 2019. One of the columns store pure hexadecimal strings (both binary and readable ASCII content). I need to search this table for rows containing a specific hex string.
Normally, I would do this:
SELECT * FROM transactionoutputs WHERE outhex LIKE '%74657374%' ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
Since the results are paginated, it can take less than a second to find the first 10 results. However, when increasing the offset, or searching for strings that only appear 1-2 times in the entire table, it can take more than a minute, at which point my application will time out.
The execution plan for this query is this:
Are there any easy ways to improve the performance of such a search?
Using this answer, I was able to reduce the query time from 33 seconds to 27 seconds:
SELECT * FROM transactionoutputs WHERE
CHARINDEX('74657374' collate Latin1_General_BIN, outhex collate Latin1_General_BIN) > 0
ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
When I leave out the ORDER BY and pagination, I can reduce this to 19 seconds. This is not ideal because I need both the ordering and pagination. It still has to scan the entire table
I have tried the following:
Create an index on that column. This has no noticeable effect.
I came across this article about slow queries. Initially, I was using parameterized queries in my application, which was much slower than running them in SSMS. I have since moved to the query shown above, but it is still slow.
I tried to enable Multiple Active Result Sets (MARS), but without any improvement in query time.
I also tried using Full-Text Search. This seemed to be the most promising solution as text search is exactly what I need. I created a full-text index and can do a similar query like above, but using the index:
SELECT * FROM transactionoutputs WHERE CONTAINS(outhex,'7465') ORDER BY id desc OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
This returns results almost instantly. However, when the query is longer than a few characters (often 4), it doesn't return anything. Am I doing something wrong or why is it doing that?
The execution plan:
My understanding is that my case is not the ideal use case for FTS, as it is designed to search in readable text and not hexadecimal strings. Is it possible to use anyway, and if so, how?
After reading dozens of articles and SO posts, I can not confidently say I know how to improve the performance of such queries for my specific use case, if it is even possible at all. So, is there any easy option to improve this?

First, Kudo's for the fantastic explanation of your problem. This helps you get better answers fast. You should also include DDL, including indexes when possible. This will be come clear as I answer your question.
I'm going to tackle a couple issues with your query which are unrelated to how your parsing your text now and talk about how to handle the string problem later tonight.
Answer Part 1: Unrelated to string parsing
It's quite possible that the way you are searching through the string is the main performance problem. Let's start with the SELECT * - do you absolutely need all the columns? Specifically, do you absolutely need all the columns that are included in that Key lookup? This is the most important thing to sort out. Let me explain.
You're query is performing a scan against your nonclustered index named outhex-index, then performs a Key lookup to retrieve the rows not included in outhex-index. Key lookups destroy performance, especially against a clustered and nonclustered index with 40,000,000 rows.
If you do need those columns, then you should consider adding them as included columns to your outhex-index index. I say consider because I don't know how many columns nor the data type. Include columns speed up queries by eliminating costly key lookups but they slow down data modification, sometimes dramatically depending on the number/type of indexes. If you need the columns not included in outhex-index and they are big columns (MAX/BLOB/LOB data types, XML, etc) then a covering index is NOT an option. If your don't need them then you she refactor your SELECT * statement to only include the columns you need.
Full text indexing is not an option here unless you can find a way to lose that sort. Sorting has an N log N complexity which means the sort gets more expensive the more rows you sort. A 40 million row sort should be avoided whenever possible. This will be hard to avoid with full text indexing for reasons which require more time to explain then I have time for. Adding/modifying a 40-million row index can be expensive and take a lot of time. If you do go that route I suggest taking an offline copy of that table to time how long it takes to build. You can also consider adding creating a filtered index if possible to reduce your search area.
I noticed, too, that both queries are getting serial execution plans. I don't know if a parallel plan will help the first query with the key lookup but I know that it will likely help with the second one as there is a sort involved. Parallel execution plans can really speed up sorts. Consider testing your query with OPTION (QUERYTRACEON 8649) or make_parallel() by Adam Machanic.
I'll update this post tonight with some ideas to parse your string faster. One thing you could look into in the meantime is Paul White's clever Trigram Wildcard String Search trick which might be an option.

Related

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Does using the TOP X * format in SQL speed up queries significantly?

So lately when I run queries on huge tables I'll use the the top 10 * notation like so:
select top 10 * from BI_Sessions (nolock)
where SessionSID like 'b6d%'
and CreateDate between '03-15-2012' AND '05-18-2012'
I thought that it let's it run faster, but it doesn't seem so , this one took 4 minutes(or is that OK time)?
I guess I'm curious about whether the top functionality happens after it pulls all the data anyway(which would seem like it's inefficient).
thanks
It entirely depends on the query, with the exceptino of "Top 0". "Top 0" does return much faster.
In your case, the query has to look through the rows in a huge table to find rows that match the WHERE clause. If no rows are found, the number of rows being returned doesn't help. If the rows are at the end of the table scan, then the number of rows being returned doesn't help.
There are certain cases with more complicated queries where the "top" could affect performance. There is a difference between optimizing overall and for the first row returned. I'm not sure if SQL Server's optimizer recognizes this difference.
Well, it depends. If you do not have a covering index on BI_sessions and its a large database then the answer is probably. A good covering index may be something like: CreateDate, SessionSIS, and all the columns you actually need to return. If you do have a coveing index, then SQL will not even read the table, it will get all the data it needs from the covering index. Possibly if you specified the columns you actually need to return, 10 rows should come back in a fraction of a second.
for more useful info
http://www.mssqltips.com/sqlservertip/1078/improve-sql-server-performance-with-covering-index-enhancements/
and a bit more technical:
http://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
also
http://www.sqlserverinternals.com/
and
http://www.insidesqlserver.com/thebooks.html

Why I would bother using full text search?

I am new to Full Text Search, I used the following query
Select * From Students Where FullName LIKE '%abc%'
Students table contains million records all random and look like this
'QZAQHIEK VABCNLRM KFFZJYUU'
It took only 2 seconds and resulted 1100 rows. If million record is searched in two seconds why I would bother using Full Text Search ?!! Did Like predicate used the Full Text Index as well?
No. LIKE does not make use of full text indexing. See here.
Computers are pretty darn fast these days but if you're seeing search results faster than you expect it's possible that you simply got back a cached result set because you executed the same query previously. To be sure you're not getting cached results you could use DBCC DROPCLEANBUFFERS. Take a look at this post for some SQL Server cache clearing options.
Excerpt from the linked page:
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A LIKE query against millions of rows of text data can take minutes to return; whereas a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.
I think you have answered your own question, at least to your own satisfaction. If your prototyping produces results in an acceptable amount of time, and you are certain that caching does not explain the quick response (per Paul Sasik), by all means skip the overhead of full-text indexing and proceed with LIKE.
You might be interested in full-text search if you care about ranking your set of result or lexical stemming.
No, in fact your example query can't even take advantage of a regular index to speed things up because it doesn't know the first letters of any potential matches.
In general, full-text search is faster than a regular lookup. But LIKE is considerably slower.

Lucene.Net memory consumption and slow search when too many clauses used

I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
Assuming you can reuse the dbid part of your queries:
Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
Make both parts into queries
Convert the pkid query to a filter (by using QueryWrapperFilter)
Convert the filter into a cached filter (using CachingWrapperFilter)
Hang onto the filter, perhaps via some kind of dictionary
Next time you do a search, use the overload that allows you to use a query and filter
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

When do sql optimizations become overkill?

I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.
If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).
This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.