Why is my SQL QUERY using CONTAINS numbers taking up to 2 minutes - sql

I have a table named Locations with a FullText Index on all columns. There's one PK Column (INT) and the rest are TEXT/VARCHAR. This table has 300,000 records.
The following query is taking 2 minutes to return one record.
SELECT TOP 1 * FROM Locations WHERE CONTAINS(*, '"1*"') ORDER BY LocationID
This slow query time is consistant when using any combination of numbers from 1 to 3 digits in length.
Using a characters (a-zA-Z) are performing normally, with a sub 25 milisecond response time.
Any idea why the numeric values are causing such a performance hit?

I suspect it is a combination of 2 causes.
Cause 1: Wildcard searches on common prefixes are slow. Do the records contain a lot of strings (numeric or alphanumeric) that begin with "1"? If so, that might explain the poor performance.
Wildcard searches tend to be slower than other full text searches. The more terms there are that contain the prefix ("1" in your case), the more work the full text engine has to do.
Although 300,000 records is not a lot of records for the full text engine to handle, factors like the number of unique terms in each record and the number of records and columns in which each of those terms is found will contribute even more to the search performance.
Cause 2: Missing index on ORDER BY columns. You should make sure the LocationID column is indexed since that is how you're sorting the results. It is possible that "1*" is generating a lot of results, all of which need to be sorted. If there is no index, the sort could take a long time.

Related

What is a better schema for indexing: a combined varchar column or several integer columns?

I want to make my table schema better. This table will insert a record per microsecond.
The table is already too big, so I could not test the table itself.
Current setup (columns id, name, one, two, three):
SELECT *
FROM table
WHERE name = 'foo'
AND one = 1
AND two = 2
AND three = 3;
Maybe in the future (columns id, name, path):
SELECT *
FROM table
WHERE
name = 'foo'
AND path = '1/2/3';
If I change three integer columns to one varchar column, will the SQL run faster than now?
Using PostgreSQL
varchar length will 5~12.
I think I can use bigint with zerofill (1/2/3 to 1000010200003) which may be faster than varchar.
Premature optimization is the root of all evil.
If you have a fixed number of integers, or at least a reasonable upper limit, stick with having an individual column for each.
You would then use a combined index over alk columns, ideally with the not nullable and selective columns first.
If you want to optimize, use smallint which only takes up two bytes.
If I change three integer columns to one varchar column, will the SQL run faster than now?
Not noticeably so. You might produce some small impacts on performance, balancing things such as:
Are the string columns bigger or smaller than the integer keys (resulting in marginal bigger or smaller data pages and indexes)?
Is an index on two variable length strings less efficient than an index on on variable length string and three fixed length keys?
Do the results match what you need or is additional processing needed after you fetch a record?
In either case the available index is going to be used to find the row(s) that match the conditions. This is an index seek, because the comparisons are all equality. Postgres will then go directly to the rows you need. There is a lot of work going on beyond just the index comparisons.
You are describing 1,000,000 inserts per second or 84 millions inserts each day -- that is a lot. Under such circumstances, you are not using an off-the-shelf instance of Postgres running on your laptop. You should have proper DBA support to answer a question like this.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

SQL Select is very slow with CHARINDEX

I am using sql server and have a table with 2 columns
myId varchar(80)
cchunk varchar(max)
Basicly it stores large chunk of text so thats why I need it varchar(max).
My problem is when I do a query like this:
select *
from tbchunks
where
(CHARINDEX('mystring1',tbchunks.cchunk)< CHARINDEX('mystring2',tbchunks.cchunk))
AND
CHARINDEX('mystring2',tbchunks.cchunk) - CHARINDEX('mystring1',tbchunks.cchunk) <=10
It takes about 3 seconds to complete, and the table chunks is about only 500,000 records and data returned from the above query is anywhere between 0 to 800 max
I have unclustered index on myid column, it helped with making fast select count(*) but didnt help with the above query.
I tried using Fulltext but was slow. i tried spliting the text in cchunk into smaller parts and adding an id column that will connect all those splited chunks, but ended up with a table with 2 million records of splited chunks of text (i did that so i can add index) but still the query was even slower.
EDIT:
modified the table to include primary key (int)
created fultext catalog with "Accent Senstive=true"
created fulltext index on my tabe on column "cchunk"
ran the same above query and it ended up taking 22 seconds with is much slower
UPDATE
Thanks everyone for suggesting the FullText (#Aaron Bertrand thanks!), i converted my query to this
SELECT * FROM tbchunksAS FT_TBL INNER JOIN
CONTAINSTABLE(tbchunks, cchunk, '(mystring1 NEAR mystring2)') AS KEY_TBL
ON FT_TBL.cID = KEY_TBL.[KEY]
by the way the cID is the primary key i added later.
anyway i am getting borad results and i notice that the higher the RANK column that was returned the better the results. my question is when RANK starts to get accurate?
An index isn't going to help with CHARINDEX at all. An index on a particular column is only going to be able to quickly find rows where the value in the indexed field is exactly an indexed value. I'm actually quite surprised that query only takes 3 seconds given that it has to read every single row four times (or at the very least, twice).
Well as good the ideas that were presented here, no body manage to really solve my problem but rather provided helpful tips that lead me to the solution which i would love to share.
Using Full text really was the answer like many mentioned but i managed to use the Contains in combination with Near so it can totally replace my current sql query and provide an awesome speed.
CONTAINS(tbchunks, 'NEAR ((mystring1, mystring2), 3, TRUE)')

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.