How can I improve the speed of a SQL query searching for a collection of strings - sql

I have a table called T_TICKET with a column CallId varchar(30).
Here is an example of my data:
CallId | RelatedData
===========================================
MXZ_SQzfGMCPzUA | 0000
MXyQq6wQ7gVhzUA | 0001
MXwZN_d5krgjzUA | 0002
MXw1YXo7JOeRzUA | 0000
...
I am attempting to find records that match a collection of CallId's. Something like this:
SELECT * FROM T_TICKET WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
And I have anywhere from 200 - 300 CallId's that I am looking up at a time using this query. The query takes around 35 seconds to run. Is there anything I can do to either the table structure, the column type, the index, or the query itself to improve the performance of this query?
There are around 300,000 rows in T_INDEX currently. CallId is not unique. And RelatedData is not unique. I also have an index (non-clustered) on CallId.
I know the basics of SQL, but I'm not a pro. Some things I've thought of doing are:
Change the type of CallId from varchar to char.
Shorten the length of CallId (it's length is 30, but in reality, right now, I am using only 15 bytes).
I have not tried any of these yet because it requires changes to live production data. And, I am not sure they would make a significant improvement.
Would either of these options make a significant improvement? Or, are there other things I could do to make this perform faster?

First, be sure that the types are the same -- either VARCHAR() or NVARCHAR(). Then, add an index:
create index idx_t_ticket_callid on t_ticket(callid);
If the types are compatible, SQL Server should make use of the index.

Your table is what we called heap (a table without clustered index). This kind of tables only good for data loading and/or as staging table. I would recommend you to convert your table to have a clustered key. A good clustering key should be unique, static, narrow, non-nullable, and ever-increasing (eg. int/bigint identity datatype).
Another downside of heap is when you have lots of UPDATE/DELETE on your table, it will slow down your SELECT because of forwarded records. Quoting from Paul Randal about forwarded records:
If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here – it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.
Lastly, make sure you define all your columns on your SELECT. Avoid the SELECT *. I'm guessing you are experiencing a table scan when you execute the query. What you can do is INCLUDE all columns list on your SELECT on your index like this:
CREATE INDEX [IX_T_TICKET_CallId_INCLUDE] ON [T_TICKET] ([CallId]) INCLUDE ([RelatedData]) WITH (DROP_EXISTING=ON)

It turns out there is in fact a way to significantly optimize my query without changing any data types.
This query:
SELECT * FROM T_TICKET
WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
is using NVARCHAR types as the input params (N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...). As I specified in my question, CallId is VARCHAR. Sql Server was converting CallId in every row of the table to an NVARCHAR type to do the comparison, which was taking a long time (even though I have an index on CallId).
I was able to optimize it by simply NOT changing the parameter types to NVARCHAR:
SELECT * FROM T_TICKET
WHERE CALLID IN('MXZInrBl1DCnzUA', 'MXZ0TWkUhHprzUA', 'MXZ_SQzfGMCPzUA', ... ,'MXyQq6wQ7gVhzUA')
Now, instead of taking over 30 seconds to run, it only takes around .03 seconds. Thanks for all the input.

Related

What is a better schema for indexing: a combined varchar column or several integer columns?

I want to make my table schema better. This table will insert a record per microsecond.
The table is already too big, so I could not test the table itself.
Current setup (columns id, name, one, two, three):
SELECT *
FROM table
WHERE name = 'foo'
AND one = 1
AND two = 2
AND three = 3;
Maybe in the future (columns id, name, path):
SELECT *
FROM table
WHERE
name = 'foo'
AND path = '1/2/3';
If I change three integer columns to one varchar column, will the SQL run faster than now?
Using PostgreSQL
varchar length will 5~12.
I think I can use bigint with zerofill (1/2/3 to 1000010200003) which may be faster than varchar.
Premature optimization is the root of all evil.
If you have a fixed number of integers, or at least a reasonable upper limit, stick with having an individual column for each.
You would then use a combined index over alk columns, ideally with the not nullable and selective columns first.
If you want to optimize, use smallint which only takes up two bytes.
If I change three integer columns to one varchar column, will the SQL run faster than now?
Not noticeably so. You might produce some small impacts on performance, balancing things such as:
Are the string columns bigger or smaller than the integer keys (resulting in marginal bigger or smaller data pages and indexes)?
Is an index on two variable length strings less efficient than an index on on variable length string and three fixed length keys?
Do the results match what you need or is additional processing needed after you fetch a record?
In either case the available index is going to be used to find the row(s) that match the conditions. This is an index seek, because the comparisons are all equality. Postgres will then go directly to the rows you need. There is a lot of work going on beyond just the index comparisons.
You are describing 1,000,000 inserts per second or 84 millions inserts each day -- that is a lot. Under such circumstances, you are not using an off-the-shelf instance of Postgres running on your laptop. You should have proper DBA support to answer a question like this.

dictionary database, one table vs table for each char

I have a very simple database contains one table T
wOrig nvarchar(50), not null
wTran nvarchar(50), not null
The table has +50 million rows. I execute a simple query
select wTran where wOrig = 'myword'
The query takes about 40 sec to complete. I divided the table based on the first char of wOrig and the execution time is much smaller than before (based on each table new length).
Am I missing something here? Should not the database use more efficient way to do the search, like binary search?
My question What changes to the database options - based on this situation - could make the search more efficient in order to keep all the data in one table?
You should be using an index. For your query, you want an index on wTran(wOrig). Your query will be much faster:
create index idx_wTran_wOrig on wTran(wOrig);
Depending on considerations such as space and insert/update characteristics, a clustered index on (wOrig) or (wOrig, wTran) might be the best solution.

sql select condition performance

I have a table 'Tab' with data such as:
id | value
---------------
1 | Germany
2 | Argentina
3 | Brasil
4 | Holland
What way of select is better by perfomane?
1. SELECT * FROM Tab WHERE value IN ('Argentina', 'Holland')
or
2. SELECT * FROM Tab WHERE id IN (2, 4)
I suppose that second select would be faster, because int comparison is faster than string. Is that true for MS SQL?
This is a premature optimization. The comparison between integers and strings is generally going to have a minimal impact on query performance. The drivers of query performance are more along the lines of tables sizes, query plans, available memory, and competition for resources.
In general, it is a good idea to have indexes on columns used for either comparison. The first column looks like a primary key, so it automatically gets an index. The string column should have an index built on it. In general, indexes built on an integer column will have marginally better performance compared to integers built on variable length string columns. However, this type of performance difference really makes a difference only in environments with very high levels of transactions (think thousands of data modification operations per second).
You should use the logic that best fits the application and worry about other aspects of the code.
To answer the simple question yes option 2 SELECT * FROM Tab WHERE id IN (2, 4) would be faster as you said because int comparison is faster.
One way to speed it up is to add indexes to your columns to speed up evaluation, filtering, and the final retrieval of results.
If this table was to grow even more you should also not SELECT * but SELECT id, value otherwise you may be pulling more data than you need.
You can also speed up your query's buy adding WITH(NOLOCK) as the speed of your query might be affected by other sessions accessing the tables at the same time. For example SELECT * FROM Tab WITH(NOLOCK) WHERE id IN (2, 4) . As mentioned below though adding nolock is not a turbo and should only be used in appropriate situations.

Index on VARCHAR column

I have a table of 32,589 rows, and one of the columns is called 'Location' and is a Varchar(40) column type. The column holds a location, which is actually a suburb, all uppercase text.
A function that uses this table does a:
IF EXISTS(SELECT * FROM MyTable WHERE Location = 'A Suburb')
...
Would it be beneficial to add an index to this column, for efficiency? This is more a read-only table, so not much edits or inserts except for maintanance.
Without an index SQL Server will have to perform a table scan to find the first instance of the location you're looking for. You might get lucky and have the value be in one of the first few rows, but it could be at row 32,000, which would be a waste of time. Adding an index only takes a few second and you'll probably see a big performance gain.
I concur with #Brian Shamblen answer.
Also, try using TOP 1 in the inner select
IF EXISTS(SELECT TOP 1 * FROM MyTable WHERE Location = 'A Suburb')
You don't have to select all the records matching your criteria for EXISTS, one is enough.
An opportunistic approach to performance tuning is usually a bad idea.
To answer the specific question - if your function is using location in a where clause, and the table has more than a few hundred rows, and the values in the location column are not all identical, creating an index will speed up your function.
Whether you notice any difference is hard to say - there may be much bigger performance problems lurking in the database, and you might be fixing the wrong problem.

Checking for the presence of text in a text column efficiently

I have a table with about 2,000,000 rows. I need to query one of the columns to retrieve the rows where a string exsists as part of the value.
When I run the query I will know the position of the string, but not before hand. So a view which takes a substring is not an option.
As far as I can see I have three options
using like ‘% %’
using instr
using substr
I do have the option of creating a function based index, if I am nice to the dba.
At the moment all queries are taking about two seconds. Does anyone have experience of which of these options will work best, or if there is another option? The select will be used for deletes every few seconds, it will typically select 10 rows.
edit with some more info
The problem comes about as we are using a table for storing objects with arbitrary keys and values. The objects come from outside our system so we have limited scope to control them so the text column is something like 'key1=abc,key2=def,keyn=ghi' I know this is horribly denormalised but as we don't know what the keys will be (to some extent) it is a reliable way to store and retrieve values. To retrieve a row is fairly fast as we are searching the whole of the column, which is indexed. But the performance is not good if we want to retrieve the rows with key2=def.
We may be able to create a table with columns for the most common keys, but I was wondering if there was a way to improve performance with the existing set up.
In Oracle 10:
CREATE TABLE test (tst_test VARCHAR2(200));
CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
SELECT *
FROM TEST
WHERE REGEXP_REPLACE(REGEXP_SUBSTR(TST_TEST, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1') = 'TEST'
This will use newly selected index.
You will need as many indices as there are KEYs in you data.
Presence of an INDEX, of course, impacts performance, but it depends very little on REGEXP being there:
SQL> CREATE INDEX ix_test ON test (tst_test)
2 /
Index created
Executed in 0,016 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 47,781 seconds
SQL> TRUNCATE TABLE test
2 /
Table truncated
Executed in 2,546 seconds
SQL> DROP INDEX ix_test
2 /
Index dropped
Executed in 0 seconds
SQL> CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
2 /
Index created
Executed in 0,015 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 53,375 seconds
As you can see, on my not very fast machine (Core2 4300, 1 Gb RAM) you can insert 20000 records per second to an indexed field, and this rate almost does not depend on type of INDEX being used: plain or function based.
You can use Tom Kyte's runstats package to compare the performance of different implementations - running each say 1000 times in a loop. For example, I just compared LIKE with SUBSTR and it said that LIKE was faster, taking about 80% of the time of SUBSTR.
Note that "col LIKE '%xxx%'" is different from "SUBSTR(col,5,3) = 'xxx'". The equivalent LIKE would be:
col LIKE '____xxx%'
using one '_' for each leading character to be ignored.
I think whichever way you do it, the results will be similar - it always involves a full table (or perhaps full index) scan. A function-based index would only help if you knew the offset of the substring at the time of creating the index.
I am rather concerned when you say that "The select will be used for deletes every few seconds". This does rather suggest a design flaw somewhere, but without knowing the requirements it's hard to say.
UPDATE:
If your column values are like 'key1=abc,key2=def,keyn=ghi' then perhaps you could consider adding another table like this:
create table key_values
( main_table_id references main_table
, key_value varchar2(50)
, primary key (fk_col, key_value)
);
create index key_values_idx on key_values (key_value);
Split the key values up and store them in this table like this:
main_table_id key_value
123 key1=abc
123 key2=def
123 key3=ghi
(This could be done in an AFTER INSERT trigger on main_table for example)
Then your delete could be:
delete main_table
where id in (select main_table_id from key_values
where key_value = 'key2=def');
Can you provide a bit more information?
Are you querying for an arbitrary substring of a string column, or is there some syntax on the strings store in the columns that would allow for some preprocessing to minimize repeated work?
Have you already done any timing tests on your three options to determine their relative performance on the data you're querying?
I suggest reconsidering your logic.
Instead of looking for where a string exists, it may be faster to check if it has a length of >0 and is not a string.
You can use the TRANSLATE function in oracle to convert all non string characters to nulls then check if the result is null.
Separate answer to comment on the table design.
Can't you at least have a KEY/VALUE structure, so instead of storing in a single column, 'key1=abc,key2=def,keyn=ghi' you would have a child table like
KEY VALUE
key1 abc
key2 def
key3 ghi
Then you can create a single index on key and value and your queries are much simpler (since I take it you are actually looking for an exact match on a given key's value).
Some people will probably comment that this is a horrible design, but I think it's better than what you have now.
If you're always going to be looking for the same substring, then using INSTR and a function-based index makes sense to me. You could also do this if you have a small set of constant substrings you will be looking for, creating one FBI for each one.
Quassnoi's REGEXP idea looks promising too. I haven't used regular expressions inside Oracle yet.
I think that Oracle Text would be another way to go. Info on that here
Not sure about improving existing setup stuff, but Lucene (full-text search library; ported to many platforms) can really help. There's extra burden of synchronizing index with the DB, but if you have anything that resembles a service layer in some programming language this becomes an easy task.
Similar to Anton Gogolev's response, Oracle does incorporate a text search engine documented here
There's also extensible indexing, so you can build your own index structures, documented here
As you've agreed, this is a very poor data structure, and I think you will struggle to achieve the aim of deleting stuff every few seconds. Depending on how this data gets input, I'd look at properly structuring the data on load, at least to the extent of having rows of "parent_id", "key_name", "key_value".