SQL Server: Get records that match specifc data at specific location? - sql

I have a varchar column that has data such as 00110100001110100100010011111, and I need to get back records that have 1 in position 5 and 0 in position 11. What is the fastest way I can search for them?
Right now I'm thinking of using substring: substring(column, 5, 1)==1 and substring (column, 11,1)==0. Is this the best way?
Thanks.

LIKE '____1_____0%' is the simplest way with your current structure. It will involve a full table scan though due to the leading wildcard.
What does this string of characters represent though?
If it is a fixed set of boolean values you might consider separating them out into individual bit columns and indexing them individually.
This is more space efficient as 8 values can fit into 2 bytes (including null bitmap) as opposed to 2 values in 2 bytes for the varchar version.
You might well still end up with table scans however as these indexes will not be selective enough to be used except if the values are skewed and you are searching for the less common values but at least SQL Server will be able to maintain separate column statistics and use the indexes when this would help.
If it is an arbitrary set (e.g. an ever growing history of states) then you should probably separate out into a new table (EntityId, Position (int), Value (bit)). You can then use a relational division query to bring back all EntityIds matching the desired pattern.
SELECT EntityId
WHERE ( Position = 5
AND Value = 1
)
OR ( Position = 11
AND Value = 0
)
GROUP BY EntityId
HAVING COUNT(*) = 2

Use SUBSTRING. You can parameterise substring so if you want positions 3 and 13 you can change it or have it in a UDF etc
It depends what you want of course
If it's static positions, use Martin Smith's answer because it's cleaner
I suspect you need to refactor this column into several discrete ones though

Do positions 5 and 11 stay constant? Do you have ability to create computed columns and indexes?
If the answer to both of these questions is "yes", then you should be able to achieve good performance by implementing the following general idea:
Create computed column on substring(column, 5, 1).
Create computed column on substring(column, 11,1).
Create a composite index on both of these columns.
Then, in your query, just use the exact same expressions as in the definitions of your computed columns (such as: substring(column, 5, 1)==1 and substring (column, 11,1)==0, as you already proposed).
That being said, if you can, do yourself a favor and normalize your data model. Your table is not even in the 1st normal form!

Related

What is a better schema for indexing: a combined varchar column or several integer columns?

I want to make my table schema better. This table will insert a record per microsecond.
The table is already too big, so I could not test the table itself.
Current setup (columns id, name, one, two, three):
SELECT *
FROM table
WHERE name = 'foo'
AND one = 1
AND two = 2
AND three = 3;
Maybe in the future (columns id, name, path):
SELECT *
FROM table
WHERE
name = 'foo'
AND path = '1/2/3';
If I change three integer columns to one varchar column, will the SQL run faster than now?
Using PostgreSQL
varchar length will 5~12.
I think I can use bigint with zerofill (1/2/3 to 1000010200003) which may be faster than varchar.
Premature optimization is the root of all evil.
If you have a fixed number of integers, or at least a reasonable upper limit, stick with having an individual column for each.
You would then use a combined index over alk columns, ideally with the not nullable and selective columns first.
If you want to optimize, use smallint which only takes up two bytes.
If I change three integer columns to one varchar column, will the SQL run faster than now?
Not noticeably so. You might produce some small impacts on performance, balancing things such as:
Are the string columns bigger or smaller than the integer keys (resulting in marginal bigger or smaller data pages and indexes)?
Is an index on two variable length strings less efficient than an index on on variable length string and three fixed length keys?
Do the results match what you need or is additional processing needed after you fetch a record?
In either case the available index is going to be used to find the row(s) that match the conditions. This is an index seek, because the comparisons are all equality. Postgres will then go directly to the rows you need. There is a lot of work going on beyond just the index comparisons.
You are describing 1,000,000 inserts per second or 84 millions inserts each day -- that is a lot. Under such circumstances, you are not using an off-the-shelf instance of Postgres running on your laptop. You should have proper DBA support to answer a question like this.

Select query to find all unique keys from a table where a any word from a list appears in a free text field?

Looking for a little bit of SQL-foo to help find the most efficient way to do this query.
I have a table with two columns, ID and a small character field (<300 chars). The ID field is not unique, and I would like the result to be a distinct list of ID numbers. I also have an input list of words that I want to query on, say 'foo', 'bar' as the base case. For a result to be valid, it also must have at least one matching row for each word that is input.
What is a clean and efficient way to write this as one query? I am also open to multiple queries if there is no single-query way to execute it efficiently.
Please note that in the specific environment I am working with I cannot use more than 10 subqueries, and I may have 10 or more words provided as input (although I may be able to limit the input to 10 as long as the user is aware of this). Also note that I cannot use the 'IN' clause if it is possible that the list of values in it grows to be larger than a few thousand. I am querying a table with potentially millions of ID-text pairs.
Thanks for any and all advice!
Use a UDF that returns a table:
Consider writing a user-defined function (UDF) that takes a string containing all values that you wish to search for, separated by a delimiter. The UDF would split the data in the string and return it as a table. Then, include the table that the UDF returns as a join on the table in question.
Here's an example: http://everysolution.wordpress.com/2011/07/28/udf-to-split-a-delimited-string-and-return-it-as-a-table/
If that small character field is always one word and you're looking for an exact match with a word in your list, I don't see why the below would not work. That is, if you're looking for IDs with 'foo', do you want only IDs that are 'foo', or might there be 'fooish', which should also be a match? In the latter case this won't work, in the former it should.
The query below assumes:
That your 2 column table is called "tbl"
That you can put the list of these 'input' words into a table; in my example below this other table is called "othertbl". It should contain however many words you're searching on, and it can be over 1,000 (the exists subquery doesn't have that limitation)
As stated before, I am assuming you are looking for exact matches on the 2nd column of "tbl", not partial or fuzzy matches
For performance reasons, you'll want to ensure that tbl.wordfield and othertbl.word are indexed (whatever the column names actually are)
-
select distinct id
from tbl
where exists
(select 'x' from othertbl where othertbl.word = tbl.wordfield)

How to create an index for a string column in sql?

I have a table with 3 columns: a list id, name and numeric value.
The goal is to use the table to retrieve and update the numeric value for a name in various lists.
The problem is that sql refuses to create an index with the name column because it's a string column with variable length.
Without an index selecting with a name will be inefficient, and the option of using a static length text column will be wasting a lot of storage space as names can be fairly long.
What is the best way to build this table and it's indexes?
(running sql server 2008)
If your string is longer than 900 bytes, then it can't be an index key, regardless of whether it is variable or fixed length.
One idea would be to at least make seeks more selective by adding a computed column. e.g.
CREATE TABLE dbo.Strings
(
-- other columns,
WholeString VARCHAR(4000),
Substring AS (CONVERT(VARCHAR(10), WholeString) PERSISTED
);
CREATE INDEX ss ON dbo.Strings(Substring);
Now when searching for a row to update, you can say:
WHERE s.Substring = LEFT(#string, 10)
AND s.WholeString = #string;
This will at least help the optimizer narrow its search down to the index pages where the exact match is most likely to live. You may want to experiment with that length as it depends on how many similar strings you have and what will best help the optimizer weed out a single page. You may also want to experiment with including some or all of the other columns in the ss index, with or without using the INCLUDE clause (whether this is useful will vary greatly on various factors such as what else your update query does, read/write ratio, etc).
A regular index can't be created on ntext or text columns (i guess your name column is of that type, or (n)varchar longer than 900 bytes). You can create full-text index on that column type.

Creating indexes for optimizing the execution of Stored Prcocedures

The WHERE clause of one of my queries looks like this:
and tbl0.Type = 'Alert'
AND (tbl0.AccessRights like '%'+TblCUG0.userGroup+'%'
or tbl0.AccessRights like 'All' )
AND (tbl0.ExpiryDate > CONVERT(varchar(8), GETDATE(), 1)
or tbl0.ExpiryDate is null)
order by tbl0.Priority,tbl0.PublishedDate desc, tbl0.Title asc
I will like to know on which columns can I create indexes and which type of index will best suit. Also I have heard that indexes dont work with Like and Wild cards at the starting. So what should be the approach to optimize the queries.
1 and tbl0.Type = 'Alert'
2 AND (tbl0.AccessRights like '%'+TblCUG0.userGroup+'%'
3 or tbl0.AccessRights like 'All' )
4 AND (tbl0.ExpiryDate > CONVERT(varchar(8), GETDATE(), 1)
5 or tbl0.ExpiryDate is null)
most likely, you will not be able to use an index with a WHERE clause like this.
Line 1, You could create an index on tbl0.Type, but if you have many rows and few actual values, SQL Server will just skip the index and table scan anyway. Also, having nothing to do with the index issue, a column like this, a code/flag value is better as a fixed width value char(1), tiny int, etc, where "A"=alert or 1=alert. I would name the column XyzType, where Xyz is what the type describes (DoctorType, CarType, etc). I would create a new table XyzTye, with a FK back to this column in tb10. this new table would have two columns XyzType PK and XyzDescription, where you expand out the name.
Line 2, are you combining multiple values into tbl0.AccessRights? and trying to use the LIKE to find values within it? if so, split this out into a different table and then you can remove the like and possibly add an index there.
Line 3, OR kills an index usage. Imagine looking through the phone book for all names that are "Smith" or start with "G", you can't just use the index. You may try splitting the query into a UNION or UNION ALL around the OR so an index can be used (one part looks for "Smith" and the other part looks for "G"). You have not provided enough of the query to determine if this is possible or not in your case. You many need to use a derived table that contains this UNION so you can join it to the rest of your query.
Line 4, tbl0.ExpiryDate could benifit from a index, but the or will kill its usage, see the Line 3 comment.
Line 5, you may try the OR union trick discussed above, or just not use NULL, put in a a default like '01/01/3000' so you don't need the OR.
SQL Server's Database Tuning Advisor can suggest which indexes will optimize your query, including covering indexes that will optimize the selected columns that you do not include in your query. Just because you add an index doesn't mean that the query optimizer will use it. Some indexes may cost more to use than others, so the optimizer will choose the bext indexes using the underlying tables' statistics.
Out-of-hand you could use add all ordering and criteria columns to an index, but that would be useless if for example, there are too few distinct Priority values to make it worth the storage.
You are right about LIKE and wildcards. An index is a btree which means that it can speed quick searches for specific values or range queries. A wildcard at the beginning means that the query will have to touch all records to check whether they match the pattern. A wildcard at the end means that the query will only have to touch items that start with the substring up to the wildcard, partially turning this into a range query that can benefit from an index.

Checking for the presence of text in a text column efficiently

I have a table with about 2,000,000 rows. I need to query one of the columns to retrieve the rows where a string exsists as part of the value.
When I run the query I will know the position of the string, but not before hand. So a view which takes a substring is not an option.
As far as I can see I have three options
using like ‘% %’
using instr
using substr
I do have the option of creating a function based index, if I am nice to the dba.
At the moment all queries are taking about two seconds. Does anyone have experience of which of these options will work best, or if there is another option? The select will be used for deletes every few seconds, it will typically select 10 rows.
edit with some more info
The problem comes about as we are using a table for storing objects with arbitrary keys and values. The objects come from outside our system so we have limited scope to control them so the text column is something like 'key1=abc,key2=def,keyn=ghi' I know this is horribly denormalised but as we don't know what the keys will be (to some extent) it is a reliable way to store and retrieve values. To retrieve a row is fairly fast as we are searching the whole of the column, which is indexed. But the performance is not good if we want to retrieve the rows with key2=def.
We may be able to create a table with columns for the most common keys, but I was wondering if there was a way to improve performance with the existing set up.
In Oracle 10:
CREATE TABLE test (tst_test VARCHAR2(200));
CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
SELECT *
FROM TEST
WHERE REGEXP_REPLACE(REGEXP_SUBSTR(TST_TEST, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1') = 'TEST'
This will use newly selected index.
You will need as many indices as there are KEYs in you data.
Presence of an INDEX, of course, impacts performance, but it depends very little on REGEXP being there:
SQL> CREATE INDEX ix_test ON test (tst_test)
2 /
Index created
Executed in 0,016 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 47,781 seconds
SQL> TRUNCATE TABLE test
2 /
Table truncated
Executed in 2,546 seconds
SQL> DROP INDEX ix_test
2 /
Index dropped
Executed in 0 seconds
SQL> CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
2 /
Index created
Executed in 0,015 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 53,375 seconds
As you can see, on my not very fast machine (Core2 4300, 1 Gb RAM) you can insert 20000 records per second to an indexed field, and this rate almost does not depend on type of INDEX being used: plain or function based.
You can use Tom Kyte's runstats package to compare the performance of different implementations - running each say 1000 times in a loop. For example, I just compared LIKE with SUBSTR and it said that LIKE was faster, taking about 80% of the time of SUBSTR.
Note that "col LIKE '%xxx%'" is different from "SUBSTR(col,5,3) = 'xxx'". The equivalent LIKE would be:
col LIKE '____xxx%'
using one '_' for each leading character to be ignored.
I think whichever way you do it, the results will be similar - it always involves a full table (or perhaps full index) scan. A function-based index would only help if you knew the offset of the substring at the time of creating the index.
I am rather concerned when you say that "The select will be used for deletes every few seconds". This does rather suggest a design flaw somewhere, but without knowing the requirements it's hard to say.
UPDATE:
If your column values are like 'key1=abc,key2=def,keyn=ghi' then perhaps you could consider adding another table like this:
create table key_values
( main_table_id references main_table
, key_value varchar2(50)
, primary key (fk_col, key_value)
);
create index key_values_idx on key_values (key_value);
Split the key values up and store them in this table like this:
main_table_id key_value
123 key1=abc
123 key2=def
123 key3=ghi
(This could be done in an AFTER INSERT trigger on main_table for example)
Then your delete could be:
delete main_table
where id in (select main_table_id from key_values
where key_value = 'key2=def');
Can you provide a bit more information?
Are you querying for an arbitrary substring of a string column, or is there some syntax on the strings store in the columns that would allow for some preprocessing to minimize repeated work?
Have you already done any timing tests on your three options to determine their relative performance on the data you're querying?
I suggest reconsidering your logic.
Instead of looking for where a string exists, it may be faster to check if it has a length of >0 and is not a string.
You can use the TRANSLATE function in oracle to convert all non string characters to nulls then check if the result is null.
Separate answer to comment on the table design.
Can't you at least have a KEY/VALUE structure, so instead of storing in a single column, 'key1=abc,key2=def,keyn=ghi' you would have a child table like
KEY VALUE
key1 abc
key2 def
key3 ghi
Then you can create a single index on key and value and your queries are much simpler (since I take it you are actually looking for an exact match on a given key's value).
Some people will probably comment that this is a horrible design, but I think it's better than what you have now.
If you're always going to be looking for the same substring, then using INSTR and a function-based index makes sense to me. You could also do this if you have a small set of constant substrings you will be looking for, creating one FBI for each one.
Quassnoi's REGEXP idea looks promising too. I haven't used regular expressions inside Oracle yet.
I think that Oracle Text would be another way to go. Info on that here
Not sure about improving existing setup stuff, but Lucene (full-text search library; ported to many platforms) can really help. There's extra burden of synchronizing index with the DB, but if you have anything that resembles a service layer in some programming language this becomes an easy task.
Similar to Anton Gogolev's response, Oracle does incorporate a text search engine documented here
There's also extensible indexing, so you can build your own index structures, documented here
As you've agreed, this is a very poor data structure, and I think you will struggle to achieve the aim of deleting stuff every few seconds. Depending on how this data gets input, I'd look at properly structuring the data on load, at least to the extent of having rows of "parent_id", "key_name", "key_value".