How do I optimize a database for superstring queries? - sql

So I have a database table in MySQL that has a column containing a string. Given a target string, I want to find all the rows that have a substring contained in the target, ie all the rows for which the target string is a superstring for the column. At the moment I'm using a query along the lines of:
SELECT * FROM table WHERE 'my superstring' LIKE CONCAT('%', column, '%')
My worry is that this won't scale. I'm currently doing some tests to see if this is a problem but I'm wondering if anyone has any suggestions for an alternative approach. I've had a brief look at MySQL's full-text indexing but that also appears to be geared toward finding a substring in the data, rather than finding out if the data exists in a given string.

You could create a temporary table with a full text index and insert 'my superstring' into it. Then you could use MySQL's full text match syntax in a join query with your permanent table. You'll still be doing a full table scan on your permanent table because you'll be checking for a match against every single row (what you want, right?). But at least 'my superstring' will be indexed so it will likely perform better than what you've got now.
Alternatively, you could consider simply selecting column from table and performing the match in a high level language. Depending on how many rows are in table, this approach might make more sense. Offloading heavy tasks to a client server (web server) can often be a win because it reduces load on the database server.

If your superstrings are URLs, and you want to find substrings in them, it would be useful to know if your substrings can be anchored on the dots.
For instance, you have superstrings :
www.mafia.gov.ru
www.mymafia.gov.ru
www.lobbies.whitehouse.gov
If your rules contain "mafia' and you want the first 2 to match, then what I'll say doesn't apply.
Else, you can parse your URLs into things like : [ 'www', 'mafia', 'gov', 'ru' ]
Then, it will be much easier to look up each element in your table.

Well it appears the answer is that you don't. This type of indexing is generally not available and if you want it within your MySQL database you'll need to create your own extensions to MySQL. The alternative I'm pursuing is to do the indexing in my application.
Thanks to everyone that responded!

I created a search solution using views that needed to be robust enought to grow with the customers needs. For Example:
CREATE TABLE tblMyData
(
MyId bigint identity(1,1),
Col01 varchar(50),
Col02 varchar(50),
Col03 varchar(50)
)
CREATE VIEW viewMySearchData
as
SELECT
MyId,
ISNULL(Col01,'') + ' ' +
ISNULL(Col02,'') + ' ' +
ISNULL(Col03,'') + ' ' AS SearchData
FROM tblMyData
SELECT
t1.MyId,
t1.Col01,
t1.Col02,
t1.Col03
FROM tblMyData t1
INNER JOIN viewMySearchData t2
ON t1.MyId = t2.MyId
WHERE t2.SearchData like '%search string%'
If they then decide to add columns to tblMyData and they want those columns to be searched then modify viewMysearchData by adding the new colums to "AS SearchData" section.
If they decide that there are two many columns in the search then just modify the viewMySearchData by removing the unwanted columns from the "AS SearchData" section.

Related

Select query to find all unique keys from a table where a any word from a list appears in a free text field?

Looking for a little bit of SQL-foo to help find the most efficient way to do this query.
I have a table with two columns, ID and a small character field (<300 chars). The ID field is not unique, and I would like the result to be a distinct list of ID numbers. I also have an input list of words that I want to query on, say 'foo', 'bar' as the base case. For a result to be valid, it also must have at least one matching row for each word that is input.
What is a clean and efficient way to write this as one query? I am also open to multiple queries if there is no single-query way to execute it efficiently.
Please note that in the specific environment I am working with I cannot use more than 10 subqueries, and I may have 10 or more words provided as input (although I may be able to limit the input to 10 as long as the user is aware of this). Also note that I cannot use the 'IN' clause if it is possible that the list of values in it grows to be larger than a few thousand. I am querying a table with potentially millions of ID-text pairs.
Thanks for any and all advice!
Use a UDF that returns a table:
Consider writing a user-defined function (UDF) that takes a string containing all values that you wish to search for, separated by a delimiter. The UDF would split the data in the string and return it as a table. Then, include the table that the UDF returns as a join on the table in question.
Here's an example: http://everysolution.wordpress.com/2011/07/28/udf-to-split-a-delimited-string-and-return-it-as-a-table/
If that small character field is always one word and you're looking for an exact match with a word in your list, I don't see why the below would not work. That is, if you're looking for IDs with 'foo', do you want only IDs that are 'foo', or might there be 'fooish', which should also be a match? In the latter case this won't work, in the former it should.
The query below assumes:
That your 2 column table is called "tbl"
That you can put the list of these 'input' words into a table; in my example below this other table is called "othertbl". It should contain however many words you're searching on, and it can be over 1,000 (the exists subquery doesn't have that limitation)
As stated before, I am assuming you are looking for exact matches on the 2nd column of "tbl", not partial or fuzzy matches
For performance reasons, you'll want to ensure that tbl.wordfield and othertbl.word are indexed (whatever the column names actually are)
-
select distinct id
from tbl
where exists
(select 'x' from othertbl where othertbl.word = tbl.wordfield)

is there a way to search within multiple tables in SQL

Right now i have 100 tables in SQL and i am looking for a specific string value in all tables, and i do not know which column it is in.
select * from table1, table2 where column1 = 'MyLostString' will not work because i do not know which column it has to be in.
Is there a SQL query for that, must i brute force search every table for every column for that 'MyLostString'
If I were to brute-force search across all tables, is there an efficient query for that?
For instance:
select * from table3 where allcolumns = MyLostString
It is the defining feature of a RDBMS (or at least one of them), that the meaning of a value depends on the column it is in. E.g.: The value 17 will have quite different meanings, if it stands in a customer_id column, than in the product_id of a fictional orders table.
This leads to the fact, that RDBMS are not well equipped to search for a value, no matter in which column of which tables it might be used.
My recommendation is to first study the data model to try and find out, which column of which table should be holding the value. If this really fails, you have a problem much worse than a "lost string".
The last ressort is to transform the DB into something better suited for fulltext search ... such as a flat file. You might want to try mydbexportcommand --options | grep -C10 'My lost string' or friends.

Sql Un-Wizardry: Compare values from one list in another

I have a comparison I'd like to make more efficient in SQL.
The input field (fldInputField) is a comma separated list of "1,3,4,5"
The database has a field (fldRoleList) which contains "1,2,3,4,5,6,7,8"
So, for the first occurrence of fldInputField within fldRoleList, tell us which value it was.
Is there a way to achieve the following in MySQL or a Stored Procedure?
pseudo-code
SELECT *
FROM aTable t1
WHERE fldInputField in t1.fldRoleList
/pseudo-code
I'm guessing there might be some functions that are best suited for this type of comparison? I couldn't find anything in the search, if someone could direct me I'll delete the question... Thanks!
UPDATE: This isn't the ideal (or good) way to do things. It's inherited code and we are simply trying to put in a quick fix while we look at building in the logic to deal with this via normalized rows.. Luckily this isn't heavily used code.
I agree with #Ken White's answer that comma-delimited lists have no place in a normalized database design.
The solution would be simpler and perform better if you stored the fldRoleList as multiple rows in a dependent table:
SELECT t1.*, r1.fldRole
FROM aTable t1 JOIN aTableRoles r1 USING (aTable_id)
WHERE FIND_IN_SET(r1.fldRole, fldInputField);
(see the MySQL function FIND_IN_SET())
But that outputs multiple rows if multiple roles match the comma-separated input string. If you need to restrict the result to one row per aTable entry, with the first matching role:
SELECT t1.*, MIN(r1.fldRole) AS First_fldRole
FROM aTable t1 JOIN aTableRoles r1 USING (aTable_id)
WHERE FIND_IN_SET(r1.fldRole, fldInputField);
GROUP BY t1.aTable_id;
You have a terrible schema design, you know. Comma-delimited lists have no business in a DB.
That being said... You're looking for LIKE.
SELECT * FROM aTable t1 WHERE t.fldRoleList LIKE fldInputField + '%'
If the content might not always match at the beginning, add another percent sign before fldInputField.
SELECT * FROM aTable t1 WHERE t.fldRoleList LIKE '%' + fldInputField + '%'

What is the reason not to use select *?

I've seen a number of people claim that you should specifically name each column you want in your select query.
Assuming I'm going to use all of the columns anyway, why would I not use SELECT *?
Even considering the question *SQL query - Select * from view or Select col1, col2, … colN from view*, I don't think this is an exact duplicate as I'm approaching the issue from a slightly different perspective.
One of our principles is to not optimize before it's time. With that in mind, it seems like using SELECT * should be the preferred method until it is proven to be a resource issue or the schema is pretty much set in stone. Which, as we know, won't occur until development is completely done.
That said, is there an overriding issue to not use SELECT *?
The essence of the quote of not prematurely optimizing is to go for simple and straightforward code and then use a profiler to point out the hot spots, which you can then optimize to be efficient.
When you use select * you're make it impossible to profile, therefore you're not writing clear & straightforward code and you are going against the spirit of the quote. select * is an anti-pattern.
So selecting columns is not a premature optimization. A few things off the top of my head ....
If you specify columns in a SQL statement, the SQL execution engine will error if that column is removed from the table and the query is executed.
You can more easily scan code where that column is being used.
You should always write queries to bring back the least amount of information.
As others mention if you use ordinal column access you should never use select *
If your SQL statement joins tables, select * gives you all columns from all tables in the join
The corollary is that using select * ...
The columns used by the application is opaque
DBA's and their query profilers are unable to help your application's poor performance
The code is more brittle when changes occur
Your database and network are suffering because they are bringing back too much data (I/O)
Database engine optimizations are minimal as you're bringing back all data regardless (logical).
Writing correct SQL is just as easy as writing Select *. So the real lazy person writes proper SQL because they don't want to revisit the code and try to remember what they were doing when they did it. They don't want to explain to the DBA's about every bit of code. They don't want to explain to their clients why the application runs like a dog.
If your code depends on the columns being in a specific order, your code will break when there are changes to the table. Also, you may be fetching too much from the table when you select *, especially if there is a binary field in the table.
Just because you are using all the columns now, it doesn't mean someone else isn't going to add an extra column to the table.
It also adds overhead to the plan execution caching since it has to fetch the meta data about the table to know what columns are in *.
One major reason is that if you ever add/remove columns from your table, any query/procedure that is making a SELECT * call will now be getting more or less columns of data than expected.
In a roundabout way you are breaking the modularity rule about using
strict typing wherever possible. Explicit is almost universally
better.
Even if you now need every column in the table, more could be added
later which will be pulled down every time you run the query and
could hurt performance. It hurts performance because
You are pulling more data over the wire; and
Because you might defeat the optimizer's ability to pull the data right out of the index (for queries on columns that are all part of an index.) rather than doing
a lookup in the table itself
When TO use select *
When you explicitly NEED every column in the table, as opposed to needing every column in the table THAT EXISTED AT THE TIME YOU WROTE THE QUERY. For example, if were writing an DB management app that needed to display the entire contents of the table (whatever they happened to be) you might use that approach.
There are a few reasons:
If the number of columns in a database changes and your application expects there to be a certain number...
If the order of columns in a database changes and your application expects them to be in a certain order...
Memory overhead. 8 unnecessary INTEGER columns would add 32 bytes of wasted memory. That doesn't sound like a lot, but this is for each query and INTEGER is one of the small column types... the extra columns are more likely to be VARCHAR or TEXT columns, which adds up quicker.
Network overhead. Related to memory overhead: if I issue 30,000 queries and have 8 unnecessary INTEGER columns, I've wasted 960kB of bandwidth. VARCHAR and TEXT columns are likely to be considerably larger.
Note: I chose INTEGER in the above example because they have a fixed size of 4 bytes.
If your application gets data with SELECT * and the table structure in the database is changed (say a column is removed), your application will fail in every place that you reference the missing field. If you instead include all the columns in your query, you application will break in the (hopefully) one place where you initially get the data, making the fix easier.
That being said, there are a number of situations in which SELECT * is desirable. One is a situation that I encounter all the time, where I need to replicate an entire table into another database (like SQL Server to DB2, for example). Another is an application written to display tables generically (i.e. without any knowledge of any particular table).
I actually noticed a strange behaviour when I used select * in views in SQL Server 2005.
Run the following query and you will see what I mean.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[starTest]') AND type in (N'U'))
DROP TABLE [dbo].[starTest]
CREATE TABLE [dbo].[starTest](
[id] [int] IDENTITY(1,1) NOT NULL,
[A] [varchar](50) NULL,
[B] [varchar](50) NULL,
[C] [varchar](50) NULL
) ON [PRIMARY]
GO
insert into dbo.starTest
select 'a1','b1','c1'
union all select 'a2','b2','c2'
union all select 'a3','b3','c3'
go
IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[vStartest]'))
DROP VIEW [dbo].[vStartest]
go
create view dbo.vStartest as
select * from dbo.starTest
go
go
IF EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[vExplicittest]'))
DROP VIEW [dbo].[vExplicittest]
go
create view dbo.[vExplicittest] as
select a,b,c from dbo.starTest
go
select a,b,c from dbo.vStartest
select a,b,c from dbo.vExplicitTest
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[starTest]') AND type in (N'U'))
DROP TABLE [dbo].[starTest]
CREATE TABLE [dbo].[starTest](
[id] [int] IDENTITY(1,1) NOT NULL,
[A] [varchar](50) NULL,
[B] [varchar](50) NULL,
[D] [varchar](50) NULL,
[C] [varchar](50) NULL
) ON [PRIMARY]
GO
insert into dbo.starTest
select 'a1','b1','d1','c1'
union all select 'a2','b2','d2','c2'
union all select 'a3','b3','d3','c3'
select a,b,c from dbo.vStartest
select a,b,c from dbo.vExplicittest
Compare the results of last 2 select statements.
I believe what you will see is a result of Select * referencing columns by index instead of name.
If you rebuild the view it will work fine again.
EDIT
I have added a separate question, *“select * from table” vs “select colA, colB, etc. from table” interesting behaviour in SQL Server 2005* to look into that behaviour in more details.
You might join two tables and use column A from the second table. If you later add column A to the first table (with same name but possibly different meaning) you'll most likely get the values from the first table and not the second one as earlier. That won't happen if you explicitly specify the columns you want to select.
Of course specifying the columns also sometimes causes bugs if you forget to add the new columns to every select clause. If the new column is not needed every time the query is executed, it may take some time before the bug gets noticed.
I understand where you're going regarding premature optimization, but that really only goes to a point. The intent is to avoid unnecessary optimization in the beginning. Are your tables unindexed? Would you use nvarchar(4000) to store a zip code?
As others have pointed out, there are other positives to specifying each column you intend to use in the query (such as maintainability).
When you're specifying columns, you're also tying yourself into a specific set of columns and making yourself less flexible, making Feuerstein roll over in, well, whereever he is. Just a thought.
SELECT * is not always evil. In my opinion, at least. I use it quite often for dynamic queries returning a whole table, plus some computed fields.
For instance, I want to compute geographical geometries from a "normal" table, that is a table without any geometry field, but with fields containing coordinates.
I use postgresql, and its spatial extension postgis. But the principle applies for many other cases.
An example:
a table of places, with coordinates stored in fields labeled x, y, z:
CREATE TABLE places (place_id integer, x numeric(10, 3), y numeric(10, 3), z numeric(10, 3), description varchar);
let's feed it with a few example values:
INSERT INTO places (place_id, x, y, z, description)
VALUES
(1, 2.295, 48.863, 64, 'Paris, Place de l\'Étoile'),
(2, 2.945, 48.858, 40, 'Paris, Tour Eiffel'),
(3, 0.373, 43.958, 90, 'Condom, Cathédrale St-Pierre');
I want to be able to map the contents of this table, using some GIS client. The normal way is to add a geometry field to the table, and build the geometry, based on the coordinates.
But I would prefer to get a dynamic query: this way, when I change coordinates (corrections, more accuracy, etc.), the objects mapped actually move, dynamically.
So here is the query with the SELECT *:
CREATE OR REPLACE VIEW places_points AS
SELECT *,
GeomFromewkt('SRID=4326; POINT ('|| x || ' ' || y || ' ' || z || ')')
FROM places;
Refer to postgis, for GeomFromewkt() function use.
Here is the result:
SELECT * FROM places_points;
place_id | x | y | z | description | geomfromewkt
----------+-------+--------+--------+------------------------------+--------------------------------------------------------------------
1 | 2.295 | 48.863 | 64.000 | Paris, Place de l'Étoile | 01010000A0E61000005C8FC2F5285C02405839B4C8766E48400000000000005040
2 | 2.945 | 48.858 | 40.000 | Paris, Tour Eiffel | 01010000A0E61000008FC2F5285C8F0740E7FBA9F1D26D48400000000000004440
3 | 0.373 | 43.958 | 90.000 | Condom, Cathédrale St-Pierre | 01010000A0E6100000AC1C5A643BDFD73FB4C876BE9FFA45400000000000805640
(3 lignes)
The rightmost column can now be used by any GIS program to properly map the points.
If, in the future, some fields get added to the table: no worries, I just have to run again the same VIEW definition.
I wish the definition of the VIEW could be kept "as is", with the *, but hélas it is not the case: this is how it is internally stored by postgresql:
SELECT places.place_id, places.x, places.y, places.z, places.description, geomfromewkt(((((('SRID=4326; POINT ('::text || places.x) || ' '::text) || places.y) || ' '::text) || places.z) || ')'::text) AS geomfromewkt FROM places;
Even if you use every column but address the row array by numeric index you will have problems if you add another row later on.
So basically it is a question of maintainability! If you don't use the * selector you will not have to worry about your queries.
Selecting only the columns you need keeps the dataset in memory smaller and therefor keeps your application faster.
Also, a lot of tools (e.g. stored procedures) cache query execution plans too. If you later add or remove a column (particularly easy if you're selecting off a view), the tool will often error when it doesn't get back results that it expects.
It makes your code more ambiguous and more difficult to maintain; because you're adding extra unused data to the domain, and it's not clear which you've intended and which not. (It also suggests that you might not know, or care.)
To answer you question directly: Do not use "SELECT *" when it makes your code more fragle to changes to the underlying tables. Your code should break only when a change is made to the table that directly affects requirments of your program.
Your application should take advantage of the abstraction layer that Relational access provides.
I don't use SELECT * simply because it is nice to see and know what fields I am retrieving.
Generally bad to use 'select *' inside of views because you will be forced to recompile the view in the event of a table column change. Changing the underlying table columns of a view you will get an error for non-existant columns until you go back and recompile.
It's ok when you're doing exists(select * ...) since it never gets expanded. Otherwise it's really only useful when exploring tables with temporary select statments or if you had a CTE defined above and you want every column without typing them all out again.
Just to add one thing that no one else has mentioned. Select * returns all the columns, someone may add a column later that you don't necessarily want the users to be able to see such as who last updated the data or a timestamp or notes that only managers should see not all users, etc.
Further, when adding a column, the impact on existing code should be reviewed and considered to see if changes are needed based on what information is stored in the column. By using select *, that review will often be skipped because the developer will assume that nothing will break. And in fact nothing may explicitly appear to break but queries may now start returning the wrong thing. Just because nothing explicitly breaks, doesn't mean that there should not have been changes to the queries.
because "select * " will waste memory when you don't need all the fields.But for sql server, their performence are the same.

Use a LIKE clause in part of an INNER JOIN

Can/Should I use a LIKE criteria as part of an INNER JOIN when building a stored procedure/query? I'm not sure I'm asking the right thing, so let me explain.
I'm creating a procedure that is going to take a list of keywords to be searched for in a column that contains text. If I was sitting at the console, I'd execute it as such:
SELECT Id, Name, Description
FROM dbo.Card
WHERE Description LIKE '%warrior%'
OR
Description LIKE '%fiend%'
OR
Description LIKE '%damage%'
But a trick I picked up a little while go to do "strongly typed" list parsing in a stored procedure is to parse the list into a table variable/temporary table, converting it to the proper type and then doing an INNER JOIN against that table in my final result set. This works great when sending say a list of integer IDs to the procedure. I wind up having a final query that looks like this:
SELECT Id, Name, Description
FROM dbo.Card
INNER JOIN #tblExclusiveCard ON dbo.Card.Id = #tblExclusiveCard.CardId
I want to use this trick with a list of strings. But since I'm looking for a particular keyword, I am going to use the LIKE clause. So ideally I'm thinking I'd have my final query look like this:
SELECT Id, Name, Description
FROM dbo.Card
INNER JOIN #tblKeyword ON dbo.Card.Description LIKE '%' + #tblKeyword.Value + '%'
Is this possible/recommended?
Is there a better way to do something like this?
The reason I'm putting wildcards on both ends of the clause is because there are "archfiend", "beast-warrior", "direct-damage" and "battle-damage" terms that are used in the card texts.
I'm getting the impression that depending on the performance, I can either use the query I specified or use a full-text keyword search to accomplish the same task?
Other than having the server do a text index on the fields I want to text search, is there anything else I need to do?
Try this
select * from Table_1 a
left join Table_2 b on b.type LIKE '%' + a.type + '%'
This practice is not ideal. Use with caution.
Your first query will work but will require a full table scan because any index on that column will be ignored. You will also have to do some dynamic SQL to generate all your LIKE clauses.
Try a full text search if your using SQL Server or check out one of the Lucene implementations. Joel talked about his success with it recently.
try it...
select * from table11 a inner join table2 b on b.id like (select '%'+a.id+'%') where a.city='abc'.
Its works for me.:-)
It seems like you are looking for full-text search. Because you want to query a set of keywords against the card description and find any hits? Correct?
Personally, I have done it before, and it has worked out well for me. The only issues i could see is possibly issues with an unindexed column, but i think you would have the same issue with a where clause.
My advice to you is just look at the execution plans between the two. I'm sure that it will differ which one is better depending on the situation, just like all good programming problems.
#Dillie-O
How big is this table?
What is the data type of Description field?
If either are small a full text search will be overkill.
#Dillie-O
Maybe not the answer you where looking for but I would advocate a schema change...
proposed schema:
create table name(
nameID identity / int
,name varchar(50))
create table description(
descID identity / int
,desc varchar(50)) --something reasonable and to make the most of it alwase lower case your values
create table nameDescJunc(
nameID int
,descID int)
This will let you use index's without have to implement a bolt on solution, and keeps your data atomic.
related: Recommended SQL database design for tags or tagging
a trick I picked up a little while go
to do "strongly typed" list parsing in
a stored procedure is to parse the
list into a table variable/temporary
table
I think what you might be alluding to here is to put the keywords to include into a table then use relational division to find matches (could also use another table for words to exclude). For a worked example in SQL see Keyword Searches by Joe Celko.
Performance will be depend on the actual server than you use, and on the schema of the data, and the amount of data. With current versions of MS SQL Server, that query should run just fine (MS SQL Server 7.0 had issues with that syntax, but it was addressed in SP2).
Have you run that code through a profiler? If the performance is fast enough and the data has the appropriate indexes in place, you should be all set.
LIKE '%fiend%' will never use an seek, LIKE 'fiend%' will. Simply a wildcard search is not sargable
Try this;
SELECT Id, Name, Description
FROM dbo.Card
INNER JOIN #tblKeyword ON dbo.Card.Description LIKE '%' +
CONCAT(CONCAT('%',#tblKeyword.Value),'%') + '%'