In MySql, find strings with a given prefix - sql

In MySql, I want to locate records where the string value in one of the columns begins with (or is the same as) a query string. The column is indexed with the appropriate collation order. There is no full-text search index on the column though.
A good solution will:
Use the index on the column. Solutions that need to iterate over all the records in the table aren't good enough (several million records in the table)
Work with strings with any character values. Some of the column values contain punctuation characters. The query string might too. Keep this in mind if your solution includes regex characters or similar. The strings are UTF-8 encoded, but if your solution only works with ASCII it could still be useful.
The closest I have at this point is
SELECT * FROM TableName WHERE ColumnName BETWEEN query AND <<query+1>>
Where <<query+1>> is pre-computed to lexicographically follow query in the collation order. For example, if query is "o hai" then <<query+1>> is "o haj".

Surprisingly, a LIKE query will use an index just fine if you're doing a prefix search.
SELECT * from TableName Where ColumnName LIKE 'o hai%'
will indeed use an index since it does not begin with a wildcard character.
This (and other behavior) is documented in the "How MySQL uses Indexes" doc:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
You will need to escape the '%' character and follow normal quoting rules, but other than that any utf-8 input prefix ought to work and do the job. Run an EXPLAIN query to make sure, sometimes other reasons can preclude indexes from working such as needing to do an OPTIMIZE TABLE to update index cardinalities (though this can take ages and locks your table)

Try this:
SELECT * FROM tablename WHERE columname LIKE CONCAT(query, '%');

Related

MS SQL fulltext search with ignoring special characters

I have a following (simplified) database structure:
[places]
name NVARCHAR(255)
description TEXT (usually quite a lot of text)
region_id INT FK
[regions]
id INT PK
name NVARCHAR(255)
[regions_translations]
lang_code NVARCHAR(5) FK
label NVARCHAR(255)
region_id INT FK
In real db I have few more fields in [places] table to search in, and [countries] table with similar structure to [regions].
My requirements are:
Search using name, description and region label, using the same behaviour as name LIKE '%text%' OR description LIKE '%text5' OR regions_translations.label LIKE '%text%'
Ignore all special characters like Ą, Ć, Ó, Š, Ö, Ü, etc. so for example, when someone search for
PO ZVAIGZDEM I return a place with name PO ŽVAIGŽDĖM - but of course also return this record, when user uses proper characters with accents.
Quite fast. ;)
I had a few approaches to solve this issue.
Create new column 'searchable_content', normalize text (so replace Ą -> A, Ö -> O and so on) and just do simple SELECT ... FROM places WHERE searchable_content LIKE '%text%' but it was slow
Add fulltext search index to table places and regions_translations - it was faster, but I could not find a way to ignore special characters (characters are from various of languages, so specyfying index language will not work)
Create new column as in first attempt, and addfulltext index only on that column - it was faster then attempt 1 (probably because I do not need to join tables) and I could manually normalize the content, but I feel like it's not a great solution.
Question is - what is the best approach here?
My top priority is to ignore special characters.
EDIT:
ALTER FULLTEXT CATALOG [catalog_name] REBUILD WITH ACCENT_SENSITIVITY = OFF
Probably is a solution to my issue with special characters (need to test it a bit more) - I query too fast, and index did not rebuild, that's why I did not get any records.
You can use the COLLATE clause on a column to specify a sql collation that will treat these special characters as their non-accented counterparts. Think of it as essentially casting one data type as another, except you're casting é as e (for example). You can use the same tool to return case sensitive or case insensitive results.
The documentation talks a little more about it, and you can do a search to find exactly which collation works best for you.
https://learn.microsoft.com/en-us/sql/t-sql/statements/collations?view=sql-server-ver16

Wildcard Search for a numeric range

I am trying to filter out a column (plan_name) which have internet plans of customers. Eg (200GB Fast Unlimited,Free Additional Mailbox,Unlimited VDSL...). I specifically want to filter out plans which do not have any numbers in them. The column is of type varchar2 and I'm querying an Oracle database. I have written the following code:
SELECT *
FROM plans
WHERE plan_name NOT LIKE '%[0-9]%'
However, this code still returns plans like 200GB fast that have numbers in them? Can anyone explain why this is?
Oracle's like syntax does not support the kind of pattern matching you are trying. This is SQL Server syntax. Oracle interprets the pattern as a litteral '[0-9]' (which, obviously, something like '200GB'does not match).
However, unlike SQL Server, Oracle has proper suport for regular expression, through the regexp_* functions. If you want values that contain no digit, you can do:
where not regexp_like(plan_name, '\d')

Fastest way to find string by substring in SQL?

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
Sounds like you've ruled out all good alternatives.
You already know that your query
SELECT * FROM t WHERE TITLE LIKE '%abc%'
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
SELECT * FROM t WHERE TITLE LIKE 'abc%'
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')
Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
Use ASCII charset with clustered indexing the char column.
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)

Various ways to query strings?

I'm working on an app that involves a lot of carefully designed strings. I'm in the process of designing the string format and for that I need to know what's possible and what's not when I'm querying the same data.
Which ones of these are possible with MySQL? .. and how do I accomplish them?
Results which contain this exact string -- not case sensitive
Results which contain this exact string -- case sensitive
Results which contain a similar string -- not case sensitive
Results which contain a similar string -- but individual characters must be of the same case
1. Results which contain this exact string -- not case sensitive
2. Results which contain this exact string -- case sensitive
These can both be accomplished. See the page documenting string functions in MySQL, in particular INSTR.
Case sensitivity is determined by the collation of a column. If you want values in a column to be compared in a case-sensitive fashion, then you give it a case-sensitive collation, as in the following example:
ALTER TABLE MyTable ADD MyColumn VARCHAR(10) CHARACTER SET ascii COLLATE ascii_general_cs NOT NULL
Conversely, if you want values in a column to be compared in a case-insensitive fashion, then give it a case-insensitive collation.
If you might want values in a column to be compared either way, then there are ways to do that too, though it's slightly more complicated.
3. Results which contain a similar string -- not case sensitive
4. Results which contain a similar string -- but individual characters must be of the same case
Depends what exactly you mean by "similar", but for some values of "similar" yes this is available. You will probably find it useful to consult the page I linked above.
SELECT this FROM that WHERE LOWER(string) = LOWER("blablabla");
SELECT this FROM that WHERE string = "blablabla";
SELECT this FROM that WHERE LOWER(string) LIKE LOWER("blablabla");
SELECT this FROM that WHERE string LIKE "blablabla";
Hope that's right.

How to SQL compare columns when one has accented chars?

I have two SQLite tables, that I would love to join them on a name column. This column contains accented characters, so I am wondering how can I compare them for join. I would like the accents dropped for the comparison to work.
You can influence the comparison of characters (such as ignoring case, ignoring accents) by using a Collation. SQLLite has only a few built in collations, although you can add your own.
SqlLite, Data types, Collating Sequences
SqlLite, Define new Collating Sequence
EDIT:
Given that it seems doubtful if Android supports UDFs and computed columns, here's another approach:
Add another column to your table, normalizedName
When your app writes out rows to your table, it normalizes name itself, removing accents and performing other changes. It saves the result in normalizedName.
You use normalizedName in your join.
As the normalization function is now in java, you should have few restrictions in coding it. Several examples for removing accents in java are given here.
There is an easy solution, but not very elegant.
Use the REPLACE function, to remove your accents. Exemple:
SELECT YOUR_COLUMN FROM YOUR_TABLE WHERE replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace( lower(YOUR_COLUMN), 'á','a'), 'ã','a'), 'â','a'), 'é','e'), 'ê','e'), 'í','i'),
'ó','o') ,'õ','o') ,'ô','o'),'ú','u'), 'ç','c') LIKE 'SEARCH_KEY%'
Where SEARCH_KEY is the key word that you wanna find on the column.
As mdma says, a possible solution would be a User-Defined-Function (UDF). There is a document here describing how to create such a function for SQLite in PHP. You could write a function called DROPACCENTS() which drops all the accents in the string. Then, you could join your column with the following code:
SELECT * FROM table1
LEFT JOIN table2
ON DROPACCENTS(table1.column1) = DROPACCENTS(table2.column1)
Much similar to how you would use the UCASE() function to perform a case-insensitive join.
Since you cannot use PHP on Android, you would have to find another way to create the UDF. Although it has been said that creating a UDF is not possible on Android, there is another Stack Overflow article claiming that a content provider could do the trick. The latter sounds slightly complicated, but promising.
Store a special "neutral" column without accented characters and compare / search only this column.