MS SQL fulltext search with ignoring special characters - sql

I have a following (simplified) database structure:
[places]
name NVARCHAR(255)
description TEXT (usually quite a lot of text)
region_id INT FK
[regions]
id INT PK
name NVARCHAR(255)
[regions_translations]
lang_code NVARCHAR(5) FK
label NVARCHAR(255)
region_id INT FK
In real db I have few more fields in [places] table to search in, and [countries] table with similar structure to [regions].
My requirements are:
Search using name, description and region label, using the same behaviour as name LIKE '%text%' OR description LIKE '%text5' OR regions_translations.label LIKE '%text%'
Ignore all special characters like Ą, Ć, Ó, Š, Ö, Ü, etc. so for example, when someone search for
PO ZVAIGZDEM I return a place with name PO ŽVAIGŽDĖM - but of course also return this record, when user uses proper characters with accents.
Quite fast. ;)
I had a few approaches to solve this issue.
Create new column 'searchable_content', normalize text (so replace Ą -> A, Ö -> O and so on) and just do simple SELECT ... FROM places WHERE searchable_content LIKE '%text%' but it was slow
Add fulltext search index to table places and regions_translations - it was faster, but I could not find a way to ignore special characters (characters are from various of languages, so specyfying index language will not work)
Create new column as in first attempt, and addfulltext index only on that column - it was faster then attempt 1 (probably because I do not need to join tables) and I could manually normalize the content, but I feel like it's not a great solution.
Question is - what is the best approach here?
My top priority is to ignore special characters.
EDIT:
ALTER FULLTEXT CATALOG [catalog_name] REBUILD WITH ACCENT_SENSITIVITY = OFF
Probably is a solution to my issue with special characters (need to test it a bit more) - I query too fast, and index did not rebuild, that's why I did not get any records.

You can use the COLLATE clause on a column to specify a sql collation that will treat these special characters as their non-accented counterparts. Think of it as essentially casting one data type as another, except you're casting é as e (for example). You can use the same tool to return case sensitive or case insensitive results.
The documentation talks a little more about it, and you can do a search to find exactly which collation works best for you.
https://learn.microsoft.com/en-us/sql/t-sql/statements/collations?view=sql-server-ver16

Related

Fastest way to find string by substring in SQL?

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.
My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:
SELECT * FROM t LIKE '%abc%'
I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?
I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.
if you dont care about storage, then you can create another table with partial Title entries, beginning with each substring (up to 255 entries per normal title ).
in this way, you can index these substrings, and match only to the beginning of the string, should greatly improve performance.
If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it.
You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.
Sounds like you've ruled out all good alternatives.
You already know that your query
SELECT * FROM t WHERE TITLE LIKE '%abc%'
won't use an index, it will do a full table scan every time.
If you were sure that the string was at the beginning of the field, you could do
SELECT * FROM t WHERE TITLE LIKE 'abc%'
which would use an index on Title.
Are you sure full text search wouldn't help you here?
Depending on your business requirements, I've sometimes used the following logic:
Do a "begins with" query (LIKE 'abc%') first, which will use an index.
Depending on if any rows are returned (or how many), conditionally move on to the "harder" search that will do the full scan (LIKE '%abc%')
Depends on what you need, of course, but I've used this in situations where I can show the easiest and most common results first, and only move on to the more difficult query when necessary.
You can add another calculated column on the table: titleLength as len(title) PERSISTED. This would store the length of the "title" column. Create an index on this.
Also, add another calculated column called: ReverseTitle as Reverse(title) PERSISTED.
Now when someone searches for a keyword, check if the length of keyword is same as titlelength. If so, do a "=" search. If length of keyword is less than the length of the titleLength, then do a LIKE. But first do a title LIKE 'abc%', then do a reverseTitle LIKE 'cba%'. Similar to Brad's approach - ie you do the next difficult query only if required.
Also, if the 80-20 rules applies to your keywords/ substrings (ie if most of the searches are on a minority of the keywords), then you can also consider doing some sort of caching. For eg: say you find that many users search for the keyword "abc" and this keyword search returns records with ids 20, 22, 24, 25 - you can store this in a separate table and have this indexed.
And now when someone searches for a new keyword, first look in this "cache" table to see if the search was already performed by an earlier user. If so, no need to look again in main table. Simply return results from "cache" table.
You can also combine the above with SQL Server TextSearch. (assuming you have a valid reason not to use it). But you could nevertheless use Text search first to shortlist the result set. and then run a SQL query against your table to get exact results using the Ids returned by the TExt Search as a parameter along with your keyword.
All this is obviously assuming you have to use SQL. If not, you can explore something like Apache Solr.
Create index view there is new feature in sql create index on the column that you need to search and use that view after in your search that will give your more faster result.
Use ASCII charset with clustered indexing the char column.
The charset influences the search performance because of the data
size on both ram and disk. The bottleneck is often I/O.
Your column is 255 characters long so you can use normal index on
your char field rather than full text, which is faster. Do not
select unnecessary columns in your select statement.
Lastly, add more RAM to the server and Increase cache size.
Do one thing, use primary key on specific column & index it in cluster form.
Then search using any method (wild card or = or any), it will search optimally because the table is already in clustered form, so it knows where he can find (because column is already in sorted form)

Firebird configuration - turn case-sensitivity off

I'm looking to perform case-insensitive search in a Firebird database, without modifying actual queries. In other words, I'd like all my existing "SELECT/WHERE/LIKE" statements to retrieve BOB, Bob, and bob. Does Firebird configuration allow to modify this behavior?
Try using something like:
Imagine you have a table of persons like this one:
CREATE TABLE PERSONS (
PERS_ID INTEGER NOT NULL PRIMARY KEY,
LAST_NAME VARCHAR(50),
FIRST_NAME VARCHAR(50)
);
Now there is an application, which allows the user to search for persons by last name and/or first name. So the user inputs the last name of the person he is searching for.
We want this search to be case insensitive, i.e. no matter if the user enters "Presley", "presley", "PRESLEY", or even "PrESley", we always want to find the King.
Ah yes, and we want that search to be fast, please. So there must be an index speeding it up.
A simple way to do case insensitive comparisons is to uppercase both strings and then compare the uppercased versions of both strings.
Uppercasing has limitations, because some letters cannot be uppercased. Note also that there are languages/scripts where there is no such thing as case. So the technique described in this article will work best for European languages.
In order to get really perfect results one would need a case insensitive (CI) and/or accent insensitive (AI) collation. However, at the time of this writing (July 2006) there are only two Czech AI/CI collations for Firebird 2.0. The situation will hopefully improve over time.
(You should know the concepts of Character Sets and Collations in order to understand what comes next. I use the DE_DE collation in my examples, this is the collation for German/Germany in the ISO8859_1 character set.)
In order to get correct results from the UPPER() function that is built into Firebird, you must specify a collation. This can be in the DDL definition of the table:
CREATE TABLE PERSONS (
PERS_ID INTEGER NOT NULL PRIMARY KEY,
LAST_NAME VARCHAR(50) COLLATE DE_DE,
FIRST_NAME VARCHAR(50) COLLATE DE_DE
);
or it can be done when calling the UPPER() function:
SELECT UPPER (LAST_NAME COLLATE DE_DE) FROM PERSONS;
http://www.destructor.de/firebird/caseinsensitivesearch.htm
or you can edit your queries and add the lower() function
LOWER()
Available in: DSQL, ESQL, PSQL
Added in: 2.0
Description: Returns the lower-case equivalent of the input string. This function also correctly lowercases non-ASCII characters, even if the default (binary) collation is used. The character set must be appropriate though: with ASCII or NONE for instance, only ASCII characters are lowercased; with OCTETS, the entire string is returned unchanged.
Result type: VAR(CHAR)
Syntax:
LOWER (str)
Important
If the external function LOWER is declared in your database, it will obfuscate the internal function. To make the internal function available, DROP or ALTER the external function (UDF).
Example:
select field from table
where lower(Name) = 'bob'
http://www.firebirdsql.org/refdocs/langrefupd21-intfunc-lower.html
Necromancing.
Instead of a lowercase-field, you can specify the collation:
SELECT field FROM table
WHERE Name = 'bob' COLLATE UNICODE_CI
See
https://firebirdsql.org/refdocs/langrefupd21-collations.html
Maybe you can also specify it as the default character-set for the database, see
http://www.destructor.de/firebird/charsets.htm
Eventually I went with creating shadow columns containing lower-cased versions of required fields.

SQL Create table script from DBMS contains [ ]

If some one does right click on a given Table in a Database using SQL Server management Studio from Microsoft and create script table to query window, it displays the create table code in the window. I notice something like the following
CREATE TABLE [dbo].[Login](
[UserId] [int] NOT NULL,
[LoginName] nvarchar COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
etc
)
Is this how it appears in other DBMS too ?
Is this something specific to DBMS used ?
Is it just a fancy view only ?
Why are "[ ]" used around column names and table names. Simple TSQL definition would look something like
CREATE TABLE table_name (
Column#1 Datatype NOT NULL,
Column#2 Datatype NOT NULL UNIQUE,
-- etc...
)
Please dont mind my silly questions.
Thanks in advance,
Balaji S
Those are identifier quotes. They mark the contents as a database identifier (column name, table name, etc) and allow spaces, special characters, and reserved words to be used as identifiers. Usually none of those appear in identifiers so the ID quotes are, strictly speaking, unnecessary. When the text cannot be parsed because of special characters or reserved words, the ID quotes are required.
It's easier for automated tools to simply always use the ID quotes than figure out when you could get away without them.
Different database products use different characters for ID quotes. You often see the back-tick (`) used for this.
Just to add to Larry Lustig's excellent answer, SQL Server's 'create script' function spits out SQL code whose first priority is to be parser-friendly and that's why it always uses its preferred delimited identifier characters being square brackets, rather than say the double quotes used by the SQL-92 standard (SQL Server code can support the Standard delimited identifier by the use of SET QUOTED_IDENTIFIER ON but I don't think this can be specified as an option for output). Being easy on the human eye (use of whitespace, line breaks, etc) is merely a secondary consideration.

In MySql, find strings with a given prefix

In MySql, I want to locate records where the string value in one of the columns begins with (or is the same as) a query string. The column is indexed with the appropriate collation order. There is no full-text search index on the column though.
A good solution will:
Use the index on the column. Solutions that need to iterate over all the records in the table aren't good enough (several million records in the table)
Work with strings with any character values. Some of the column values contain punctuation characters. The query string might too. Keep this in mind if your solution includes regex characters or similar. The strings are UTF-8 encoded, but if your solution only works with ASCII it could still be useful.
The closest I have at this point is
SELECT * FROM TableName WHERE ColumnName BETWEEN query AND <<query+1>>
Where <<query+1>> is pre-computed to lexicographically follow query in the collation order. For example, if query is "o hai" then <<query+1>> is "o haj".
Surprisingly, a LIKE query will use an index just fine if you're doing a prefix search.
SELECT * from TableName Where ColumnName LIKE 'o hai%'
will indeed use an index since it does not begin with a wildcard character.
This (and other behavior) is documented in the "How MySQL uses Indexes" doc:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
You will need to escape the '%' character and follow normal quoting rules, but other than that any utf-8 input prefix ought to work and do the job. Run an EXPLAIN query to make sure, sometimes other reasons can preclude indexes from working such as needing to do an OPTIMIZE TABLE to update index cardinalities (though this can take ages and locks your table)
Try this:
SELECT * FROM tablename WHERE columname LIKE CONCAT(query, '%');

Fulltext search (sql server 2005) works only on some fields

OK this is the situation..
I am enabling fulltext search on a table but it only works on some fields..
CREATE FULLTEXT CATALOG [defaultcatalog]
CREATE UNIQUE INDEX ui_staticid on static(id)
CREATE FULLTEXT INDEX ON static(title_gr LANGUAGE 19,title_en,description_gr LANGUAGE 19,description_en) KEY INDEX staticid ON [defaultcatalog] WITH CHANGE_TRACKING AUTO
now why the following will bring results
Select * from static where freetext(description_en, N'str')
and this not (while the both have text with str in it ..)
Select * from static where freetext(description_gr, N'str')
(i have tried it also without the language specification - greek in this case)
(the collation is of the database is Greek_CI_AS)
btw
Select * from static where description_gr like N'%str%'
will work just fine ..
all fields are nvarchar type and the _gr fields hold english and greek text..(should not matter)
All help will be greatly appreciated
Just trying to figure out what's going on: what do you get with this query here?
SELECT * FROM static WHERE FREETEXT(*, N'str')
If you're not explicitly specifying any column to search in - does it give you the expected results?
Another point: I think you have a wrong language ID in your statement. According to SQL Server Books Online:
When specified as a string,
language_term corresponds to the alias
column value in the syslanguages
system table. The string must be
enclosed in single quotation marks, as
in 'language_term'. When specified
as an integer, language_term is the
actual LCID that identifies the
language.
and from what I found on the internet searching around, the LCID for Greek is 1032 - not 19. Can you try with 1032 instead of 19? Does that make a difference?
Marc