Firebird configuration - turn case-sensitivity off - sql

I'm looking to perform case-insensitive search in a Firebird database, without modifying actual queries. In other words, I'd like all my existing "SELECT/WHERE/LIKE" statements to retrieve BOB, Bob, and bob. Does Firebird configuration allow to modify this behavior?

Try using something like:
Imagine you have a table of persons like this one:
CREATE TABLE PERSONS (
PERS_ID INTEGER NOT NULL PRIMARY KEY,
LAST_NAME VARCHAR(50),
FIRST_NAME VARCHAR(50)
);
Now there is an application, which allows the user to search for persons by last name and/or first name. So the user inputs the last name of the person he is searching for.
We want this search to be case insensitive, i.e. no matter if the user enters "Presley", "presley", "PRESLEY", or even "PrESley", we always want to find the King.
Ah yes, and we want that search to be fast, please. So there must be an index speeding it up.
A simple way to do case insensitive comparisons is to uppercase both strings and then compare the uppercased versions of both strings.
Uppercasing has limitations, because some letters cannot be uppercased. Note also that there are languages/scripts where there is no such thing as case. So the technique described in this article will work best for European languages.
In order to get really perfect results one would need a case insensitive (CI) and/or accent insensitive (AI) collation. However, at the time of this writing (July 2006) there are only two Czech AI/CI collations for Firebird 2.0. The situation will hopefully improve over time.
(You should know the concepts of Character Sets and Collations in order to understand what comes next. I use the DE_DE collation in my examples, this is the collation for German/Germany in the ISO8859_1 character set.)
In order to get correct results from the UPPER() function that is built into Firebird, you must specify a collation. This can be in the DDL definition of the table:
CREATE TABLE PERSONS (
PERS_ID INTEGER NOT NULL PRIMARY KEY,
LAST_NAME VARCHAR(50) COLLATE DE_DE,
FIRST_NAME VARCHAR(50) COLLATE DE_DE
);
or it can be done when calling the UPPER() function:
SELECT UPPER (LAST_NAME COLLATE DE_DE) FROM PERSONS;
http://www.destructor.de/firebird/caseinsensitivesearch.htm
or you can edit your queries and add the lower() function
LOWER()
Available in: DSQL, ESQL, PSQL
Added in: 2.0
Description: Returns the lower-case equivalent of the input string. This function also correctly lowercases non-ASCII characters, even if the default (binary) collation is used. The character set must be appropriate though: with ASCII or NONE for instance, only ASCII characters are lowercased; with OCTETS, the entire string is returned unchanged.
Result type: VAR(CHAR)
Syntax:
LOWER (str)
Important
If the external function LOWER is declared in your database, it will obfuscate the internal function. To make the internal function available, DROP or ALTER the external function (UDF).
Example:
select field from table
where lower(Name) = 'bob'
http://www.firebirdsql.org/refdocs/langrefupd21-intfunc-lower.html

Necromancing.
Instead of a lowercase-field, you can specify the collation:
SELECT field FROM table
WHERE Name = 'bob' COLLATE UNICODE_CI
See
https://firebirdsql.org/refdocs/langrefupd21-collations.html
Maybe you can also specify it as the default character-set for the database, see
http://www.destructor.de/firebird/charsets.htm

Eventually I went with creating shadow columns containing lower-cased versions of required fields.

Related

MS SQL fulltext search with ignoring special characters

I have a following (simplified) database structure:
[places]
name NVARCHAR(255)
description TEXT (usually quite a lot of text)
region_id INT FK
[regions]
id INT PK
name NVARCHAR(255)
[regions_translations]
lang_code NVARCHAR(5) FK
label NVARCHAR(255)
region_id INT FK
In real db I have few more fields in [places] table to search in, and [countries] table with similar structure to [regions].
My requirements are:
Search using name, description and region label, using the same behaviour as name LIKE '%text%' OR description LIKE '%text5' OR regions_translations.label LIKE '%text%'
Ignore all special characters like Ą, Ć, Ó, Š, Ö, Ü, etc. so for example, when someone search for
PO ZVAIGZDEM I return a place with name PO ŽVAIGŽDĖM - but of course also return this record, when user uses proper characters with accents.
Quite fast. ;)
I had a few approaches to solve this issue.
Create new column 'searchable_content', normalize text (so replace Ą -> A, Ö -> O and so on) and just do simple SELECT ... FROM places WHERE searchable_content LIKE '%text%' but it was slow
Add fulltext search index to table places and regions_translations - it was faster, but I could not find a way to ignore special characters (characters are from various of languages, so specyfying index language will not work)
Create new column as in first attempt, and addfulltext index only on that column - it was faster then attempt 1 (probably because I do not need to join tables) and I could manually normalize the content, but I feel like it's not a great solution.
Question is - what is the best approach here?
My top priority is to ignore special characters.
EDIT:
ALTER FULLTEXT CATALOG [catalog_name] REBUILD WITH ACCENT_SENSITIVITY = OFF
Probably is a solution to my issue with special characters (need to test it a bit more) - I query too fast, and index did not rebuild, that's why I did not get any records.
You can use the COLLATE clause on a column to specify a sql collation that will treat these special characters as their non-accented counterparts. Think of it as essentially casting one data type as another, except you're casting é as e (for example). You can use the same tool to return case sensitive or case insensitive results.
The documentation talks a little more about it, and you can do a search to find exactly which collation works best for you.
https://learn.microsoft.com/en-us/sql/t-sql/statements/collations?view=sql-server-ver16

HSQL Query constraint for alphabets and numbers

How I have a column Firstname which should be only Alphabets
and phone number which should be only numbers in HSQL
i wrote something like
ALTER TABLE USER
ADD CONSTRAINT CHECK_USER_FIRST_NAME CHECK (FIRST_NAME NOT LIKE '%[^A-Z ]%' )
But it allows Tp1 which i don't want.
Can someone help me with constraint in HSQL
LIKE doesn't support regular expressions. The only wildcards it supports are % for multiple characters and _ for a single character. To match against a regular expression you need regexp_matches()
ALTER TABLE user
ADD CONSTRAINT check_user_first_name
CHECK (regexp_matches(first_name, '[A-Z]+'))
This will allow only uppercase letters, so you probably want to use '[A-Za-z]+' instead. It also requires at least one character in the name. If you want to allow empty strings, change the + to *. This will still allow null values though.
Note that USER is a reserved keyword in SQL. You shouldn't create a table with that name. If you try this e.g. on Oracle or PostgreSQL it will fail. The name requires to be quoted - which is not a good idea. You should find a different name.

SQL Create table script from DBMS contains [ ]

If some one does right click on a given Table in a Database using SQL Server management Studio from Microsoft and create script table to query window, it displays the create table code in the window. I notice something like the following
CREATE TABLE [dbo].[Login](
[UserId] [int] NOT NULL,
[LoginName] nvarchar COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
etc
)
Is this how it appears in other DBMS too ?
Is this something specific to DBMS used ?
Is it just a fancy view only ?
Why are "[ ]" used around column names and table names. Simple TSQL definition would look something like
CREATE TABLE table_name (
Column#1 Datatype NOT NULL,
Column#2 Datatype NOT NULL UNIQUE,
-- etc...
)
Please dont mind my silly questions.
Thanks in advance,
Balaji S
Those are identifier quotes. They mark the contents as a database identifier (column name, table name, etc) and allow spaces, special characters, and reserved words to be used as identifiers. Usually none of those appear in identifiers so the ID quotes are, strictly speaking, unnecessary. When the text cannot be parsed because of special characters or reserved words, the ID quotes are required.
It's easier for automated tools to simply always use the ID quotes than figure out when you could get away without them.
Different database products use different characters for ID quotes. You often see the back-tick (`) used for this.
Just to add to Larry Lustig's excellent answer, SQL Server's 'create script' function spits out SQL code whose first priority is to be parser-friendly and that's why it always uses its preferred delimited identifier characters being square brackets, rather than say the double quotes used by the SQL-92 standard (SQL Server code can support the Standard delimited identifier by the use of SET QUOTED_IDENTIFIER ON but I don't think this can be specified as an option for output). Being easy on the human eye (use of whitespace, line breaks, etc) is merely a secondary consideration.

When should I use the SQL Server Unicode 'N' Constant?

I've been looking into the use of the Unicode 'N' constant within my code, for example:
select object_id(N'VW_TABLE_UPDATE_DATA', N'V');
insert into SOME_TABLE (Field1, Field2) values (N'A', N'B');
After doing some reading around when to use it, and I'm still not entirely clear as to the circumstances under which it should and should not be used.
Is it as simple as using it when data types or parameters expect a unicode data type (as per the above examples), or is it more sophiticated than that?
The following Microsoft site gives an explanation, but I'm also a little unclear as to some of the terms it is using
http://msdn.microsoft.com/en-us/library/ms179899.aspx
Or to precis:
Unicode constants are interpreted as
Unicode data, and are not evaluated by
using a code page. Unicode constants
do have a collation. This collation
primarily controls comparisons and
case sensitivity. Unicode constants
are assigned the default collation of
the current database, unless the
COLLATE clause is used to specify a
collation.
What does it mean by:
'evaluated by using a code page'?
Collation?
I realise this is quite a broad question, but any links or help would be appreciated.
Thanks
Is it as simple as using it when data types or parameters expect a unicode data type?
Pretty much.
To answer your other points:
A code page is another name for encoding of a character set. For example, windows code page 1255 encodes Hebrew. This is normally used for 8bit encodings for characters. In terms of your question, strings may be evaluated using different code pages (so the same bit pattern may be interpreted as a Japanese character or an Arabic one, depending on what code page was used to evaluate it).
Collation is about how SQL Server is to order strings - this depends on code page, as you would order strings in different languages differently. See this article for an example.
National character nchar() and nvarchar() use two bytes per character and support international character set -- think internet.
The N prefix converts a string constant to two bytes per character. So if you have people from different countries and would like their names properly stored -- something like:
CREATE TABLE SomeTable (
id int
,FirstName nvarchar(50)
);
Then use:
INSERT INTO SomeTable
( Id, FirstName )
VALUES ( 1, N'Guðjón' );
and
SELECT *
FROM SomeTable
WHERE FirstName = N'Guðjón';

SQL server ignore case in a where expression

How do I construct a SQL query (MS SQL Server) where the "where" clause is case-insensitive?
SELECT * FROM myTable WHERE myField = 'sOmeVal'
I want the results to come back ignoring the case
In the default configuration of a SQL Server database, string comparisons are case-insensitive. If your database overrides this setting (through the use of an alternate collation), then you'll need to specify what sort of collation to use in your query.
SELECT * FROM myTable WHERE myField = 'sOmeVal' COLLATE SQL_Latin1_General_CP1_CI_AS
Note that the collation I provided is just an example (though it will more than likely function just fine for you). A more thorough outline of SQL Server collations can be found here.
Usually, string comparisons are case-insensitive. If your database is configured to case sensitive collation, you need to force to use a case insensitive one:
SELECT balance FROM people WHERE email = 'billg#microsoft.com'
COLLATE SQL_Latin1_General_CP1_CI_AS
I found another solution elsewhere; that is, to use
upper(#yourString)
but everyone here is saying that, in SQL Server, it doesn't matter because it's ignoring case anyway? I'm pretty sure our database is case-sensitive.
The top 2 answers (from Adam Robinson and Andrejs Cainikovs) are kinda, sorta correct, in that they do technically work, but their explanations are wrong and so could be misleading in many cases. For example, while the SQL_Latin1_General_CP1_CI_AS collation will work in many cases, it should not be assumed to be the appropriate case-insensitive collation. In fact, given that the O.P. is working in a database with a case-sensitive (or possibly binary) collation, we know that the O.P. isn't using the collation that is the default for so many installations (especially any installed on an OS using US English as the language): SQL_Latin1_General_CP1_CI_AS. Sure, the O.P. could be using SQL_Latin1_General_CP1_CS_AS, but when working with VARCHAR data, it is important to not change the code page as it could lead to data loss, and that is controlled by the locale / culture of the collation (i.e. Latin1_General vs French vs Hebrew etc). Please see point # 9 below.
The other four answers are wrong to varying degrees.
I will clarify all of the misunderstandings here so that readers can hopefully make the most appropriate / efficient choices.
Do not use UPPER(). That is completely unnecessary extra work. Use a COLLATE clause. A string comparison needs to be done in either case, but using UPPER() also has to check, character by character, to see if there is an upper-case mapping, and then change it. And you need to do this on both sides. Adding COLLATE simply directs the processing to generate the sort keys using a different set of rules than it was going to by default. Using COLLATE is definitely more efficient (or "performant", if you like that word :) than using UPPER(), as proven in this test script (on PasteBin).
There is also the issue noted by #Ceisc on #Danny's answer:
In some languages case conversions do not round-trip. i.e. LOWER(x) != LOWER(UPPER(x)).
The Turkish upper-case "İ" is the common example.
No, collation is not a database-wide setting, at least not in this context. There is a database-level default collation, and it is used as the default for altered and newly created columns that do not specify the COLLATE clause (which is likely where this common misconception comes from), but it does not impact queries directly unless you are comparing string literals and variables to other string literals and variables, or you are referencing database-level meta-data.
No, collation is not per query.
Collations are per predicate (i.e. something operand something) or expression, not per query. And this is true for the entire query, not just the WHERE clause. This covers JOINs, GROUP BY, ORDER BY, PARTITION BY, etc.
No, do not convert to VARBINARY (e.g.convert(varbinary, myField) = convert(varbinary, 'sOmeVal')) for the following reasons:
that is a binary comparison, which is not case-insensitive (which is what this question is asking for)
if you do want a binary comparison, use a binary collation. Use one that ends with _BIN2 if you are using SQL Server 2008 or newer, else you have no choice but to use one that ends with _BIN. If the data is NVARCHAR then it doesn't matter which locale you use as they are all the same in that case, hence Latin1_General_100_BIN2 always works. If the data is VARCHAR, you must use the same locale that the data is currently in (e.g. Latin1_General, French, Japanese_XJIS, etc) because the locale determines the code page that is used, and changing code pages can alter the data (i.e. data loss).
using a variable-length datatype without specifying the size will rely on the default size, and there are two different defaults depending on the context where the datatype is being used. It is either 1 or 30 for string types. When used with CONVERT() it will use the 30 default value. The danger is, if the string can be over 30 bytes, it will get silently truncated and you will likely get incorrect results from this predicate.
Even if you want a case-sensitive comparison, binary collations are not case-sensitive (another very common misconception).
No, LIKE is not always case-sensitive. It uses the collation of the column being referenced, or the collation of the database if a variable is compared to a string literal, or the collation specified via the optional COLLATE clause.
LCASE is not a SQL Server function. It appears to be either Oracle or MySQL. Or possibly Visual Basic?
Since the context of the question is comparing a column to a string literal, neither the collation of the instance (often referred to as "server") nor the collation of the database have any direct impact here. Collations are stored per each column, and each column can have a different collation, and those collations don't need to be the same as the database's default collation or the instance's collation. Sure, the instance collation is the default for what a newly created database will use as its default collation if the COLLATE clause wasn't specified when creating the database. And likewise, the database's default collation is what an altered or newly created column will use if the COLLATE clause wasn't specified.
You should use the case-insensitive collation that is otherwise the same as the collation of the column. Use the following query to find the column's collation (change the table's name and schema name):
SELECT col.*
FROM sys.columns col
WHERE col.[object_id] = OBJECT_ID(N'dbo.TableName')
AND col.[collation_name] IS NOT NULL;
Then just change the _CS to be _CI. So, Latin1_General_100_CS_AS would become Latin1_General_100_CI_AS.
If the column is using a binary collation (ending in _BIN or _BIN2), then find a similar collation using the following query:
SELECT *
FROM sys.fn_helpcollations() col
WHERE col.[name] LIKE N'{CurrentCollationMinus"_BIN"}[_]CI[_]%';
For example, assuming the column is using Japanese_XJIS_100_BIN2, do this:
SELECT *
FROM sys.fn_helpcollations() col
WHERE col.[name] LIKE N'Japanese_XJIS_100[_]CI[_]%';
For more info on collations, encodings, etc, please visit: Collations Info
No, only using LIKE will not work. LIKE searches values matching exactly your given pattern. In this case LIKE would find only the text 'sOmeVal' and not 'someval'.
A pracitcable solution is using the LCASE() function. LCASE('sOmeVal') gets the lowercase string of your text: 'someval'. If you use this function for both sides of your comparison, it works:
SELECT * FROM myTable WHERE LCASE(myField) LIKE LCASE('sOmeVal')
The statement compares two lowercase strings, so that your 'sOmeVal' will match every other notation of 'someval' (e.g. 'Someval', 'sOMEVAl' etc.).
You can force the case sensitive, casting to a varbinary like that:
SELECT * FROM myTable
WHERE convert(varbinary, myField) = convert(varbinary, 'sOmeVal')
What database are you on? With MS SQL Server, it's a database-wide setting, or you can over-ride it per-query with the COLLATE keyword.