Collation issue in SQL query

Collation issue in SQL query - sql

I have been reading about collation in SQL, but I am still confused. Why is it that this code works fine:
...case WHEN _AccountID not in ('00000000P','0000000P9','899') THEN 'blah'
but the following does not work and produces an error message
"Cannot resolve the collation conflict between
"SQL_Latin1_General_CP1_CS_AS" and "SQL_Latin1_General_CP1_CI_AS" in
the equal to operation"
...case WHEN _AccountID not in (select _AccountID from tMyTable) THEN 'blah'
especially when the rest of the query is exactly the same!
Actually, I can write other queries where even the latter syntax works fine (so I wouldn't think it's because of actual column values, right?), but my above examples are both from the otherwise exact same query. I can't understand what to look for enough in my data to differentiate the queries in which it works from the queries where it doesn't work.

Collations are used to determine things like sort order and handling for case sensitivity. Collations can be set at the server, database, table and column level. So two columns in the one table could potentially have different collations. In your error message, one collation is case insensitive (CI) and one is case sensitive (CS). What we don't know yet from the information you've posted, is the server/database/tables the two columns called _AccountID are stored. Nevertheless they have different collations. CI and CS are addressed in BOL thusly:
Distinguishes between uppercase and lowercase letters. If selected,
lowercase letters sort ahead of their uppercase versions. If this
option is not selected, the collation is case-insensitive. That is,
SQL Server considers the uppercase and lowercase versions of letters
to be identical for sorting purposes. You can explicitly select case
insensitivity by specifying _CI.
One workaround assuming the first _AccountID has a different collation to the database's default collation (and the second one uses the database default), might be:
...case WHEN _AccountID collate database_default not in (select _AccountID from tMyTable) THEN 'blah'
As an aside, assuming you're using SQL Server, you might want to consider using
WHERE NOT EXISTS (SELECT * FROM from tMyTable tbl WHERE tbl._AccountID = <the_other>._AccountID)
...which will perform better than WHERE NOT IN (SELECT...)

Related

What's the most efficient way to do a case-insensitive like expression?

In Pervasive v13, is there a "more performant" way to perform a case-insensitive like expression than is shown below?
select * from table_name
where upper(field_name) like '%TEST%'
The UPPER function above has performance cost that I'd like to avoid.
I disagree with those who say that the performance-overhead of UPPER is minor; it is doubling the execution time compared to the exact same query without UPPER.
Background:
I was very satisfied with the execution time of this wildcard-like-expression until I realized the result set was missing records due to mismatched capitalization.
Then, I implemented the UPPER technique (above). This achieved including those missing records, but it doubled the execution time of my query.
This UPPER technique, for case-insensitive comparison, seems outlandishly intensive to me at even a conceptual level. Instead of changing a field's case, for every record in a large database table, I'm hoping that the SQL standard provides some type of syntactical flag that modifies the like-expression's behavior regarding case-sensitivity.
From there, behind the scenes, the database engine could generate a compiled regular expression (or some other optimized case-insensitive evaluator) that could well outperform this UPPER technique. This seems like a possibility that might exist.
However, I must admit, at some level there still must be a conversion to make the letter-comparisons. And perhaps, this UPPER technique is no worse than any other method that might achieve the same result set.
Regardless, I'm posting this question in hopes someone might reveal a more performant syntax I'm unaware of.

You do not need the UPPER, when you define your table using CASE.
The CASE keyword causes PSQL to ignore case when evaluating
restriction clauses involving a string column. CASE can be specified
as a column attribute in a CREATE TABLE or ALTER TABLE statement, or
in an ORDER BY clause of a SELECT statement.
(see: https://docs.actian.com/psql/psqlv13/index.html#page/sqlref%2Fsqlref.CASE_(string).htm )
CREATE TABLE table_name (field_name VARCHAR(100) CASE)

Why are table aliases not compiled out of existence when sharing SQL statements (on Oracle DBMS)

Quest Software\Knowledge Xpert states:
If two identical SQL statements vary because an identical table has two different aliases, then the SQL is different and will not be shared.
What sense does this make?
I understand that if I have table A and table B and I fail to alias an ambiguous column what I'm trying to do is mathematically ambiguous, but the names of the aliases themselves shouldn't matter should they? Why would SQL/Oracle care that table A's alias is FOO in one statement and BAR in another when determining for caching purposes if they are identical?
On a similar line why should whitespace or word case matter at all?
"SQL cannot be shared within the SGA unless it is absolutely identical. Statement components that must be the same include:
Word case (uppercase and lowercase characters)
 
Whitespace
 
Underlying schema objects"
Underlying schema objects makes sense, because after all mathematically that's something different. Is the idea I might be an idiot and have columns named "Foo" "FOO" and "foo" and we don't want to accidentally cache?

I think it's to avoid the extra overhead of "normalizing" each SQL statement before creating a SQL_ID.
The SQL_ID is a hash of the SQL statement. In order to do what you are asking, it would require the SQL parser to do extra work (for limited benefit) in order to make a uniform SQL statement that would compare exactly with another statement that was equivalent, but had mixed case, extra spaces, etc.

I think this restrictions are due to SQL processing mechanism Oracle uses. It calculates hash value of query text and if this hash matches with one stored in SGA it helps to avoid hard parsing steps. More details are here.

Should I use an empty string instead of NULL when checking for unique?

I'm still relatively new to database design, and I'm making a table with SQLite. I thought I was taught that it's best to use NULL in place of empty strings, so that's what I've been doing. I'm building an address table with the line:
CREATE TABLE addresses (
addressID INTEGER PRIMARY KEY,
officeName TEXT,
address TEXT NOT NULL CHECK(address<>''),
UNIQUE (officeName, address)
And adding addresses to the database (through PHP PDO) using the line
INSERT OR IGNORE INTO addresses (officeName,address) VALUES (?,?)
That line should check to see if the officeName/address is already in the database, and ignore it if it is, or add it if it isn't. "Address" is always a non-null string, but sometimes the officeName is blank. And if I make it NULL, it keeps getting added as if each NULL was distinct (it works fine if it's just an empty string). I did find this article saying that yes, NULLs are treated as distinct in a unique column. That now makes me wonder… should I always just use an empty string instead of NULL? Is there ever a case where it's "best practice" to use NULL instead? I thought it was always best practice, but now I'm thinking it might never be best practice.

NULL and the empty string are semantically different, just as NULL and 0 are semantically different.
NULL means "no value". In your case, that would be "no address".
Empty string is string string value of zero length. In your case, that would be an address that is the empty string.
Whether or not to use NULL or the empty string depends on the semantics of the situation, just like the decision of whether to use NULL or 0.
However, NULLs are a bit of a mess when it comes to comparison, IN, indexes, DISTINCT, and GROUP BY. Everyone seems to do things a little differently (FYI, this link doesn't cover SQL Server, which does it yet another way), so unfortunately, compromises are often made to accommodate particular desired behavior, depending on the DBMS.
In your case, you will have to use empty strings if you want to use the SQLite functionality you are interested in.
SQLite was originally coded in such a way that [NULLs are never distinct]. But the experiments run
on other SQL engines showed that none of them worked this way. So
SQLite was modified to work the same as Oracle, PostgreSQL, and DB2.
This involved making NULLs indistinct for the purposes of the SELECT
DISTINCT statement and for the UNION operator in a SELECT. NULLs are
still distinct in a UNIQUE column. This seems somewhat arbitrary, but
the desire to be compatible with other engines outweighed that
objection.
Know, however, that INSERT OR IGNORE is unique to SQLite; for no other DBMS would you be asking about using that statement.
Best practice is to base your decision on what you mean: no value, or the value with no characters. (Of course, you may always choose to forgo best practice for your own personal reasons.)

SQL like in different DBMS

I am performing a simple like query such as
SELECT * FROM table WHERE column LIKE '%searchterm%';
on the same data imported to a SQLite DB and a Postgres DB.
However, the number of results varies between the two databases.
I tried googling but I couldn't really find out if there any major implementation differences.

One of the main differences you'll find is that Postgres LIKE queries are case-sensitive, while Sqlite isn't (at least for ASCII characters). You'll need to use ILIKE to get a case-insensitive match in Postgres.

Is SQL syntax case sensitive?

Is SQL case sensitive? I've used MySQL and SQL Server which both seem to be case insensitive. Is this always the case? Does the standard define case-sensitivity?

The SQL keywords are case insensitive (SELECT, FROM, WHERE, etc), but they are often written in all caps. However, in some setups, table and column names are case sensitive.
MySQL has a configuration option to enable/disable it. Usually case sensitive table and column names are the default on Linux MySQL and case insensitive used to be the default on Windows, but now the installer asked about this during setup. For SQL Server it is a function of the database's collation setting.
Here is the MySQL page about name case-sensitivity
Here is the article in MSDN about collations for SQL Server

This isn't strictly SQL language, but in SQL Server if your database collation is case-sensitive, then all table names are case-sensitive.

The SQL-92 specification states that identifiers might be quoted, or unquoted. If both sides are unquoted then they are always case insensitive, e.g., table_name == TAble_nAmE.
However, quoted identifiers are case sensitive, e.g., "table_name" != "TAble_naME". Also based on the specification if you wish to compare unquoted identifiers with quoted ones, then unquoted and quoted identifiers can be considered the same, if the unquoted characters are uppercased, e.g. TABLE_NAME == "TABLE_NAME", but TABLE_NAME != "table_name" or TABLE_NAME != "TAble_NaMe".
Here is the relevant part of the specification (section 5.2.13):
A <regular identifier> and a <delimited identifier> are equivalent if the <identifier body> of the <regular identifier> (with
every letter that is a lower-case letter replaced by the equivalent upper-case letter or letters) and the <delimited identifier
body> of the <delimited identifier> (with all occurrences of
<quote> replaced by <quote symbol> and all occurrences of <doublequote symbol> replaced by <double quote>), considered as
the repetition of a <character string literal> that specifies a
<character set specification> of SQL_TEXT and an implementation-
defined collation that is sensitive to case, compare equally
according to the comparison rules in Subclause 8.2, "<comparison
predicate>".
Note, that just like with other parts of the SQL standard, not all databases follow this section fully. PostgreSQL for example stores all unquoted identifiers lowercased instead of uppercased, so table_name == "table_name" (which is exactly the opposite of the standard). Also some databases are case insensitive all the time, or case-sensitiveness depend on some setting in the DB or are dependent on some of the properties of the system, usually whether the file system is case sensitive or not.
Note that some database tools might send identifiers quoted all the time, so in instances where you mix queries generated by some tool (like a CREATE TABLE query generated by Liquibase or other DB migration tool), with hand made queries (like a simple JDBC select in your application) you have to make sure that the cases are consistent, especially on databases where quoted and unquoted identifiers are different (DB2, PostgreSQL, etc.)

In SQL Server it is an option. Turning it on sucks.
I'm not sure about MySQL.

Identifiers and reserved words should not be case sensitive, although many follow a convention to use capitals for reserved words and upper camel case for identifiers.
See SQL-92 Sec. 5.2

My understanding is that the SQL standard calls for case-insensitivity. I don't believe any databases follow the standard completely, though.
MySQL has a configuration setting as part of its "strict mode" (a grab bag of several settings that make MySQL more standards-compliant) for case sensitive or insensitive table names. Regardless of this setting, column names are still case-insensitive, although I think it affects how the column-names are displayed. I believe this setting is instance-wide, across all databases within the RDBMS instance, although I'm researching today to confirm this (and hoping the answer is no).
I like how Oracle handles this far better. In straight SQL, identifiers like table and column names are case insensitive. However, if for some reason you really desire to get explicit casing, you can enclose the identifier in double-quotes (which are quite different in Oracle SQL from the single-quotes used to enclose string data). So:
SELECT fieldName
FROM tableName;
will query fieldname from tablename, but
SELECT "fieldName"
FROM "tableName";
will query fieldName from tableName.
I'm pretty sure you could even use this mechanism to insert spaces or other non-standard characters into an identifier.
In this situation if for some reason you found explicitly-cased table and column names desirable it was available to you, but it was still something I would highly caution against.
My convention when I used Oracle on a daily basis was that in code I would put all Oracle SQL keywords in uppercase and all identifiers in lowercase. In documentation I would put all table and column names in uppercase. It was very convenient and readable to be able to do this (although sometimes a pain to type so many capitals in code -- I'm sure I could've found an editor feature to help, here).
In my opinion MySQL is particularly bad for differing about this on different platforms. We need to be able to dump databases on Windows and load them into Unix, and doing so is a disaster if the installer on Windows forgot to put the RDBMS into case-sensitive mode. (To be fair, part of the reason this is a disaster is our coders made the bad decision, long ago, to rely on the case-sensitivity of MySQL on UNIX.) The people who wrote the Windows MySQL installer made it really convenient and Windows-like, and it was great to move toward giving people a checkbox to say "Would you like to turn on strict mode and make MySQL more standards-compliant?" But it is very convenient for MySQL to differ so significantly from the standard, and then make matters worse by turning around and differing from its own de facto standard on different platforms. I'm sure that on differing Linux distributions this may be further compounded, as packagers for different distros probably have at times incorporated their own preferred MySQL configuration settings.
Here's another Stack Overflow question that gets into discussing if case-sensitivity is desirable in an RDBMS.

No. MySQL is not case sensitive, and neither is the SQL standard. It's just common practice to write the commands upper-case.
Now, if you are talking about table/column names, then yes they are, but not the commands themselves.
So
SELECT * FROM foo;
is the same as
select * from foo;
but not the same as
select * from FOO;

I found this blog post to be very helpful (I am not the author). Summarizing (please read, though):
...delimited identifiers are case sensitive ("table_name" != "Table_Name"), while non quoted identifiers are not, and are transformed to upper case (table_name => TABLE_NAME).
He found DB2, Oracle and Interbase/Firebird are 100% compliant:
PostgreSQL ... lowercases every unquoted identifier, instead of uppercasing it. MySQL ... file system dependent. SQLite and SQL Server ... case of the table and field names are preserved on creation, but they are completely ignored afterwards.

I don't think SQL Server is case sensitive, at least not by default.
When I'm querying manually via SQL Server Management Studio, I mess up case all the time and it cheerfully accepts it:
select cOL1, col2 FrOM taBLeName WheRE ...

SQL keywords are case insensitive themselves.
Names of tables, columns, etc., have a case sensitivity which is database dependent - you should probably assume that they are case sensitive unless you know otherwise (in many databases they aren't though; in MySQL table names are sometimes case sensitive, but most other names are not).
Comparing data using =, >, <, etc., has a case awareness which is dependent on the collation settings which are in use on the individual database, table or even column in question. It's normal however, to keep collation fairly consistent within a database. We have a few columns which need to store case sensitive values; they have a collation specifically set.

Have the best of both worlds
These days you can just write all your SQL statements in lowercase and if you ever need to have it formatted then just install a plugin that will do it for you. This is only applicable if your code editor has those plug-ins available. Visual Studio Code has many extensions that can do this.
Here's a couple you can use: vscode-sql-formatter and SqlFormatter-VSCode

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas