Non-latin-characters ordering in database with "order by"

Non-latin-characters ordering in database with "order by" - sql

I just found some strange behavior of database's "order by" clause. In string comparison, I expected some characters such as '[' and '_' are greater than latin characters/digits such as 'I' or '2' considering their orders in the ASCII table. However, the sorting results from database's "order by" clause is different with my expectation. Here's my test:
SQLite version 3.6.23
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table products(name varchar(10));
sqlite> insert into products values('ipod');
sqlite> insert into products values('iphone');
sqlite> insert into products values('[apple]');
sqlite> insert into products values('_ipad');
sqlite> select * from products order by name asc;
[apple]
_ipad
iphone
ipod
select * from products order by name asc;
name
...
[B#
_ref
123
1ab
...
This behavior is different from Java's string comparison (which cost me some time to find this issue). I can verify this in both SQLite 3.6.23 and Microsoft SQL Server 2005. I did some web search but cannot find any related documentation. Could someone shed me some light on it? Is it a SQL standard? Where can I find some information about this? Thanks in advance.

The concept of comparing and ordering the characters in a database is called collation.
How the strings are stored depends on the collation which is usually set in the server, client or session properties.
In MySQL:
SELECT *
FROM (
SELECT 'a' AS str
UNION ALL
SELECT 'A' AS str
UNION ALL
SELECT 'b' AS str
UNION ALL
SELECT 'B' AS str
) q
ORDER BY
str COLLATE UTF8_BIN
--
'A'
'B'
'a'
'b'
and
SELECT *
FROM (
SELECT 'a' AS str
UNION ALL
SELECT 'A' AS str
UNION ALL
SELECT 'b' AS str
UNION ALL
SELECT 'B' AS str
) q
ORDER BY
str COLLATE UTF8_GENERAL_CI
--
'a'
'A'
'b'
'B'
UTF8_BIN sorts characters according to their unicode. Caps have lower unicodes and therefore go first.
UTF8_GENERAL_CI sorts characters according to their alphabetical position, disregarding case.
Collation is also important for indexes, since the indexes rely heavily on sorting and comparison rules.

The important keyword in this case is 'collation'. I have no experience with SQLite, but would expect it to be similar to other database engines in that you can define the collation to use for whole databases, single tables, per connection, etc.
Check your DB documentation for the options available to you.

The ASCII codes for lower-case characters such as 'i' are greater than the ones for '[' and '_':
'i': 105
'[': 91
'_': 95
However, try to insert upper-case characters, eg. try with "IPOD" or "Iphone", those will become before "_" and "[" with the default binary collation.

Related

When is comparing strings case-sensitive and when case-insensitive in SQL?

When is comparing strings case-sensitive and when case-insensitive in popular databases such as PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and SQLite?
I mean comparison using operators like: 'ab' = 'AB', or comparison of strings performed inside string functions like: POSITION('b' IN 'ABC'), INSTR('ABC', 'b'), REPLACE('ABC', 'b', 'x'), TRANSLATE('ABC', 'b', 'x'), TRIM('XabcX', 'x').
I think that I can know the answer but I don't know if it is correct.
Additionally, does the SQL standard define the case-sensitivity of string comparison?
Unfortunately, I only found a question about case-sensitivity of SQL syntax, and not in string comparison.
Edit: I am asking about default setting of the RDBMS, without additional setting the collation of a database, table, nor column.
I am asking only about ASCII letters A-Z and a-z.

Oracle DB functions/operators are case sensitive, with some exceptions including regexp_like method that can be switched to case-sensitive mode ;
--case sensitive demo
select q.txt
from (select 'myPhrase' txt from dual) q
where regexp_like(q.txt, 'myphrase'); --empty result set
select q.txt
from (select 'myPhrase' txt from dual) q
where q.txt like 'myphrase'; --empty result set
select q.txt
from (select 'myPhrase' txt from dual) q
where q.txt like 'myPhrase'; --exact match; returns value
select q.txt
from (select 'myPhrase' txt from dual) q
where regexp_like(q.txt, 'myphrase', 'i'); --case insensitive switch 'i'; returns value
select q.txt
from (select 'myPhrase' txt from dual) q
where lower(q.txt) like 'myphrase'; --enforced match; returns value;
Here's reference to docs for Collation rules for different SQL Operations and for Data Type Comparison Rules
Once a pseudo-collation is determined as the collation to use, NLS_SORT and NLS_COMP session parameters are checked to provide the actual named collation to apply
Usually these two are set to BINARY
SELECT
t.parameter,
t.value
FROM
nls_database_parameters t
WHERE
t.parameter IN (
'NLS_COMP', 'NLS_SORT'
);

How to make Oracle and SQL Server ORDER BY the same?

I need to compare table counts for an Oracle schema to a SQL Server database. However, when I make my query, the results are always off because of the way each handles the underscore ('_') in terms of ordering. I've included an example of what I'm seeing below.
In Oracle:
SELECT FIELD1 FROM ORACLE_ORDER ORDER BY FIELD1 ASC;
Result:
'ABC'
'ABCD'
'ABC_D'
In SQL Server:
SELECT FIELD1 FROM SQL_ORDER ORDER BY FIELD1 ASC;
Result:
'ABC'
'ABC_D'
'ABCD'
As you can see from above, oracle and sql server both treat the underscore differently when it comes to ordering. How can I modify either of the queries (or environments) to make them order the same as the other?

In the SQL Server Side use the following
Select * from SQL_ORDER
ORDER BY FIELD1 Collate SQL_Latin1_General_CP850_BIN
The collation SQL_Latin1_General_CP850_BIN makes it to be used with ASCII values. In this case ASCII of underscore is 95, A being 65, and Z being 90. Remember lower case "a" will have a higher value than upper case "A" and so on.
Here is the fiddle

Simple way is to use Collate SQL_Latin1_General_CP850_BIN function in ORDER BY to achieve this
SELECT * FROM (
SELECT 'ABC' AS TAB UNION
SELECT'ABC_D'UNION
SELECT'ABCD'UNION
SELECT'ABC_'UNION
SELECT 'ABC' UNION
SELECT'A_C' UNION
SELECT'ABC_DE_FGH'UNION
SELECT'ABCXDEYFGH') AS X
ORDER BY X.Tab Collate SQL_Latin1_General_CP850_BIN

SQL Server sort order with nonprintable characters

I have a scalar value function that returns a varchar of data containing the ASCII unit seperator Char(31). I am using this result as part of an Order By clause and attempting to sort in ascending order.
My scalar value function returns results like the following (nonprintable character spelled out for reference)
ABC
ABC (CHAR(31)) DEF
ABC (CHAR(31)) DEF (CHAR(31)) HIJ
I would expect that when I order by ascending the results would be the following:
ABC
ABCDEF
ABCDEFHIJ
instead I am seeing the results as the complete opposite:
ABCDEFHIJ
ABCDEF
ABC
Now I am fairly certain that this has to do with the non-printable characters, but I am not sure why. Any idea as to why that is the case?
Thanks

The sortorder can be influenced by your COLLATION settings. Following script, explicitly using Latin1_General_CI_AS as collation orders the items as you would expect.
;WITH q (Col) AS (
SELECT 'ABC' UNION ALL
SELECT 'ABC' + CHAR(31) + 'DEF' UNION ALL
SELECT 'ABC' + CHAR(31) + 'DEF' + CHAR(31) + 'HIJ'
)
SELECT *
FROM q
ORDER BY
Col COLLATE Latin1_General_CI_AS
What collation are you using? You can verify your current database collation settings with
SELECT DATABASEPROPERTYEX('master', 'Collation') SQLCollation;

I am able to duplicate this behavior in SQL Server 2008 R2 with collation set to SQL_Latin1_General_CP1_CI_AS.
If you cannot change your collation settings, set the field to nvarchar instead of varchar. This solved the issue for me.

Is the 'as' keyword required in Oracle to define an alias?

Is the 'AS' keyword required in Oracle to define an alias name for a column in a SELECT statement?
I noticed that
SELECT column_name AS "alias"
is the same as
SELECT column_name "alias"
I am wondering what the consequences are of defining a column alias in the latter way.

According to the select_list Oracle select documentation the AS is optional.
As a personal note I think it is easier to read with the AS

(Tested on Oracle 11g)
About AS:
When used on result column, AS is optional.
When used on table name, AS shouldn't be added, otherwise it's an error.
About double quote:
It's optional & valid for both result column & table name.
e.g
-- 'AS' is optional for result column
select (1+1) as result from dual;
select (1+1) result from dual;
-- 'AS' shouldn't be used for table name
select 'hi' from dual d;
-- Adding double quotes for alias name is optional, but valid for both result column & table name,
select (1+1) as "result" from dual;
select (1+1) "result" from dual;
select 'hi' from dual "d";

AS without double quotations is good.
SELECT employee_id,department_id AS department
FROM employees
order by department
--ok--
SELECT employee_id,department_id AS "department"
FROM employees
order by department
--error on oracle--
so better to use AS without double quotation if you use ORDER BY clause

Both are correct. Oracle allows the use of both.

My conclusion is that（Tested on 12c）:
AS is always optional, either with or without ""; AS makes no difference (column alias only, you can not use AS preceding table alias)
However, with or without "" does make difference because "" lets lower case possible for an alias
thus :
SELECT {T / t} FROM (SELECT 1 AS T FROM DUAL); -- Correct
SELECT "tEST" FROM (SELECT 1 AS "tEST" FROM DUAL); -- Correct
SELECT {"TEST" / tEST} FROM (SELECT 1 AS "tEST" FROM DUAL ); -- Incorrect
SELECT test_value AS "doggy" FROM test ORDER BY "doggy"; --Correct
SELECT test_value AS "doggy" FROM test WHERE "doggy" IS NOT NULL; --You can not do this, column alias not supported in WHERE & HAVING
SELECT * FROM test "doggy" WHERE "doggy".test_value IS NOT NULL; -- Do not use AS preceding table alias
So, the reason why USING AS AND "" causes problem is NOT AS
Note: "" double quotes are required if alias contains space OR if it contains lower-case characters and MUST show-up in Result set as lower-case chars. In all other scenarios its OPTIONAL and can be ignored.

The quotes are required when we have a space in Alias Name like
SELECT employee_id,department_id AS "Department ID"
FROM employees
order by department

There is no difference between both, AS is just a more explicit way of mentioning the alias which is good because some dependent libraries depends on this small keyword. e.g. JDBC 4.0. Depend on use of it, different behaviour can be observed.
See this. I would always suggest to use the full form of semantic to avoid such issues.

Flatten national characters in SQL Server

I have a column that contains pet names with national characters. How do I write the query to match them all in one condition?
|PetName|
Ćin
ćin
Ĉin
ĉin
Ċin
ċin
Čin
čin
sth like FLATTEN funciton here:
...WHERE LOWER(FLATTEN(PetName)) = 'cin'
Tried to cast it to from NVARCHAR to VARCHAR but it didn't help. I'd like to avoid using REPLACE for every character.

this should work because cyrillic collation base cases all diacritics like Đ,Ž,Ć,Č,Š,etc...
declare #t table(PetName nvarchar(100))
insert into #t
SELECT N'Ćin' union all
SELECT N'ćin' union all
SELECT N'Ĉin' union all
SELECT N'ĉin' union all
SELECT N'Ċin' union all
SELECT N'ċin' union all
SELECT N'Čin' union all
SELECT N'čin'
SELECT *
FROM #t
WHERE lower(PetName) = 'cin' COLLATE Cyrillic_General_CS_AI

You can change the collation used for the comparison:
WHERE PetName COLLATE Cyrillic_General_CI_AI = 'cin'

There isn't really a way or built-in function that will strip accents from characters.
If you are doing comparisons (LIKE, IN, PATINDEX etc), you can just force COLLATE if the column/db is not already accent insensitive.
Normally, a query like this
with test(col) as (
select 'Ćin' union all
select 'ćin')
select * from test
where col='cin'
will return both columns, since the default collation (unless you change it) is insensitive. This won't work for FULLTEXT indexes though.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Non-latin-characters ordering in database with "order by" - sql

The ASCII codes for lower-case characters such as 'i' are greater than the ones for '[' and '_': 'i': 105 '[': 91 '_': 95 However, try to insert upper-case characters, eg. try with "IPOD" or "Iphone", those will become before "_" and "[" with the default binary collation.

Related

When is comparing strings case-sensitive and when case-insensitive in SQL?

How to make Oracle and SQL Server ORDER BY the same?

SQL Server sort order with nonprintable characters

Is the 'as' keyword required in Oracle to define an alias?

Flatten national characters in SQL Server

Categories

Resources