DB2 sql query to find non ascii characters in strings - sql

I have a table (say ELEMENTS) with a VARCHAR field named NAME encoded in ccsid 1144. I need to find all the strings in the NAME field which contain "non ascii characters", that is characters that are in the ccsid 1144 set of characters without the ascii ones.

I think you should be able to create a function like this:
CREATE FUNCTION CONTAINS_NON_ASCII(INSTR VARCHAR(4000))
RETURNS CHAR(1)
DETERMINISTIC NO EXTERNAL ACTION CONTAINS SQL
BEGIN ATOMIC
DECLARE POS, LEN INT;
IF INSTR IS NULL THEN
RETURN NULL;
END IF;
SET (POS, LEN) = (1, LENGTH(INSTR));
WHILE POS <= LEN DO
IF ASCII(SUBSTR(INSTR, POS, 1)) > 128 THEN
RETURN 'Y';
END IF;
SET POS = POS + 1;
END WHILE;
RETURN 'N';
END
And then write:
SELECT NAME
FROM ELEMENTS
WHERE CONTAINS_NON_ASCII(NAME) = 'Y'
;
(Disclaimer: completely untested.)
By the way — judging by the documentation, it seems that VARCHAR is a string of bytes, not of Unicode characters. (Bytes range from 0 to 0xFF; Unicode characters range from 0 to 0x10FFFD.) If you're interested in supporting Unicode, you might want to use a different data-type.

Related

Strip out all non alpanumeric characters and non set punctation

Hi I've taken a function to strip out alphanumeric characters and some punctuation characters from a string.
ALTER FUNCTION [dbo].[Remove_Non_Alphanumeric]
(#String_Parameter VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #Alphanumeric_Characters VARCHAR(289) = '%[^A-Za-z0-9| |?|,|&|\|/|.|'+CHAR(13)+'|'+CHAR(10)+'|(|)|]|[|-]%';
WHILE PATINDEX(#Alphanumeric_Characters, #String_Parameter) > 0
BEGIN
SELECT #String_Parameter = STUFF(#String_Parameter, PATINDEX(#Alphanumeric_Characters, #String_Parameter), 1, '');
END
RETURN #String_Parameter;
END
This mostly seems to work. However there are odd passages which have some characters returned with a ? and odder still some bullet points are changed to ? and others are left. Despite both not being in a my list of acceptable characters.

ORACLE PL-SQL how to create function to split a string and return n-long "chunks" into an array?

I need to create a function that will accept a sting input of any length and return an array of strings each containing n long chunks. For example, an input of This is a test with 3 character long chunks should return:
Thi
s i
s a
tes
t
I have created the following function to do so. My question is, is there possibly a better, faster way to approach this? I know that this function may be called many times using very long strings and I do not wan't this to possibly slow down the system. Additionally, I eventuallly need to set the function up so that it also creates a new entry upon detection of a delimiter. For example, with a "chunk length" of three:
Testing with comma delimiters, one, two, three, test
Should return:
Tes
tin
g w
ith
co
mma
del
imi
ter
s,
one
,
two
,
thr
ee,
te
st
Notice that i do not want the delimiter it's self to be deleted or replaced. I just have a new line/new array entry populate just after detection.
Here is my code so far:
CREATE OR REPLACE FUNCTION SPLIT_STRING (
p_str VARCHAR2, --String to split
p_del VARCHAR2, --Delimiter
p_len INTEGER, --Length of each chunk
p_force NUMBER) --Forces split when length is reached (1=on, 0=off)
RETURN VARCHAR2 IS
l_tmp_str VARCHAR2(32767);
l_chnk_len INTEGER;
l_str VARCHAR2(32767);
l_chunk VARCHAR2(32767);
l_pos INTEGER;
l_len INTEGER;
l_chnksize NUMBER;
BEGIN
--Determine the strings total length
l_len:= LENGTH(p_str);
IF (l_len > 0)
THEN
l_tmp_str:= p_str;
--Determine the necessary number of chuncks
l_chnksize:=(l_len/p_len);
IF MOD(l_chnksize,1) != 0
THEN
l_chnksize:= CEIL(l_chnksize);
END IF;
--Split the string into chunks
IF p_force = 1
THEN
l_pos:=1;
FOR loop_num IN 1..l_chnksize
LOOP
IF (loop_num>1)
THEN
l_str:=l_str||CHR(10)||CHR(13)||SUBSTR(p_str,l_pos,p_len);
ELSE
l_str:=SUBSTR(p_str,l_pos,p_len);
END IF;
--Increment position placeholder
l_pos:=l_pos+p_len;
END LOOP;
ELSE
l_str:='UNFORCED, NOT IMPLEMENTED';
END IF;
END IF;
--Return the delimited string
RETURN l_str;
My specific question is: Is there a FASTER way to do this for LARGE string inputs?
I don't know if this is faster, but definitely simpler. You are not actually putting the chunks in arrays, but inserting newline character after every delimiter or a group of characters. This can be easily done using regular expressions.
select regexp_replace('Testing with comhm,a sdfdeli,mitjers,one,two,three,test',
'(.{0,3},)|(.{5})',
'\1\2' ||chr(10)) chunks
from dual;
CHUNKS
-------
Testi
ng wi
th co
mhm,
a sdf
deli,
mitje
rs,
one,
two,
three
,
test
Regex Explanation:
(.{0,3},) : Group of up to 3 characters followed by a comma(delimiter), Assuming 5 as the length of each chunk.
(.{5}) : Group of 5 characters, Assuming 5 as the length of each chunk.
These first and second capture groups are replaced by themselves appended with a newline character.
Generic expression would be,
'(.{0,'||(length-2)||'}'||delimiter||')|(.{'||(length)||'})'

How to cut varchar/text before n'th occurence of delimiter? PostgreSQL

I have strings (saved in database as varchar) and I have to cut them just before n'th occurence of delimiter.
Example input:
String: 'My-Example-Awesome-String'
Delimiter: '-'
Occurence: 2
Output:
My-Example
I implemented this function for fast prototype:
CREATE OR REPLACE FUNCTION find_position_delimiter(fulltext varchar, delimiter varchar, occurence integer)
RETURNS varchar AS
$BODY$
DECLARE
result varchar = '';
arr text[] = regexp_split_to_array( fulltext, delimiter);
word text;
counter integer := 0;
BEGIN
FOREACH word IN ARRAY arr LOOP
EXIT WHEN ( counter = occurence );
IF (counter > 0) THEN result := result || delimiter;
END IF;
result := result || word;
counter := counter + 1;
END LOOP;
RETURN result;
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE;
SELECT find_position_delimiter('My-Example-Awesome-String', '-', 2);
For now it assumes that string is not empty (provided by query where I will call function) and delimiter string contains at least one delimiter of provided pattern.
But now I need something better for performance test. If it is possible, I would love to see the most universal solution, because not every user of my system is working on PostgreSQL database (few of them prefer Oracle, MySQL or SQLite), but it is not the most importatnt. But performance is - because on specific search, that function can be called even few hundreds times.
I didn't find anything about fast and easy using varchar as a table of chars and checking for occurences of delimiter (I could remember position of occurences and then create substring from first char to n'th delimiter position-1). Any ideas? Are smarter solutions?
# EDIT: yea, I know that function in every database will be a bit different, but body of function can be very similliar or the same. Generality is not a main goal :) And sorry for that bad function working-name, I just saw it has not right meaning.
you can try doing something based on this:
select
varcharColumnName,
INSTR(varcharColumnName,'-',1,2),
case when INSTR(varcharColumnName,'-',1,2) <> 0
THEN SUBSTR(varcharColumnName, 1, INSTR(varcharColumnName,'-',1,2) - 1)
else '...'
end
from tableName;
of course, you have to handle "else" the way you want. It works on postgres and oracle (tested), it should work on other dbms's because these are standard sql functions
//edit - as a function, however this way it's rather hard to make it cross-dbms
CREATE OR REPLACE FUNCTION find_position_delimiter(fulltext varchar, delimiter varchar, occurence integer)
RETURNS varchar as
$BODY$
DECLARE
result varchar := '';
delimiterPos integer := 0;
BEGIN
delimiterPos := INSTR(fulltext,delimiter,1,occurence);
result := SUBSTR(fulltext, 1, delimiterPos - 1);
RETURN result;
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE;
SELECT find_position_delimiter('My-Example-Awesome-String', '-', 2);
create or replace function trunc(string text, delimiter char, occurence int) returns text as $$
return delimiter.join(string.split(delimiter)[:occurence])
$$ language plpythonu;
# select trunc('My-Example-Awesome-String', '-', 2);
trunc
------------
My-Example
(1 row)

How to get size in bytes of a CLOB column in Oracle?

How do I get the size in bytes of a CLOB column in Oracle?
LENGTH() and DBMS_LOB.getLength() both return number of characters used in the CLOB but I need to know how many bytes are used (I'm dealing with multibyte charactersets).
After some thinking i came up with this solution:
LENGTHB(TO_CHAR(SUBSTR(<CLOB-Column>,1,4000)))
SUBSTR returns only the first 4000 characters (max string size)
TO_CHAR converts from CLOB to VARCHAR2
LENGTHB returns the length in Bytes used by the string.
I'm adding my comment as an answer because it solves the original problem for a wider range of cases than the accepted answer. Note: you must still know the maximum length and the approximate proportion of multi-byte characters that your data will have.
If you have a CLOB greater than 4000 bytes, you need to use DBMS_LOB.SUBSTR rather than SUBSTR. Note that the amount and offset parameters are reversed in DBMS_LOB.SUBSTR.
Next, you may need to substring an amount less than 4000, because this parameter is the number of characters, and if you have multi-byte characters then 4000 characters will be more than 4000 bytes long, and you'll get ORA-06502: PL/SQL: numeric or value error: character string buffer too small because the substring result needs to fit in a VARCHAR2 which has a 4000 byte limit. Exactly how many characters you can retrieve depends on the average number of bytes per character in your data.
So my answer is:
LENGTHB(TO_CHAR(DBMS_LOB.SUBSTR(<CLOB-Column>,3000,1)))
+NVL(LENGTHB(TO_CHAR(DBM‌​S_LOB.SUBSTR(<CLOB-Column>,3000,3001))),0)
+NVL(LENGTHB(TO_CHAR(DBM‌​S_LOB.SUBSTR(<CLOB-Column>,6000,6001))),0)
+...
where you add as many chunks as you need to cover your longest CLOB, and adjust the chunk size according to average bytes-per-character of your data.
Try this one for CLOB sizes bigger than VARCHAR2:
We have to split the CLOB in parts of "VARCHAR2 compatible" sizes, run lengthb through every part of the CLOB data, and summarize all results.
declare
my_sum int;
begin
for x in ( select COLUMN, ceil(DBMS_LOB.getlength(COLUMN) / 2000) steps from TABLE )
loop
my_sum := 0;
for y in 1 .. x.steps
loop
my_sum := my_sum + lengthb(dbms_lob.substr( x.COLUMN, 2000, (y-1)*2000+1 ));
-- some additional output
dbms_output.put_line('step:' || y );
dbms_output.put_line('char length:' || DBMS_LOB.getlength(dbms_lob.substr( x.COLUMN, 2000 , (y-1)*2000+1 )));
dbms_output.put_line('byte length:' || lengthb(dbms_lob.substr( x.COLUMN, 2000, (y-1)*2000+1 )));
continue;
end loop;
dbms_output.put_line('char summary:' || DBMS_LOB.getlength(x.COLUMN));
dbms_output.put_line('byte summary:' || my_sum);
continue;
end loop;
end;
/
The simple solution is to cast CLOB to BLOB and then request length of BLOB !
The problem is that Oracle doesn't have a function that cast CLOB to BLOB, but we can simply define a function to do that
create or replace
FUNCTION clob2blob (p_in clob) RETURN blob IS
v_blob blob;
v_desc_offset PLS_INTEGER := 1;
v_src_offset PLS_INTEGER := 1;
v_lang PLS_INTEGER := 0;
v_warning PLS_INTEGER := 0;
BEGIN
dbms_lob.createtemporary(v_blob,TRUE);
dbms_lob.converttoblob
( v_blob
, p_in
, dbms_lob.getlength(p_in)
, v_desc_offset
, v_src_offset
, dbms_lob.default_csid
, v_lang
, v_warning
);
RETURN v_blob;
END;
The SQL command to use to obtain number of bytes is
SELECT length(clob2blob(fieldname)) as nr_bytes
or
SELECT dbms_lob.getlength(clob2blob(fieldname)) as nr_bytes
I have tested this on Oracle 10g without using Unicode(UTF-8).
But I think that this solution must be correct using Unicode(UTF-8) Oracle instance :-)
I want render thanks to Nashev that has posted a solution to convert clob to blob How convert CLOB to BLOB in Oracle? and to this post written in german (the code is in PL/SQL) 13ter.info.blog that give additionally a function to convert blob to clob !
Can somebody test the 2 commands in Unicode(UTF-8) CLOB so I'm sure that this works with Unicode ?
NVL(length(clob_col_name),0) works for me.
Check the LOB segment name from dba_lobs using the table name.
select TABLE_NAME,OWNER,COLUMN_NAME,SEGMENT_NAME from dba_lobs where TABLE_NAME='<<TABLE NAME>>';
Now use the segment name to find the bytes used in dba_segments.
select s.segment_name, s.partition_name, bytes/1048576 "Size (MB)"
from dba_segments s, dba_lobs l
where s.segment_name = l.segment_name
and s.owner = '<< OWNER >> ' order by s.segment_name, s.partition_name;
It only works till 4000 byte, What if the clob is bigger than 4000 bytes then we use this
declare
v_clob_size clob;
begin
v_clob_size:= (DBMS_LOB.getlength(v_clob)) / 1024 / 1024;
DBMS_OUTPUT.put_line('CLOB Size ' || v_clob_size);
end;
or
select (DBMS_LOB.getlength(your_column_name))/1024/1024 from your_table

Natural Sort in MySQL

Is there an elegant way to have performant, natural sorting in a MySQL database?
For example if I have this data set:
Final Fantasy
Final Fantasy 4
Final Fantasy 10
Final Fantasy 12
Final Fantasy 12: Chains of Promathia
Final Fantasy Adventure
Final Fantasy Origins
Final Fantasy Tactics
Any other elegant solution than to split up the games' names into their components
Title: "Final Fantasy"
Number: "12"
Subtitle: "Chains of Promathia"
to make sure that they come out in the right order? (10 after 4, not before 2).
Doing so is a pain in the a** because every now and then there's another game that breaks that mechanism of parsing the game title (e.g. "Warhammer 40,000", "James Bond 007")
Here is a quick solution:
SELECT alphanumeric,
integer
FROM sorting_test
ORDER BY LENGTH(alphanumeric), alphanumeric
Just found this:
SELECT names FROM your_table ORDER BY games + 0 ASC
Does a natural sort when the numbers are at the front, might work for middle as well.
Same function as posted by #plalx, but rewritten to MySQL:
DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
DELIMITER ;;
CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
RETURNS int
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE position int;
DECLARE tmp_position int;
SET position = 5000;
SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
IF (position = 5000) THEN RETURN 0; END IF;
RETURN position;
END
;;
DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE sortString varchar(4000);
DECLARE numStartIndex int;
DECLARE numEndIndex int;
DECLARE padLength int;
DECLARE totalPadLength int;
DECLARE i int;
DECLARE sameOrderCharsLen int;
SET totalPadLength = 0;
SET instring = TRIM(instring);
SET sortString = instring;
SET numStartIndex = udf_FirstNumberPos(instring);
SET numEndIndex = 0;
SET i = 1;
SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);
WHILE (i <= sameOrderCharsLen) DO
SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
SET i = i + 1;
END WHILE;
WHILE (numStartIndex <> 0) DO
SET numStartIndex = numStartIndex + numEndIndex;
SET numEndIndex = numStartIndex;
WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
SET numEndIndex = numEndIndex + 1;
END WHILE;
SET numEndIndex = numEndIndex - 1;
SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
IF padLength < 0 THEN
SET padLength = 0;
END IF;
SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
SET totalPadLength = totalPadLength + padLength;
SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
END WHILE;
RETURN sortString;
END
;;
Usage:
SELECT name FROM products ORDER BY udf_NaturalSortFormat(name, 10, ".")
I think this is why a lot of things are sorted by release date.
A solution could be to create another column in your table for the "SortKey". This could be a sanitized version of the title which conforms to a pattern you create for easy sorting or a counter.
I've written this function for MSSQL 2000 a while ago:
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* #author Alexandre Potvin Latreille (plalx)
* #param {nvarchar(4000)} string The formatted string.
* #param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* #param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* #return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
CREATE FUNCTION dbo.udf_NaturalSortFormat(
#string nvarchar(4000),
#numberLength int = 10,
#sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE #sortString varchar(4000),
#numStartIndex int,
#numEndIndex int,
#padLength int,
#totalPadLength int,
#i int,
#sameOrderCharsLen int;
SELECT
#totalPadLength = 0,
#string = RTRIM(LTRIM(#string)),
#sortString = #string,
#numStartIndex = PATINDEX('%[0-9]%', #string),
#numEndIndex = 0,
#i = 1,
#sameOrderCharsLen = LEN(#sameOrderChars);
-- Replace all char that has to have the same order by a space.
WHILE (#i <= #sameOrderCharsLen)
BEGIN
SET #sortString = REPLACE(#sortString, SUBSTRING(#sameOrderChars, #i, 1), ' ');
SET #i = #i + 1;
END
-- Pad numbers with zeros.
WHILE (#numStartIndex <> 0)
BEGIN
SET #numStartIndex = #numStartIndex + #numEndIndex;
SET #numEndIndex = #numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(#string, #numEndIndex, 1)) = 1)
BEGIN
SET #numEndIndex = #numEndIndex + 1;
END
SET #numEndIndex = #numEndIndex - 1;
SET #padLength = #numberLength - (#numEndIndex + 1 - #numStartIndex);
IF #padLength < 0
BEGIN
SET #padLength = 0;
END
SET #sortString = STUFF(
#sortString,
#numStartIndex + #totalPadLength,
0,
REPLICATE('0', #padLength)
);
SET #totalPadLength = #totalPadLength + #padLength;
SET #numStartIndex = PATINDEX('%[0-9]%', RIGHT(#string, LEN(#string) - #numEndIndex));
END
RETURN #sortString;
END
GO
MySQL doesn't allow this sort of "natural sorting", so it looks like the best way to get what you're after is to split your data set up as you've described above (separate id field, etc), or failing that, perform a sort based on a non-title element, indexed element in your db (date, inserted id in the db, etc).
Having the db do the sorting for you is almost always going to be quicker than reading large data sets into your programming language of choice and sorting it there, so if you've any control at all over the db schema here, then look at adding easily-sorted fields as described above, it'll save you a lot of hassle and maintenance in the long run.
Requests to add a "natural sort" come up from time to time on the MySQL bugs and discussion forums, and many solutions revolve around stripping out specific parts of your data and casting them for the ORDER BY part of the query, e.g.
SELECT * FROM table ORDER BY CAST(mid(name, 6, LENGTH(c) -5) AS unsigned)
This sort of solution could just about be made to work on your Final Fantasy example above, but isn't particularly flexible and unlikely to extend cleanly to a dataset including, say, "Warhammer 40,000" and "James Bond 007" I'm afraid.
So, while I know that you have found a satisfactory answer, I was struggling with this problem for awhile, and we'd previously determined that it could not be done reasonably well in SQL and we were going to have to use javascript on a JSON array.
Here's how I solved it just using SQL. Hopefully this is helpful for others:
I had data such as:
Scene 1
Scene 1A
Scene 1B
Scene 2A
Scene 3
...
Scene 101
Scene XXA1
Scene XXA2
I actually didn't "cast" things though I suppose that may also have worked.
I first replaced the parts that were unchanging in the data, in this case "Scene ", and then did a LPAD to line things up. This seems to allow pretty well for the alpha strings to sort properly as well as the numbered ones.
My ORDER BY clause looks like:
ORDER BY LPAD(REPLACE(`table`.`column`,'Scene ',''),10,'0')
Obviously this doesn't help with the original problem which was not so uniform - but I imagine this would probably work for many other related problems, so putting it out there.
Add a Sort Key (Rank) in your table. ORDER BY rank
Utilise the "Release Date" column. ORDER BY release_date
When extracting the data from SQL, make your object do the sorting, e.g., if extracting into a Set, make it a TreeSet, and make your data model implement Comparable and enact the natural sort algorithm here (insertion sort will suffice if you are using a language without collections) as you'll be reading the rows from SQL one by one as you create your model and insert it into the collection)
Regarding the best response from Richard Toth https://stackoverflow.com/a/12257917/4052357
Watch out for UTF8 encoded strings that contain 2byte (or more) characters and numbers e.g.
12 南新宿
Using MySQL's LENGTH() in udf_NaturalSortFormat function will return the byte length of the string and be incorrect, instead use CHAR_LENGTH() which will return the correct character length.
In my case using LENGTH() caused queries to never complete and result in 100% CPU usage for MySQL
DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
DECLARE sortString varchar(4000);
DECLARE numStartIndex int;
DECLARE numEndIndex int;
DECLARE padLength int;
DECLARE totalPadLength int;
DECLARE i int;
DECLARE sameOrderCharsLen int;
SET totalPadLength = 0;
SET instring = TRIM(instring);
SET sortString = instring;
SET numStartIndex = udf_FirstNumberPos(instring);
SET numEndIndex = 0;
SET i = 1;
SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);
WHILE (i <= sameOrderCharsLen) DO
SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
SET i = i + 1;
END WHILE;
WHILE (numStartIndex <> 0) DO
SET numStartIndex = numStartIndex + numEndIndex;
SET numEndIndex = numStartIndex;
WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
SET numEndIndex = numEndIndex + 1;
END WHILE;
SET numEndIndex = numEndIndex - 1;
SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
IF padLength < 0 THEN
SET padLength = 0;
END IF;
SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
SET totalPadLength = totalPadLength + padLength;
SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
END WHILE;
RETURN sortString;
END
;;
p.s. I would have added this as a comment to the original but I don't have enough reputation (yet)
Add a field for "sort key" that has all strings of digits zero-padded to a fixed length and then sort on that field instead.
If you might have long strings of digits, another method is to prepend the number of digits (fixed-width, zero-padded) to each string of digits. For example, if you won't have more than 99 digits in a row, then for "Super Blast 10 Ultra" the sort key would be "Super Blast 0210 Ultra".
To order:
0
1
2
10
23
101
205
1000
a
aac
b
casdsadsa
css
Use this query:
SELECT
column_name
FROM
table_name
ORDER BY
column_name REGEXP '^\d*[^\da-z&\.\' \-\"\!\#\#\$\%\^\*\(\)\;\:\\,\?\/\~\`\|\_\-]' DESC,
column_name + 0,
column_name;
If you do not want to reinvent the wheel or have a headache with lot of code that does not work, just use Drupal Natural Sort ... Just run the SQL that comes zipped (MySQL or Postgre), and that's it. When making a query, simply order using:
... ORDER BY natsort_canon(column_name, 'natural')
Another option is to do the sorting in memory after pulling the data from mysql. While it won't be the best option from a performance standpoint, if you are not sorting huge lists you should be fine.
If you take a look at Jeff's post, you can find plenty of algorithms for what ever language you might be working with.
Sorting for Humans : Natural Sort Order
You can also create in a dynamic way the "sort column" :
SELECT name, (name = '-') boolDash, (name = '0') boolZero, (name+0 > 0) boolNum
FROM table
ORDER BY boolDash DESC, boolZero DESC, boolNum DESC, (name+0), name
That way, you can create groups to sort.
In my query, I wanted the '-' in front of everything, then the numbers, then the text. Which could result in something like :
-
0
1
2
3
4
5
10
13
19
99
102
Chair
Dog
Table
Windows
That way you don't have to maintain the sort column in the correct order as you add data. You can also change your sort order depending on what you need.
A lot of other answers I see here (and in the duplicate questions) basically only work for very specifically formatted data, e.g. a string that's entirely a number, or for which there's a fixed-length alphabetic prefix. This isn't going to work in the general case.
It's true that there's not really any way to implement a 100% general nat-sort in MySQL, because to do it what you really need is a modified comparison function, that switches between lexicographic sorting of the strings and numeric sort if/when it encounters a number. Such code could implement any algorithm you could desire for recognising and comparing the numeric portions within two strings. Unfortunately, though, the comparison function in MySQL is internal to its code, and cannot be changed by the user.
This leaves a hack of some kind, where you try to create a sort key for your string in which the numeric parts are re-formatted so that the standard lexicographic sort actually sorts them the way you want.
For plain integers up to some maximum number of digits, the obvious solution is to simply left-pad them with zeros so that they're all fixed width. This is the approach taken by the Drupal plugin, and the solutions of #plalx / #RichardToth. (#Christian has a different and much more complex solution, but it offers no advantages that I can see).
As #tye points out, you can improve on this by prepending a fixed-digit length to each number, rather than simply left-padding it. There's much, much more you can improve on, though, even given the limitations of what is essentially an awkward hack. Yet, there doesn't seem to be any pre-built solutions out there!
For example, what about:
Plus and minus signs? +10 vs 10 vs -10
Decimals? 8.2, 8.5, 1.006, .75
Leading zeros? 020, 030, 00000922
Thousand separators? "1,001 Dalmations" vs "1001 Dalmations"
Version numbers? MariaDB v10.3.18 vs MariaDB v10.3.3
Very long numbers? 103,768,276,592,092,364,859,236,487,687,870,234,598.55
Extending on #tye's method, I've created a fairly compact NatSortKey() stored function that will convert an arbitrary string into a nat-sort key, and that handles all of the above cases, is reasonably efficient, and preserves a total sort-order (no two different strings have sort keys that compare equal). A second parameter can be used to limit the number of numbers processed in each string (e.g. to the first 10 numbers, say), which can be used to ensure the output fits within a given length.
NOTE: Sort-key string generated with a given value of this 2nd parameter should only be sorted against other strings generated with the same value for the parameter, or else they might not sort correctly!
You can use it directly in ordering, e.g.
SELECT myString FROM myTable ORDER BY NatSortKey(myString,0); ### 0 means process all numbers - resulting sort key might be quite long for certain inputs
But for efficient sorting of large tables, it's better to pre-store the sort key in another column (possibly with an index on it):
INSERT INTO myTable (myString,myStringNSK) VALUES (#theStringValue,NatSortKey(#theStringValue,10)), ...
...
SELECT myString FROM myTable ORDER BY myStringNSK;
[Ideally, you'd make this happen automatically by creating the key column as a computed stored column, using something like:
CREATE TABLE myTable (
...
myString varchar(100),
myStringNSK varchar(150) AS (NatSortKey(myString,10)) STORED,
...
KEY (myStringNSK),
...);
But for now neither MySQL nor MariaDB allow stored functions in computed columns, so unfortunately you can't yet do this.]
My function affects sorting of numbers only. If you want to do other sort-normalization things, such as removing all punctuation, or trimming whitespace off each end, or replacing multi-whitespace sequences with single spaces, you could either extend the function, or it could be done before or after NatSortKey() is applied to your data. (I'd recommend using REGEXP_REPLACE() for this purpose).
It's also somewhat Anglo-centric in that I assume '.' for a decimal point and ',' for the thousands-separator, but it should be easy enough to modify if you want the reverse, or if you want that to be switchable as a parameter.
It might be amenable to further improvement in other ways; for example it currently sorts negative numbers by absolute value, so -1 comes before -2, rather than the other way around. There's also no way to specify a DESC sort order for numbers while retaining ASC lexicographical sort for text. Both of these issues can be fixed with a little more work; I will updated the code if/when I get the time.
There are lots of other details to be aware of - including some critical dependencies on the chaset and collation that you're using - but I've put them all into a comment block within the SQL code. Please read this carefully before using the function for yourself!
So, here's the code. If you find a bug, or have an improvement I haven't mentioned, please let me know in the comments!
delimiter $$
CREATE DEFINER=CURRENT_USER FUNCTION NatSortKey (s varchar(100), n int) RETURNS varchar(350) DETERMINISTIC
BEGIN
/****
Converts numbers in the input string s into a format such that sorting results in a nat-sort.
Numbers of up to 359 digits (before the decimal point, if one is present) are supported. Sort results are undefined if the input string contains numbers longer than this.
For n>0, only the first n numbers in the input string will be converted for nat-sort (so strings that differ only after the first n numbers will not nat-sort amongst themselves).
Total sort-ordering is preserved, i.e. if s1!=s2, then NatSortKey(s1,n)!=NatSortKey(s2,n), for any given n.
Numbers may contain ',' as a thousands separator, and '.' as a decimal point. To reverse these (as appropriate for some European locales), the code would require modification.
Numbers preceded by '+' sort with numbers not preceded with either a '+' or '-' sign.
Negative numbers (preceded with '-') sort before positive numbers, but are sorted in order of ascending absolute value (so -7 sorts BEFORE -1001).
Numbers with leading zeros sort after the same number with no (or fewer) leading zeros.
Decimal-part-only numbers (like .75) are recognised, provided the decimal point is not immediately preceded by either another '.', or by a letter-type character.
Numbers with thousand separators sort after the same number without them.
Thousand separators are only recognised in numbers with no leading zeros that don't immediately follow a ',', and when they format the number correctly.
(When not recognised as a thousand separator, a ',' will instead be treated as separating two distinct numbers).
Version-number-like sequences consisting of 3 or more numbers separated by '.' are treated as distinct entities, and each component number will be nat-sorted.
The entire entity will sort after any number beginning with the first component (so e.g. 10.2.1 sorts after both 10 and 10.995, but before 11)
Note that The first number component in an entity like this is also permitted to contain thousand separators.
To achieve this, numbers within the input string are prefixed and suffixed according to the following format:
- The number is prefixed by a 2-digit base-36 number representing its length, excluding leading zeros. If there is a decimal point, this length only includes the integer part of the number.
- A 3-character suffix is appended after the number (after the decimals if present).
- The first character is a space, or a '+' sign if the number was preceded by '+'. Any preceding '+' sign is also removed from the front of the number.
- This is followed by a 2-digit base-36 number that encodes the number of leading zeros and whether the number was expressed in comma-separated form (e.g. 1,000,000.25 vs 1000000.25)
- The value of this 2-digit number is: (number of leading zeros)*2 + (1 if comma-separated, 0 otherwise)
- For version number sequences, each component number has the prefix in front of it, and the separating dots are removed.
Then there is a single suffix that consists of a ' ' or '+' character, followed by a pair base-36 digits for each number component in the sequence.
e.g. here is how some simple sample strings get converted:
'Foo055' --> 'Foo0255 02'
'Absolute zero is around -273 centigrade' --> 'Absolute zero is around -03273 00 centigrade'
'The $1,000,000 prize' --> 'The $071000000 01 prize'
'+99.74 degrees' --> '0299.74+00 degrees'
'I have 0 apples' --> 'I have 00 02 apples'
'.5 is the same value as 0000.5000' --> '00.5 00 is the same value as 00.5000 08'
'MariaDB v10.3.0018' --> 'MariaDB v02100130218 000004'
The restriction to numbers of up to 359 digits comes from the fact that the first character of the base-36 prefix MUST be a decimal digit, and so the highest permitted prefix value is '9Z' or 359 decimal.
The code could be modified to handle longer numbers by increasing the size of (both) the prefix and suffix.
A higher base could also be used (by replacing CONV() with a custom function), provided that the collation you are using sorts the "digits" of the base in the correct order, starting with 0123456789.
However, while the maximum number length may be increased this way, note that the technique this function uses is NOT applicable where strings may contain numbers of unlimited length.
The function definition does not specify the charset or collation to be used for string-type parameters or variables: The default database charset & collation at the time the function is defined will be used.
This is to make the function code more portable. However, there are some important restrictions:
- Collation is important here only when comparing (or storing) the output value from this function, but it MUST order the characters " +0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" in that order for the natural sort to work.
This is true for most collations, but not all of them, e.g. in Lithuanian 'Y' comes before 'J' (according to Wikipedia).
To adapt the function to work with such collations, replace CONV() in the function code with a custom function that emits "digits" above 9 that are characters ordered according to the collation in use.
- For efficiency, the function code uses LENGTH() rather than CHAR_LENGTH() to measure the length of strings that consist only of digits 0-9, '.', and ',' characters.
This works for any single-byte charset, as well as any charset that maps standard ASCII characters to single bytes (such as utf8 or utf8mb4).
If using a charset that maps these characters to multiple bytes (such as, e.g. utf16 or utf32), you MUST replace all instances of LENGTH() in the function definition with CHAR_LENGTH()
Length of the output:
Each number converted adds 5 characters (2 prefix + 3 suffix) to the length of the string. n is the maximum count of numbers to convert;
This parameter is provided as a means to limit the maximum output length (to input length + 5*n).
If you do not require the total-ordering property, you could edit the code to use suffixes of 1 character (space or plus) only; this would reduce the maximum output length for any given n.
Since a string of length L has at most ((L+1) DIV 2) individual numbers in it (every 2nd character a digit), for n<=0 the maximum output length is (inputlength + 5*((inputlength+1) DIV 2))
So for the current input length of 100, the maximum output length is 350.
If changing the input length, the output length must be modified according to the above formula. The DECLARE statements for x,y,r, and suf must also be modified, as the code comments indicate.
****/
DECLARE x,y varchar(100); # need to be same length as input s
DECLARE r varchar(350) DEFAULT ''; # return value: needs to be same length as return type
DECLARE suf varchar(101); # suffix for a number or version string. Must be (((inputlength+1) DIV 2)*2 + 1) chars to support version strings (e.g. '1.2.33.5'), though it's usually just 3 chars. (Max version string e.g. 1.2. ... .5 has ((length of input + 1) DIV 2) numeric components)
DECLARE i,j,k int UNSIGNED;
IF n<=0 THEN SET n := -1; END IF; # n<=0 means "process all numbers"
LOOP
SET i := REGEXP_INSTR(s,'\\d'); # find position of next digit
IF i=0 OR n=0 THEN RETURN CONCAT(r,s); END IF; # no more numbers to process -> we're done
SET n := n-1, suf := ' ';
IF i>1 THEN
IF SUBSTRING(s,i-1,1)='.' AND (i=2 OR SUBSTRING(s,i-2,1) RLIKE '[^.\\p{L}\\p{N}\\p{M}\\x{608}\\x{200C}\\x{200D}\\x{2100}-\\x{214F}\\x{24B6}-\\x{24E9}\\x{1F130}-\\x{1F149}\\x{1F150}-\\x{1F169}\\x{1F170}-\\x{1F189}]') AND (SUBSTRING(s,i) NOT RLIKE '^\\d++\\.\\d') THEN SET i:=i-1; END IF; # Allow decimal number (but not version string) to begin with a '.', provided preceding char is neither another '.', nor a member of the unicode character classes: "Alphabetic", "Letter", "Block=Letterlike Symbols" "Number", "Mark", "Join_Control"
IF i>1 AND SUBSTRING(s,i-1,1)='+' THEN SET suf := '+', j := i-1; ELSE SET j := i; END IF; # move any preceding '+' into the suffix, so equal numbers with and without preceding "+" signs sort together
SET r := CONCAT(r,SUBSTRING(s,1,j-1)); SET s = SUBSTRING(s,i); # add everything before the number to r and strip it from the start of s; preceding '+' is dropped (not included in either r or s)
END IF;
SET x := REGEXP_SUBSTR(s,IF(SUBSTRING(s,1,1) IN ('0','.') OR (SUBSTRING(r,-1)=',' AND suf=' '),'^\\d*+(?:\\.\\d++)*','^(?:[1-9]\\d{0,2}(?:,\\d{3}(?!\\d))++|\\d++)(?:\\.\\d++)*+')); # capture the number + following decimals (including multiple consecutive '.<digits>' sequences)
SET s := SUBSTRING(s,LENGTH(x)+1); # NOTE: LENGTH() can be safely used instead of CHAR_LENGTH() here & below PROVIDED we're using a charset that represents digits, ',' and '.' characters using single bytes (e.g. latin1, utf8)
SET i := INSTR(x,'.');
IF i=0 THEN SET y := ''; ELSE SET y := SUBSTRING(x,i); SET x := SUBSTRING(x,1,i-1); END IF; # move any following decimals into y
SET i := LENGTH(x);
SET x := REPLACE(x,',','');
SET j := LENGTH(x);
SET x := TRIM(LEADING '0' FROM x); # strip leading zeros
SET k := LENGTH(x);
SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294) + IF(i=j,0,1),10,36),2,'0')); # (j-k)*2 + IF(i=j,0,1) = (count of leading zeros)*2 + (1 if there are thousands-separators, 0 otherwise) Note the first term is bounded to <= base-36 'ZY' as it must fit within 2 characters
SET i := LOCATE('.',y,2);
IF i=0 THEN
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x,y,suf); # k = count of digits in number, bounded to be <= '9Z' base-36
ELSE # encode a version number (like 3.12.707, etc)
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x); # k = count of digits in number, bounded to be <= '9Z' base-36
WHILE LENGTH(y)>0 AND n!=0 DO
IF i=0 THEN SET x := SUBSTRING(y,2); SET y := ''; ELSE SET x := SUBSTRING(y,2,i-2); SET y := SUBSTRING(y,i); SET i := LOCATE('.',y,2); END IF;
SET j := LENGTH(x);
SET x := TRIM(LEADING '0' FROM x); # strip leading zeros
SET k := LENGTH(x);
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x); # k = count of digits in number, bounded to be <= '9Z' base-36
SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294),10,36),2,'0')); # (j-k)*2 = (count of leading zeros)*2, bounded to fit within 2 base-36 digits
SET n := n-1;
END WHILE;
SET r := CONCAT(r,y,suf);
END IF;
END LOOP;
END
$$
delimiter ;
Other answers are correct, but you may want to know that MariaDB 10.11 LTS has a natural_sort_key() function. The function is documented here.
If you're using PHP you can do the the natural sort in php.
$keys = array();
$values = array();
foreach ($results as $index => $row) {
$key = $row['name'].'__'.$index; // Add the index to create an unique key.
$keys[] = $key;
$values[$key] = $row;
}
natsort($keys);
$sortedValues = array();
foreach($keys as $index) {
$sortedValues[] = $values[$index];
}
I hope MySQL will implement natural sorting in a future version, but the feature request (#1588) is open since 2003, So I wouldn't hold my breath.
A simplified non-udf version of the best response of #plaix/Richard Toth/Luke Hoggett, which works only for the first integer in the field, is
SELECT name,
LEAST(
IFNULL(NULLIF(LOCATE('0', name), 0), ~0),
IFNULL(NULLIF(LOCATE('1', name), 0), ~0),
IFNULL(NULLIF(LOCATE('2', name), 0), ~0),
IFNULL(NULLIF(LOCATE('3', name), 0), ~0),
IFNULL(NULLIF(LOCATE('4', name), 0), ~0),
IFNULL(NULLIF(LOCATE('5', name), 0), ~0),
IFNULL(NULLIF(LOCATE('6', name), 0), ~0),
IFNULL(NULLIF(LOCATE('7', name), 0), ~0),
IFNULL(NULLIF(LOCATE('8', name), 0), ~0),
IFNULL(NULLIF(LOCATE('9', name), 0), ~0)
) AS first_int
FROM table
ORDER BY IF(first_int = ~0, name, CONCAT(
SUBSTR(name, 1, first_int - 1),
LPAD(CAST(SUBSTR(name, first_int) AS UNSIGNED), LENGTH(~0), '0'),
SUBSTR(name, first_int + LENGTH(CAST(SUBSTR(name, first_int) AS UNSIGNED)))
)) ASC
I have tried several solutions but the actually it is very simple:
SELECT test_column FROM test_table ORDER BY LENGTH(test_column) DESC, test_column DESC
/*
Result
--------
value_1
value_2
value_3
value_4
value_5
value_6
value_7
value_8
value_9
value_10
value_11
value_12
value_13
value_14
value_15
...
*/
Also there is natsort. It is intended to be a part of a drupal plugin, but it works fine stand-alone.
Here is a simple one if titles only have the version as a number:
ORDER BY CAST(REGEXP_REPLACE(title, "[a-zA-Z]+", "") AS INT)';
Otherwise you can use simple SQL if you use a pattern (this pattern uses a # before the version):
create table titles(title);
insert into titles (title) values
('Final Fantasy'),
('Final Fantasy #03'),
('Final Fantasy #11'),
('Final Fantasy #10'),
('Final Fantasy #2'),
('Bond 007 ##2'),
('Final Fantasy #01'),
('Bond 007'),
('Final Fantasy #11}');
select REGEXP_REPLACE(title, "#([0-9]+)", "\\1") as title from titles
ORDER BY REGEXP_REPLACE(title, "#[0-9]+", ""),
CAST(REGEXP_REPLACE(title, ".*#([0-9]+).*", "\\1") AS INT);
+-------------------+
| title |
+-------------------+
| Bond 007 |
| Bond 007 #2 |
| Final Fantasy |
| Final Fantasy 01 |
| Final Fantasy 2 |
| Final Fantasy 03 |
| Final Fantasy 10 |
| Final Fantasy 11 |
| Final Fantasy 11} |
+-------------------+
8 rows in set, 2 warnings (0.001 sec)
You can use other patterns if needed.
For example if you have a movie "I'm #1" and "I'm #1 part 2" then maybe wrap the version e.g. "Final Fantasy {11}"
I know this topic is ancient but I think I've found a way to do this:
SELECT * FROM `table` ORDER BY
CONCAT(
GREATEST(
LOCATE('1', name),
LOCATE('2', name),
LOCATE('3', name),
LOCATE('4', name),
LOCATE('5', name),
LOCATE('6', name),
LOCATE('7', name),
LOCATE('8', name),
LOCATE('9', name)
),
name
) ASC
Scrap that, it sorted the following set incorrectly (It's useless lol):
Final Fantasy 1
Final Fantasy 2
Final Fantasy 5
Final Fantasy 7
Final Fantasy 7: Advent Children
Final Fantasy 12
Final Fantasy 112
FF1
FF2