Convert escaped unicode character to unicode notation - sql

I have a DB2 LUW table with escaped unicode characters in it.
I want to convert this to a real unicode string.
$ db2 "select loc_longtext from fh01tq07 where loc_longtext like '%\\u%'"
LOC_LONGTEXT
------------
S\u00e4ule
After long time try and error I'm at this point:
$ db2 "select loc_longtext, xmlquery('fn:replace(\$LOC_LONGTEXT,''\\\u([0-9a-f]{1,4})'',''&#x\$1;'')') from fh01tq07 where loc_longtext like '%\\u%'"
SQL16002N An XQuery expression has an unexpected token "&#x" following "]{1,4})','". Expected tokens may include: "<". Error QName=err:XPST0003. SQLSTATE=10505
But fn:normalize-unicode requests this type of escaped unicode format.
Any suggestions?

Related

How to convert string with German characters to Blob in Firebird?

I want to convert a string into a blob with the f_strblob(CSTRING) function of FreeAdhocUDF. At this point I do not find a way to get my special characters like ß or ä shown in the blob.
The result of f_strblob('Gemäß') is Gem..
I tried to change the character set to UTF8 of my variables, but that does not help.
Is there a masking option which I did not find?
You don't need that function, and the FreeAdhocUDF documentation also marks it as obsolete for that reason.
In a lot of situations, Firebird will automatically convert string literals to blobs (eg in statements where a string literal is assigned to a blob value), and otherwise you can explicitly cast using cast('your string' as blob sub_type text).

Removing replacement character � from column

Based on my research so far this character indicates bad encoding between the database and front end. Unfortunately, I don't have any control over either of those. I'm using Teradata Studio.
How can I filter this character out? I'm trying to perform a REGEX_SUBSTR function on a column that occasionally contains �, which throws the error "The string contains an untranslatable character".
Here is my SQL. AIRCFT_POSITN_ID is the column that contains the replacement character.
SELECT DISTINCT AIRCFT_POSITN_ID,
REGEXP_SUBSTR(AIRCFT_POSITN_ID, '[0-9]+') AS AUTOROW
FROM PROD_MAE_MNTNC_VW.FMR_DISCRPNCY_DFRL
WHERE DFRL_CREATE_TMS > CURRENT_DATE -25
Your diagnostic is correct, so first of all, you might want to check the Session Character Set (it is part of the connection definition).
If it is ASCII change it to UTF8 and you will be able to see the original characters instead of the substitute character.
And in case the character is indeed part of the data and not just an indication for encoding translations issues:
The substitute character AKA SUB (DEC: 26 HEX: 1A) is quite unique in Teradata.
you cannot use it directly -
select '�';
-- [6706] The string contains an untranslatable character.
select '1A'XC;
-- [6706] The string contains an untranslatable character.
If you are using version 14.0 or above you can generate it with the CHR function:
select chr(26);
If you're below version 14.0 you can generate it like this:
select translate (_unicode '05D0'XC using unicode_to_latin with error);
Once you have generated the character you can now use it with REPLACE or OTRANSLATE
create multiset table t (i int,txt varchar(100) character set latin) unique primary index (i);
insert into t (i,txt) values (1,translate ('Hello שלום world עולם' using unicode_to_latin with error));
select * from t;
-- Hello ���� world ����
select otranslate (txt,chr(26),'') from t;
-- Hello world
select otranslate (txt,translate (_unicode '05D0'XC using unicode_to_latin with error),'') from t;
-- Hello world
BTW, there are 2 versions for OTRANSLATE and OREPLACE:
The functions under syslib works with LATIN.
the functions under TD_SYSFNLIB works with UNICODE.
In addition to Dudu's excellent answer above, I wanted to add the following now that I've encountered the issue again and had more time to experiment. The following SELECT command produced an untranslatable character:
SELECT IDENTIFY FROM PROD_MAE_MNTNC_VW.SCHD_MNTNC;
IDENTIFY
24FEB1747659193DC330A163DCL�ORD
Trying to perform a REGEXP_REPLACE or OREPLACE directly on this character produces an error:
Failed [6706 : HY000] The string contains an untranslatable character.
I changed the CHARSET property in my Teradata connection from UTF8 to ASCII and I could now see the offending character, looks like a tab
IDENTIFY
Using the TRANSLATE_CHK command using this specific conversion succeeds and identifies the position of the offending character (Note that this does not work using the UTF8 charset):
TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) AS BADCHAR
BADCHAR
28
Now this character can be dealt with using some CASE statements to remove the bad character and retain the remainder of the string:
CASE WHEN TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) = 0 THEN IDENTIFY
ELSE SUBSTR(IDENTIFY, 1, TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE)-1)
END AS IDENTIFY
Hopes this helps someone out.

Regex literal in Frege

What is an unicode code for grave accent mark used to specify regex literal in Frege?
The character is called Acute Accent and the unicode for that is 00B4. In ubuntu, you can type that using Ctrl+Shift+u and then type 00B4 then space. However you don't really have to use that if your regex literal is more than one character in which case you can just use apostrophes.
Quoting the doc:
Regular expression literals have type Regex and are written:
´\b(foo|bar)\b´ -- string enclosed in grave accents
'\w+' -- string with length > 1 enclosed in apostrophes
The notation with the apostrophes has been introduced because many have a hard time entering a grave accent mark on their terminal. However, it is not possible to write a regular expressions with length 1 this way, because then the literal gets interpreted as Char literal. (One can write something like '(?:X)' for a Regex that matches a single 'X').

Is it ok to insert ascii decimal characters in postgres select query?

I need o/p of select statement in this way.
You are assigned to following projects:
A
B
C
There are some spaces, special characters & new lines. To do that I used ascii charaters chr(10), chr(32), chr(8226).
It's working fine but query doesn't look good & I am not sure if it's a good approach to do it like this.
The query looks like this
SELECT
'You are assigned to following projects:' || chr(10) || chr(32) || chr(8226) || chr(32) ||
string_agg(e.projects, chr(10) || chr(32) || chr(8226) || chr(32))
Also will this work in all OS & in every environment?
You have a few options:
Insert characters literally. Usually the best option. Want a "•"? Use a string like 'this is a •'. Nothing more is required if your client_encoding is correct and the encoding you're using includes the character you want (like •). This is SQL-standard. Newlines may be included as literals:
SELECT '
' AS "this_is_a_newline";
This approach may not work for some non-printable characters, depending on the database implementation. For PostgreSQL it's fine for everything except \x00, the zero byte, which PostgreSQL doesn't support in text / varchar etc at all, only in bytea.
Watch out to make sure your text editor / SQL editor's text encoding matches what your connection tells PostgreSQL the client_encoding is, otherwise you'll get mangled strings or weird errors. Users of unix-like terminals also need to make sure the terminal encoding matches client_encoding to avoid weird output errors. These days Windows is the only platform where this is generally an issue.
Insert characters by hex or unicode literal in an E'' escape-string, e.g. E'this is a \u2022' . Note that \u escapes are hexadecimal - 0x2022 is decimal 8226. The E'' syntax is a PostgreSQL extension.
For characters that have shorthand escapes defined, use the shorthand escapes in an escape-string, e.g. E'\n'. This is a PostgreSQL extension.
use chr(8226), as you described, but note that chr interprets the code according to your server_encoding (the database's text encoding). So I do not encourage it. For multi-byte chars you'll just get an error like ERROR: requested character too large for encoding: 8226:
regress=> CREATE DATABASE latin ENCODING 'latin-1' LC_CTYPE 'C' LC_COLLATE 'C' TEMPLATE template0;
CREATE DATABASE
regress=> \c latin
You are now connected to database "latin" as user "craig".
latin=> SHOW server_encoding;
server_encoding
-----------------
LATIN1
(1 row)
latin=> SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)
latin=> select chr(8226);
ERROR: requested character too large for encoding: 8226
but for chars whose ordinal is in the 1-byte range, you can get an unexpected character instead. Take ü, which in both utf-8 and latin-1 (iso-8859-1) is 0xfc (decimal 252), but in iso-8859-5 is ќ. So:
regress=> SHOW server_encoding;
server_encoding
-----------------
UTF8
regress=> SELECT chr(252);
chr
-----
ü
regress=> CREATE DATABASE iso5 ENCODING 'iso-8859-5' LC_CTYPE 'C' LC_COLLATE 'C' TEMPLATE template0;
regress=> \c iso5
iso5=> SELECT chr(252);
chr
-----
ќ
So my advice: Always use literals where possible. Where you must use escapes, use E'' strings with unicode escapes to prevent ambiguity about the meaning of a codepoint based on the current server encoding. Avoid \x escapes and chr.
For the specific example you wrote, you should use:
SELECT 'You are assigned to following projects:
• A
• B
• C';
Note for readers on very old PostgreSQL versions: Extremely old PostgreSQL release didn't support E'' strings and treated all strings as if they were escape strings. So '\n' meant "newline" whereras modern PostgreSQL follows the SQL standard in which '\n' is just the string "\n". Only marginally prehistoric versions still did this, but raised a warning about it and let you request the standard behaviour by setting standard_conforming_strings = on. This has been the default for quite a while.
Instead of
chr(10) || chr(32) || chr(8226) || chr(32)
just use
\n •
No reason to use chr() in this case.

what characters should be escaped in sql string parameters

I need a complete list of characters that should be escaped in sql string parameters to prevent exceptions. I assume that I need to replace all the offending characters with the escaped version before I pass it to my ObjectDataSource filter parameter.
No, the ObjectDataSource will handle all the escaping for you. Any parametrized query will also require no escaping.
As others have pointed out, in 99% of the cases where someone thinks they need to ask this question, they are doing it wrong. Parameterization is the way to go. If you really need to escape yourself, try to find out if your DB access library offers a function for this (for example, MySQL has mysql_real_escape_string).
SQL Books online:
Search for String Literals:
String Literals
A string literal consists of zero or more characters surrounded by quotation marks. If a string contains quotation marks, these must be escaped in order for the expression to parse. Any two-byte character except \x0000 is permitted in a string, because the \x0000 character is the null terminator of a string.
Strings can include other characters that require an escape sequence. The following table lists escape sequences for string literals.
\a
Alert
\b
Backspace
\f
Form feed
\n
New line
\r
Carriage return
\t
Horizontal tab
\v
Vertical tab
\"
Quotation mark
\
Backslash
\xhhhh
Unicode character in hexadecimal notation
Here's a way I used to get rid of apostrophes. You could do the same thing with other offending characters that you run into. (example in VB.Net)
Dim companyFilter = Trim(Me.ddCompany.SelectedValue)
If (Me.ddCompany.SelectedIndex > 0) Then
filterString += String.Format("LegalName like '{0}'", companyFilter.Replace("'", "''"))
End If
Me.objectDataSource.FilterExpression = filterString
Me.displayGrid.DataBind()