how replace accented letter in a varchar2 column in oracle - sql

I have a varchar2 column named NAME_USER. for example the data is: JUAN ROMÄN but I try to show JUAN ROMAN, replace Á to A in my statement results. How Can I do that?. Thanks in advance.

Use convert function with the appropriate charset
select CONVERT('JUAN ROMÄN', 'US7ASCII') from dual;
below are the charset which can be used in oracle:
US7ASCII: US 7-bit ASCII character set
WE8DEC: West European 8-bit character set
WE8HP: HP West European Laserjet 8-bit character set
F7DEC: DEC French 7-bit character set
WE8EBCDIC500: IBM West European EBCDIC Code Page 500
WE8PC850: IBM PC Code Page 850
WE8ISO8859P1: ISO 8859-1 West European 8-bit character set

You could use replace, regexp_replace or translate, but they would each require you to map all possible accented characters to their unaccented versions.
Alternatively, there's a function called nlssort() which is typically used to override the default language settings used for the order by clause. It has an option for accent-insensitive sorting, which can be creatively misused to solve your problem. nlssort() returns a binary, so you have to convert back to varchar2 using utl_raw.cast_to_varchar2():
select utl_raw.cast_to_varchar2(nlssort(NAME_USER, 'nls_sort=binary_ai'))
from YOUR_TABLE;
Try this, for a list of accented characters from the extended ASCII set, together with their derived, unaccented values:
select level+192 ascii_code,
chr(level+192) accented,
utl_raw.cast_to_varchar2(nlssort(chr(level+192),'nls_sort=binary_ai')) unaccented
from dual
connect by level <= 63
order by 1;
Not really my answer - I've used this before and it seemed to work ok, but have to credit this post: https://community.oracle.com/thread/1117030
ETA: nlssort() can't do accent-insensitive without also doing case-insensitive, so this solution will always convert to lower case. Enclosing the expression above in upper() will of course get your example value back to "JUAN ROMAN". If your values can be mixed case, and you need to preserve the case of each character, and initcap() isn't flexible enough, then you'll need to write a bit of PL/SQL.

select replace('JUAN ROMÄN','Ä','A')
from dual;
If you have more mappings to make, you could use TRANSLATE ...

You can use regular expressions:
SELECT regexp_replace('JUAN ROMÄNí','[[=A=]]+','A' )
FROM dual;

Related

Remove accents from string in Oracle

When trying to remove all accents from a string in Oracle using the techniques described in this stackoverflow answer: how replace accented letter in a varchar2 column in oracle I’m getting mixed results.
select CONVERT('JUAN ROMÄN', 'US7ASCII') from dual;
Returns the original string but replaces characters with for example ñ by a question mark (probably because of the chosen charset - tests with different charsets led to different results).
Using the following technique:
select utl_raw.cast_to_varchar2(nlssort(NAME_USER, 'nls_sort=binary_ai')) from YOUR_TABLE;
Returns the complete string but also places a NUL value at the end of the string.
Is there a characterset that I can use with Spanish accents to get a correct result (the original string with the different accents removed); is there a way to avoid the NUL value in the utl_raw.cast_to_varchar2 technique?
Based on the comments the the replace char(0) seems to remove the NUL value. For example
select
upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))) as test,
replace(upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))),chr(0),'') as test2
from dual;
If possible I would however to have a more 'straightforward/simpler' solution.
You can use TRANSLATE(your_string, from_chars, to_chars) https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions196.htm
Just put all chars with accents in from_chars string and their corresponding replacement chars in to_chars.

SQL: Storing Extended ASCII (128 to 255) in VARCHAR

How do you store chars 128 to 255 in VARCHAR..?
SQL seems to change some of these to char(63) '?'. I'm not sure if it's something to do with collation? UTF-8? N'..'? I've tried COLLATE Latin1_General_Bin, not sure if it supports extended ascii though..
Obviously works with NVARCHAR, but in theory this should work in VARCHAR too..?
The character stored in varchar/char columns beyond the ASCII 0-127 character range is determined by the code page associated with the collation. Characters not specifically defined by the code page are ether mapped to a similar character or, when there is none, '?'.
You can list collations along with the associated code page with this query:
SELECT name, description, COLLATIONPROPERTY(name, 'CodePage') AS CodePage
FROM fn_helpcollations();
Dan's answer got me on the write track.
VARCHAR definitely does store Extended ASCII, but it depends on the code page associated with the collation. I'm using Latin1_General_100_BIN which uses code page 1252.
https://en.wikipedia.org/wiki/Windows-1252
According to this code page the the following chars are undefined:
129, 141, 143, 144, 157
In reality it looks like SQL exclude most chars from 128 to 159. Easy solution was just to remove those characters.

Removing replacement character � from column

Based on my research so far this character indicates bad encoding between the database and front end. Unfortunately, I don't have any control over either of those. I'm using Teradata Studio.
How can I filter this character out? I'm trying to perform a REGEX_SUBSTR function on a column that occasionally contains �, which throws the error "The string contains an untranslatable character".
Here is my SQL. AIRCFT_POSITN_ID is the column that contains the replacement character.
SELECT DISTINCT AIRCFT_POSITN_ID,
REGEXP_SUBSTR(AIRCFT_POSITN_ID, '[0-9]+') AS AUTOROW
FROM PROD_MAE_MNTNC_VW.FMR_DISCRPNCY_DFRL
WHERE DFRL_CREATE_TMS > CURRENT_DATE -25
Your diagnostic is correct, so first of all, you might want to check the Session Character Set (it is part of the connection definition).
If it is ASCII change it to UTF8 and you will be able to see the original characters instead of the substitute character.
And in case the character is indeed part of the data and not just an indication for encoding translations issues:
The substitute character AKA SUB (DEC: 26 HEX: 1A) is quite unique in Teradata.
you cannot use it directly -
select '�';
-- [6706] The string contains an untranslatable character.
select '1A'XC;
-- [6706] The string contains an untranslatable character.
If you are using version 14.0 or above you can generate it with the CHR function:
select chr(26);
If you're below version 14.0 you can generate it like this:
select translate (_unicode '05D0'XC using unicode_to_latin with error);
Once you have generated the character you can now use it with REPLACE or OTRANSLATE
create multiset table t (i int,txt varchar(100) character set latin) unique primary index (i);
insert into t (i,txt) values (1,translate ('Hello שלום world עולם' using unicode_to_latin with error));
select * from t;
-- Hello ���� world ����
select otranslate (txt,chr(26),'') from t;
-- Hello world
select otranslate (txt,translate (_unicode '05D0'XC using unicode_to_latin with error),'') from t;
-- Hello world
BTW, there are 2 versions for OTRANSLATE and OREPLACE:
The functions under syslib works with LATIN.
the functions under TD_SYSFNLIB works with UNICODE.
In addition to Dudu's excellent answer above, I wanted to add the following now that I've encountered the issue again and had more time to experiment. The following SELECT command produced an untranslatable character:
SELECT IDENTIFY FROM PROD_MAE_MNTNC_VW.SCHD_MNTNC;
IDENTIFY
24FEB1747659193DC330A163DCL�ORD
Trying to perform a REGEXP_REPLACE or OREPLACE directly on this character produces an error:
Failed [6706 : HY000] The string contains an untranslatable character.
I changed the CHARSET property in my Teradata connection from UTF8 to ASCII and I could now see the offending character, looks like a tab
IDENTIFY
Using the TRANSLATE_CHK command using this specific conversion succeeds and identifies the position of the offending character (Note that this does not work using the UTF8 charset):
TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) AS BADCHAR
BADCHAR
28
Now this character can be dealt with using some CASE statements to remove the bad character and retain the remainder of the string:
CASE WHEN TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) = 0 THEN IDENTIFY
ELSE SUBSTR(IDENTIFY, 1, TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE)-1)
END AS IDENTIFY
Hopes this helps someone out.

Converting symbols from outside of alphabet caused by copying text in different encoding

In my database I should only have data written using Polish alphabet but sometimes there are symbols not included in Polish alphabet (words copied from source with different encoding) that correspond to Polish letters in another encoding. Is it possible to somehow convert symbols outside of Polish alphabet to corresponding letters?
The only solution I figured is to manually find and replace those characters but maybe you have better solution to my problem.
Question concerns Oracle SQL Language.
I don't have database in front of me but as I remember correctly the example could look like this - two rows from my db:
ŚWIAT
ÚWIAT
and what I need is to convert Ú that doesn't belong to Polish alphabet to Ś.
You can try this. Experiment with it first to see if it works.
If I want to change every occurrence of the letter z with a j in a string, I would use the translate function: translate(text_string, 'z', 'j'). I don't have to use the letters z and j; instead, I can write translate(text_string, chr(122), chr(106) - to find out the character code, I use select ascii('z') from dual;. For example:
SQL> select translate('banzo', chr(122), chr(106)) from dual;
TRANS
-----
banjo
This changes every occurrence of z to j in text_string.
Now, you will have to find the code for the characters you want to change (both the "from" and the "to" characters) in your character set - it should be your session character set, not the database character set. (At least I think this is correct; experiment with it or read the documentation for CHR and perhaps for TRANSLATE - CHR returns the character code in the DATABASE character set unless you indicate otherwise, while I believe TRANSLATE uses the session character set.)
The function ascii may or may not work for non-ASCII characters, but if you google the name of your character set, you should find a character set table that will show you the codes for all the letters available in that character set.
Then, if this works, you can do the translation in one shot - translate(text_string, 'abcd', 'qrst') will change every 'a' to a 'q', every 'b' to an 'r' etc. And with chr(...), instead of 'abcd' you can write chr(97) || chr(98) || chr(99) || chr(100).

Convert text with HTML character encoding to database characterset

Our application receives data from various sources. Some of these contain HTML character makeup instead of regular characters. So instead of string "â" we receive string "â".
How can we convert "â" to a character in the database character set using SQL/PLSQL?
Our database is 10GR2.
Unescape_reference and excape_reference I believe is what you're looking for
UTL_I18N.UNESCAPE_REFERENCE('hello < å')
This returns 'hello <'||chr(229).
http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/u_i18n.htm#i998992
You can use the CHR() function to convert an ascii character number to a character representation.
SELECT chr(226)
FROM dual;
CHR(226)
--------
â
For more information see: http://www.techonthenet.com/oracle/functions/chr.php
Hope it helps...
one solution
replace(your_test, 'â', chr(226))
but you'd have to nest many replace functions, one for each entity you need to replace. This might be very slow if you have to replace many.
You can wrote your own function, seqrching for the ampersand and replacing when found.
Have you searched the Oracle Supplied Packages manual? I know they have a function that does the opposite for a few entities.
to convert a column in oracle which contains HTML items to plain text, you could use:
trim(regexp_replace(UTL_I18N.unescape_reference(column_name), '<[^>]+>'))
It will replace HTML character as above stated but will also remove HTML tags en remove leading and trailing spaces.
I hope it will help someone.