Convert text with HTML character encoding to database characterset - sql

Our application receives data from various sources. Some of these contain HTML character makeup instead of regular characters. So instead of string "â" we receive string "â".
How can we convert "â" to a character in the database character set using SQL/PLSQL?
Our database is 10GR2.

Unescape_reference and excape_reference I believe is what you're looking for
UTL_I18N.UNESCAPE_REFERENCE('hello < å')
This returns 'hello <'||chr(229).
http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/u_i18n.htm#i998992

You can use the CHR() function to convert an ascii character number to a character representation.
SELECT chr(226)
FROM dual;
CHR(226)
--------
â
For more information see: http://www.techonthenet.com/oracle/functions/chr.php
Hope it helps...

one solution
replace(your_test, 'â', chr(226))
but you'd have to nest many replace functions, one for each entity you need to replace. This might be very slow if you have to replace many.
You can wrote your own function, seqrching for the ampersand and replacing when found.
Have you searched the Oracle Supplied Packages manual? I know they have a function that does the opposite for a few entities.

to convert a column in oracle which contains HTML items to plain text, you could use:
trim(regexp_replace(UTL_I18N.unescape_reference(column_name), '<[^>]+>'))
It will replace HTML character as above stated but will also remove HTML tags en remove leading and trailing spaces.
I hope it will help someone.

Related

How to include apostrophe in character set for REGEXP_SUBSTR()

The IBM i implementation of regex uses apostrophes (instead of e.g. slashes) to delimit a regex string, i.e.:
... where REGEXP_SUBSTR(MYFIELD,'myregex_expression')
If I try to use an apostrophe inside a [group] within the expression, it always errors - presumably thinking I am giving a closing quote. I have tried:
- escaping it: \'
- doubling it: '' (and tripling)
No joy. I cannot find anything relevant in the IBM SQL manual or by google search.
I really need this to, for instance, allow names like O'Leary.
Thanks to Wiktor Stribizew for the answer in his comment.
There are a couple of "gotchas" for anyone who might land on this question with the same problem. The first is that you have to give the (presumably Unicode) hex value rather than the EBCDIC value that you would use, e.g. in ordinary interactive SQL on the IBM i. So in this case it really is \x27 and not \x7D for an apostrophe. Presumably this is because the REGEXP_ ... functions are working through Unicode even for EBCDIC data.
The second thing is that it would seem that the hex value cannot be the last one in the set. So this works:
^[A-Z0-9_\+\x27-]+ ... etc.
But this doesn't
^[A-Z0-9_\+-\x27]+ ... etc.
I don't know how to highlight text within a code sample, so I draw your attention to the fact that the hyphen is last in the first sample and second-to-last in the second sample.
If anyone knows why it has to not be last, I'd be interested to know. [edit: see Wiktor's answer for the reason]
btw, using double quotes as the string delimiter with an apostrophe in the set didn't work in this context.
A single quote can be defined with the \x27 notation:
^[A-Z0-9_+\x27-]+
^^^^
Note that when you use a hyphen in the character class/bracket expression, when used in between some chars it forms a range between those symbols. When you used ^[A-Z0-9_\+-\x27]+ you defined a range between + and ', which is an invalid range as the + comes after ' in the Unicode table.

Remove accents from string in Oracle

When trying to remove all accents from a string in Oracle using the techniques described in this stackoverflow answer: how replace accented letter in a varchar2 column in oracle I’m getting mixed results.
select CONVERT('JUAN ROMÄN', 'US7ASCII') from dual;
Returns the original string but replaces characters with for example ñ by a question mark (probably because of the chosen charset - tests with different charsets led to different results).
Using the following technique:
select utl_raw.cast_to_varchar2(nlssort(NAME_USER, 'nls_sort=binary_ai')) from YOUR_TABLE;
Returns the complete string but also places a NUL value at the end of the string.
Is there a characterset that I can use with Spanish accents to get a correct result (the original string with the different accents removed); is there a way to avoid the NUL value in the utl_raw.cast_to_varchar2 technique?
Based on the comments the the replace char(0) seems to remove the NUL value. For example
select
upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))) as test,
replace(upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))),chr(0),'') as test2
from dual;
If possible I would however to have a more 'straightforward/simpler' solution.
You can use TRANSLATE(your_string, from_chars, to_chars) https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions196.htm
Just put all chars with accents in from_chars string and their corresponding replacement chars in to_chars.

Remove Special Characters from an Oracle String

From within an Oracle 11g database, using SQL, I need to remove the following sequence of special characters from a string, i.e.
~!##$%^&*()_+=\{}[]:”;’<,>./?
If any of these characters exist within a string, except for these two characters, which I DO NOT want removed, i.e.: "|" and "-" then I would like them completely removed.
For example:
From: 'ABC(D E+FGH?/IJK LMN~OP' To: 'ABCD EFGHIJK LMNOP' after removal of special characters.
I have tried this small test which works for this sample, i.e:
select regexp_replace('abc+de)fg','\+|\)') from dual
but is there a better means of using my sequence of special characters above without doing this string pattern of '\+|\)' for every special character using Oracle SQL?
You can replace anything other than letters and space with empty string
[^a-zA-Z ]
here is online demo
As per below comments
I still need to keep the following two special characters within my string, i.e. "|" and "-".
Just exclude more
[^a-zA-Z|-]
Note: hyphen - should be in the starting or ending or escaped like \- because it has special meaning in the Character class to define a range.
For more info read about Character Classes or Character Sets
Consider using this regex replacement instead:
REGEXP_REPLACE('abc+de)fg', '[~!##$%^&*()_+=\\{}[\]:”;’<,>.\/?]', '')
The replacement will match any character from your list.
Here is a regex demo!
The regex to match your sequence of special characters is:
[]~!##$%^&*()_+=\{}[:”;’<,>./?]+
I feel you still missed to escape all regex-special characters.
To achieve that, go iteratively:
build a test-tring and start to build up your regex-string character by character to see if it removes what you expect to be removed.
If the latest character does not work you have to escape it.
That should do the trick.
SELECT TRANSLATE('~!##$%sdv^&*()_+=\dsv{}[]:”;’<,>dsvsdd./?', '~!##$%^&*()_+=\{}[]:”;’<,>./?',' ')
FROM dual;
result:
TRANSLATE
-------------
sdvdsvdsvsdd
SQL> select translate('abc+de#fg-hq!m', 'a+-#!', etc.) from dual;
TRANSLATE(
----------
abcdefghqm

mysql query to find every occurrence of a non-ascii character in a table column and replace it with its HTML equivalent

I have a column that contains HTML strings and has characters like ® and ™ in it. I want to replace all such characters with their HTML equivalent. Is this possible?
You can do it with multiple nested REPLACE function calls http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace .
REPLACE("nonascii2", "&htmlnonascii2", REPLACE("nonascii1", "&htmlnonascii1", "mystring"))

How can I write special character in VB code

I have a Sql statament using special character (ex: ('), (/), (&)) and I don't know how to write them in my VB.NET code. Please help me. Thanks.
Find out the Unicode code point for the character (from http://www.unicode.org) and then use ChrW to convert from the code point to the character. (To put this in another string, use concatenation. I'm somewhat surprised that VB doesn't have an escape sequence, but there we go.)
For example, for the Euro sign (U+20AC) you'd write:
Dim euro as Char = ChrW(&H20AC)
The advantage of this over putting the character directly into source code is that your source code stays "just pure ASCII" - which means you won't have any strange issues with any other program trying to read it, diff it, etc. The disadvantage is that it's harder to see the symbol in the code, of course.
The most common way seems to be to append a character of the form Chr(34)... 34 represents a double quote character. The character codes can be found from the windows program "charmap"... just windows/Run... and type charmap
If you are passing strings to be processed as SQL statement try doubling the characters for example.
"SELECT * FROM MyRecords WHERE MyRecords.MyKeyField = ""With a "" Quote"" "
The '' double works with the other special characters as well.
The ' character can be doubled up to allow it into a string e.g
lSQLSTatement = "Select * from temp where name = 'fred''s'"
Will search for all records where name = fred's
Three points:
1) The example characters you've given are not special characters. They're directly available on your keyboard. Just press the corresponding key.
2) To type characters that don't have a corresponding key on the keyboard, use this:
Alt + (the ASCII code number of the special character)
For example, to type ¿, press Alt and key in 168, which is the ASCII code for that special character.
You can use this method to type a special character in practically any program not just a VB.Net text editor.
3) What you probably looking for is what is called 'escaping' characters in a string. In your SQL query string, just place a \ before each of those characters. That should do.
Chr() is probably the most popular.
ChrW() can be used if you want to generate unicode characters
The ControlChars class contains some special and 'invisible' characters, plus the quote - for example, ControlChars.Quote