How to efficiently replace special characters in an XML in Oracle SQL? - sql

I'm parsing an xml in oracle sql.
XMLType(replace(column1,'&','<![CDATA[&]]>')) //column1 is a column name that has xml data
While parsing, I'm temporarily wrapping '&' in CDATA to prevent any xml exception. After getting rid of the exception caused by '&', I'm getting "invalid character 32 (' ') found in a Name or Nmtoken". This is because of '<' character.
E.g: <child> 40 < 50 </child> // This causes the above exception.
So I tried the below and it works.
XMLType(replace(replace(column1,'&','<![CDATA[&]]>'),'< ','<![CDATA[< ]]>'))
In the above, I'm wrapping '< '(less than symbol followed by space) in CDATA. But the above is a bit time consuming. So I'm trying to use regex to reduce the time taken.
Does anyone know how to implement the above action using regex in Oracle sql??
Input : <child> 40 & < 50 </child>
Expected Output : <child> 40 <![CDATA[&]]> <![CDATA[< ]]> 50 </child>
Note: Replacing '& ' with ampersand semicolon sometimes is leading to 'entity reference not well formed' exception. Hence I have opted to wrap in CDATA.

You can do that with a regexp like this:
select regexp_replace(sr.column1,'(&|< )','<![CDATA[\1]]>') from dual;
However, regexp_replace (and all the regexp_* functions) are often slower than using plain replace, because they do more complicated logic. So I'm not sure if it'll be faster or not.
You might already be aware, but your underlying problem here is that you're starting out with invalid XML that you're trying to fix, which is a hard problem! The ideal solution is to not have invalid XML in the first place - if possible, you should escape special characters when originally generating your XML. There are built-in functions which can do that quickly, like DBMS_XMLGEN.CONVERT or HTF.ESCAPE_SC.

Related

How to include apostrophe in character set for REGEXP_SUBSTR()

The IBM i implementation of regex uses apostrophes (instead of e.g. slashes) to delimit a regex string, i.e.:
... where REGEXP_SUBSTR(MYFIELD,'myregex_expression')
If I try to use an apostrophe inside a [group] within the expression, it always errors - presumably thinking I am giving a closing quote. I have tried:
- escaping it: \'
- doubling it: '' (and tripling)
No joy. I cannot find anything relevant in the IBM SQL manual or by google search.
I really need this to, for instance, allow names like O'Leary.
Thanks to Wiktor Stribizew for the answer in his comment.
There are a couple of "gotchas" for anyone who might land on this question with the same problem. The first is that you have to give the (presumably Unicode) hex value rather than the EBCDIC value that you would use, e.g. in ordinary interactive SQL on the IBM i. So in this case it really is \x27 and not \x7D for an apostrophe. Presumably this is because the REGEXP_ ... functions are working through Unicode even for EBCDIC data.
The second thing is that it would seem that the hex value cannot be the last one in the set. So this works:
^[A-Z0-9_\+\x27-]+ ... etc.
But this doesn't
^[A-Z0-9_\+-\x27]+ ... etc.
I don't know how to highlight text within a code sample, so I draw your attention to the fact that the hyphen is last in the first sample and second-to-last in the second sample.
If anyone knows why it has to not be last, I'd be interested to know. [edit: see Wiktor's answer for the reason]
btw, using double quotes as the string delimiter with an apostrophe in the set didn't work in this context.
A single quote can be defined with the \x27 notation:
^[A-Z0-9_+\x27-]+
^^^^
Note that when you use a hyphen in the character class/bracket expression, when used in between some chars it forms a range between those symbols. When you used ^[A-Z0-9_\+-\x27]+ you defined a range between + and ', which is an invalid range as the + comes after ' in the Unicode table.

Why does Replace '&' with '&' not work for XML data?

I need to download a XML file and its data is retrieved from stored procedure.
My problem is if the data contains any '&' symbol, in XML file it is showing as
'&'
I have used REPLACE function in my Procedure as shown below but...
SELECT #V_NAME = REPLACE(#V_NAME, ' & ', ' & ');
UPDATE #TMP_RS_XML
SET OBJECT_ID=#V_ID,
FNAME=#V_FILE,
DOCUMENT=(SELECT #V_NAME as 'Description',
...
Now, the output is:
&amp;
This is not the way this is supposed to work...
XML is not just some text with fancy extras but with very strict rules. As any text-based container you will need either magic words or special characters to tell the consumer what is the content and what is the markup.
The most important markup characters in XML are < and > - of course. If you want these characters to be part of your content, you'll have to replace them. That is done with xml entities.
Within the content, any XML entity will start with an ampersand (< comes out as <), therefore the ampersand is the third most important special character. If you want an ampersand within the content you must use an entitiy (&) as a code for in this place we want an ampersand.
You must distinguish between the text you see, when you look at the XML and the actual content taken out of the XML.
Try this:
DECLARE #SomeStringWithSpecialCharacters NVARCHAR(200)=N'This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?' + CHAR(13) + CHAR(10) + 'Here is the second line. And an unprintable?' + CHAR(2);
--Here we use FOR XML, all the escaping is done implicitly
SELECT #SomeStringWithSpecialCharacters AS TestIt FOR XML PATH('test');
The result
<test>
<TestIt>This & that -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>
Now I take the XML as it came out of the first part and place it into a XML-typed variable.
Attention: I had to remove the  entity, check it out...
DECLARE #SomeXML XML=
N'<test>
<TestIt>This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>';
--Now we do the magic using .value() against a native XML:
SELECT #SomeXML.value('(/test/TestIt/text())[1]','nvarchar(max)');
The result comes out with all entities re-espaced:
This & -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?
The general hint is: Never do the replacements yourself. Pushing content into the XML will need escaping and reading content out of XML will need the opposite. All this is done for you implicitly, when you use the proper tools.
'&' is a special character that is being rendered out of ' &amp ; '
The best practice here would be to decode the XML, adding a reference below:
https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?redirectedfrom=MSDN&view=netframework-4.8#overloads

selecting alpha numeric node in XQuery

i have this XQuery
declare #XML xml
set #XML =
'
<root>
<row1>
<value>1</value>
</row1>
<1row2>
<value>2</value>
</1row2>
</root>
'
select #XML.query('/root/1row2')
i keep on getting an error white trying to select 1row2.
this error
XQuery [query()]: Syntax error near '1', expected a step expression.
is seems that i just keep getting this error when xml node start with a number is there a way to select the said node?
From XML Naming Rules, XML elements must follow these naming rules:
Element names are case-sensitive
Element names must start with a letter or underscore
Element names cannot start with the letters xml (or XML, or Xml, etc)
Element names can contain letters, digits, hyphens, underscores, and
periods
Element names cannot contain spaces
Any name can be used, no words are reserved (except xml).
So, the elements names must start with a letter or underscore. On SQL Server 2016 SP1 your XML is event not a valid and cannot be executed:
You need to either repair your string to be a valid XML or to query the data using some other technique (for example, SQL CLR function to implement regex expression support or splitting the nodes).

Trying to parse non well-formed XML using NSXMLParser

I am parsing XML Data using NSXMLParser and I notice now, that the Elements can contain ALL characters, including for example a &. Since the parser is giving an error when it comes across this character I replaced every Occurence of this character.
Now I want to make sure to handle every of these characters that may cause Errors.
What are they and how do you think I should handle these characters best?
Thanks in advance!
To answer half your question, XML has 5 special characters that you may want to escape:
< -- replace with <
> -- replace with >
& -- replace with &
' -- replace with &apos;
and
" -- replace with "
Now, for the other half--how to find and replace these without also replacing all the tags, etc... Not easy, but I'd look in to regular expressions and NSRegularExpression: http://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
Remember, depending on your use case, to escape the values of the parameters on tags, too; <tag parameter="with "quotes"" />
You should encode these characters for instance & becomes & or " becomes "
When it goes through the parser it should come out ok. Your other option is to use a different XML parser like TBXML which doesn't do format checking.

Replace character in SQL results

This is from a Oracle SQL query. It has these weird skinny rectangle shapes in the database in places where apostrophes should be. (I wish we would could paste screen shots in here)
It looks like this when I copy and paste the results.
spouse�s
is there a way to write a SQL SELECT statement that searches for this character in the field and replaces it with an apostrophe in the results?
Edit: I need to change only the results in a SELECT statement for reporting purposes, I can't change the Database.
I ran this
select dump('�') from dual;
which returned
Typ=96 Len=3: 239,191,189
This seems to work so far
select translate('What is your spouse�s first name?', '�', '''') from dual;
but this doesn't work
select translate(Fieldname, '�', '''') from TableName
Select FN from TN
What is your spouse�s first name?
SELECT DUMP(FN, 1016) from TN
Typ=1 Len=33 CharacterSet=US7ASCII: 57,68,61,74,20,69,73,20,79,6f,75,72,20,73,70,6f,75,73,65,92,73,20,66,69,72,73,74,20,6e,61,6d,65,3f
EDIT:
So I have established that is the backquote character. I can't get the DB updated so I'm trying this code
SELECT REGEX_REPLACE(FN,"\0092","\0027") FROM TN
and I"m getting ORA-00904:"Regex_Replace":invalid identifier
This seems a problem with your charset configuracion. Check your NLS_LANG and others NLS_xxx enviroment/regedit values. You have to check the oracle server, your client and the client of the inserter of that data.
Try to DUMP the value. you can do it with a select as simple as:
SELECT DUMP(the_column)
FROM xxx
WHERE xxx
UPDATE: I think that before try to replace, look for the root of the problem. If this happens because a charset trouble you can get big problems with bad data.
UPDATE 2: Answering the comments. The problem may be is not on the database server side, may be is in the client side. The problem (if this is the problem) can be a translation on server to/from client comunication. It's for a server-client bad configuracion-coordination. For instance if the server has defined UTF8 charset and your client uses US7ASCII, then all acutes will appear as ?.
Another approach can be that if the server has defined UTF8 charset and your client also UTF8 but the application is not able to show UTF8 chars, then the problem is in the application side.
UPDATE 3: On your examples:
select translate('What. It works because the � is exactly the same char: You have pasted on both sides.
select translate(Fieldname. It does not work because the � is not stored on database, it's the char that the client receives may be because some translation occurs from the data table until it's showed to you.
Next step: Look in DUMP syntax and try to extract the codes for the mysterious char (from the table not pasting �!).
I would say there's a good chance the character is a single-tick "smart quote" (I hate the name). The smart quotes are characters 91-94 (using a Windows encoding), or Unicode U+2018, U+2019, U+201C, and U+201D.
I'm going to propose a front-end application-based, client-side approach to the problem:
I suspect that this problem has more to do with a mismatch between the font you are trying to display the word spouse�s with, and the character �. That icon appears when you are trying to display a character in a Unicode font that doesn't have the glyph for the character's code.
The Oracle database will dutifully return whatever characters were INSERTed into its' column. It's more up to you, and your application, to interpret what it will look like given the font you are trying to display your data with in your application, so I suggest investigating as to what this mysterious � character is that is replacing your apostrophes. Start by using FerranB's recommended DUMP().
Try running the following query to get the character code:
SELECT DUMP(<column with weird character>, 1016)
FROM <your table>
WHERE <column with weird character> like '%spouse%';
If that doesn't grab your actual text from the database, you'll need to modify the WHERE clause to actually grab the offending column.
Once you've found the code for the character, you could just replace the character by using the regex_replace() built-in function by determining the raw hex code of the character and then supplying the ASCII / C0 Controls and Basic Latin character 0x0027 ('), using code similar to this:
UPDATE <table>
set <column with offending character>
= REGEX_REPLACE(<column with offending character>,
"<character code of �>",
"'")
WHERE regex_like(<column with offending character>,"<character code of �>");
If you aren't familiar with Unicode and different ways of character encoding, I recommend reading Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I wasn't until I read that article.
EDIT: If your'e seeing 0x92, there's likely a charset mismatch here:
0x92 in CP-1252 (default Windows code page) is a backquote character, which looks kinda like an apostrophe. This code isn't a valid ASCII character, and it isn't valid in IS0-8859-1 either. So probably either the database is in CP-1252 encoding (don't find that likely), or a database connection which spoke CP-1252 inserted it, or somehow the apostrophe got converted to 0x92. The database is returning values that are valid in CP-1252 (or some other charset where 0x92 is valid), but your db client connection isn't expecting CP-1252. Hence, the wierd question mark.
And FerranB is likely right. I would talk with your DBA or some other admin about this to get the issue straightened out. If you can't, I would try either doing the update above (seems like you can't), or doing this:
INSERT (<normal table columns>,...,<column with offending character>) INTO <table>
SELECT <all normal columns>, REGEX_REPLACE(<column with offending character>,
"\0092",
"\0027") -- for ASCII/ISO-8859-1 apostrophe
FROM <table>
WHERE regex_like(<column with offending character>,"\0092");
DELETE FROM <table> WHERE regex_like(<column with offending character>,"\0092");
Before you do this you need to understand what actually happened. It looks to me that someone inserted non-ascii strings in the database. For example Unicode or UTF-8. Before you fix this, be very sure that this is actually a bug. The apostrophe comes in many forms, not just the "'".
TRANSLATE() is a useful function for replacing or eliminating known single character codes.