Trying to parse non well-formed XML using NSXMLParser - objective-c

I am parsing XML Data using NSXMLParser and I notice now, that the Elements can contain ALL characters, including for example a &. Since the parser is giving an error when it comes across this character I replaced every Occurence of this character.
Now I want to make sure to handle every of these characters that may cause Errors.
What are they and how do you think I should handle these characters best?
Thanks in advance!

To answer half your question, XML has 5 special characters that you may want to escape:
< -- replace with <
> -- replace with >
& -- replace with &
' -- replace with &apos;
and
" -- replace with "
Now, for the other half--how to find and replace these without also replacing all the tags, etc... Not easy, but I'd look in to regular expressions and NSRegularExpression: http://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
Remember, depending on your use case, to escape the values of the parameters on tags, too; <tag parameter="with "quotes"" />

You should encode these characters for instance & becomes & or " becomes "
When it goes through the parser it should come out ok. Your other option is to use a different XML parser like TBXML which doesn't do format checking.

Related

How to include apostrophe in character set for REGEXP_SUBSTR()

The IBM i implementation of regex uses apostrophes (instead of e.g. slashes) to delimit a regex string, i.e.:
... where REGEXP_SUBSTR(MYFIELD,'myregex_expression')
If I try to use an apostrophe inside a [group] within the expression, it always errors - presumably thinking I am giving a closing quote. I have tried:
- escaping it: \'
- doubling it: '' (and tripling)
No joy. I cannot find anything relevant in the IBM SQL manual or by google search.
I really need this to, for instance, allow names like O'Leary.
Thanks to Wiktor Stribizew for the answer in his comment.
There are a couple of "gotchas" for anyone who might land on this question with the same problem. The first is that you have to give the (presumably Unicode) hex value rather than the EBCDIC value that you would use, e.g. in ordinary interactive SQL on the IBM i. So in this case it really is \x27 and not \x7D for an apostrophe. Presumably this is because the REGEXP_ ... functions are working through Unicode even for EBCDIC data.
The second thing is that it would seem that the hex value cannot be the last one in the set. So this works:
^[A-Z0-9_\+\x27-]+ ... etc.
But this doesn't
^[A-Z0-9_\+-\x27]+ ... etc.
I don't know how to highlight text within a code sample, so I draw your attention to the fact that the hyphen is last in the first sample and second-to-last in the second sample.
If anyone knows why it has to not be last, I'd be interested to know. [edit: see Wiktor's answer for the reason]
btw, using double quotes as the string delimiter with an apostrophe in the set didn't work in this context.
A single quote can be defined with the \x27 notation:
^[A-Z0-9_+\x27-]+
^^^^
Note that when you use a hyphen in the character class/bracket expression, when used in between some chars it forms a range between those symbols. When you used ^[A-Z0-9_\+-\x27]+ you defined a range between + and ', which is an invalid range as the + comes after ' in the Unicode table.

Why does Replace '&' with '&' not work for XML data?

I need to download a XML file and its data is retrieved from stored procedure.
My problem is if the data contains any '&' symbol, in XML file it is showing as
'&'
I have used REPLACE function in my Procedure as shown below but...
SELECT #V_NAME = REPLACE(#V_NAME, ' & ', ' & ');
UPDATE #TMP_RS_XML
SET OBJECT_ID=#V_ID,
FNAME=#V_FILE,
DOCUMENT=(SELECT #V_NAME as 'Description',
...
Now, the output is:
&amp;
This is not the way this is supposed to work...
XML is not just some text with fancy extras but with very strict rules. As any text-based container you will need either magic words or special characters to tell the consumer what is the content and what is the markup.
The most important markup characters in XML are < and > - of course. If you want these characters to be part of your content, you'll have to replace them. That is done with xml entities.
Within the content, any XML entity will start with an ampersand (< comes out as <), therefore the ampersand is the third most important special character. If you want an ampersand within the content you must use an entitiy (&) as a code for in this place we want an ampersand.
You must distinguish between the text you see, when you look at the XML and the actual content taken out of the XML.
Try this:
DECLARE #SomeStringWithSpecialCharacters NVARCHAR(200)=N'This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?' + CHAR(13) + CHAR(10) + 'Here is the second line. And an unprintable?' + CHAR(2);
--Here we use FOR XML, all the escaping is done implicitly
SELECT #SomeStringWithSpecialCharacters AS TestIt FOR XML PATH('test');
The result
<test>
<TestIt>This & that -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>
Now I take the XML as it came out of the first part and place it into a XML-typed variable.
Attention: I had to remove the  entity, check it out...
DECLARE #SomeXML XML=
N'<test>
<TestIt>This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>';
--Now we do the magic using .value() against a native XML:
SELECT #SomeXML.value('(/test/TestIt/text())[1]','nvarchar(max)');
The result comes out with all entities re-espaced:
This & -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?
The general hint is: Never do the replacements yourself. Pushing content into the XML will need escaping and reading content out of XML will need the opposite. All this is done for you implicitly, when you use the proper tools.
'&' is a special character that is being rendered out of ' &amp ; '
The best practice here would be to decode the XML, adding a reference below:
https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?redirectedfrom=MSDN&view=netframework-4.8#overloads

enclosing all special characters in xml in cdata

i am trying to parse the xml file but getting exceptions since the file contains the special character. To make file to parse successfully i have to enclose the value in since there are many special characters in my xml files at different places is there any way like declaration or something so that these fields will get replaced with CDATA whicch will reduce the mannual work exerytime?
Well, either your parser is broken, or your XML.
XML has 5 predefined entities, which your parser should support, and whatever generates your XML should use them when appropiate:
< <
> >
& &
&apos; '
" "

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

string is partially parsed when symbol "&" is present

If I have a string like "Fuel & Additives" when my XML parser goes through it ignores anything BEFORE the "&" symbol, why?
if ([elementname isEqualToString:#"GLDesc"])
{
currentParsedObjectContainer.GLDesc = currentNodeContent;
NSLog(#"%#",currentParsedObjectContainer.GLDesc);
}
The ampersand character (&) and the
left angle bracket (<) may appear in
their literal form only when used as
markup delimiters, or within a
comment, a processing instruction, or
a CDATA section. If they are needed
elsewhere, they must be escaped using
either numeric character references or
the strings "&" and "<"
respectively.
http://www.w3.org/TR/2000/REC-xml-20001006#syntax
As the above snippet states, you'll need to escape & to the string & before passing it to the XML parser.
I'm going to say it has something to do with character codes (for example &lt). I'm not really familiar with xml though, so I'm not sure. Try &amp?