Dealing with whitespace when parsing XML

Dealing with whitespace when parsing XML - cocoa-touch

I have problem with parsing XML.
I parsed data of cities, Amsterdam & Den Bosch.
Amsterdam works fine but Den Bosch does not.
No doubt it is due to space problem.
Den Bosch has a white space.
Should I trim the whitespace in my application or the web service?
Which would be the best to handle the space problem?
EDIT:
The OP and #PeterMurray-Rust seem to agree that the problem is that the third-party app returns URL-escaped strings of the form:
"Den%20Bosch"
%20 is not recognized by XML as anything special and that it will be necessary to replace occurrences by spaces. A typical scripting approach would be
s/%20/ /g
This is likely to be quite a common problem although I'm not clear why content should be URL-encoded.
[OP please comment if I have got this wrong]

From your update I assume that the data is something like:
<city>Den%20Bosch</city>
The string "%20" is three characters which XML does not regard as having any specific meaning. Depending on your language or whether you use XSLT you will need to replace them. In Java and the XOM library I might write
String value = cityNode.getValue().replaceAll("%20", " ");
I can't help with the specifics of Cocoa - I think you'll have to investigate the API to find how to get content values.

I Assume that you are parsing xml at application level and also by white space you mean the trailing white spaces and not the one in between the words "Den" and "Bosch". In anycase, I think you can trim the spaces at web service level, since you can be rest assured that any call coming from any other application using this web service need not have to trim the spaces since web service handles that internally. This would be a one-point change for you.

Don't know much about cocoa and your xml as well. City names, are these inner text of node or tag name. If it is in tag name or attributes without quotes, it will fail. If it is in inner text, it should work. However, there is CDATA fragment which tells the parser to ignore the contents

This is the code i implemented its working fine..........
if ([appDelegate.cityListArray count]>0) {
aDJInfo=[appDelegate.cityListArray objectAtIndex:indexPath.row];
//http://compliantbox.com/party_temperature/citysearch.php?city=Amsterdam&latitude=52.366125&longitude=4.899171
url=#"http://compliantbox.com/party_temperature/citysearch.php?city=";
NSString *string=[aDJInfo.city_Name stringByReplacingOccurrencesOfString:#" " withString:#"%20"];
url=[url stringByAppendingString:string];
NSLog(#"abbbbbbbbbbb %#",string);
url=[url stringByAppendingString:#"&latitude=52.366125&longitude=4.899171"];
[self parseEventName:[[NSURL alloc]initWithString:url]];
}
}
#All thanks a lot..

Related

NSString containing aphostrophe (') is not set properly to SOAP service

I have strange in my view problem. I have UITextField which contains username. When it contains apostrophe (') the service cannot read username properly. I suppose it is connected with Unicode. I try to see what are the codes and I get:
L'TEST - contains code 8217
' - is 39
` - is 96
can anyone explains to me why this happens so I can fix this issue

L'TEST - contains code 8217
That would be L’TEST with U+2019 ‹’› \N{RIGHT SINGLE QUOTATION MARK}. If you look closely, in most fonts this character is displayed with a slight curl. It's not an apostrophe, but misused as one.
can anyone explains to me why this happens
Common causes:
"The Fool" Some mischievous input system silently substituted apostrophe for quotation mark. Word processors and mobile OS on-screen keyboards do that. It's well-meaning, but sometimes produces the wrong result.
"The Clueless" User is ignorant how to correctly type an apostrophe and picked the similar looking quotation mark.
"The Angry" Your UI text field (or something else in the chain originating from the user) forbids entry of apostrophes for some retarded reason. The user absolutely refuses to write something orthographically incorrect and manually substitutes apostrophe for quotation mark in order to work around the defect software.
so I can fix this issue
This is a social problem, not a software problem.

Since a SOAP service is based on XML and single quotes are a delineator in XML, you need to escape your single quotes by replacing "'" with "\'" in all text fields. It's a very common issue.

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.

The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

SQL Strip the Font Format(Colour or other)

I have a problem to strip out the format in a note table
Here is an example:
";\red31\green73\blue125;
\viewkind4\uc1\ltrpar\f0\fs20 USEFUL TEXT BODY \cf1\f3
\ltrpar\f0\fs17
"
How to get rid of those stuff? I want to play safe not to replace anything after'\'
Many thanks,
Rick

Your making it quite difficult for yourself by not replace '\' .
If you look at http://other9.tripod.com/Refs/easy-rtf.html you will see that there are different RTF codes and there is no default size for the codes.
Additionally, it is not like HTML where there must be a necessary "closing" tag which makes it additionally difficult.
The only thing I can think of is to record all possible RTF codes (or use an RTF parser library) and hence be able to recognize if a \ is or is not RTF code.

Remove & character from string objective c

How would I go about removing the "&" symbol from a string. It's making my xml parser fail.
I have tried
[currentParsedCharacterData setString: [currentParsedCharacterData stringByReplacingOccurrencesOfString:#"&" withString:#"and"]];
But it seems to have no effect

Really what this boils down to is you want to gracefully handle invalid XML. The XML Parser is properly telling you that this XML is invalid, and is thusly failing to parse. Assuming you have no control over this XML content, I would suggest pre-parsing it for common errors like this, the output of which would be a sanitized XML doc that has a better chance of success.
To sanitize the doc, it may be as simple as doing search and replace, the problem with just doing a blanket replace on any & is that there are valid uses of &, for example & or ©. You would end up munging the XML by creating something like this: andcopy;
You could search for "ampersand space" but that won't catch a string that has an ampersand as the last character (an out-case that might be easily handled). What you are really searching for are occurrences of & that are not followed by a ; or those of which where any type of whitespace is encountered before the following ; because the semi-colon is fine on its own.
If you need more power because you need to detect this, and other errors, I would suggest going to NSScanner or RegEx matching to search for occurrences of this and other common errors during your sanitization step. It is also very common for XML files to be rather large things, so you need to be careful when dealing with these as in-memory strings as this can easily lead to application crashes. Breaking it up into manageable chunks is something NSScanner can do very well.

For a quick attempt look at stringByReplacingOccurrencesOfString on NSString
NSString* str = #"a & b";
[str stringByReplacingOccurrencesOfString:#"&" withString:#"and"]; // better replace by &
However you should also deal with other characters i.e. < >

How to Parse Some Wiki Markup

Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!

This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).

I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.

If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dealing with whitespace when parsing XML - cocoa-touch

Don't know much about cocoa and your xml as well. City names, are these inner text of node or tag name. If it is in tag name or attributes without quotes, it will fail. If it is in inner text, it should work. However, there is CDATA fragment which tells the parser to ignore the contents

Related

NSString containing aphostrophe (') is not set properly to SOAP service

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

SQL Strip the Font Format(Colour or other)

Remove & character from string objective c

How to Parse Some Wiki Markup

Categories

Resources