Regex to get value within tag - objective-c

I have a sample set of XML returned back:
<rsp stat="ok">
<site>
<id>1234</id>
<name>testAddress</name>
<hostname>anotherName</hostname>
...
</site>
<site>
<id>56789</id>
<name>ba</name>
<hostname>alphatest</hostname>
...
</site>
</rsp>
I want to extract everything within <name></name> but not the tags themselves, and to have that only for the first instance (or based on some other test select which item).
Is this possible with regex?

<disclaimer>I don't use Objective-C</disclaimer>
You should be using an XML parser, not regexes. XML is not a regular language, hence not easely parseable by a regular expression. Don't do it.
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
You could use Expat, with has Objective C bindings.
Apple's options are:
The CF xml parser
The tree based Cocoa parser (10.4 only)

Without knowing your language or environment, here are some perl expressions. Hopefully it will give you the right idea for your application.
Your regular expression to capture the text content of a tag would look something like this:
m/>([^<]*)</
This will capture the content in each tag. You will have to loop on the match to extract all content. Note that this does not account for self-terminated tags. You would need a regex engine with negative lookbehinds to accomplish that. Without knowing your environment, it's hard to say if it would be supported.
You could also just strip all tags from your source using something like:
s/<[^>]*>//g
Also depending on your environment, if you can use an XML-parsing library, it will make your life much easier. After all, by taking the regex approach, you lose everything that XML really offers you (structured data, context awareness, etc).

The best tool for this kind of task is XPath.
NSURL *rspURL = [NSURL fileURLWithPath:[#"~/rsp.xml" stringByExpandingTildeInPath]];
NSXMLDocument *document = [[[NSXMLDocument alloc] initWithContentsOfURL:rspURL options:NSXMLNodeOptionsNone error:NULL] autorelease];
NSArray *nodes = [document nodesForXPath:#"/rsp/site[1]/name" error:NULL];
NSString *name = [nodes count] > 0 ? [[nodes objectAtIndex:0] stringValue] : nil;
If you want the name of the site which has id 56789, use this XPath: /rsp/site[id='56789']/name instead. I suggest you read W3Schools XPath tutorial for a quick overview of the XPath syntax.

As others say, you should really be using NSXMLParser for this sort of thing.
HOWEVER, if you only need to extract the stuff in the name tags, then RegexKitLite can do it quite easily:
NSString * xmlString = ...;
NSArray * captures = [xmlString arrayOfCaptureComponentsMatchedByRegex:#"<name>(.*?)</name>"];
for (NSArray * captureGroup in captures) {
NSLog(#"Name: %#", [captureGroup objectAtIndex:1];
}

Careful about namespaces:
<prefix:name xmlns:prefix="">testAddress</prefix:name>
is equivalent XML that will break regexp based code. For XML, use an XML parser. XPath is your friend for things like this. The XPath code below will return a sequence of strings with the info you want:
./rsp/site/name/text()
Cocoa has NSXML support for XPath.

Related

Generating large NSString xml

I have a plist and i want to convert it to xml. The xml itself is going to be around 1.2mb in size. What the best way to generate this xml? Simply with a NSMutableString? I am just worried about the performance issues and wether there is a better way to generate xml.
Thanks
For those wondering, what I have right now is something like this:
NSString *xml = [NSString stringWithFormat:#"<Sheet>%#</Sheet>", [self getSheetXMLString]];
and then, in getSheetXMLString method, i have more methods like above which drill down deep until the plist is fully transversed.
Thanks again.
What do you plan to do with the XML, if it is to output over a network or write to a file then instead of creating a NSString you could just write straight out to the network/file. If you plan to do manipulation if the XML you may want to consider libxml2, which is a C library included in iOS.

Manipulating HTML

I need to read a HTML file and search for some tags in it. Based on the results, some tags would need to be removed, other ones changed and maybe refining some attributes — to then write the file back.
Is NSXMLDocument the way to go? I don't think that a parser is really needed in this case, it could even mean more work. And I don't want to touch the entire file, all I need to do is to load the file in memory, change some things, and save it again.
Note that, I'll be dealing with HTML, and not XHTML. Could that be a problem for NSXMLDocument? Maybe some unmatched tags or un-closed ones could make it stop working.
NSXMLDocument is the way to go. That way you can use Xpath/Xquery to find the tags you want. Bad HTML might be a problem but you can set NSXMLDocumentTidyHTML and it should be OK unless it's really bad.
NSRange startRange = [string rangeOfString:#"<htmlTag>"];
NSRange endRange = [string rangeOfString:#"</htmlTag>"];
NSString *subStr = [string subStringWithRange:NSMakeRange(startRange.location+startRange.length, endRange.location-startRange.location-startRange.length)];
NSString *finalStr = [string stringByReplacingOccurencesOfString:substr];
and then write finalstr to the file.
This is what I would do, note that I don't exactly know what the advantages of using NSXMLDocument would be, this should do it perfectly.
NSXMLDocument will possibly fail, due to the fact that HTML pages are not well formed, but you can try with NSXMLDocumentTidyHTML/NSXMLDocumentTidyXML (you can use them both to improve results) as outlined here and also have a look a this for tan approach at modifying the HTML.

How do I find non-length-specified substrings in a string in Objective-C?

I'm trying, for the first time in my life, to contribute to open source software. Therefore I'm trying to help out on this ticket, as it seems to be a good "beginner ticket".
I have successfully got the string from the Twitter API: however, it's in this format:
Tweetie for Mac
What I want to extract from this string is the URL (http://twitter.com) and the name of the Twitter client (Tweetie for Mac). How can I do this in Objective-C? As the URL's aren't the same I can't search for a specified index, and the same applies for the client name.
Assuming you have the HTML link already and aren't parsing an entire HTML page.
//Your HTML Link
NSString *link = [urlstring text];
//Length of HTML href Link
int length = [link length];
//Range of the first quote
NSRange firstQuote = [link rangeOfString:#"\""];
//Subrange to search for another quote in the HTML href link
NSRange nextQuote = NSMakeRange(firstQuote.location+1, length-firstQuote.location-1);
//Range of the second quote after the first
NSRange secondQuote = [link rangeOfString:#"\"" options:NSCaseInsensitiveSearch range:nextQuote];
//Extracts the http://twitter.com
NSRange urlRange = NSMakeRange(firstQuote.location+1, (secondQuote.location-1) - (firstQuote.location));
NSString *url = [link substringWithRange:urlRange];
//Gets the > right before Tweetie for Mac
NSRange firstCaret = [link rangeOfString:#">"];
//This appears at the start of the href link, we want the next one
NSRange firstClosedCaret = [link rangeOfString:#"<"];
NSRange nextClosedCaret = NSMakeRange(firstClosedCaret.location+1, length-firstClosedCaret.location-1);
//Gets the < right after Tweetie for Mac
NSRange secondClosedCaret = [link rangeOfString:#"<" options:NSCaseInsensitiveSearch range:nextClosedCaret];
//Range of the twitter client
NSRange rangeOfTwitterClient = NSMakeRange(firstCaret.location+1, (secondClosedCaret.location-1)-(firstCaret.location));
NSString *twitterClient = [link substringWithRange:rangeOfTwitterClient];
you know that this portion of the string will be the same:
...
so what you really want is a search to the first " and the closing > for the beginning of the a tag.
The easiest way to do this would be to find what is in the quotes (see this link for how to search NSStrings) and then get the text after the second to last > for your actual name.
You could also use an NSXMLParser as that works on XML specifically, but that may be overkill for this case.
I haven't looked at Adium source but you should check if there are any categories available that extend e.g. NSString with methods for parsing html/xml to more usable structures, like a node tree for example. Then you could simply walk the tree and search for the required attributes.
If not, you may either parse it yourself by dividing the string into tokens (tag open, tag close, tag attributes, quoted strings and so on), then look for the required attributes. Alternatively you could even use a regular expression if the strings always consist of a single html anchor element.
I know it's been discussed many times that regular expressions simply don't work for html parsing, but this is a specific scenario where it's actually reasonable. Better than running a full-blown, generic html/xml parser. That would be, as slycrel said, an overkill.

next line character a huge influence on xmlparser?

I have question about a basic xml file I'm parsing and just putting in simple nextlines(Enters).
I'll try to explain my problem with this next example.
I'm( still) building an xml tree and all it has to do ( this is a testtree ) is put the summary in an itemlist. I then export it to a plist so I can see if everything is done correctly.
A method that does this is in the parser which looks like this
if([elementName isEqualToString:#"Book"]) {
[appDelegate.books addObject:aBook];
[aBook release];
aBook = nil;
}
else
{
[aBook setValue:currentElementValue forKey:elementName];
NSString *directions = [NSString stringWithFormat:currentElementValue];
[directionTree = setObject:directions forKey:#"directions"];
}
[currentElementValue release];
currentElementValue = nil;
}
the export for the plistfile happens at the endtag of books.
Below is the first xmlfile
<?xml version="1.0" encoding="UTF-8"?>
<Books><Book id="1"><summary>Ero adn the ancient quest to measure the globe.</summary></Book><Book id="2"><summary>how the scientific revolution began.</summary></Book></Books>
This is my output
http://img139.imageshack.us/img139/9175/picture6rtn.png
If I make some adjustments like here
<?xml version="1.0" encoding="UTF-8"?>
<Books><Book id="1">
<summary>Ero adn the ancient quest to measure the globe.</summary>
</Book>
<Book id="2">
<summary>how the scientific revolution began.</summary>
</Book>
</Books>
My directions key with type string remains empty...
http://img248.imageshack.us/img248/5838/picture7y.png
I never knew that if I just put in an enter it would have such an influence.
Does anyone know a solution to this since my real xml file looks like this.
ps. the funny thing is I can actually see ( when debugging)my directions string (NSString directions ) fill up with the currentElementValue in both cases.
Instrument your code; specifically, just above the line that reads...
[directionTree setObject:directions forKey:#"directions"];
... (I removed a stray =) try adding ...
NSLog(#"setting directions to '%#'", directions);
I bet you'll see the above is logged multiple times per element. Specifically, the newline between the </summary> and the </book> tag is, in and of itself, an element just like the text in the <summary></summary> tag is an element.
Now, you could continue down the path of trying to special case for this that and the other, but that would be wrong.
You need to parse the XML as a structured document -- as a tree of nodes. Specifically, you should be looking for the <summary> tag somewhere and then grabbing the element that hangs below it (that should be of, IIRC, the TEXT type in XML parlance -- been a while).
Or, better yet, use one of the XML parsing APIs on the system. NSXMLDocument comes immediately to mind. If working on the iPhone (which this question didn't indicate), you'll need to use NSXMLParser and not NSXMLDocument as it is not available.
Or, even better, since this looks like pretty straightforward XML encapsulation of a regular data schema, use CoreData. CoreData is ideal for storing this kind of information. If your XML is intended to be an interchange format, you won't want to use CoreData as the XML it produces is entirely of its own design.

Getting the value of an Element in Cocoa using TouchXML

I am using TouchXml because of the given limitations of NSXML on the actual iPhone. Anyway, I'm just starting out with Objective-C, I come from a C# background, and felt like learning something new..anyhow.. here is my xml file...
<?xml version="1.0" encoding="utf-8"?>
<FundInfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://tempuri.org/webservices">
<FundsReleaseDate>2009-02-11T00:00:00</FundsReleaseDate>
<FundValue>7800</FundValue>
<FundShares>1000</FundShares>
</FundInfo>
I'm trying to get 7800, i.e FundValue. Can someone point me in the correct direction, I am using the TouchXml CXMLDocument and myParser is of type CXMLDocument, I have tried
NSArray *nodes = [myParser nodesForXPath:#"//FundInfo" error:&err];
Basically nodes evaluates to nil
if ([nodes count] > 0 ) {
amount = [nodes objectAtIndex:1];
}
UPDATE 1: I have abandoned parsing using XPath, and have replaced it with NSURLRequest, so now I have the entire XML in a string, but my rude introduction to objective-C continues...I just realized how spoiled I have become to the .NET BCL where things like Regex are so easily available.
UPDATE2: I've figured out how to use RegexKitLite for regex.
So the question now is, how do I get the "7800" from inside FundValue which is now one big string. I have confirmed I have it in my NSString by writing it using NSLog.
The problem is that your XML has a namespace, which means your XPath does not actually match. Unfortunately there is no default mapping, and XPath does not actually define a good way to handle this internally, so you can't fix it simply by changing the XPath. Instead you need to inform the interpreter how you want to map it in your XPath.
It looks like TouchXML implements support for this via:
- (NSArray *)nodesForXPath:(NSString *)xpath namespaceMappings:(NSDictionary *)inNamespaceMappings error:(NSError **)error;
So you can try something like:
NSDictionary *mappings = [NSDictionary dictionaryWithObject:#"http://tempuri.org/webservices" forKey:#"tempuri"];
[myParser nodesForXPath:#"//tempuri:FundInfo" namespaceMappings:mappings error:&err];
I had a similar problem. ANY selection on an XML that has a namespace specified (xmlns="http://somenamespace") will result in no nodes found. Very strange considering it does support the additional namespace formats such as xmlns:somenamespace="http://somenamespace".
In any case, by far the easiest thing to do is do a string replace and replace xmlns="http://tempuri.org/webservices with an empty string.
Overall I like touchxml, but I can't believe this bug still exists.
Try
NSArray *nodes = [myParser nodesForXPath:#"/FundInfo/*" error:&err];