Manipulating HTML - objective-c

I need to read a HTML file and search for some tags in it. Based on the results, some tags would need to be removed, other ones changed and maybe refining some attributes — to then write the file back.
Is NSXMLDocument the way to go? I don't think that a parser is really needed in this case, it could even mean more work. And I don't want to touch the entire file, all I need to do is to load the file in memory, change some things, and save it again.
Note that, I'll be dealing with HTML, and not XHTML. Could that be a problem for NSXMLDocument? Maybe some unmatched tags or un-closed ones could make it stop working.

NSXMLDocument is the way to go. That way you can use Xpath/Xquery to find the tags you want. Bad HTML might be a problem but you can set NSXMLDocumentTidyHTML and it should be OK unless it's really bad.

NSRange startRange = [string rangeOfString:#"<htmlTag>"];
NSRange endRange = [string rangeOfString:#"</htmlTag>"];
NSString *subStr = [string subStringWithRange:NSMakeRange(startRange.location+startRange.length, endRange.location-startRange.location-startRange.length)];
NSString *finalStr = [string stringByReplacingOccurencesOfString:substr];
and then write finalstr to the file.
This is what I would do, note that I don't exactly know what the advantages of using NSXMLDocument would be, this should do it perfectly.

NSXMLDocument will possibly fail, due to the fact that HTML pages are not well formed, but you can try with NSXMLDocumentTidyHTML/NSXMLDocumentTidyXML (you can use them both to improve results) as outlined here and also have a look a this for tan approach at modifying the HTML.

Related

Removing non-ascii characters from NSData?

First off, I'm not exactly sure what is happening or if I fully understand it enough to describe the issue so I'll try my best.
I'm encoding a NSData object that contains json and one of the objects contains a degree symbol. We believe this what is causing the issue and would like to remove it before encoding since the problem occurs during encoding.
I have plenty of options out there for removing certain characters from strings but none from doing it from the NSData object itself. Wondering if this is even possible or if its an issue with how I'm already encoding it.
This is how the NSData object is being encoded and turned back into a NSData object to serialize it to json. Right now I'm not trying to remove the degree symbol, using Latin 1 because another character I want to use but do not need it, this probably isn't the best way to do but it works for majority of other data objects that pass through it just not this one so this needs to change.
NSString* stringISOLatin1 = [NSString stringWithCString:data.bytes encoding:NSISOLatin1StringEncoding];
NSData* dataUTF8 = [stringISOLatin1 dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:NO];
The results are a little weird, most of the time it works fine, even including the degree symbol in the text when displayed on screen. Other times after encoding the string comes back messed up at the end which makes it unable to be serialized.
Any help would be appreciated even if it just leads to a better explanation of what is happen. Thanks
The problem is likely that you are using NSString:stringWithCString:encoding: to convert your data object. This function requires the data to be null terminated. NSData objects do not have to be NULL terminated because they have an explicit length. If the NULL character is missing it will continue to read whatever there happens to be after the string, giving you either garbage at the end or possibly crash because of memory violation.
Instead try using this:
NSString *stringISOLatin1 = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];

Performance of sorting NSURLs with localizedStandardCompare

I need to sort a NSMutableArray containing NSURLs with localizedStandardCompare:
[array sortUsingComparator:^NSComparisonResult(id obj1, id obj2) {
NSString *f1 = [(NSURL *)obj1 absoluteString];
NSString *f2 = [(NSURL *)obj2 absoluteString];
return [f1 localizedStandardCompare:f2];
}];
This works fine, but I worry a bit about the performance: the block will be evaluated n log n times during the sort, so I'd like it to be fast (the array might have up to 100,000 elements). Since localizedStandardCompare is only available on NSString, I need to convert the URLs to strings. Above, I use absoluteString, but there are other methods that return a NSString, for example relativeString. Reading the NSURL class reference, I get the impression that relativeString might be faster, since the URL does not need to be resolved, but this is my first time with Cocoa and OS-X, and thus just a wild guess.
Additional constraint: in this case, all URLs come from a NSDirectoryEnumerator on local storage, so all are file URLs. It would be a bonus if the method would work for all kinds of URL, though.
My question: which method should I use to convert NSURL to NSString for best performance?
Profiling all possible methods might be possible, but I have only one (rather fast) OS-X machine, and who knows - one day the code might end up on iOS.
I'm using Xcode 4.5.2 on OS-X 10.8.2, but the program should work on older version, too (within reasonable bounds).
You may need to use Carbon's FSCatalogSearch, which is faster than NSDirectoryEnumerator. As for getting the path, I see no choice.
The only thing you may consider for speeding up the sorting is that the paths are partially sorted, because the file system will return all the files of the same folder in alphabetical order.
So you may want to take all the path of the same directory and merge them with the other results.
For example the home contents may be:
ab1.txt
bb.txt
c.txt
The documents directory may contain:
adf.txt
fgh.txt
So you just merge them with a customized algorithm, which just applies the merge part of a mergesort.
I benchmarked the sort. It turned out that absoluteString and relativeString are much faster that path or relativePath.
Sorting about 26000 entries:
relativeString 550ms
absoluteString 580ms
path 920ms
relativePath 960ms
field access 480ms
For field access, I put the value of absoluteString into a field prior to the sort and access that. So, the ...String accessors are almost as fast as field access, and thus a good choice for my use case.

What is the most efficient way to compare an NSString in this way

I have an app (Cocoa Touch, Web Browser), however I need to be able to compare an NSString with thousands of other strings. Here's the deal.
When a WebView loads, I get the URL. I need to compare this URL with literally thousands of results (27,847). Each of those numbers represents a line of text in a plain text file.
I would like to know the best way to go about getting the data from the text file, and comparing it with the NSString. I need to know if the URL that the WebView is loading contains any of these strings.
The app needs to be very fast, so I can't just parse through every line in the text file, turn it into an array, and then compare each and every result.
Please share your ideas. Thanks.
I think the cleanest solution is to:
Create a web service that can offload the work to a server and return a response. Since it sounds like you're building a web protection service, your database may grow to be quite substantial over time, and you can just scale your server up to increase its speed. Furthermore, you don't want to have to update your app every time the lookup data changes.
Other options are:
Use a local SQLite database. SQL databases should perform lookups relatively fast.
If you don't want to use any database, have you tried putting all the search strings into an NSDictionary or NSMutableDictionary object? This way, you would just check if the valueForKey: for the string you're searching for is nil.
Sample code for this:
NSDictionary *searchDictionary = [NSDictionary dictionaryWithObjectsAndKeys:
[NSNumber numberWithBool:YES], #"google.com",
[NSNumber numberWithBool:YES], #"yahoo.com",
[NSNumber numberWithBool:YES], #"bing.com",
nil];
NSString *searchString = #"bing.com";
if ([searchDictionary valueForKey:searchString]) {
// search string found
} else {
// search string not found
}
Note: if you want the NSDictionary to perform case-insensitive comparisons, pre-load all values lowercase, and make the search string lowercase when using valueForKey:.
How much memory this could take is a whole other story, but I don't see how this comparison could be made much faster locally. I strongly recommend the remove web service approach, though.
Create a string from the file and enumerate through the lines.
NSString *stringToCheck;
NSData *bytesOfFile = [NSData dataWithContentsOfFile:#"/path/myfile.txt"];
NSString *fileString = [[NSString alloc] initWithData:bytesOfFile
encoding:NSUTF8Encoding];
__block BOOL foundMatch = NO;
[fileString enumerateLinesUsingBlock:^(NSString *line, BOOL *stop){
if([stringToCheck isEqualToString:line]){
*stop = YES;
foundMatch = YES;
}
}];
This is a job for regular expressions. Take all of the substrings you're looking for/filtering against, escape them appropriately (escaping characters such as [, ], |, and \, among others, with \), and join them with a |. The resulting string is your regular expression, which you apply to each URL.
You could loop through an entire array full of substrings, doing rangeOfString:options: with each one, but that's the slow way. A good regular expression implementation is built for this sort of thing, and I would hope that Apple's implementation is suitable.
That said, profile the hell out of it. I've seen some regex implementations choke on the | operator, so you'll want to make sure that Apple's is not one of them.
If you need to compare each string in your text file, you are going to have to compare it, no way around it.
What you can do however is do it on a background thread while showing some loading or something, and it won't feel as if the app got stuck.
I would suggest you try with NSDictionary first. You can load up all your URLs into this, and internally it will use some sort of hash table/map for very quick (O(1)) lookup.
You can then check the result of [dictionary objectForKey:userURL], and if it returns something then the URL matched one in the dictionary.
The only problem with this is that it requires an exact string match. If your dictionary contains http://server/foobar and the user enters http://server/FOOBAR (because it's a case-insensitive server), you are going to get a miss on your lookup. Similarly, adding ?foobar queries to the end of URLs will result in a miss. You could also add an explicit port with server:80, and with %XX character encoding you can create hundreds of variations of the same URL. You will have to account for this and canonicalize both the URLs in your dictionary, and the URL entered by the user prior to lookup.

Generating large NSString xml

I have a plist and i want to convert it to xml. The xml itself is going to be around 1.2mb in size. What the best way to generate this xml? Simply with a NSMutableString? I am just worried about the performance issues and wether there is a better way to generate xml.
Thanks
For those wondering, what I have right now is something like this:
NSString *xml = [NSString stringWithFormat:#"<Sheet>%#</Sheet>", [self getSheetXMLString]];
and then, in getSheetXMLString method, i have more methods like above which drill down deep until the plist is fully transversed.
Thanks again.
What do you plan to do with the XML, if it is to output over a network or write to a file then instead of creating a NSString you could just write straight out to the network/file. If you plan to do manipulation if the XML you may want to consider libxml2, which is a C library included in iOS.

Regex to get value within tag

I have a sample set of XML returned back:
<rsp stat="ok">
<site>
<id>1234</id>
<name>testAddress</name>
<hostname>anotherName</hostname>
...
</site>
<site>
<id>56789</id>
<name>ba</name>
<hostname>alphatest</hostname>
...
</site>
</rsp>
I want to extract everything within <name></name> but not the tags themselves, and to have that only for the first instance (or based on some other test select which item).
Is this possible with regex?
<disclaimer>I don't use Objective-C</disclaimer>
You should be using an XML parser, not regexes. XML is not a regular language, hence not easely parseable by a regular expression. Don't do it.
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
You could use Expat, with has Objective C bindings.
Apple's options are:
The CF xml parser
The tree based Cocoa parser (10.4 only)
Without knowing your language or environment, here are some perl expressions. Hopefully it will give you the right idea for your application.
Your regular expression to capture the text content of a tag would look something like this:
m/>([^<]*)</
This will capture the content in each tag. You will have to loop on the match to extract all content. Note that this does not account for self-terminated tags. You would need a regex engine with negative lookbehinds to accomplish that. Without knowing your environment, it's hard to say if it would be supported.
You could also just strip all tags from your source using something like:
s/<[^>]*>//g
Also depending on your environment, if you can use an XML-parsing library, it will make your life much easier. After all, by taking the regex approach, you lose everything that XML really offers you (structured data, context awareness, etc).
The best tool for this kind of task is XPath.
NSURL *rspURL = [NSURL fileURLWithPath:[#"~/rsp.xml" stringByExpandingTildeInPath]];
NSXMLDocument *document = [[[NSXMLDocument alloc] initWithContentsOfURL:rspURL options:NSXMLNodeOptionsNone error:NULL] autorelease];
NSArray *nodes = [document nodesForXPath:#"/rsp/site[1]/name" error:NULL];
NSString *name = [nodes count] > 0 ? [[nodes objectAtIndex:0] stringValue] : nil;
If you want the name of the site which has id 56789, use this XPath: /rsp/site[id='56789']/name instead. I suggest you read W3Schools XPath tutorial for a quick overview of the XPath syntax.
As others say, you should really be using NSXMLParser for this sort of thing.
HOWEVER, if you only need to extract the stuff in the name tags, then RegexKitLite can do it quite easily:
NSString * xmlString = ...;
NSArray * captures = [xmlString arrayOfCaptureComponentsMatchedByRegex:#"<name>(.*?)</name>"];
for (NSArray * captureGroup in captures) {
NSLog(#"Name: %#", [captureGroup objectAtIndex:1];
}
Careful about namespaces:
<prefix:name xmlns:prefix="">testAddress</prefix:name>
is equivalent XML that will break regexp based code. For XML, use an XML parser. XPath is your friend for things like this. The XPath code below will return a sequence of strings with the info you want:
./rsp/site/name/text()
Cocoa has NSXML support for XPath.