Parsing SRT file with Objective C - objective-c

Text example:
1
00:00:00,000 --> 00:00:01,000
This is the first line
2
00:00:01,000 --> 00:00:02,000
This is the second line
3
00:00:02,000 --> 00:00:03,000
This is the last line
In JavaScript I would parse this with a regular expression certainly. I'm just wondering, is that the best way to do this in Obj C? I'm sure I could figure out a way to do this, but I'm wanting to do it an appropriate way.
I only need to know where to start and I'm happy to do the rest, but for understanding sake I'm going to end up with something like this (pseudo code):
NSDictionary
index -> [0-9]+
start -> hh:mm:ss,mmm
end -> hh:mm:ss,mmm
text -> one of the lines of text
In this case, I'd be parsing three entries into my dictionary.

Some background: I wrote a small app and created a file called stuff.srt containing your examples that resides in the bundle; hence, my means of accessing it.
This is just a quick and dirty thing, a proof-of-concept. Note that it doesn't check results. Real applications always check their results. As you can see, the work takes place in the -applicationDidFinishLaunching: method (I'm working in Mac OS X, not iOS).
EDIT:
It's been pointed out that the code as originally posted didn't handle multiple text lines correctly. To address this, I take advantage of the fact that SRT files use CRLF as their line breaks, and search for two occurrences of this sequence. I then change all occurrences of CRLF in the text string to spaces, based on what I observed here. This doesn't account for leading or trailing spaces in each line of the text.
I changed the contents of the stuff.srt file to this:
1
00:00:00,000 --> 00:00:01,000
This is the first line
and it has a secondary line
2
00:00:01,000 --> 00:00:02,000
This is the second line
3
00:00:02,000 --> 00:00:03,000
This is the last line
and it has a secondary line too
and the code has been revised as follows (I also put everything into an #autoreleasepool directive; there might be a lot of autoreleased objects generated in the course of parsing the file!):
- (void)applicationDidFinishLaunching:(NSNotification *)aNotification
{
NSString *path = [[NSBundle mainBundle] pathForResource:#"stuff" ofType:#"srt"];
NSString *string = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:NULL];
NSScanner *scanner = [NSScanner scannerWithString:string];
while (![scanner isAtEnd])
{
#autoreleasepool
{
NSString *indexString;
(void) [scanner scanUpToCharactersFromSet:[NSCharacterSet newlineCharacterSet] intoString:&indexString];
NSString *startString;
(void) [scanner scanUpToString:#" --> " intoString:&startString];
// My string constant doesn't begin with spaces because scanners
// skip spaces and newlines by default.
(void) [scanner scanString:#"-->" intoString:NULL];
NSString *endString;
(void) [scanner scanUpToCharactersFromSet:[NSCharacterSet newlineCharacterSet] intoString:&endString];
NSString *textString;
// (void) [scanner scanUpToCharactersFromSet:[NSCharacterSet newlineCharacterSet] intoString:&textString];
// BEGIN EDIT
(void) [scanner scanUpToString:#"\r\n\r\n" intoString:&textString];
textString = [textString stringByReplacingOccurrencesOfString:#"\r\n" withString:#" "];
// Addresses trailing space added if CRLF is on a line by itself at the end of the SRT file
textString = [textString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
// END EDIT
NSDictionary *dictionary = [NSDictionary dictionaryWithObjectsAndKeys:
indexString, #"index",
startString, #"start",
endString , #"end",
textString , #"text",
nil];
NSLog(#"%#", dictionary);
}
}
}
The revised output looks like this:
2013-02-09 16:10:17.727 SRTFileScan[4846:303] {
end = "00:00:01,000";
index = 1;
start = "00:00:00,000";
text = "This is the first line and it has a secondary line";
}
2013-02-09 16:10:17.729 SRTFileScan[4846:303] {
end = "00:00:02,000";
index = 2;
start = "00:00:01,000";
text = "This is the second line";
}
2013-02-09 16:10:17.730 SRTFileScan[4846:303] {
end = "00:00:03,000";
index = 3;
start = "00:00:02,000";
text = "This is the last line and it has a secondary line too";
}
One other thing I learned from what I've read today: The SRT file format originated in France, and the comma seen in the input is the decimal separator used there.

Apple has a sample code to parse subtitle files. Check the relevant part here:
https://developer.apple.com/library/mac/samplecode/avsubtitleswriterOSX/Listings/avsubtitleswriter_SubtitlesTextReader_m.html#//apple_ref/doc/uid/DTS40013409-avsubtitleswriter_SubtitlesTextReader_m-DontLinkElementID_5

My suggest is to use a NSDateFormatter to parse the second line. I would split that string in two strings (see componentsSeparatedByString: in NSString class reference). This while reading the file line per line.
So the loop would be:
If the file contains again data, read the next line;
If the next line is a multiple of 4, allocate a new object. This object should be able to contain two dates, one integer and one string;
If the next line is not a multiple of 4, read the line and assign it's value to the corresponding field.

Related

Replacing bad words in a string in Objective-C

I have a game with a public highscore list where I allow layers to enter their name (or anything unto 12 characters). I am trying to create a couple of functions to filter out bad words from a list of bad words
I have in a text file. I have two methods:
One to read in the text file:
-(void) getTheBadWordsAndSaveForLater {
badWordsFilePath = [[NSBundle mainBundle] pathForResource:#"badwords" ofType:#"txt"];
badWordFile = [[NSString alloc] initWithContentsOfFile:badWordsFilePath encoding:NSUTF8StringEncoding error:nil];
badwords =[[NSArray alloc] initWithContentsOfFile:badWordFile];
badwords = [badWordFile componentsSeparatedByString:#"\n"];
NSLog(#"Number Of Words Found in file: %i",[badwords count]);
for (NSString* words in badwords) {
NSLog(#"Word in Array----- %#",words);
}
}
And one to check a word (NSString*) agains the list that I read in:
-(NSString *) removeBadWords :(NSString *) string {
// If I hard code this line below, it works....
// *****************************************************************************
//badwords =[[NSMutableArray alloc] initWithObjects:#"shet",#"shat",#"shut",nil];
// *****************************************************************************
NSLog(#"checking: %#",string);
for (NSString* words in badwords) {
string = [string stringByReplacingOccurrencesOfString:words withString:#"-" options:NSCaseInsensitiveSearch range:NSMakeRange(0, string.length)];
NSLog(#"Word in Array: %#",words);
}
NSLog(#"Cleaned Word Returned: %#",string);
return string;
}
The issue I'm having is that when I hardcode the words into an array (see commented out above) then it works like a charm. But when I use the array I read in with the first method, it does't work - the stringByReplacingOccurrencesOfString:words does not seem to have an effect. I have traced out to the log so I can see if the words are coming thru and they are... That one line just doesn't seem to see the words unless I hardcore into the array.
Any suggestions?
A couple of thoughts:
You have two lines:
badwords =[[NSArray alloc] initWithContentsOfFile:badWordFile];
badwords = [badWordFile componentsSeparatedByString:#"\n"];
There's no point in doing that initWithContentsOfFile if you're just going to replace it with the componentsSeparatedByString on the next line. Plus, initWithContentsOfFile assumes the file is a property list (plist), but the rest of your code clearly assumes it's a newline separated text file. Personally, I would have used the plist format (it obviates the need to trim the whitespace from the individual words), but you can use whichever you prefer. But use one or the other, but not both.
If you're staying with the newline separated list of bad words, then just get rid of that line that says initWithContentsOfFile, you disregard the results of that, anyway. Thus:
- (void)getTheBadWordsAndSaveForLater {
// these should be local variables, so get rid of your instance variables of the same name
NSString *badWordsFilePath = [[NSBundle mainBundle] pathForResource:#"badwords" ofType:#"txt"];
NSString *badWordFile = [[NSString alloc] initWithContentsOfFile:badWordsFilePath encoding:NSUTF8StringEncoding error:nil];
// calculate `badwords` solely from `componentsSeparatedByString`, not `initWithContentsOfFile`
badwords = [badWordFile componentsSeparatedByString:#"\n"];
// confirm what we got
NSLog(#"Found %i words: %#", [badwords count], badwords);
}
You might want to look for whole word occurrences only, rather than just the presence of the bad word anywhere:
- (NSString *) removeBadWords:(NSString *) string {
NSLog(#"checking: %# for occurrences of these bad words: %#", string, badwords);
for (NSString* badword in badwords) {
NSString *searchString = [NSString stringWithFormat:#"\\b%#\\b", badword];
string = [string stringByReplacingOccurrencesOfString:searchString
withString:#"-"
options:NSCaseInsensitiveSearch | NSRegularExpressionSearch
range:NSMakeRange(0, string.length)];
}
NSLog(#"resulted in: %#", string);
return string;
}
This uses a "regular expression" search, where \b stands for "a boundary between words". Thus, \bhell\b (or, because backslashes have to be quoted in a NSString literal, that's #"\\bhell\\b") will search for the word "hell" that is a separate word, but won't match "hello", for example.
Note, above, I am also logging badwords to see if that variable was reset somehow. That's the only thing that would make sense given the symptoms you describe, namely that the loading of the bad words from the text file works but replace process fails. So examine badwords before you replace and make sure it's still set properly.

Yet another NSScanner characterSetWithCharactersInString newb

Let's assume I have a string ("G00 X0.0000 Y0.0000") and I need to to parse its contents. Here is my code:
NSCharacterSet *params = [NSCharacterSet characterSetWithCharactersInString:#"XY"];
//setup the scanner
NSScanner *scanner = [NSScanner scannerWithString:stringToBeScanned];
NSString *scanned = nil;
//scan the string
NSLog(#"%#", stringToBeScanned);
while ([scanner scanUpToCharactersFromSet:params intoString:&scanned]) {
struct keypair code;
code.key = [scanned characterAtIndex:0];
code.value = [[scanned substringFromIndex:1] doubleValue];
NSLog(#"--> %# [%lu]= (%c, %.4f)", scanned, [scanner scanLocation], code.key, code.value);
}
And the output to NSLog:
G00 X0.0000 Y0.0000
--> G00 [4]= (G, 0.0000)
My characterSet includes both 'X' and 'Y' and I can't figure out why my NSScanner won't scan in the 'X0.0000 ' - it should find that Y and pull in everything from X up to Y according to my understanding.
I can see from the scanLocation that the scanner is stopping at index 4 (correctly), but the loop either doesn't continue or evaluates to false. Shouldn't the scanner keep looping and finding my delimiters (from the characterSet) and grabbing data?
scanUpToCharactersFromSet:intoString: scans up to the "X" and gives you the characters it scanned "G00 ".
Note that it does not scan the "X". When you call the method again, it looks at the next character (the "X"), notices that it is a character in the set, and stops scanning. As it scanned no characters, it then returns NO.
To scan the "X" (or "Y"), you will want to use scanCharactersFromSet:intoString: as well.
I solved this issue. Basically I receive a string with a list of "codes" followed by a value associated with that command/parameter. There could several different "commands" in each string, or none at all. The key was to use scanCharactersFromSet: and scanUpToCharactersFromSet: in order to capture the right pairings and parse the entire string while staying very flexible. It's a little ugly, I know.
Here is my code:
//setup the scanner
NSScanner *scanner = [NSScanner scannerWithString:[self stringByAppendingString:#"!"]];
NSCharacterSet *codeset = [NSCharacterSet characterSetWithCharactersInString:#"GMTFIJKPRSXYZ!"];
NSString *scanned = nil;
char codechar;
//perform the first scan
[scanner scanCharactersFromSet:codeset intoString:&scanned];
if (scanned)
codechar = [scanned characterAtIndex:0];
//scan the string
while ([scanner scanUpToCharactersFromSet:codeset intoString:&scanned]) {
struct keypair code;
code.key = codechar;
code.value = [scanned doubleValue];
NSLog(#"--> (%c, %.4f)", code.key, code.value);
//skip over the delimeter we encountered
[scanner scanCharactersFromSet:codeset intoString:&scanned];
if (scanned)
codechar = [scanned characterAtIndex:0];
}

Creating substrings from text file

I have a text file that contains two lines of numbers, all I want to do is turn each line into a string, then add it into an array (called fields). My problem arrises when trying to find the EOF character. I can read from the file with no problem: I turn it's content into a NSString, then pass to this method.
-(void)parseString:(NSString *)inputString{
NSLog(#"[parseString] *inputString: %#", inputString);
//the end of the previous line, this is also the start of the next lien
int endOfPreviousLine = 0;
//count of how many characters we've gone through
int charCount = 0;
//while we havent gone through every character
while(charCount <= [inputString length]){
NSLog(#"[parseString] while loop count %i", charCount);
//if its an end of line character or end of file
if([inputString characterAtIndex:charCount] == '\n' || [inputString characterAtIndex:charCount] == '\0'){
//add a substring into the array
[fields addObject:[inputString substringWithRange:NSMakeRange(endOfPreviousLine, charCount)]];
NSLog(#"[parseString] string added into array: %#", [inputString substringWithRange:NSMakeRange(endOfPreviousLine, charCount)]);
//set the endOfPreviousLine to the current char count, this is where the next string will start from
endOfPreviousLine = charCount+1;
}
charCount++;
}
NSLog(#"[parseString] exited while. endOfPrevious: %i, charCount: %i", endOfPreviousLine, charCount);
}
The contents of my text file look like this:
123
456
I can get the first string (123) no problem. The call would be:
[fields addObject:[inputString substringWithRange:NSMakeRange(0, 3)]];
Next, I make the call for the second String:
[fields addObject:[inputString substringWithRange:NSMakeRange(4, 7)]];
But I get an error, and I think it is because my index is out of bounds. Since the index starts from 0, there is no index 7 (well I think its supposed to be the EOF character), and I get an error.
To sum everything up: I don't know how to deal with an index of 7 when there are only 6 characters + the EOF character.
Thanks.
You can use componentsSeparatedByCharactersInSet: to get the effect that you are looking for:
-(NSArray*)parseString:(NSString *)inputString {
return [inputString componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
}
Short answer is to use [inputString componentsSeparatedByString:#"\n"] and get the array of numbers.
Example:
Use the following code to get the lines in an array
NSString *path = [[NSBundle bundleForClass:[self class]] pathForResource:#"aaa" ofType:#"txt"];
NSString *str = [[NSString alloc] initWithContentsOfFile: path];
NSArray *lines = [str componentsSeparatedByString:#"\n"];
NSLog(#"str = %#", str);
NSLog(#"lines = %#", lines);
The above code assumes that you have a file called "aaa.txt" in your resources which is plain text file.

Split NSString into NSArray by blank lines

I am reading a *.srt subtitle file into a NSString. The content of this string looks like this:
1
00:00:20,000 --> 00:00:24,400
Altocumulus clouds occur between six thousand
2
00:00:24,600 --> 00:00:27,800
and twenty thousand feet above ground level.
I am looking for an elegant solution to split this string into an NSArray in which each element contains the information which is related to one particular subtitle-"frame", e.g. the zeroth element would look like this:
1
00:00:20,000 --> 00:00:24,400
Altocumulus clouds occur between six thousand
Any ideas how to accomplish this task in an elegant manner? I tried splitting the original string using the method
[string componentsSeparatedByString:#"\n\n"];
but this method fails to detect the blank lines..
Thanks for your help!
tobi
If [string componentsSeparatedByString:#"\n\n"] doesn't work, then there are two possibilities:
Your file contains MSDOS-style line breaks, which are \r\n. So try splitting on #"\r\n\r\n".
Your supposedly blank lines contain spaces or tabs. You can check this from the shell using cat -e.
I'd suggest using NSScanner instead. It's more flexible and you don't have to worry about whether your line breaks are Windows or Unix style and whether the blank lines contain any spaces. Here's an example:
NSMutableArray *lines = [NSMutableArray array];
NSString *s = #"foo\n\nbar\r\n \t \r\nbaz"; //intentionally mixed line breaks
NSScanner *scanner = [NSScanner scannerWithString:s];
while (![scanner isAtEnd]) {
[scanner scanCharactersFromSet:[NSCharacterSet newlineCharacterSet] intoString:NULL];
NSString *line = nil;
[scanner scanUpToCharactersFromSet:[NSCharacterSet newlineCharacterSet] intoString:&line];
if (line) {
[lines addObject:line];
}
}
NSLog(#"%#", lines);
According to http://en.wikipedia.org/wiki/SubRip, the line breaks are a CRLF, which would be \r\n.

scanUpToCharactersFromSet stops after one loop

I'm trying to get the contents of a CSV file into an array. When I've done this before I had one record per line, and used the newline character with scanUpToCharactersFromSet:intoString:, passing newlineCharacterSet as the character set:
while ([lineScanner scanUpToCharactersFromSet:[NSCharacterSet newlineCharacterSet]
intoString:&line])
Now, I'm working with a file where many of the entries themselves contain newline characters. I've tried adding a unique character to the end of each record (a * character) but my loop only runs once. Is there something which is making the while loop break that I don't know about? Here's the code I'm using now:
NSError *error;
NSString *data = [[NSString alloc] initWithContentsOfFile:[[self delegate] filePath] encoding:NSUTF8StringEncoding error:&error];
NSScanner *lineScanner = [NSScanner scannerWithString:data];
NSString *line = nil;
// Start parsing the CSV file
while ([lineScanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:#"*"]
intoString:&line]) {
NSArray *elements = [line componentsSeparatedByString:#","];
NSLog("Name: %#", [elements objectAtIndex:1]);
}
**Edit: ** Thanks to Peter's answer below, I found that my scanner was stuck behind the * character. I added this line in the loop:
[lineScanner scanCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:#"*"] intoString:NULL];
and now it's working like it should.
Let's go through one pass at a time:
First:
while ([lineScanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:[NSCharacterSet newlineCharacterSet]] intoString:&line]) {
The scanner puts everything before the line break into line. It advances up to the newline.
Second:
while ([lineScanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:[NSCharacterSet newlineCharacterSet]] intoString:&line]) {
The scanner is already on a line break, so it scans no characters. As documented, since it scanned no characters, it returns NO. Your loop terminates.
The solution is to scan the line break at the end of the loop, to get the scanner past it. You can pass NULL for the output parameter, assuming you don't care what the line break was.
This is correct behavior: If you did/do care what the characters you scanned up to were, this lets you obtain them. That would be more difficult if NSScanner scanned past the characters automatically.
I think the while condition is wrong. According to the String Programming Guide, it should be something like:
while ([theScanner isAtEnd] == NO) {
[lineScanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:#"*"] intoString:&line]
// ...
}