Using NSXMLParser with ISO-8859-1 truncates words with accents - objective-c

I have the same exact problem that's in this question, but it didn't get any good answers.
I'm trying to parse an XML file with an ISO-8859-1 encoding, but everytime there's an accentuated word, it gets truncated and doesn't show properly.
Example:
Original Word: Interés
Word Shown: és

You're making the assumption that you only get one -parser:foundCharacters: delegate method for the text. In this case, that's wrong. You're getting two calls to -parser:foundCharacters:, the first being the text up to the accented character, and the second being the text after it. Your logs even demonstrate this.
Therefore, what you need to do is, when you start a new element, you should also initialize a new NSMutableString* instance. Then when you get -parser:foundCharacters: you append to this string instead of replacing it. When the tag closes, this string now contains all of the text in the tag, instead of just the last text block.

You must use a NSMutableString and append chars with it on the foundCharacters method.
That's why your string becomes truncated.

Related

Hebrew punctuation displayed incorrectly in Objective-C

I have the very basic line:
self.label.text = #",הם הכריחו אותה לשתות ויסקי";
Notice the comma at the left of the string. When this displays in a UILabel, I see the comma at the right. This is one example of punctuation problems I am seeing with Hebrew.
Any ideas for resolving this?
Most of the text you have is right-to-left, but a comma is left-to-right. You are displaying source code here as it is displayed by Xcode. It's not at all obvious what rules Xcode would choose to display such text. You would be much confident about what your source code is if you write
self.label.text = #"הם הכריחו אותה לשתות ויסקי" ",";
for example or
self.label.text = #"," "הם הכריחו אותה לשתות ויסקי";
so you know 100% sure what text you have in Xcode. After that I'm afraid it's very much a matter of reading the documentation and seeing what you need to do. While characters in text have some ordering, a text field on its own has a text ordering as well. You can have latin text with a bit of hebrew inside, or hebrew (right to left) text with a bit of latin inside, and they will behave differently.
What you describe looks like a left-to-right text field that is used to display some hebrew text, so the overall display order is left to right, but hebrew items inside (not the comma) are displayed right to left. You'd need to change the display order of the text field itself.
I've been reading up on Bi-directional text, it seems as though certain Unicode characters specify certain properties of the following text. Through experimentation, I've found that the Right-To-Left Isolate character, or U+2067 ⁧, will cause the text that follows to be displayed in the correct order. So the Objective C solution to the problem was:
self.label.text = [#"\u2067" stringByAppendingString: #",הם הכריחו אותה לשתות ויסקי"];

Add bold formatting to localized string that has placeholders

I have a localization string with a placeholder:
Verb {0}
I use this string in my view-model to return a string to my view that, in turn, is displayed in a TextBlock. Easy enough. But a new requirement has arisen saying that the "Verb" portion (everything other than the placeholder's inserted value) be displayed in bold.
Using a string with placeholders seems like the typical and easiest way to indicate word order. So the first question, then, is: where should I parse the localization string in order to add the bold formatting? The parse operation will need knowledge of the original placeholder's location. So far, the view-model has been responsible for utilizing the localization strings by using string.Format to insert values and return its result to the view. If I leave this responsibility in the view-model, as is probably necessary, then the view-model also needs to return rich text.
Is binding to rich text even supported by RichTextBlock? Even if it is supported, I've never before had a view-model return formatted text before. It initially feels sacrilegious to a follower of MVVM-ism, but perhaps upon further consideration I may find it acceptable.
What's the best way to add bold formatting to a localized string that has placeholders? Is returning rich text from the view-model the best way?

Comparing NSString to NSTextView Range prior to Appending

Coding in Objective-C, I'm appending text to a NSTextView object named subCap in my code like so:
[[[_subCAP textStorage] mutableString]appendString:[NSString stringWithFormat:#"%#", subcapLine]];
subcapLine will have two timecode values such as: "01:00:00:00 01:00:01:00" separated by a single space, then a newline (\n) character, then a string like "ONC314_001_001" followed by two newline chars (\n\n).
The end result will create a list similar to:
01:00:00:00 01:00:01:00
ONC314_001_001
01:00:01:00 01:00:02:00
ONC314_001_002
01:00:02:00 01:00:03:00
ONC314_001_003
etc, etc, etc.
It's a sub caption file for placing text (the ONC314 lines) at appropriate times in a video file, as indicated by the timecodes.
However, I've determined that there is an odd set of circumstances where a timecode pair could be the same as the previous timecode pair, and if that happens, I want to skip appending that line.
So, my question is, given that the timecodes are always 11 chars apiece, separated by a space, can anybody think of a way I can easily grab the prior TC pair and compare it to my current pair in the subcapLine I'm preparing to append? The problem is the text of the sub caption could be random lengths. In my example they're the same, but that isn't always the case.
If I need to check prior to compiling my subcapLine, I can do that too, but I just thought it might be more slick to use a range of some sort to grab the prior pair of TCs from the last-written line in the NSTextView object and compare (again, using a range?) against the TCs in the line I'm about to append?
Thoughts and suggestions much appreciated.
Chris Conlee
When you add a timecode store the length of the text field string just before you add the timecode so you will have the offset to the timecode you are about to add.
Then before adding a new timecode you could simply use the previous offset you stored to extract the substring and do a string comparison and see if the timecodes are identical.
This should allow you to always have an offset to the previous timecode regardless of the length of the subtitles.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

How do I match non-ASCII characters with RegexKitLite?

I am using RegexKitLite and I'm trying to match a pattern.
The following regex patterns do not capture my word that includes N with a titlde: ñ.
Is there a string conversion I am missing?
subjectString = #"define_añadir";
//regexString = #"^define_(.*)"; //this pattern does not match, so I assume to add the ñ
//regexString = #"^define_([.ñ]*)"; //tried this pattern first with a range
regexString = #"^define_((?:\\w|ñ)*)"; //tried second
NSString *captured= [subjectString stringByMatching:regexString capture:1L];
//I want captured == añadir
Looks like an encoding problem to me. Either you're saving the source code in an encoding that can't handle that character (like ASCII), or the compiler is using the wrong encoding to read the source files. Going back to the original regex, try creating the subject string like this:
subjectString = #"define_a\xC3\xB1adir";
or this:
subjectString = #"define_a\u00F1adir";
If that works, check the encoding of your source code files and make sure it's the same encoding the compiler expects.
EDIT: I've never worked with the iPhone technology stack, but according to this doc you should be using the stringWithUTF8String method to create the NSString, not the #"" literal syntax. In fact, it says you should never use non-ASCII characters (that is, anything not in the range 0x00..0x7F) in your code; that way you never have to worry about the source file's encoding. That's good advice no matter what language or toolset you're using.