How to handle 32bit unicode characters in a NSString - objective-c

I have a NSString containing a unicode character bigger than U+FFFF, like the MUSICAL SYMBOL G CLEF symbol '𝄞'. I can create the NSString and display it.
NSString *s = #"A\U0001d11eB"; // "A𝄞B"
NSLog(#"String = \"%#\"", s);
The log is correct and displays the 3 characters. This tells me the NSString is well done and there is no encoding problem.
String = "A𝄞B"
But when I try to loop through all characters using the method
- (unichar)characterAtIndex:(NSUInteger)index
everything goes wrong.
The type unichar is 16 bits so I expect to get the wrong character for the musical symbol. But the length of the string is also incorrect!
NSLog(#"Length = %d", [s length]);
for (int i=0; i<[s length]; i++)
{
NSLog(#" Character %d = %c", i, [s characterAtIndex:i]);
}
displays
Length = 4
Character 0 = A
Character 1 = 4
Character 2 = .
Character 3 = B
What methods should I use to correctly parse my NSString and get my 3 unicode characters?
Ideally the right method should return a type like wchar_t in place of unichar.
Thank you

NSString *s = #"A\U0001d11eB";
NSData *data = [s dataUsingEncoding:NSUTF32LittleEndianStringEncoding];
const wchar_t *wcs = [data bytes];
for (int i = 0; i < [data length]/4; i++) {
NSLog(#"%#010x", wcs[i]);
}
Output:
0x00000041
0x0001d11e
0x00000042
(The code assumes that wchar_t has a size of 4 bytes and little-endian encoding.)
length and charAtIndex: do not give the expected result because \U0001d11e
is internally stored as UTF-16 "surrogate pair".
Another useful method for general Unicode strings is
[s enumerateSubstringsInRange:NSMakeRange(0, [s length])
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
NSLog(#"%#", substring);
}];
Output:
A
𝄞
B

Related

Uppercase random characters in a NSString

I'm trying to figure out the best approach to a problem. I have an essentially random alphanumeric string that I'm generating on the fly:
NSString *string = #"e04325ca24cf20ac6bd6ebf73c376b20ac57192dad83b22602264e92dac076611b51142ae12d2d92022eb2c77f";
You can see that there are no special characters, just numbers and letters, and all the letters are lowercase. Changing all the letters in this string to uppercase is easy:
[string capitalizedString];
The hard part is that I want to capitalize random characters in this string, not all of them. For example, this could be the output on one execution:
E04325cA24CF20ac6bD6eBF73C376b20Ac57192DAD83b22602264e92daC076611b51142AE12D2D92022Eb2C77F
This could be the output on another, since it's random:
e04325ca24cf20aC6bd6eBF73C376B20Ac57192DAd83b22602264E92dAC076611B51142AE12D2d92022EB2c77f
In case it makes this easier, let's say I have two variables as well:
int charsToUppercase = 12;//hardcoded value for how many characters to uppercase here
int totalChars = 90;//total string length
In this instance it would mean that 12 random characters out of the 90 in this string would be uppercased. What I've figured out so far is that I can loop through each char in the string relatively easily:
NSUInteger len = [string length];
unichar buffer[len+1];
[string getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"loop through each char");
for(int i = 0; i < len; i++) {
NSLog(#"%C", buffer[i]);
}
Still stuck with selecting random chars in this loop to uppercase, so not all are uppercased. I'm guessing a condition in the for loop could do the trick well, given that it's random enough.
Here's one way, not particularly concerned with efficiency, but not silly efficiency-wise either: create an array characters in the original string, building an index of which ones are letters along the way...
NSString *string = #"e04325ca24cf20ac6bd6ebf73c376b20ac57192dad83b22602264e92dac076611b51142ae12d2d92022eb2c77f";
NSMutableArray *chars = [#[] mutableCopy];
NSMutableArray *letterIndexes = [#[] mutableCopy];
for (int i=0; i<string.length; i++) {
unichar ch = [string characterAtIndex:i];
// add each char as a string to a chars collection
[chars addObject:[NSString stringWithFormat:#"%c", ch]];
// record the index of letters
if ([[NSCharacterSet letterCharacterSet] characterIsMember:ch]) {
[letterIndexes addObject:#(i)];
}
}
Now, select randomly from the letterIndexes (removing them as we go) to determine which letters shall be upper case. Convert the member of the chars array at that index to uppercase...
int charsToUppercase = 12;
for (int i=0; i<charsToUppercase && letterIndexes.count; i++) {
NSInteger randomLetterIndex = arc4random_uniform((u_int32_t)(letterIndexes.count));
NSInteger indexToUpdate = [letterIndexes[randomLetterIndex] intValue];
[letterIndexes removeObjectAtIndex:randomLetterIndex];
[chars replaceObjectAtIndex:indexToUpdate withObject:[chars[indexToUpdate] uppercaseString]];
}
Notice the && check on letterIndexes.count. This guards against the condition where charsToUppercase exceeds the number of chars. The upper bound of conversions to uppercase is all of the letters in the original string.
Now all that's left is to join the chars array into a string...
NSString *result = [chars componentsJoinedByString:#""];
NSLog(#"%#", result);
EDIT Looking discussion in OP comments, you could, instead of acharsToUppercase input parameter, be given a probability of uppercase change as an input. That would compress this idea into a single loop with a little less data transformation...
NSString *string = #"e04325ca24cf20ac6bd6ebf73c376b20ac57192dad83b22602264e92dac076611b51142ae12d2d92022eb2c77f";
float upperCaseProbability = 0.5;
NSMutableString *result = [#"" mutableCopy];
for (int i=0; i<string.length; i++) {
NSString *chString = [string substringWithRange:NSMakeRange(i, 1)];
BOOL toUppercase = arc4random_uniform(1000) / 1000.0 < upperCaseProbability;
if (toUppercase) {
chString = [chString uppercaseString];
}
[result appendString:chString];
}
NSLog(#"%#", result);
However this assumes a given uppercase probability for any character, not any letter, so it won't result in a predetermined number of letters changing case.

ObjC / iOS: How to retrieve unicode hex code for character?

So, I know how to convert a unicode hex code into an NSString consisting of the unicode character referenced by that code:
NSString *ucStr = #"\\u004A"; // hex code for capital J
NSString *theLetter = [ucStr mutableCopy];
CFStringRef transform = CFSTR("Any-Hex/Java");
CFStringTransform((__bridge CFMutableStringRef)theLetter, NULL, transform, YES);
// theLetter is now #"J"
...However, I don't seem to understand how to go in the other direction, i.e. starting with an NSString #"J", output the NSString #"004A".
Simply extract each character and format it using the format string #"%04x", as below:
NSString *input = #"How now brown cow";
for (NSUInteger i = 0; i < [input length]; i++) {
unichar c = [input characterAtIndex:i];
NSLog(#"%04x", (unsigned)c);
// or NSString *s = [NSString stringWithFormat:#"%04x", (unsigned)c];
}
BTW I don't understand the code you have posted, but as that wasn't the question, it doesn't matter.

How to "concat" bytes from NSString and its length together?

I have a NSString [WORD] that has some length [LEN]. What i need to do is to get bytes from this string and put them together with length in short (2 bytes), so i would have [WORD] [LEN].
E.g.
String "AB" in utf8 HEX is 4142. Length of this string is 2==> 0002 in HEX.
So everything together is 41420002. How to get this bytes together?
I think this code does what you want.
NSString *myString = #"AB";
const char *chars = [myString UTF8String];
NSMutableString * result = [NSMutableString string];
for (int i=0; i < [myString length]; i++) {
[result appendFormat:#"%X", chars[i]];
}
[result appendFormat:#"%04X", [myString length]];
NSLog(#"%#", result);
Hope it helps!

How do I split a string with special characters into a NSMutableArray

I'am trying to seperate a string with danish characters into a NSMutableArray. But something is not working. :(
My code:
NSString *danishString = #"æøå";
NSMutableArray *characters = [[NSMutableArray alloc] initWithCapacity:[danishString length]];
for (int i=0; i < [danishString length]; i++)
{
NSString *ichar = [NSString stringWithFormat:#"%c", [danishString characterAtIndex:i ]];
[characters addObject:ichar];
}
If I do at NSLog on the danishString it works (returns æøå);
But if I do a NSLog on the characters (the array) I get some very stange characters - What is wrong?
/Morten
First of all, your code is incorrect. characterAtIndex returns unichar, so you should use #"%C"(uppercase) as the format specifier.
Even with the correct format specifier, your code is unsafe, and strictly speaking, still incorrect, because not all unicode characters can be represented by a single unichar. You should always handle unicode strings per substring:
It's common to think of a string as a sequence of characters, but when
working with NSString objects, or with Unicode strings in general, in
most cases it is better to deal with substrings rather than with
individual characters. The reason for this is that what the user
perceives as a character in text may in many cases be represented by
multiple characters in the string.
You should definitely read String Programming Guide.
Finally, the correct code for you:
NSString *danishString = #"æøå";
NSMutableArray *characters = [[NSMutableArray alloc] initWithCapacity:[danishString length]];
[danishString enumerateSubstringsInRange:NSMakeRange(0, danishString.length) options:NSStringEnumerationByComposedCharacterSequences usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
[characters addObject:substring];
}];
If with NSLog(#"%#", characters); you see "strange character" of the form "\Uxxxx", that's correct. It's the default stringification behavior of NSArray by description method. You can print these unicode characters one by one if you want to see the "normal characters":
for (NSString *c in characters) {
NSLog(#"%#", c);
}
In your example, ichar isn't type of NSString, but unichar. If you want NSStrings try getting a substring instead :
NSString *danishString = #"æøå";
NSMutableArray *characters = [[NSMutableArray alloc] initWithCapacity:[danishString length]];
for (int i=0; i < [danishString length]; i++)
{
NSRange r = NSMakeRange(i, 1);
NSString *ichar = [danishString substringWithRange:r];
[characters addObject:ichar];
}
You could do something like the following, which should be fine with Danish characters, but would break down if you have decomposed characters. I suggest reading the String Programming Guide for more information.
NSString *danishString = #"æøå";
NSMutableArray* characters = [NSMutableArray array];
for( int i = 0; i < [danishString length]; i++ ) {
NSString* subchar = [danishString substringWithRange:NSMakeRange(i, 1)];
if( subchar ) [characters addObject:subchar];
}
That would split the string into an array of individual characters, assuming that all the code points were composed characters.
It is printing the unicode of the characters. Anyhow, you can use the unicode (with \u) anywhere.

Convert NSString to C string, increment and come back to NSString

I'm trying to develop a simple application where i can encrypt a message. The algorithm is Caesar's algorithm and for example, for 'Hello World' it prints 'KHOOR ZRUOG' if the increment is 3 (standard).
My problem is how to take each single character and increment it...
I've tried this:
NSString *text = #"hello";
int q, increment = 3;
NSString *string;
for (q = 0; q < [text length]; q++) {
string = [text substringWithRange:NSMakeRange(q, 1)];
const char *c = [string UTF8String] + increment;
NSLog(#"%#", [NSString stringWithUTF8String:c]);
}
very simple but it doesn't work.. My theory was: take each single character, transform into c string and increment it, then return to NSString and print it, but xcode print nothing, also if i print the char 'c' i can't see the result in console. Where is the problem?
First of all, incrementing byte by byte only works for ASCII strings. If you use UTF-8, you will get garbage for glyphs that have multi-byte representations.
With that in mind, this should work (and work faster than characterAtIndex: and similar methods):
NSString *foo = #"FOOBAR";
int increment = 3;
NSUInteger bufferSize = [foo length] + 1;
char *buffer = (char *)calloc(bufferSize, sizeof(char));
if ([foo getCString:buffer maxLength:bufferSize encoding:NSASCIIStringEncoding]) {
int bufferLen = strlen(buffer);
for (int i = 0; i < bufferLen; i++) {
buffer[i] += increment;
if (buffer[i] > 'Z') {
buffer[i] -= 26;
}
}
NSString *encoded = [NSString stringWithCString:buffer
encoding:NSASCIIStringEncoding];
}
free(buffer);
first of all replace your code with this:
for (q = 0; q < [text length]; q++) {
string = [text substringWithRange:NSMakeRange(q, 1)];
const char *c = [string UTF8String];
NSLog(#"Addr: 0x%X", c);
c = c + increment;
NSLog(#"Addr: 0x%X", c);
NSLog(#"%#", [NSString stringWithUTF8String:c]);
}
Now you can figure out your problem. const char *c is a pointer. A pointer saves a memory address.
When I run this code the first log output is this:
Addr: 0x711DD10
that means the char 'h' from the NSString named string with the value #"h" is saved at address 0x711DD10 in memory.
Now we increment this address by 3. Next output is this:
Addr: 0x711DD13
In my case at this address there is a '0x00'. But it doesn't matter what is actually there because a 'k' won't be there (unless you are very lucky).
If you are happy there is a 0x00 too. Because then nothing bad will happen. If you are unlucky there is something else. If there is something other than 0x00 (or the string delimiter or "end of string") NSString will try to convert it. It might crash while trying this, or it might open a huge security hole.
so instead of manipulating pointers you have to manipulate the values where they point to.
You can do this like this:
for (q = 0; q < [text length]; q++) {
string = [text substringWithRange:NSMakeRange(q, 1)];
const char *c = [string UTF8String]; // get the pointer
char character = *c; // get the character from this pointer address
character = character + 3; // add 3 to the letter
char cString[2] = {0, 0}; // create a cstring with length of 1. The second char is \0, the delimiter (the "end marker") of the string
cString[0] = character; // assign our changed character to the first character of the cstring
NSLog(#"%#", [NSString stringWithUTF8String:cString]); // profit...
}