Set line-terminator string in NSDocument? - objective-c

(This question has been rewritten from an issue with NSTextView following some further research)
UPDATE: You can download a very basic project that displays the issue here:
http://w3style.co.uk/~d11wtq/DocumentApp.tar.gz
(Do a grep -c "\r" file.txt on the file you save to get a line count where \r occurs... repeat for \n).
I've realised all files created by NSDocument have \r is line endings, not the standard \n, even though the NSData my document subclass returns does not contain \r, it only contains \n. Is there a way to configure this?
I thought Macs used UNIX line endings these days, so it seems weird that AppKit is still using the antiquated Mac endings. Weirder is that NSDocument asks for NSData, then rather unkindly corrupts that NSData by transforming the line endings.
The switch to \r is happening after producing NSData, so NSDocument itself is doing some replacements on the bytes:
const char *bytes = [data bytes];
int i, len;
for (i = 0, len = [data length]; i < len; ++i) {
NSLog(#"byte %d = %02x", i, bytes[i]);
}
Outputs (note 0a is the hex value of \n):
> 2010-12-17 12:45:59.076
> MojiBaker[74929:a0f] byte 0 = 66
> 2010-12-17 12:45:59.076
> MojiBaker[74929:a0f] byte 1 = 6f
> 2010-12-17 12:45:59.076
> MojiBaker[74929:a0f] byte 2 = 6f
> 2010-12-17 12:45:59.077
> MojiBaker[74929:a0f] byte 3 = 0a
> 2010-12-17 12:45:59.077
> MojiBaker[74929:a0f] byte 4 = 62
> 2010-12-17 12:45:59.077
> MojiBaker[74929:a0f] byte 5 = 61
> 2010-12-17 12:45:59.077
> MojiBaker[74929:a0f] byte 6 = 72
> 2010-12-17 12:45:59.077
> MojiBaker[74929:a0f] byte 7 = 0a
If NSDocument is going to ask for NSData then it should respect that and not modify it.
Here's the full code from the method: -dataOfType:error: method in my document:
-(NSData *)dataOfType:(NSString *)typeName error:(NSError **)outError {
NSString *string = [textView string];
// DEBUG CODE...
NSArray *unixLines = [string componentsSeparatedByString:#"\n"];
NSArray *windowsLines = [string componentsSeparatedByString:#"\r\n"];
NSArray *macLines = [string componentsSeparatedByString:#"\r"];
NSLog(#"TextView has %d LF, %d CRLF, %d CR", [unixLines count] - 1, [windowsLines count] - 1, [macLines count] - 1);
NSData *data = [NSData dataWithBytes:[string cStringUsingEncoding:NSUTF8StringEncoding]
length:[string lengthOfBytesUsingEncoding:NSUTF8StringEncoding]];
const char *bytes = [data bytes];
int i, len;
for (i = 0, len = [data length]; i < len; ++i) {
NSLog(#"byte %d = %02x", i, bytes[i]);
}
if (data != nil) {
[textView breakUndoCoalescing];
}
return data;
}

NSDocument doesn’t care about line termination; it’s a semi-abstract class, designed to be subclassed. By itself it imposes nothing on a file format.
It’s the particular implementation of an NSDocument subclass - one that happens to read and write plain text - that will care about line termination characters.

Related

Create code challenge (base64 encoded, sha 256 ascii) from string

For some code challenge used in the oauth2 login process I need to do the following:
code_challenge = BASE64URL-ENCODE(SHA256(ASCII(code_verifier)))
How can I do this from my random string contained in code_verifier?
UPDATE: Can you check if this is correct? Or is some stuff unneccesary/deprecated? I actually have not really an idea what I am doing here, I just copied code from everywhere to solve it...
- (NSString *)createCodeChallengeWithVerifier:(NSString *)codeVerifier {
//Create ASCII
const char *asciiString = [codeVerifier cStringUsingEncoding:NSASCIIStringEncoding];
//Sha256
unsigned char buf[CC_SHA256_DIGEST_LENGTH];
CC_SHA256(asciiString, strlen(asciiString), buf);
NSMutableString * shaString = [NSMutableString stringWithCapacity:(CC_SHA256_DIGEST_LENGTH * 2)];
for (int i = 0; i < CC_SHA256_DIGEST_LENGTH; ++i) {
[shaString appendFormat:#"%02x", buf[i]];
}
//Base 64 encode
NSData *dataFromShaString = [shaString dataUsingEncoding:NSUTF8StringEncoding];
return([dataFromShaString base64EncodedStringWithOptions:0]);
}

Converting NSData that contains UTF-8 and null bytes to string

I have an __NSCFData object. I know what's inside it.
61 70 70 6c 65 2c 74 79 70 68 6f 6f 6e 00 41 52 4d 2c 76 38 00
I tried converting it to a string with initWithData: and stringWithUTF8String: and it gives me "apple,typhoon". The conversion is terminated at 00
The data actually is
61 a
70 p
70 p
6c l
65 e
2c ,
74 t
79 y
70 p
68 h
6f o
6f o
6e n
00 (null)
41 A
52 R
4d M
2c ,
76 v
38 8
00 (null)
How can I properly convert this without loss of information?
The documentation for stringWithUTF8String describes its first parameter as:
A NULL-terminated C array of bytes in UTF8 encoding.
Which is why your conversion stops at the first null byte.
What you appear to have is a collection of C strings packed into a single NSData. You can convert each one individually. Use the NSData methods bytes and length to obtain a pointer to the bytes/first C string and the total number of bytes respectively. The standard C function strlen() will give you the length in bytes of an individual string. Combine these and some simple pointer arithmetic and you can write a loop which converts each string and, for example, stores them all into an array or concatenates them.
If you get stuck implementing the solution ask a new question, show your code, and explain the issue. Someone will undoubtedly help you with the next step.
HTH
In contrast to the intention of some answers, the stored strings in instances of NSString are not 0-terminated. Even there might be problems with writing them out (since underlying C functions for output expects a 0-terminated string), the instances itself can contain a \0:
NSString *zeroIncluded = #"A\0B";
NSLog(#"%ld", [zeroIncluded length]);
// prints 3
To create such an instance you can use methods that have a bytes and a length parameter, i. e. -initWithBytes:length:encoding:. Therefore something like this should work:
NSData *data = …
[[NSString alloc] initWithBytes:[data bytes] length:[data length] encoding:NSUTF8StringEncoding];
However, as intended by CRD, you might check, whether you want to have such a string.
0, or null, is the sentinel value which terminates strings, so you're going to have to deal with it somehow if you want to automatically dump the bytes into a string. If you don't, the string, or things that try to print it, for example, will assume the end of string is reached when reaching the NULL.
Just replace the bytes as they occur with something printable, like a space. Use whatever value works for you.
Example:
// original data you have from somewhere
char something[] = "apple,typhoon\0ARM,v8\0";
NSData *data = [NSData dataWithBytes:something length:sizeof(something)];
// Find each null terminated string in the data
NSMutableArray *strings = [NSMutableArray new];
NSMutableString *temp = [NSMutableString string];
const char *bytes = [data bytes];
for (int i = 0; i < [data length]; i++) {
unsigned char byte = (unsigned char)bytes[i];
if (byte == 0) {
if ([temp length] > 0) {
[strings addObject:temp];
temp = [NSMutableString string];
}
} else {
[temp appendFormat:#"%c", byte];
}
}
// Results
NSLog(#"strings count: %lu", [strings count]);
[strings enumerateObjectsUsingBlock:^(NSString *string, NSUInteger idx, BOOL * _Nonnull stop) {
NSLog(#"%ld: %#", idx, string);
}];
// strings count: 2
// 0: apple,typhoon
// 1: ARM,v8

Convert NSData byte array to string?

I have an NSData object. I need to convert its bytes to a string and send as JSON. description returns hex and is unreliable (according to various SO posters). So I'm looking at code like this:
NSUInteger len = [imageData length];
Byte *byteData = (Byte*)malloc(len);
[imageData getBytes:&byteData length:len];
How do I then send byteData as JSON? I want to send the raw bytes.
CODE:
NSString *jsonBase64 = [imageData base64EncodedString];
NSLog(#"BASE 64 FINGERPRINT: %#", jsonBase64);
NSData *b64 = [NSData dataFromBase64String:jsonBase64];
NSLog(#"Equal: %d", [imageData isEqualToData:b64]);
NSLog(#"b64: %#", b64);
NSLog(#"original: %#", imageData);
NSString *decoded = [[NSString alloc] initWithData:b64 encoding:NSUTF8StringEncoding];
NSLog(#"decoded: %#", decoded);
I get values for everything except for the last line - decoded.
Which would indicate to me that the raw bytes are not formatted in NSUTF8encoding?
The reason the String is being considered 'unreliable' in previous Stack posts is because they too were attempting to use NSData objects where the ending bytes aren't properly terminated with NULL :
NSString *jsonString = [NSString stringWithUTF8String:[nsDataObj bytes]];
// This is unreliable because it may result in NULL string values
Whereas the example below should give you your desired results because the NSData byte string will terminate correctly:
NSString *jsonString = [[NSString alloc] initWithBytes:[nsDataObj bytes] length:[nsDataObj length] encoding: NSUTF8StringEncoding];
You were on the right track and hopefully this is able to help you solve your current problem. Best of luck!
~ EDIT ~
Make sure you are declaring your NSData Object from an image like so:
NSData *imageData = [[NSData alloc] init];
imageData = UIImagePNGRepresentation(yourImage);
Have you tried using something like this:
#implementation NSData (Base64)
- (NSString *)base64EncodedString
{
return [self base64EncodedStringWithWrapWidth:0];
}
This will turn your NSData in a base64 string, and on the other side you just need to decode it.
EDIT: #Lucas said you can do something like this:
NSString *myString = [[NSString alloc] initWithData:myData encoding:NSUTF8StringEncoding];
but i had some problem with this method because of some special characters, and because of that i started using base64 strings for communication.
EDIT3: Trys this method base64EncodedString
#implementation NSData (Base64)
- (NSString *)base64EncodedString
{
return [self base64EncodedStringWithWrapWidth:0];
}
//Helper Method
- (NSString *)base64EncodedStringWithWrapWidth:(NSUInteger)wrapWidth
{
//ensure wrapWidth is a multiple of 4
wrapWidth = (wrapWidth / 4) * 4;
const char lookup[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
long long inputLength = [self length];
const unsigned char *inputBytes = [self bytes];
long long maxOutputLength = (inputLength / 3 + 1) * 4;
maxOutputLength += wrapWidth? (maxOutputLength / wrapWidth) * 2: 0;
unsigned char *outputBytes = (unsigned char *)malloc((NSUInteger)maxOutputLength);
long long i;
long long outputLength = 0;
for (i = 0; i < inputLength - 2; i += 3)
{
outputBytes[outputLength++] = lookup[(inputBytes[i] & 0xFC) >> 2];
outputBytes[outputLength++] = lookup[((inputBytes[i] & 0x03) << 4) | ((inputBytes[i + 1] & 0xF0) >> 4)];
outputBytes[outputLength++] = lookup[((inputBytes[i + 1] & 0x0F) << 2) | ((inputBytes[i + 2] & 0xC0) >> 6)];
outputBytes[outputLength++] = lookup[inputBytes[i + 2] & 0x3F];
//add line break
if (wrapWidth && (outputLength + 2) % (wrapWidth + 2) == 0)
{
outputBytes[outputLength++] = '\r';
outputBytes[outputLength++] = '\n';
}
}
//handle left-over data
if (i == inputLength - 2)
{
// = terminator
outputBytes[outputLength++] = lookup[(inputBytes[i] & 0xFC) >> 2];
outputBytes[outputLength++] = lookup[((inputBytes[i] & 0x03) << 4) | ((inputBytes[i + 1] & 0xF0) >> 4)];
outputBytes[outputLength++] = lookup[(inputBytes[i + 1] & 0x0F) << 2];
outputBytes[outputLength++] = '=';
}
else if (i == inputLength - 1)
{
// == terminator
outputBytes[outputLength++] = lookup[(inputBytes[i] & 0xFC) >> 2];
outputBytes[outputLength++] = lookup[(inputBytes[i] & 0x03) << 4];
outputBytes[outputLength++] = '=';
outputBytes[outputLength++] = '=';
}
if (outputLength >= 4)
{
//truncate data to match actual output length
outputBytes = realloc(outputBytes, (NSUInteger)outputLength);
return [[NSString alloc] initWithBytesNoCopy:outputBytes
length:(NSUInteger)outputLength
encoding:NSASCIIStringEncoding
freeWhenDone:YES];
}
else if (outputBytes)
{
free(outputBytes);
}
return nil;
}
Null termination is not the only problem when converting from NSData to NSString.
NSString is not designed to hold arbitrary binary data. It expects an encoding.
If your NSData contains an invalid UTF-8 sequence, initializing the NSString will fail.
The documentation isn't completely clear on this point, but for initWithData it says:
Returns nil if the initialization fails for some reason (for example
if data does not represent valid data for encoding).
Also: The JSON specification defines a string as a sequence of Unicode characters.
That means even if you're able to get your raw data into a JSON string, parsing could fail on the receiving end if the code performs UTF-8 validation.
If you don't want to use Base64, take a look at the answers here.
All code in this answer is pseudo-code fragments, you need to convert the algorithms into Objective-C or other language yourself.
Your question raises many questions... You start with:
I have an NSData object. I need to convert its bytes to a string and send as JSON. description returns hex and is unreliable (according to various SO posters).
This appears to suggest you wish to encode the bytes as a string, ready to decode them back to bytes the other end. If this is the case you have a number of choices, such as Base-64 encoding etc. If you want something simple you can just encode each byte as its two character hex value, pseudo code outline:
NSMutableString *encodedString = #"".mutableCopy;
foreach aByte in byteData
[encodedString appendFormat:#"%02x", aByte];
The format %02x means two hexadecimal digits with zero padding. This results in a string which can be sent as JSON and decoded easily the other end. The byte size over the wire will probably be twice the byte length as UTF-8 is the recommended encoding for JSON over the wire.
However in response to one of the answer you write:
But I need absolutely the raw bits.
What do you mean by this? Is your receiver going to interpret the JSON string it gets as a sequence of raw bytes? If so you have a number of problems to address. JSON strings are a subset of JavaScript strings and are stored as UCS-2 or UTF-16, that is they are sequences of 16-bit values not 8-bit values. If you encode each byte into a character in a string then it will be represented using 16-bits, if your receiver can access the byte stream it has to skip ever other byte. Of course if you receiver accesses the strings a character at a time each 16-bit character can be truncated back to an 8-bit byte. Now you might think if you take this approach then each 8-bit byte can just be output as a character as part of a string, but that won't work. While all values 1-255 are valid Unicode character code points, and JavaScript/JSON allow NULs (0 value) in strings, not all those values are printable, you cannot put a double quote " into a string without escaping it, and the escape character is \ - all these will need to be encoded into the string. You'd end up with something like:
NSMutableString *encodedString = #"".mutableCopy;
foreach aByte in byteData
if (isprint(aByte) && aByte != '"' && aByte != '\\')
[encodedString appendFormat:#"%c", aByte];
otherwise
[encodedString appendFormat:#"\\u00%02x", aByte]; // JSON unicode escape sequence
This will produce a string which when parsed by a JSON decoder will give you one character (16-bits) for each byte, the top 8-bits being zero. However if you pass this string to a JSON encoder it will encode the unicode escape sequences, which are already encoded... So you really need to send this string over the wire yourself to avoid this...
Confused? Getting complicated? Well why are you trying to send binary byte data as a string? You never say what your high-level goal is or what, if anything, is known about the byte data (e.g. does it represent character in some encoding)
If this is really just an array of bytes then why not send it as JSON array of numbers - a byte is just a number in the range 0-255. To do this you would use code along the lines of:
NSMutableArray *encodedBytes = [NSMutableArray new];
foreach aByte in byteData
[encodedBytes addObject:#(aByte)]; // add aByte as an NSNumber object
Now pass encodedBytes to NSJSONSerialisation and it will send a JSON array of numbers over the wire, the receiver will reverse the process packing each byte back into a byte buffer and you have you bytes back.
This method avoids all issues of valid strings, encodings and escapes.
HTH

Decoding partial UTF-8 into NSString

While fetching a UTF-8-encoded file over the network using the NSURLConnection class, there's a good chance the delegate's connection:didReceiveData: message will be sent with an NSData which truncates the UTF-8 file - because UTF-8 is a multi-byte encoding scheme, and a single character can be sent in two separate NSData
In other words, if I join all the data I get from connection:didReceiveData: I will have a valid UTF-8 file, but each separate data is not valid UTF-8 ().
I do not want to store all the downloaded file in memory.
What I want is: given NSData, decode whatever you can into an NSString. In case the last
few byte of the NSData are an unclosed surrogate, tell me, so I can save them for the next NSData.
One obvious solution is repeatedly trying to decode using initWithData:encoding:, each time truncating the last byte, until success. This, unfortunately, can be very wasteful.
If you want to make sure that you don't stop in the middle of a UTF-8 multi-byte sequence, you're going to need to look at the end of the byte array and check the top 2 bits.
If the top bit is 0, then it's one of the ASCII-style unescaped UTF-8 codes, and you're done.
If the top bit is 1 and the second-from-top is 0, then it the continuation of an escape sequence and might represent the last byte of that sequence, so you will need to buffer the character for later and then look at the preceding character*
If the top bit is 1 and the second-from-top is also 1, then it is the beginning of the multi-byte sequence and you need to determine how many characters are in the sequence by looking for the first 0 bit.
Look at the multi-byte table in the Wikipedia entry: http://en.wikipedia.org/wiki/UTF-8
// assumes that receivedData contains both the leftovers and the new data
unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];
if (byteCount<1)
return nil; // or #"";
unsigned char *lastByte = data[byteCount-1];
if ( lastByte & 0x80 == 0) {
NSString *newString = [NSString initWithBytes: data length: byteCount
encoding: NSUTF8Encoding];
// verify success
// remove bytes from mutable receivedData, or set overflow to empty
return newString;
}
// now eat all of the continuation bytes
UInteger backCount=0;
while ( (byteCount > 0) && (lastByte & 0xc0 == 0x80)) {
backCount++;
byteCount--;
lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should
// always have the initial character in the receivedData
if (byteCount<1) {
// error!
return nil;
}
// at this point, you can either use just byteCount, or you can compute the
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.
UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) { // 110xxxxx
// 2 byte sequence
requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) { // 1110xxxx
// 3 byte sequence
requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) { // 11110xxx
// 4 byte sequence
requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) { // 111110xx
// 5 byte sequence
requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) { // 1111110x
// 6 byte sequence
requiredBytes= 5;
} else {
// shouldn't happen, illegal UTF8 seq
}
// now we know how many characters we need and we know how many
// (backCount) we have, so either use them, or take the
// introductory character away.
if (requiredBytes==backCount) {
// we have the right number of bytes
byteCount += backCount;
} else {
// we don't have the right number of bytes, so remove the intro character
byteCount -= 1;
}
NSString *newString = [NSString initWithBytes: data length: byteCount
encoding: NSUTF8Encoding];
// verify success
// remove byteCount bytes from mutable receivedData, or set overflow to the
// bytes between byteCount and [receivedData count]
return newString;
UTF-8 is a pretty simple encoding to parse and was designed to make it easy to detect incomplete sequences and, if you start in the middle of an incomplete sequence, to find its beginning.
Search backward from the end for a byte that's either <= 0x7f or > 0xc0. If it's <= 0x7f, it's complete. If it's between 0xc0 and 0xdf, inclusive, it requires one following byte to be complete. If it's between 0xe0 and 0xef, it requires two following bytes to be complete. If it's >= 0xf0, it requires three following bytes to be complete.
I have a similar problem - partly decoding utf8
before
NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1);
strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);
after [solved]
NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
NSLog(#"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount);
adsInfo->adsTopic = malloc(byteCount+1);
strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1);
NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding];
NSLog(#"=== %#", text);

Most efficient way to iterate over all the chars in an NSString

What's the best way to iterate over all the chars in an NSString? Would you want to loop over the length of the string and use the method.
[aNSString characterAtIndex:index];
or would you want to user a char buffer based on the NSString?
I think it's important that people understand how to deal with unicode, so I ended up writing a monster answer, but in the spirit of tl;dr I will start with a snippet that should work fine. If you want to know details (which you should!), please continue reading after the snippet.
NSUInteger len = [str length];
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"%C", buffer[i]);
}
Still with me? Good!
The current accepted answer seem to be confusing bytes with characters/letters. This is a common problem when encountering unicode, especially from a C background. Strings in Objective-C are represented as unicode characters (unichar) which are much bigger than bytes and shouldn't be used with standard C string manipulation functions.
(Edit: This is not the full story! To my great shame, I'd completely forgotten to account for composable characters, where a "letter" is made up of multiple unicode codepoints. This gives you a situation where you can have one "letter" resolving to multiple unichars, which in turn are multiple bytes each. Hoo boy. Please refer to this great answer for the details on that.)
The proper answer to the question depends on whether you want to iterate over the characters/letters (as distinct from the type char) or the bytes of the string (what the type char actually means). In the spirit of limiting confusion, I will use the terms byte and letter from now on, avoiding the possibly ambigious term character.
If you want to do the former and iterate over the letters in the string, you need to exclusively deal with unichars (sorry, but we're in the future now, you can't ignore it anymore). Finding the amount of letters is easy, it's the string's length property. An example snippet is as such (same as above):
NSUInteger len = [str length];
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"%C", buffer[i]);
}
If, on the other hand, you want to iterate over the bytes in a string, it starts getting complicated and the result will depend entirely upon the encoding you choose to use. The decent default choice is UTF8, so that's what I will show.
Doing this you have to figure out how many bytes the resulting UTF8 string will be, a step where it's easy to go wrong and use the string's -length. One main reason this very easy to do wrong, especially for a US developer, is that a string with letters falling into the 7-bit ASCII spectrum will have equal byte and letter lengths. This is because UTF8 encodes 7-bit ASCII letters with a single byte, so a simple test string and basic english text might work perfectly fine.
The proper way to do this is to use the method -lengthOfBytesUsingEncoding:NSUTF8StringEncoding (or other encoding), allocate a buffer with that length, then convert the string to the same encoding with -cStringUsingEncoding: and copy it into that buffer. Example code here:
NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
strncpy(proper_c_buffer, [str cStringUsingEncoding:NSUTF8StringEncoding], byteLength);
NSLog(#"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
NSLog(#"%c", proper_c_buffer[i]);
}
Just to drive the point home as to why it's important to keep things straight, I will show example code that handles this iteration in four different ways, two wrong and two correct. This is the code:
#import <Foundation/Foundation.h>
int main() {
NSString *str = #"буква";
NSUInteger len = [str length];
// Try to store unicode letters in a char array. This will fail horribly
// because getCharacters:range: takes a unichar array and will probably
// overflow or do other terrible things. (the compiler will warn you here,
// but warnings get ignored)
char c_buffer[len+1];
[str getCharacters:c_buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with char buffer");
for(int i = 0; i < len; i++) {
NSLog(#"Byte %d: %c", i, c_buffer[i]);
}
// Copy the UTF string into a char array, but use the amount of letters
// as the buffer size, which will truncate many non-ASCII strings.
strncpy(c_buffer, [str UTF8String], len);
NSLog(#"strncpy with UTF8String");
for(int i = 0; i < len; i++) {
NSLog(#"Byte %d: %c", i, c_buffer[i]);
}
// Do It Right (tm) for accessing letters by making a unichar buffer with
// the proper letter length
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"Letter %d: %C", i, buffer[i]);
}
// Do It Right (tm) for accessing bytes, by using the proper
// encoding-handling methods
NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
const char *utf8_buffer = [str cStringUsingEncoding:NSUTF8StringEncoding];
// We copy here because the documentation tells us the string can disappear
// under us and we should copy it. Just to be safe
strncpy(proper_c_buffer, utf8_buffer, byteLength);
NSLog(#"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
NSLog(#"Byte %d: %c", i, proper_c_buffer[i]);
}
return 0;
}
Running this code will output the following (with NSLog cruft trimmed out), showing exactly HOW different the byte and letter representations can be (the two last outputs):
getCharacters:range: with char buffer
Byte 0: 1
Byte 1:
Byte 2: C
Byte 3:
Byte 4: :
strncpy with UTF8String
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3:
Byte 4: Ð
getCharacters:range: with unichar buffer
Letter 0: б
Letter 1: у
Letter 2: к
Letter 3: в
Letter 4: а
strncpy with proper length
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3:
Byte 4: Ð
Byte 5: º
Byte 6: Ð
Byte 7: ²
Byte 8: Ð
Byte 9: °
While Daniel's solution will probably work most of the time, I think the solution is dependent on the context. For example, I have a spelling app and need to iterate over each character as it appears onscreen which may not correspond to the way it is represented in memory. This is especially true for text provided by the user.
Using something like this category on NSString:
- (void) dumpChars
{
NSMutableArray *chars = [NSMutableArray array];
NSUInteger len = [self length];
unichar buffer[len+1];
[self getCharacters: buffer range: NSMakeRange(0, len)];
for (int i=0; i<len; i++) {
[chars addObject: [NSString stringWithFormat: #"%C", buffer[i]]];
}
NSLog(#"%# = %#", self, [chars componentsJoinedByString: #", "]);
}
And feeding it a word like mañana might produce:
mañana = m, a, ñ, a, n, a
But it could just as easily produce:
mañana = m, a, n, ̃, a, n, a
The former will be produced if the string is in precomposed unicode form and the later if it's in decomposed form.
You might think this could be avoided by using the result of NSString's precomposedStringWithCanonicalMapping or precomposedStringWithCompatibilityMapping, but this is not necessarily the case as Apple warns in Technical Q&A 1225. For example a string like e̊gâds (which I totally made up) still produces the following even after converting to a precomposed form.
e̊gâds = e, ̊, g, â, d, s
The solution for me is to use NSString's enumerateSubstringsInRange passing NSStringEnumerationByComposedCharacterSequences as the enumeration option. Rewriting the earlier example to look like this:
- (void) dumpSequences
{
NSMutableArray *chars = [NSMutableArray array];
[self enumerateSubstringsInRange: NSMakeRange(0, [self length]) options: NSStringEnumerationByComposedCharacterSequences
usingBlock: ^(NSString *inSubstring, NSRange inSubstringRange, NSRange inEnclosingRange, BOOL *outStop) {
[chars addObject: inSubstring];
}];
NSLog(#"%# = %#", self, [chars componentsJoinedByString: #", "]);
}
If we feed this version e̊gâds then we get
e̊gâds = e̊, g, â, d, s
as expected, which is what I want.
The section of documentation on Characters and Grapheme Clusters may also be helpful in explaining some of this.
Note: Looks like some of the unicode strings I used are tripping up SO when formatted as code. The strings I used are mañana, and e̊gâds.
Neither. The "Optimize Your Text Manipulations" section of the "Cocoa Performance Guidelines" in the Xcode Documentation recommends:
If you want to iterate over the
characters of a string, one of the
things you should not do is use the
characterAtIndex: method to retrieve
each character separately. This method
is not designed for repeated access.
Instead, consider fetching the
characters all at once using the
getCharacters:range: method and
iterating over the bytes directly.
If you want to search a string for
specific characters or substrings, do
not iterate through the characters one
by one. Instead, use higher level
methods such as rangeOfString:,
rangeOfCharacterFromSet:, or
substringWithRange:, which are
optimized for searching the NSString
characters.
See this Stack Overflow answer on How to remove whitespace from right end of NSString for an example of how to let rangeOfCharacterFromSet: iterate over the characters of the string instead of doing it yourself.
I would definitely get a char buffer first, then iterate over that.
NSString *someString = ...
unsigned int len = [someString length];
char buffer[len];
//This way:
strncpy(buffer, [someString UTF8String]);
//Or this way (preferred):
[someString getCharacters:buffer range:NSMakeRange(0, len)];
for(int i = 0; i < len; ++i) {
char current = buffer[i];
//do something with current...
}
try enum string with blocks
Create Category of NSString
.h
#interface NSString (Category)
- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block;
#end
.m
#implementation NSString (Category)
- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block
{
bool _stop = NO;
for(NSInteger i = 0; i < [self length] && !_stop; i++)
{
NSString *character = [self substringWithRange:NSMakeRange(i, 1)];
block(character, i, &_stop);
}
}
#end
example
NSString *string = #"Hello World";
[string enumerateCharactersUsingBlock:^(NSString *character, NSInteger idx, bool *stop) {
NSLog(#"char %#, i: %li",character, (long)idx);
}];
This is little different solution for the question but I thought maybe this will be useful for someone. What I wanted was to actually iterate as actual unicode character in NSString. So, I found this solution:
NSString * str = #"hello 🤠💩";
NSRange range = NSMakeRange(0, str.length);
[str enumerateSubstringsInRange:range
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString *substring, NSRange substringRange,
NSRange enclosingRange, BOOL *stop)
{
NSLog(#"%#", substring);
}];
Although you would technically be getting individual NSString values, here is an alternative approach:
NSRange range = NSMakeRange(0, 1);
for (__unused int i = range.location; range.location < [starring length]; range.location++) {
NSLog(#"%#", [aNSString substringWithRange:range]);
}
(The __unused int i bit is necessary to silence the compiler warning.)
You should not use
NSUInteger len = [str length];
unichar buffer[len+1];
you should use memory allocation
NSUInteger len = [str length];
unichar* buffer = (unichar*) malloc (len+1)*sizeof(unichar);
and in the end use
free(buffer);
in order to avoid memory problems.