NSUnicodeStringEncoding prepends FFFE to every string - objective-c

I'm trying to append a string to a file by encoding it as NSUnicodeStringEncoding first. I'm doing this:
NSData *data = [#"data" dataUsingEncoding: NSUnicodeStringEncoding];
NSFileHandle *output = [NSFileHandle fileHandleForUpdatingAtPath:#"file"];
[output seekToEndOfFile];
[output writeData:data];
If I do this a number of times and then take a look at the file I notice that every string added has FFFE prepended to it. But when I switch from NSUnicodeStringEncoding to NSUTF8StringEncoding this prefix goes away.

That's called a byte-order marker, and is put there because NSUnicodeStringEncoding doesn't specify whether the characters are stored in big or little endian order.
To prevent 0xFFFE or 0xFEFF from appearing at the beginning of a string, use one of NSUTF16BigEndianStringEncoding, NSUTF16LittleEndianStringEncoding, NSUTF32BigEndianStringEncoding, or NSUTF32LittleEndianStringEncoding, depending on your specific needs. (For reference: Intel and ARM processors as used by Apple are little endian.)

Related

Parsing file with percent signs (%) in Objective-C

I'm writing a parser for fortune files. Fortune is a small app on *nix platforms that just prints out a random "fortune". The fortune files are straight text, with each fortune being separated by a percent sign on its own line. For example:
A little suffering is good for the soul.
-- Kirk, "The Corbomite Maneuver", stardate 1514.0
%
A man either lives life as it happens to him, meets it head-on and
licks it, or he turns his back on it and starts to wither away.
-- Dr. Boyce, "The Menagerie" ("The Cage"), star date unknown
%
What I've found is that when parsing the file, stringWithContentsOfFile returns a string with the % signs in place. For example:
#"A little suffering is good for the soul.\n\t\t-- Kirk, \"The Corbomite Maneuver\", stardate 1514.0\n%\nA man either lives life as it happens to him, meets it head-on and\nlicks it, or he turns his back on it and starts to wither away.\n\t\t-- Dr. Boyce, \"The Menagerie\" (\"The Cage\"), stardate unknown\n%"
However, when I call componentsSeparatedByCharactersInSet on the file contents, everything is parsed as a string, with the exception of the percent signs, which are NSTaggedPointerString. When I print out the lines, the percent signs are gone.
Is this because the percent sign is a format specifier for strings? I would think in that case that the initial content pull would escape those.
Here's the code:
NSFileManager *fileManager;
fileManager = [NSFileManager defaultManager];
NSStringEncoding stringEncoding;
// NSString *fileContents = [NSString stringWithContentsOfFile:fileName encoding:NSASCIIStringEncoding error:nil];
NSString *fileContents = [NSString stringWithContentsOfFile:fileName usedEncoding:&stringEncoding error:nil];
NSArray *fileLines = [fileContents componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
The used encoding ends up being UTF-8. You can see I have also tried specifying plain ASCII, but it yields the same results.
So the question is, how do I retain the percent signs? Or, may I should use it as the separator character and then parse each of the subsequent results individually.
You are calling NSLog() but passing the line strings as the format string. Something like:
NSLog(lineString);
Therefore, any percent characters in the line strings are interpreted as format specifiers. You should (almost) never pass strings that come from outside sources — i.e. strings which are not hard-coded in your code — as format strings to any function (NSLog(), printf(), +[NSString stringWithFormat:], etc.). It's not safe and you'll sometimes get unexpected results like you've seen.
You should always log a single string like this:
NSLog(#"%#", lineString);
That is, you need to pass a hard-coded format string and use the foreign string as data for that to format.
NSTaggedPointerString is just subclass of NSString. You can use anywhere as NSString.
But in your string
#"A little suffering is good for the soul.\n\t\t-- Kirk, \"The Corbomite Maneuver\", stardate 1514.0\n%\nA man either lives life as it happens to him, meets it head-on and\nlicks it, or he turns his back on it and starts to wither away.\n\t\t-- Dr. Boyce, \"The Menagerie\" (\"The Cage\"), stardate unknown\n%"
sign % is not percent sign. in Objective-C percent sign is declared as double of % mark
#"%%"

Objective-C / C Convert UTF8 Literally to Real string

Im wondering how to convert
NSString = "\xC4"; ....
to real NSString represented in normal format
Fundamentally related to xcode UTF-8 literals. Of course, it is ambiguous what you actually mean by "\xC4" - without an encoding specified, it means nothing.
If you mean the character whose Unicode code point is 0x00C4 then I would think (though I haven't tested) that this will do what you want.
NSString *s = #"\u00C4";
First are you sure you have \xC4 in your string? Consider:
NSString *one = #"\xC4\x80";
NSString *two = #"\\xC4\\x80";
NSLog(#"%# | %#", one, two);
This will output:
Ā | \xC4\x80
If you are certain your string contains the four characters \xC4 are you sure it is UTF-8 encoded as ASCII? Above you will see I added \x80, this is because \xC4 is not valid UTF-8, it is the first byte of a two-byte sequence. Maybe you have only shown a sample of your input and the second byte is present, if not you do not have UTF-8 encoded as ASCII.
If you are certain it is UTF-8 encoded as ASCII you will have to convert it yourself. It might seem the Cocoa string encoding methods would handle it, especially as what you appear to have is a string as it might be written in Objective-C source code. Unfortunately the obvious encoding, NSNonLossyAsciiStringEncoding only handles octal and unicode escapes, not the hexadecimal escapes in your string.
You can use any algorithm you like to convert it. One choice would be a simple finite state machine which scans the input a byte at a time and recognises the four byte sequence: \, x, hex-digit, hex-digit; and combines the two hex-digits into a single byte. NSString is not the best choice for byte-at-time string processing, you may be better off converting to C strings, e.g.:
// sample input, all characters should be ASCII
NSString *input = #"\\xC4\\x80";
// obtain a C string containing the ASCII characters
const char *cInput = [input cStringUsingEncoding:NSASCIIStringEncoding];
// allocate a buffer of the correct length for the result
char cOutput[strlen(c2a)+1];
// call your function to decode the hexadecimal escapes
convertAsciiEncodedUTF8(cInput, cOutput);
// create a NSString from the result
NSString *output = [NSString stringWithCString:cOutput encoding:NSUTF8StringEncoding];
You just need to write the finite state machine, or other algorithm, for convertAsciiEncodedUTF8.
(If you write an algorithm and it fails ask another question showing your code, somebody will probably help you. But don't expect someone to write it for you.)
HTH

ASCII Code to NSData

I'm trying to figure out ESC/POS commands and I need the code "GS" (ASCII code 29) put into NSData.
Currently I can put the strings I want to print without problems using the code:
NSString *str = #"Text I want to print";
NSData *data = [str dataUsingEncoding:NSASCIIStringEncoding];
Is there any easy way to do that using either C++ or OBJ-C?
C, C++, and Objective-C let you put arbitrary ASCII codes into a string using so-called escape sequences.
Escape sequence start either in \x followed by two hex digits, or \0 followed by three octal digits.
ASCII GS is 29 in decimal or 1D in hex, so you can put an GS in an NSData like this:
NSData *data = [#"\x1D" dataUsingEncoding:NSASCIIStringEncoding];

How do I read a specific line from a large text file with Objective-C?

Say I have text file my.txt like this
this is line 1
this is line 2
....
this is line 999999
this is line 1000000
In Unix I can get the line of "this is line 1000" by issuing command like "head -1000 my.txt | tail -1". What is the corresponding way to get this in Objective-C?
If it's not too inefficient to have the whole thing in memory at once then the most compact sequence of calls (which I've expanded onto multiple lines for simpler exposition) would be:
NSError *error = nil;
NSString *sourceString = [NSString stringWithContentsOfFile:#"..."
encoding:NSUTF8StringEncoding error:&error];
NSArray *lines = [sourceString componentsSeparatedByCharactersInSet:
[NSCharacterSet newlineCharacterSet]];
NSString *relevantLine = [lines objectAtIndex:1000];
You should check the value of error and the count of lines for validation.
EDIT: to compare to Nathan's answer, the benefit of splitting by characters in set is that you'll accept any of the five unicode characters that can possibly delimit a line break, with anywhere where several of them sit next to each other counting as only one break (as per e.g. \r\n).
NSInputStream is probably what you're going to have to deal with if memory footprint is an issue, which is barely more evolved than C's stdio.h fopen/fread/etc so you're going to have to write your own little loop to dash through.
The answer does not explain how to read a file too LARGE to keep in memory. There is not nice solution in Objective-C for reading large text files without putting them into memory (which isn't always an option).
In these case I like to use the c methods:
FILE* file = fopen("path to my file", "r");
size_t length;
char *cLine = fgetln(file,&length);
while (length>0) {
char str[length+1];
strncpy(str, cLine, length);
str[length] = '\0';
NSString *line = [NSString stringWithFormat:#"%s",str];
% Do what you want here.
cLine = fgetln(file,&length);
}
Note that fgetln will not keep your newline character. Also, We +1 the length of the str because we want to make space for the NULL termination.
The simplest is to just load the file using one of the NSString file methods and then use the -[NSString componentsSeparatedByString:] method to get an array of every line.
Or you could use NSScanner, scan for newline/carriage return characters counting them until you get to you line of interest.
If you are really concerned about memory usage you could look at NSInputStream use that to read in the file, keeping count of the number of newlines. It a shame that NSScanner doesn't work with NSInputStream.
I don't think this is an exact duplicate, because it sounds like you want to skip some lines in the file, but you could easily use an approach like the one here:
Objective-C: Reading a file line by line (Specific answer that has some sample code)
Loop on the input file, reading in a chunk of data, and look for newlines. Count them up and when you hit the right number, output the data after that one and until the next.
Your example looks like you might have hundreds of thousands of lines, so definitely don't just read in the file into a NSString, and definitely don't convert it to an NSArray.
If you want to do it the fancier NSInputStream way (which has some key advantages in character set decoding), here is a great example that shows the basic idea of polling to consume all of the data from a stream source (in a file example, its somewhat overkill). Its for output, but the idea is fine for input too:
Polling versus Run Loop Scheduling

Problem creating UTF8 text file with NSFileHandle

I want to use NSFileHandle to write large text files to avoid handling very large NSString's in memory. I'm having a problem where after creating the file and opening it in the Text Edit app (Mac), it is not displaying the unicode characters correctly. If I write the same text to a file using the NSString writeToFile:atomically:encoding:error: method, Text Edit display everything correctly.
I'm opening both the files in Text Edit with the "opening files encoding" option set to automatic, so I'm not sure why one works and the other method doesn't. Is there some form of header to declare the format is UTF8?
// Standard string
NSString *myString = #"This is a test with a star character \u272d";
// This works fine
// Displays: "This is a test with a star character ✭" in Text Edit
[myString writeToFile:path atomically:YES encoding:NSUTF8StringEncoding];
// This doesn't work
// Displays: "This is a test with a star character ‚ú≠" in Text Edit
[fileManager createFileAtPath:path contents:nil attributes:nil];
fileHandle = [NSFileHandle fileHandleForWritingAtPath:path];
[fileHandle writeData:[myString dataUsingEncoding:NSUTF8StringEncoding]];
The problem is not with your code, but with TextEdit: It doesn't try to decode the file as UTF-8 unless it has a UTF-8 BOM identifying it as such. Presumably, the first version of your code adds such a BOM. See this question for further discussion.
UTF-8 data generally should not include a BOM, so you probably shouldn't modify your code from the second version at all—it's working correctly. If opening the file in TextEdit has to work, you should be able to force the BOM by including it (\ufeff) explicitly at the start of the string, but, again, you should not do that unless you really need to.