Ints to Bytes: Endianess a Concern? - objective-c

Do I have to worry about endianness in this case (integers MUST be 0-127):
int a = 120;
int b = 100;
int c = 50;
char theBytes[] = {a, b, c};
I think that, since each integer sits in its own byte, I don't have to worry about Endianess in passing the byte array between systems. This has also worked out empirically. Am I missing something?

Endianness only affects the ordering of bytes within an individual value. Individual bytes are not subject to endian issues, and arrays are always sequential, so byte arrays are the same on big- and little-endian architectures.
Note that this doesn't necessarily mean that only using chars will make datatypes 100% byte-portable. Structs may still include architecture-dependent padding, for example, and one system may have unsigned chars while another uses signed (though I see you sidestep this by only allowing 0-127).

No, you don't need to worry, compiler produces code which makes correct casting and assignment.

Related

Is there any reason to use NSInteger instead of uint8_t with NS_ENUM?

The general standard appears to use NS_ENUM with NSInteger as the base type. Why is this the case? Assuming less than 256 cases (which covers almost any enumeration), is there any reason to use that instead of uint8_t, which could use less memory space? Either imports into Swift fine.
This is different than NS_OPTIONS, where a larger type makes sense, since you shouldn't be doing any bit math with enumerations, and you can use every number representable by the base type as a value.
The answer to the question in the title:
Is there any reason to use NSInteger instead of uint8_t with NS_ENUM?
is probably not.
When declaring an enum in C if no underlying type is specified the compiler is free to choose any suitable type from char and the signed and unsigned integer types which can at least represent all the values required. The current Xcode/Clang compiler picks a 4-byte integer. One could reasonably assume the compiler writers made an informed choice - some balance of performance and storage.
Smaller types, such as uint8_t, will usually be aligned on smaller boundaries in memory (or on disc) - but that is only of benefit if the adjacent field matches the alignment e.g. if a 2-byte size typed field follows a 1-byte sized typed field then unless otherwise specified (e.g. with a #pragma packed) there will probably be an intervening unused byte.
Whether any performance or storage differences are significant will be heavily dependent on the application. Follow the usual rule of thumb - don't optimise until an issue is found.
However if you find semantic benefit in limiting the size then certainly do so - there is no general reason you shouldn't. The choice is similar to picking signed vs. unsigned integers, some programmers avoid unsigned types for values that will be ≥ 0 unless absolutely required for the extra range, while others appreciate the semantic benefit.
Summary: There is no right answer, its largely a subjective issue.
HTH
First of all: The memory footprint is close to completely meaningless. You are talking about 1 Byte vs. 4/8 Bytes. (If the memory alignment does not force the usage of 4/8 bytes whatever you chosed.) How many NS_ENUM (C) objects do you want to have in your running app?
I guess that the reason is pretty easy: NSInteger is akin of "catch all" integer type in Cocoa. That makes assignments easier, especially you do not have to care about assigning a bigger integer type to a smaller one. Without casting this would lead to warnings.
Having more than one integer type in a desktop app with a 32/64 bit model is akin of an anachronism. Nor a Mac neither a MacBook neither an iPhone is an embedded micro controller …
You can use any integer data type including uint8_t with NS_ENUM as.
typedef NS_ENUM(uint8_t, eEnumAddEditViewMode) {
eWBEnumAddMode,
eWBEnumEditMode
};
In old c style standard NSInteger is default, because NSInteger is akin of "catch all" integer type in objective c. and developer can easily type boxing and unboxing with their own variable. This is just developer friendly best practise.

Difference between Objective-C primitive numbers

What is the difference between objective-c C primitive numbers? I know what they are and how to use them (somewhat), but I'm not sure what the capabilities and uses of each one is. Could anyone clear up which ones are best for some scenarios and not others?
int
float
double
long
short
What can I store with each one? I know that some can store more precise numbers and some can only store whole numbers. Say for example I wanted to store a latitude (possibly retrieved from a CLLocation object), which one should I use to avoid loosing any data?
I also noticed that there are unsigned variants of each one. What does that mean and how is it different from a primitive number that is not unsigned?
Apple has some interesting documentation on this, however it doesn't fully satisfy my question.
Well, first off types like int, float, double, long, and short are C primitives, not Objective-C. As you may be aware, Objective-C is sort of a superset of C. The Objective-C NSNumber is a wrapper class for all of these types.
So I'll answer your question with respect to these C primitives, and how Objective-C interprets them. Basically, each numeric type can be placed in one of two categories: Integer Types and Floating-Point Types.
Integer Types
short
int
long
long long
These can only store, well, integers (whole numbers), and are characterized by two traits: size and signedness.
Size means how much physical memory in the computer a type requires for storage, that is, how many bytes. Technically, the exact memory allocated for each type is implementation-dependendant, but there are a few guarantees: (1) char will always be 1 byte (2) sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long).
Signedness means, simply whether or not the type can represent negative values. So a signed integer, or int, can represent a certain range of negative or positive numbers (traditionally –2,147,483,648 to 2,147,483,647), and an unsigned integer, or unsigned int can represent the same range of numbers, but all positive (0 to 4,294,967,295).
Floating-Point Types
float
double
long double
These are used to store decimal values (aka fractions) and are also categorized by size. Again the only real guarantee you have is that sizeof(float) <= sizeof(double) <= sizeof (long double). Floating-point types are stored using a rather peculiar memory model that can be difficult to understand, and that I won't go into, but there is an excellent guide here.
There's a fantastic blog post about C primitives in an Objective-C context over at RyPress. Lots of intro CPS textbooks also have good resources.
Firstly I would like to specify the difference between au unsigned int and an int. Say that you have a very high number, and that you write a loop iterating with an unsigned int:
for(unsigned int i=0; i< N; i++)
{ ... }
If N is a number defined with #define, it may be higher that the maximum value storable with an int instead of an unsigned int. If you overflow i will start again from zero and you'll go in an infinite loop, that's why I prefer to use an int for loops.
The same happens if for mistake you iterate with an int, comparing it to a long. If N is a long you should iterate with a long, but if N is an int you can still safely iterate with a long.
Another pitfail that may occur is when using the shift operator with an integer constant, then assigning it to an int or long. Maybe you also log sizeof(long) and you notice that it returns 8 and you don't care about portability, so you think that you wouldn't lose precision here:
long i= 1 << 34;
Bit instead 1 isn't a long, so it will overflow and when you cast it to a long you have already lost precision. Instead you should type:
long i= 1l << 34;
Newer compilers will warn you about this.
Taken from this question: Converting Long 64-bit Decimal to Binary.
About float and double there is a thing to considerate: they use a mantissa and an exponent to represent the number. It's something like:
value= 2^exponent * mantissa
So the more the exponent is high, the more the floating point number doesn't have an exact representation. It may also happen that a number is too high, so that it will have a so inaccurate representation, that surprisingly if you print it you get a different number:
float f= 9876543219124567;
NSLog("%.0f",f); // On my machine it prints 9876543585124352
If I use a double it prints 9876543219124568, and if I use a long double with the .0Lf format it prints the correct value. Always be careful when using floating points numbers, unexpected things may happen.
For example it may also happen that two floating point numbers have almost the same value, that you expect they have the same value but there is a subtle difference, so that the equality comparison fails. But this has been treated hundreds of times on Stack Overflow, so I will just post this link: What is the most effective way for float and double comparison?.

Does the "C" code algorithm in RFC1071 work well on big-endian machine?

As described in RFC1071, an extra 0-byte should be added to the last byte when calculating checksum in the situation of odd count of bytes:
But in the "C" code algorithm, only the last byte is added:
The above code does work on little-endian machine where [Z,0] equals Z, but I think there's some problem on big-endian one where [Z,0] equals Z*256.
So I wonder whether the example "C" code in RFC1071 only works on little-endian machine?
-------------New Added---------------
There's one more example of "breaking the sum into two groups" described in RFC1071:
We can just take the data here (addr[]={0x00, 0x01, 0xf2}) for example:
Here, "standard" represents the situation described in the formula [2], while "C-code" representing the C code algorithm situation.
As we can see, in "standard" situation, the final sum is f201 regardless of endian-difference since there's no endian-issue with the abstract form of [Z,0] after "Swap". But it matters in "C-code" situation because f2 is always the low-byte whether in big-endian or in little-endian.
Thus, the checksum is variable with the same data(addr&count) on different endian.
I think you're right. The code in the RFC adds the last byte in as low-order, regardless of whether it is on a litte-endian or big-endian machine.
In these examples of code on the web we see they have taken special care with the last byte:
https://github.com/sjaeckel/wireshark/blob/master/epan/in_cksum.c
and in
http://www.opensource.apple.com/source/tcpdump/tcpdump-23/tcpdump/print-ip.c
it does this:
if (nleft == 1)
sum += htons(*(u_char *)w<<8);
Which means that this text in the RFC is incorrect:
Therefore, the sum may be calculated in exactly the same way
regardless of the byte order ("big-endian" or "little-endian")
of the underlaying hardware. For example, assume a "little-
endian" machine summing data that is stored in memory in network
("big-endian") order. Fetching each 16-bit word will swap
bytes, resulting in the sum; however, storing the result
back into memory will swap the sum back into network byte order.
The following code in place of the original odd byte handling is portable (i.e. will work on both big- and little-endian machines), and doesn't depend on an external function:
if (count > 0)
{
char buf2[2] = {*addr, 0};
sum += *(unsigned short *)buf2;
}
(Assumes addr is char * or const char *).

Storing integers in a redis ordered set?

I have a system which deals with keys that have been turned into unsigned long integers (by packing short sequences into byte strings). I want to try storing these in Redis, and I want to do it in the best way possible. My concern is mainly memory efficiency.
From playing with the online REPL I notice that the two following are identical
zadd myset 1.0 "123"
zadd myset 1.0 123
This means that even if I know I want to store an integer, it has to be set as a string. I notice from the documentation that keys are just stored as char*s and that commands like SETBIT indicate that Redis is not averse to treating strings as bytestrings in the client. This hints at a slightly more efficient way of storing unsigned longs than as their string representation.
What is the best way to store unsigned longs in sorted sets?
Thanks to Andre for his answer. Here are my findings.
Storing ints directly
Redis keys must be strings. If you want to pass an integer, it has to be some kind of string. For small, well-defined sets of values, Redis will parse the string into an integer, if it is one. My guess is that it will use this int to tailor its hash function (or even statically dimension a hash table based on the value). This works for small values (examples being the default values of 64 entries of a value of up to 512). I will test for larger values during my investigation.
http://redis.io/topics/memory-optimization
Storing as strings
The alternative is squashing the integer so it looks like a string.
It looks like it is possible to use any byte string as a key.
For my application's case it actually didn't make that much difference storing the strings or the integers. I imagine that the structure in Redis undergoes some kind of alignment anyway, so there may be some pre-wasted bytes anyway. The value is hashed in any case.
Using Python for my testing, so I was able to create the values using the struct.pack. long longs weigh in at 8 bytes, which is quite large. Given the distribution of integer values, I discovered that it could actually be advantageous to store the strings, especially when coded in hex.
As redis strings are "Pascal-style":
struct sdshdr {
long len;
long free;
char buf[];
};
and given that we can store anything in there, I did a bit of extra Python to code the type into the shortest possible type:
def do_pack(prefix, number):
"""
Pack the number into the best possible string. With a prefix char.
"""
# char
if number < (1 << 8*1):
return pack("!cB", prefix, number)
# ushort
elif number < (1 << 8*2):
return pack("!cH", prefix, number)
# uint
elif number < (1 << 8*4):
return pack("!cI", prefix, number)
# ulonglong
elif number < (1 << 8*8):
return pack("!cQ", prefix, number)
This appears to make an insignificant saving (or none at all). Probably due to struct padding in Redis. This also drives Python CPU through the roof, making it somewhat unattractive.
The data I was working with was 200000 zsets of consecutive integer => (weight, random integer) × 100, plus some inverted index (based on random data). dbsize yields 1,200,001 keys.
Final memory use of server: 1.28 GB RAM, 1.32 Virtual. Various tweaks made a difference of no more than 10 megabytes either way.
So my conclusion:
Don't bother encoding into fixed-size data types. Just store the integer as a string, in hex if you want. It won't make all that much difference.
References:
http://docs.python.org/library/struct.html
http://redis.io/topics/internals-sds
I'm not sure of this answer, it's more of a suggestion than anything else. I'd have to give it a try and see if it works.
As far as I can tell, Redis only supports UTF-8 strings.
I would suggest grabbing a bit representation of your long integer and pad it accordingly to fill up the nearest byte. Encode each set of 8 bytes to a UTF-8 string (ending up with 8x*utf8_char* string) and store that in Redis. The fact that they're unsigned means that you don't care about that first bit but if you did, you could add a flag to the string.
Upon retrieving the data, you have to remember to pad each character to 8 bytes again as UTF-8 will use less bytes for the representation if the character can be stored with less bytes.
End result is that you store a maximum of 8 x 8 byte characters instead of (possibly) a maximum of 64 x 8 byte characters.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.