Should there be a difference between an empty BSTR and a NULL BSTR? - com

When maintaining a COM interface should an empty BSTR be treated the same way as NULL?
In other words should these two function calls produce the same result?
// Empty BSTR
CComBSTR empty(L""); // Or SysAllocString(L"")
someObj->Foo(empty);
// NULL BSTR
someObj->Foo(NULL);

Yes - a NULL BSTR is the same as an empty one. I remember we had all sorts of bugs that were uncovered when we switched from VS6 to 2003 - the CComBSTR class had a change to the default constructor that allocated it using NULL rather than an empty string. This happens when you for example treat a BSTR as a regular C style string and pass it to some function like strlen, or try to initialise a std::string with it.
Eric Lippert discusses BSTR's in great detail in Eric's Complete Guide To BSTR Semantics:
Let me list the differences first and
then discuss each point in
excruciating detail.
A BSTR must have identical
semantics for NULL and for "". A PWSZ
frequently has different semantics for
those.
A BSTR must be allocated and freed
with the SysAlloc* family of
functions. A PWSZ can be an
automatic-storage buffer from the
stack or allocated with malloc, new,
LocalAlloc or any other memory
allocator.
A BSTR is of fixed length. A PWSZ
may be of any length, limited only by
the amount of valid memory in its
buffer.
A BSTR always points to the first
valid character in the buffer. A PWSZ
may be a pointer to the middle or end
of a string buffer.
When allocating an n-byte BSTR you
have room for n/2 wide characters.
When you allocate n bytes for a PWSZ
you can store n / 2 - 1 characters --
you have to leave room for the null.
A BSTR may contain any Unicode data
including the zero character. A PWSZ
never contains the zero character
except as an end-of-string marker.
Both a BSTR and a PWSZ always have a
zero character after their last valid
character, but in a BSTR a valid
character may be a zero character.
A BSTR may actually contain an odd
number of bytes -- it may be used for
moving binary data around. A PWSZ is
almost always an even number of bytes
and used only for storing Unicode
strings.

The easiest way to handle this dilemma is to use CComBSTR and check for .Length() to be zero. That works for both empty and NULL values.
However, keep in mind, empty BSTR must be released or there will be a memory leak. I saw some of those recently in other's code. Quite hard to find, if you are not looking carefully.

Related

How are strings written down in FlatBuffers

I am researching FlatBuffers file structure and I want to know how are strings written down. From what I could gather, the string orc (for example) is written down as letters count in little endian (0x3 0x0 0x0 0x0) followed by the actual letters and followed by something else. I am trying to understand what the something else is. What bytes follow the letters? I am only asking about the presentation of this specific string in buffer/file.
According to the FlatBuffer documentation:
"Strings are simply a vector of bytes, and are always null-terminated. Vectors are stored as contiguous aligned scalar elements prefixed by a 32bit element count (not including any null termination). Neither is stored inline in their parent, but are referred to by offset. A vector may consist of more than one offset pointing to the same value if the user explicitly serializes the same offset twice."
Thus, the 4 bytes at the front are the 32 bit element count, and 0x3 0x0 0x0 0x0 would mean that there are 3 bytes in the string excluding zero termination. (FlatBuffer defaults to little-endian; see link above.)

What does this notation mean in TLS documentation?

For example, looking at RFC 7301, which defines ALPN:
enum {
application_layer_protocol_negotiation(16), (65535)
} ExtensionType;
The (16) is the enum value to be used, but how should I read the (65535) part?
From the same document:
opaque ProtocolName<1..2^8-1>;
struct {
ProtocolName protocol_name_list<2..2^16-1>
} ProtocolNameList;
...how should I read the <1..2^8-1> and <2..2^16-1> parts?
The notation is described in https://www.rfc-editor.org/rfc/rfc8446.
For "enumerateds" (enums), see https://www.rfc-editor.org/rfc/rfc8446#section-3.5, which says that the value in brackets is the value of that enum member, and that the enum occupies as many octets as required by the highest documented value.
Thus, if you want to leave some room, you need an un-named enum member with a sufficiently high value.
One may optionally specify a value without its associated tag to force the width definition without defining a superfluous element.
In the following example, Taste will consume two bytes in the data stream but can only assume the values 1, 2, or 4.
enum { sweet(1), sour(2), bitter(4), (32000) } Taste;
For vectors, see https://www.rfc-editor.org/rfc/rfc8446#section-3.4. This says:
Variable-length vectors are defined by specifying a subrange of legal lengths, inclusively, using the notation <floor..ceiling>. When these are encoded, the actual length precedes the vector's contents in the byte stream. The length will be in the form of a number consuming as many bytes as required to hold the vector's specified maximum (ceiling) length.
So the notation <1..2^8-1> means that ProtocolName must be at least one octet, and up to 255 octets in length.
Similarly <2..2^16-1> means that protocol_name_list must have at least 2 octets (not entries), and can have up to 65535 octets (not entries).
In this particular case, the minimum of 2 octets is because it must contain at least one entry, which is itself at least 2 octets long (u8 length prefix, at least one octet in the value).
To make the octets/entries distinction clear, later in that section, it says:
uint16 longer<0..800>;
/* zero to 400 16-bit unsigned integers */

Memcpy and Memset on structures of Short Type in C

I have a query about using memset and memcopy on structures and their reliablity. For eg:
I have a code looks like this
typedef struct
{
short a[10];
short b[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
My question is,
1): in memset if i set to 0 it is fine. But when setting 2 i get m.a and m.b as 514 instead of 2. When I make them as char instead of short it is fine. Does it mean we cannot use memset for any initialization other than 0? Is it a limitation on short for eg
2): Is it reliable to do memcopy between two structures above of type short. I have a huge
strings of a,b,c,d,e... I need to make sure copy is perfect one to one.
3): Am I better off using memset and memcopy on individual arrays rather than collecting in a structure as above?
One more query,
In the structue above i have array of variables. But if I am passed pointer to these arrays
and I want to collect these pointers in a structure
typedef struct
{
short *pa[10];
short *pb[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
In this case if i or memset of memcopy it only changes the address rather than value. How do i change the values instead? Is the prototype wrong?
Please suggest. Your inputs are very imp
Thanks
dsp guy
memset set's bytes, not shorts. always. 514 = (256*2) + (1*2)... 2s appearing on byte boundaries.
1.a. This does, admittedly, lessen it's usefulness for purposes such as you're trying to do (array fill).
reliable as long as both structs are of the same type. Just to be clear, these structures are NOT of "type short" as you suggest.
if I understand your question, I don't believe it matters as long as they are of the same type.
Just remember, these are byte level operations, nothing more, nothing less. See also this.
For the second part of your question, try
memset(m.pa, 0, sizeof(*(m.pa));
memset(m.pb, 0, sizeof(*(m.pb));
Note two operations to copy from two different addresses (m.pa, m.pb are effectively addresses as you recognized). Note also the sizeof: not sizeof the references, but sizeof what's being referenced. Similarly for memcopy.

How are the digits in ObjC method type encoding calculated?

Is is a follow-up to my previous question:
What are the digits in an ObjC method type encoding string?
Say there is an encoding:
v24#0:4:8#12B16#20
How are those numbers calculated? B is a char so it should occupy just 1 byte (not 4 bytes). Does it have something to do with "alignment"? What is the size of void?
Is it correct to calculate the numbers as follows? Ask sizeof on every item and round up the result to multiple of 4? And the first number becomes the sum of all the other ones?
The numbers were used in the m68K days to denote stack layout. That is, you could literally decode the the method signature and, for just about all types, know exactly which bytes at what offset within the stack frame you could diddle to get/set arguments.
This worked because the m68K's ABI was entirely [IIRC -- been a long long time] stack based argument/return passing. There wasn't anything shoved into registers across call boundaries.
However, as Objective-C was ported to other platforms, always-on-the-stack was no longer the calling convention. Arguments and return values are often passed in registers.
Thus, those offsets are now useless. As well, the type encoding used by the compiler is no longer complete (because it never was terribly useful) and there will be types that won't be encoded. Not too mention that encoding some C++ templatized types yields method type encoding strings that can be many Kilobytes in size (I think the record I ran into was around 30K of type information).
So, no, it isn't correct to use sizeof() to generate the numbers because they are effectively meaningless to everything. The only reason why they still exist is for binary compatibility; there are bits of esoteric code here and there that still parse the type encoding string with the expectation that there will be random numbers sprinkled here and there.
Note that there are vestiges of API in the ObjC runtime that still lead one to believe that it might be possible to encode/decode stack frames on the fly. It really isn't as the C ABI doesn't guarantee that argument registers will be preserved across call boundaries in the face of optimization. You'd have to drop to assembly and things get ugly really really fast (>shudder<).
The full encoding string is constructed (in clang) by the method ASTContext::getObjCEncodingForMethodDecl, which you can find in lib/AST/ASTContext.cpp.
The method that does the size rounding is ASTContext::getObjCEncodingTypeSize, in the same file. It forces each size to be at least the size of an int. On all of Apple's current platforms, an int is 4 bytes.
The stack frame size and argument offsets are calculated by the compiler. I'm actually trying to track this down in the Clang source myself this week; it possibly has something to do with CodeGenTypes::arrangeObjCMessageSendSignature. (Looks like Rob just made my life a lot easier!)
The first number is the sum of the others, yes -- it's the total space occupied by the arguments. To get the size of the type represented by an ObjC type encoding in your code, you should use NSGetSizeAndAlignment().

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.