Is WCHAR in COM interfaces a good thing? - com

Is WCHAR in COM interfaces a good thing ?
I've been searching the internet for an answer to this question with no results.
Basically should char* / wchar* be used in COM or should i use BSTR instead ?
Is it safe or does it depend ?
In this code example its strings (code grabbed from a random source):
STDMETHOD(SetAudioLanguageOrder(WCHAR *nValue)) = 0;
STDMETHOD_(WCHAR *, GetAudioLanguageOrder()) = 0;
I'm confused over when to use what with all marshaling, memory boundaries, etc. that comes up when talking about COM.
What about data buffers (byte*) ?

It depends on the context in which the caller will call you. First, if you use a non-automation type, marshaling will not be automatically performed for you. Therefore, you'll end up having to write your own marshaler to move a wchar_t* across process boundaries.
That said, there's no rule that says you can't pass a wchar_t* in a COM interface. There are many COM interfaces that pass custom types (structs, pointers to structs, callbacks, etc), and it's all just about your needs.
In your interface, if you do use WCHAR strings, I'd declare SetAudioLanguageOrder this way:
STDMETHOD(SetAudioLanguageOrder(const WCHAR *nValue)) = 0;
This makes it clearer who is (not) supposed to free the string, and provides more context as how to treat the string (the caller is discouraged from modifying the string, though the caller can certainly force that behavior if they want to write bad code).
The GetAudioLanguageOrder call is OK, but now the question is: who frees the returned string, and how should it be freed? Via free(...)? Or C++ delete[]? If you use a BSTR, then you know - use SysFreeString. That's part of the reason to use BSTR's instead of WCHAR strings.

If you are going to support dual interfaces and clients other than C++, use BSTR. If all callers are C++, then WCHAR* is fine.

You will have to be able to know the length of that array in one way or another. In C or C++ it's typical to use null-terminated strings and you often use them within one process - the callee accesses the very same data as the caller prepared and null-terminated.
Not the same with COM - you might want to create an out-proc server or use your in-proc server in a surrogate process and then you'll need marshalling - a middleware mechanism that transmits that data between processes or threads - to work. That mechanism will not know about the size of the string unless you one of MIDL attributes such as size_is to specify the right array size. Using those attributes will require an extra parameter for each array - that complicates the interface and requires extra care while dealing with data.
That said, in most cases you get a more fluent interface by just using BSTRs.

Related

Why we want to use BSTR

I'm new to COM. According to http://msdn.microsoft.com/en-us/library/windows/desktop/ms221069(v=vs.85).aspx
BSTR is a string with prefix. I am wondering what it the purpose for BSTR. In which case we have to use BSTR instead of string type? If someone could have an example please?
Thanks
First of all, sometimes you have to use BSTR. For example, if you call a COM method and the callee expects a BSTR you'd rather not pass any other string type - otherwise they could call SysStringLen() and run into undefined behavior. Use BSTR when an API descriptions says you should use BSTR.
Also BSTR has these useful features:
(the most important thing) It is allocated on a separate heap managed by COM and so if your application consists of different modules using different runtime heaps any module can allocate a BSTR and any module can then free it easily which is not the case with other dynamically allocated stuff. This is very useful for cases when for example you pass an existing string, the callee has to either preserve or reallocate it - without the single heap and the single set of allocation functions you'd have a hard time doing that.
(the less important thing) it's supported by Automation marshaller so you can easily use it in COM interfaces that must be marshalled and not bother crafting and registering the proxy-stubs stuff
(the least important IMO) it can contain embedded null characters unlike other typical string types
BSTR is standard string type in wide family of APIs:
A BSTR (Basic string or binary string) is a string data type that is used by COM, Automation, and Interop functions. Use the BSTR data type in all interfaces that will be accessed from script.
You use BSTR when you have to, esp. when API you are using expects that you pass BSTR; e.g. when certain COM interface method requires BSTR argument. You use BSTR or anything else at your discretion when you have choices.

C++ Interop: embedding an array in a UDT

I have an application that involves a lot of communication between managed (C#) and unmanaged (C++) code. We are using Visual Studio 2005 (!), and we use the interop assembly generated automatically by tlbimp.
We have fairly good luck passing simple structs back and forth as function arguments. And because our objects are fairly simple, we can pack them into SAFEARRAYs using the IRecordInfo interface. Passing these arrays as arguments to COM methods seems to work properly.
We would like to be able to embed variable-length arrays in our UDTs, but this fails badly. I don't think I have been able to find a single piece of documentation showing how someone has accomplished this. Nor have I found documentation that says it can't be done.
1) Naive approach: Simply declare a safearray in the managed code:
struct MyUdt {
int member1;
BSTR member2;
SAFEARRAY *m3;
};
The C++ compiler is happy with this, but the generated IDL confounds tblimp.exe. It reports that it is unable to convert the signature for member m3, and the signature for member tagSAFEARRAY.rgsabound. These are only warnings, but they are meaningful, the resulting assembly is not usable.
Using LPSAFEARRAY, oddly enough, fails in different ways, but for the same reason, tblimp just can't deal with it.
2) Trickier: Pack it into a variant:
struct MyUdt {
int member1;
BSTR member2;
VARIANT m3;
};
We have code that builds safearrays of UDTs, and it never gives us any trouble. It's basically copied from MSDN. Using that code to create a safearray, then:
pVal->m3.vt = VT_SAFEARRAY | VT_RECORD;
pval->parray = p;
Fails in odd ways. It always breaks, some variations produce an OutOFMemoryException... odd, others fail in different ways. (I'm not sure if a pRecInfo pointer is required here or not, but it fails the same way, present or not.)
The Google search space for this is badly polluted with answers to questions that I am not asking:
How do you pass UDT/structs from unmanaged code.
How do you pass a SAFEARRAY of structs? (We're doing this fine.)
How do you use p/invoke or customer marshalling to pass UDTs.
And many answers describing how to define things from the managed side, not the unmanaged side.
And then there are a couple of Microsoft KBs describing problems with VT_RECORD in early versions of .NET. I don't think these are germane - VT_RECORD types work with VARIANT and with SAFEARRAY. (But maybe not with the UDT marshallling...)
If this won't ever work, it would be nice to at least know why.
Mark

How to receive bytes in Managed C++ project from COM plus project

I have a module A in Managed C++, it depends on module B in native C++ which wrapped as COM plus.
In module B, I read bytes from a file. Now I am trying to call the file reading functionality from A. But failed.
Dependency detail: I used tlbimp.exe and generated the interop according to Module B. A referrs to the interop.
I tried to pass an "array^" but only one char was received, which is understandable because marshaling doesn't know the array length and could NOT handle the whole array.
I searched out some recommendation about safe array, but could NOT use it successfully in my projects.
Could somebody help me on this?
Thanks a lot.
If you are going to be talking to your native object via COM, you're going to have to pass the array the COM way.
SAFEARRAY would definitely work, but you don't have to use it. It is a fair amount of work to set up anyway. If neither component is a scripting language or VB6, there is little value to using a SAFEARAY.
COM can marshal the array just fine, you just have to tell it how big it is. The two most common mechanisms in COM to pass (native) arrays are "fixed-sized arrays" and "conformant arrays".
Fixed-size array:
If you know at compile time the size of the array, this is the way to go. Declare your COM method as follows in your IDL:
...
const long ARRAY_SIZE = 1024;
...
HRESULTS MethodAbc(MyClass array[ARRAY_SIZE]);
Marshalling will take care of passing the whole array.
Conformant Arrays:
You declare them as follows in IDL:
HRESULT MethodAbc([size_is(arraySize)] MyClass array[], long arraySize);
This tells COM that the arraySize parameter holds the count of elements.
My experience with CLI is minimal, but I don't think you can just pass a CLI handle. Among other things, I believe you need to pin the pointer so that GC doesn't move the array during the COM call. Others please correct me here if I'm wrong.

Why does the JVM have both `invokespecial` and `invokestatic` opcodes?

Both instructions use static rather than dynamic dispatch. It seems like the only substantial difference is that invokespecial will always have, as its first argument, an object that is an instance of the class that the dispatched method belongs to. However, invokespecial does not actually put the object there; the compiler is the one responsible for making that happen by emitting the appropriate sequence of stack operations before emitting invokespecial. So replacing invokespecial with invokestatic should not affect the way the runtime stack / heap gets manipulated -- though I expect that it will cause a VerifyError for violating the spec.
I'm curious about the possible reasons behind making two distinct instructions that do essentially the same thing. I took a look at the source of the OpenJDK interpreter, and it seems like invokespecial and invokestatic are handled almost identically. Does having two separate instructions help the JIT compiler better optimize code, or does it help the classfile verifier prove some safety properties more efficiently? Or is this just a quirk in the JVM's design?
Disclaimer: It is hard to tell for sure since I never read an explicit Oracle statement about this, but I pretty much think this is the reason:
When you look at Java byte code, you could ask the same question about other instructions. Why would the verifier stop you when pushing two ints on the stack and treating them as a single long right after? (Try it, it will stop you.) You could argue that by allowing this, you could express the same logic with a smaller instruction set. (To go further with this argument, a byte cannot express too many instructions, the Java byte code set should therefore cut down wherever possible.)
Of course, in theory you would not need a byte code instruction for pushing ints and longs to the stack and you are right about the fact that you would not need two instructions for INVOKESPECIAL and INVOKESTATIC in order to express method invocations. A method is uniquely identified by its method descriptor (name and raw argument types) and you could not define both a static and a non-static method with an identical description within the same class. And in order to validate the byte code, the Java compiler must check whether the target method is static nevertheless.
Remark: This contradicts the answer of v6ak. However, a methods descriptor of a non-static method is not altered to include a reference to this.getClass(). The Java runtime could therefore always infer the appropriate method binding from the method descriptor for a hypothetical INVOKESMART instruction. See JVMS §4.3.3.
So much for the theory. However, the intentions that are expressed by both invocation types are quite different. And remember that Java byte code is supposed to be used by other tools than javac to create JVM applications, as well. With byte code, these tools produce something that is more similar to machine code than your Java source code. But it is still rather high level. For example, byte code still is verified and the byte code is automatically optimized when compiled to machine code. However, the byte code is an abstraction that intentionally contains some redundancy in order to make the meaning of the byte code more explicit. And just like the Java language uses different names for similar things to make the language more readable, the byte code instruction set contains some redundancy as well. And as another benefit, verification and byte code interpretation/compilation can speed up since a method's invocation type does not always need to be inferred but is explicitly stated in the byte code. This is desirable because verification, interpretation and compilation are done at runtime.
As a final anecdote, I should mention that a class's static initializer <clinit> was not flagged static before Java 5. In this context, the static invocation could also be inferred by the method's name but this would cause even more run time overhead.
There are the definitions:
http://docs.oracle.com/javase/specs/jvms/se5.0/html/Instructions2.doc6.html#invokestatic
http://docs.oracle.com/javase/specs/jvms/se5.0/html/Instructions2.doc6.html#invokespecial
There are significant differences. Say we want to design an invokesmart instruction, which would choose smartly between inkovestatic and invokespecial:
First, it would not be a problem to distinguish between static and virtual calls, since we can't have two methods with same name, same parameter types and same return type, even if one is static and second is virtual. JVM does not allow that (for a strange reason). Thanks raphw for noticing that.
First, what would invokesmart foo/Bar.baz(I)I mean? It may mean:
A static method call foo.Bar.baz that consumes int from operand stack and adds another int. // (int) -> (int)
An instance method call foo.Bar.baz that consumes foo.Bar and int from operand stack and adds int. // (foo.Bar, int) -> (int)
How would you choose from them? There may exist both methods.
We may try to solve it by requiring foo/Bar.baz(Lfoo/Bar;I) for the static call. However, we may have both public static int baz(Bar, int) and public int baz(int).
We may say that it does not matter and possibly disable such situation. (I don't think that it is a good idea, but just to imagine.) What would it mean?
If the method is static, there are probably no additional restrictions. On the other hand, if the method is not static, there are some restrictions: "Finally, if the resolved method is protected (§4.6), and it is either a member of the current class or a member of a superclass of the current class, then the class of objectref must be either the current class or a subclass of the current class."
There are some further differences, see the note about ACC_SUPER.
It would mean that all the referenced classes must be loaded before bytecode verification. I hope this is not necessary now, but I am not 100% sure.
So, it would mean very inconsistent behavior.

For a lanuage to work with com object does it need to have an api and compiler developed specific for com?

In the documentation for com it says that it works literally with every language. Do you need to have a specific API for that language so it can interface with com, or can any language literally just use it out of box? Also do you need a special compiler? Sorry if this is a stupid question but I have never used it before, and I have been trying to find this answer. When I look at demos of com examples it all seems to access the objects in a c style syntax, are their bindings and apis for other languages (literally all)?
The key thing about COM is that it is a "binary standard": which is to say that it doesn't care what the language used is, so long as the bits and bytes in memory end up in the right place.
COM basically specifies that all COM objects must have a specific layout in memory: the interface pointer points to a pointer that in turn points to a table of function pointers, which has at least three members, the first three of which are pointers to the IUnknown functions (AddRef, Release, QueryInterface), and the remainder are pointers to the other functions in the interface. COM also specifies how arguments are passed to these functions - so that the caller and callee agree on how the stack is used, and who pops off the values.
This requirement happily matches how C++ just happens to work on Windows; so most C++ classes that implement IUnknown will just happen to ends up as being valid COM classes: this is because Microsoft's implementation of C++ happens to use an object layout that matches what COM requires: the C++ object vtable pointer is the same as the COM pointer-to-table-of-function-pointers, the C++ table of function pointers is exactly what COM requires for its table-of-function-pointers, and so on. (This isn't entirely just a happy coincidence: COM was likely designed to take advantage of the most common way that C++ objects are implemented in memory which is the technique that MS's compiler uses. Note that C++ the language specification doesn't actually specify any particular object layout - so you could have a 3rd party C++ compiler that implemented C++ in a way that gave you classes that are not usable by COM. But no compiler vendor in their right mind would do that, since they would appear to be broken compared to the others!)
In plain C, you can create a COM object by creating suitable structs-containing-pointers manually. This works because C essentially allows you to specify binary-level memory layout for structs manually; you can create structs that you know will have the appropriate layout that COM is expecting.
In other languages, especially those that don't allow the user to specify memory layout explicitly, you need support from the language to allow for COM support. All the .Net languages - C#, VB.Net, and so on - use support in the .Net runtime that understands what COM expects, and produces the appropriate wrappers as needed to allow the interop to work.
So, long story short, it's not the case that any language under the sun will automatically work with COM; it's really the case that a couple of languages - namely C and C++ - are already aligned with COM's requirements; and most other languages will need some compiler support to make it work.