What is the difference between Serialization and Marshaling? - serialization

I know that in terms of several distributed techniques (such as RPC), the term "Marshaling" is used but don't understand how it differs from Serialization. Aren't they both transforming objects into series of bits?
Related:
What is Serialization?
What is Object Marshalling?

Marshaling and serialization are loosely synonymous in the context of remote procedure call, but semantically different as a matter of intent.
In particular, marshaling is about getting parameters from here to there, while serialization is about copying structured data to or from a primitive form such as a byte stream. In this sense, serialization is one means to perform marshaling, usually implementing pass-by-value semantics.
It is also possible for an object to be marshaled by reference, in which case the data "on the wire" is simply location information for the original object. However, such an object may still be amenable to value serialization.
As #Bill mentions, there may be additional metadata such as code base location or even object implementation code.

Both do one thing in common - that is serializing an Object. Serialization is used to transfer objects or to store them. But:
Serialization: When you serialize an object, only the member data within that object is written to the byte stream; not the code that
actually implements the object.
Marshalling: Term Marshalling is used when we talk about passing Object to remote objects(RMI). In Marshalling Object is serialized(member data is serialized) + Codebase is attached.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location.

From the Marshalling (computer science) Wikipedia article:
The term "marshal" is considered to be synonymous with "serialize" in the Python standard library1, but the terms are not synonymous in the Java-related RFC 2713:
To "marshal" an object means to record its state and codebase(s) in such a way that when the marshalled object is "unmarshalled", a copy of the original object is obtained, possibly by automatically loading the class definitions of the object. You can marshal any object that is serializable or remote. Marshalling is like serialization, except marshalling also records codebases. Marshalling is different from serialization in that marshalling treats remote objects specially. (RFC 2713)
To "serialize" an object means to convert its state into a byte stream in such a way that the byte stream can be converted back into a copy of the object.
So, marshalling also saves the codebase of an object in the byte stream in addition to its state.

Basics First
Byte Stream - Stream is a sequence of data. Input stream - reads data from source. Output stream - writes data to destination.
Java Byte Streams are used to perform input/output byte by byte (8 bits at a time). A byte stream is suitable for processing raw data like binary files.
Java Character Streams are used to perform input/output 2 bytes at a time, because Characters are stored using Unicode conventions in Java with 2 bytes for each character. Character stream is useful when we process (read/write) text files.
RMI (Remote Method Invocation) - an API that provides a mechanism to create distributed application in java. The RMI allows an object to invoke methods on an object running in another JVM.
Both Serialization and Marshalling are loosely used as synonyms. Here are few differences.
Serialization - Data members of an object is written to binary form or Byte Stream (and then can be written in file/memory/database etc). No information about data-types can be retained once object data members are written to binary form.
Marshalling - Object is serialized (to byte stream in binary format) with data-type + Codebase attached and then passed Remote Object (RMI). Marshalling will transform the data-type into a predetermined naming convention so that it can be reconstructed with respect to the initial data-type.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location. (Copied from #Nasir answer)
Serialization is almost like a stupid memory-dump of the memory used by the object(s), while Marshalling stores information about custom data-types.
In a way, Serialization performs marshalling with implementation of pass-by-value because no information of data-type is passed, just the primitive form is passed to byte stream.
Serialization may have some issues related to big-endian, small-endian if the stream is going from one OS to another if the different OS have different means of representing the same data. On the other hand, marshalling is perfectly fine to migrate between OS because the result is a higher-level representation.

Marshaling refers to converting the signature and parameters of a function into a single byte array.
Specifically for the purpose of RPC.
Serialization more often refers to converting an entire object / object tree into a byte array
Marshaling will serialize object parameters in order to add them to the message and pass it across the network.
*Serialization can also be used for storage to disk.*

I think that the main difference is that Marshalling supposedly also involves the codebase. In other words, you would not be able to marshal and unmarshal an object into a state-equivalent instance of a different class.
Serialization just means that you can store the object and reobtain an equivalent state, even if it is an instance of another class.
That being said, they are typically synonyms.

Marshalling is the rule to tell compiler how the data will be represented on another environment/system;
For example;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
as you can see two different string values represented as different value types.
Serialization will only convert object content, not representation (will stay same) and obey rules of serialization, (what to export or no). For example, private values will not be serialized, public values yes and object structure will stay same.

Here's more specific examples of both:
Serialization Example:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct {
char value[11];
} SerializedInt32;
SerializedInt32 SerializeInt32(int32_t x)
{
SerializedInt32 result;
itoa(x, result.value, 10);
return result;
}
int32_t DeserializeInt32(SerializedInt32 x)
{
int32_t result;
result = atoi(x.value);
return result;
}
int main(int argc, char **argv)
{
int x;
SerializedInt32 data;
int32_t result;
x = -268435455;
data = SerializeInt32(x);
result = DeserializeInt32(data);
printf("x = %s.\n", data.value);
return result;
}
In serialization, data is flattened in a way that can be stored and unflattened later.
Marshalling Demo:
(MarshalDemoLib.cpp)
#include <iostream>
#include <string>
extern "C"
__declspec(dllexport)
void *StdCoutStdString(void *s)
{
std::string *str = (std::string *)s;
std::cout << *str;
}
extern "C"
__declspec(dllexport)
void *MarshalCStringToStdString(char *s)
{
std::string *str(new std::string(s));
std::cout << "string was successfully constructed.\n";
return str;
}
extern "C"
__declspec(dllexport)
void DestroyStdString(void *s)
{
std::string *str((std::string *)s);
delete str;
std::cout << "string was successfully destroyed.\n";
}
(MarshalDemo.c)
#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
void *myStdString;
LoadLibrary("MarshalDemoLib");
myStdString = ((void *(*)(char *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"MarshalCStringToStdString"
))("Hello, World!\n");
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"StdCoutStdString"
))(myStdString);
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"DestroyStdString"
))(myStdString);
}
In marshaling, data does not necessarily need to be flattened, but it needs to be transformed to another alternative representation. all casting is marshaling, but not all marshaling is casting.
Marshaling doesn't require dynamic allocation to be involved, it can also just be transformation between structs. For example, you might have a pair, but the function expects the pair's first and second elements to be other way around; you casting/memcpy one pair to another won't do the job because fst and snd will get flipped.
#include <stdio.h>
typedef struct {
int fst;
int snd;
} pair1;
typedef struct {
int snd;
int fst;
} pair2;
void pair2_dump(pair2 p)
{
printf("%d %d\n", p.fst, p.snd);
}
pair2 marshal_pair1_to_pair2(pair1 p)
{
pair2 result;
result.fst = p.fst;
result.snd = p.snd;
return result;
}
pair1 given = {3, 7};
int main(int argc, char **argv)
{
pair2_dump(marshal_pair1_to_pair2(given));
return 0;
}
The concept of marshaling becomes especially important when you start dealing with tagged unions of many types. For example, you might find it difficult to get a JavaScript engine to print a "c string" for you, but you can ask it to print a wrapped c string for you. Or if you want to print a string from JavaScript runtime in a Lua or Python runtime. They are all strings, but often won't get along without marshaling.
An annoyance I had recently was that JScript arrays marshal to C# as "__ComObject", and has no documented way to play with this object. I can find the address of where it is, but I really don't know anything else about it, so the only way to really figure it out is to poke at it in any way possible and hopefully find useful information about it. So it becomes easier to create a new object with a friendlier interface like Scripting.Dictionary, copy the data from the JScript array object into it, and pass that object to C# instead of JScript's default array.
(test.js)
var x = new ActiveXObject('Dmitry.YetAnotherTestObject.YetAnotherTestObject');
x.send([1, 2, 3, 4]);
(YetAnotherTestObject.cs)
using System;
using System.Runtime.InteropServices;
namespace Dmitry.YetAnotherTestObject
{
[Guid("C612BD9B-74E0-4176-AAB8-C53EB24C2B29"), ComVisible(true)]
public class YetAnotherTestObject
{
public void send(object x)
{
System.Console.WriteLine(x.GetType().Name);
}
}
}
above prints "__ComObject", which is somewhat of a black box from the point of view of C#.
Another interesting concept is that you might have the understanding how to write code, and a computer that knows how to execute instructions, so as a programmer, you are effectively marshaling the concept of what you want the computer to do from your brain to the program image. If we had good enough marshallers, we could just think of what we want to do/change, and the program would change that way without typing on the keyboard. So, if you could have a way to store all the physical changes in your brain for the few seconds where you really want to write a semicolon, you could marshal that data into a signal to print a semicolon, but that's an extreme.

Marshalling is usually between relatively closely associated processes; serialization does not necessarily have that expectation. So when marshalling data between processes, for example, you may wish to merely send a REFERENCE to potentially expensive data to recover, whereas with serialization, you would wish to save it all, to properly recreate the object(s) when deserialized.

My understanding of marshalling is different to the other answers.
Serialization:
To Produce or rehydrate a wire-format version of an object graph utilizing a convention.
Marshalling:
To Produce or rehydrate a wire-format version of an object graph by utilizing a mapping file, so that the results can be customized. The tool may start by adhering to a convention, but the important difference is the ability to customize results.
Contract First Development:
Marshalling is important within the context of contract first development.
Its possible to make changes to an internal object graph, while keeping the external interface stable over time. This way all of the service subscribers won't have to be modified for every trivial change.
Its possible to map the results across different languages. For example from the property name convention of one language ('property_name') to another ('propertyName').

Marshaling uses Serialization process actually but the major difference is that it in Serialization only data members and object itself get serialized not signatures but in Marshalling Object + code base(its implementation) will also get transformed into bytes.
Marshalling is the process to convert java object to xml objects using JAXB so that it can be used in web services.

Serialisation vs Marshalling
Problem: Object belongs to some process(VM) and it's lifetime is the same
Serialisation - transform object state into stream of bytes(JSON, XML...) for saving, sharing, transforming...
Marshalling - contains Serialisation + codebase. Usually it used by Remote procedure call(RPC) -> Java Remote Method Invocation(Java RMI) where you are able to invoke a object's method which is hosted on remote Java processes.
codebase - is a place or URL to class definition where it can be downloaded by ClassLoader. CLASSPATH[About] is as a local codebase
JVM -> Class Loader -> load class definition
java -Djava.rmi.server.codebase="<some_URL>" -jar <some.jar>
Very simple diagram for RMI
Serialisation - state
Marshalling - state + class definition
Official doc

Think of them as synonyms, both have a producer that sends stuff over to a consumer... In the end fields of instances are written into a byte stream and the other end foes the reverse ands up with the same instances.
NB - java RMI also contains support for transporting classes that are missing from the recipient...

Related

Accepting managed struct in C++/CLI both with "hat" operator and without. What is the difference?

I've got a C++/CLI layer that I've been using successfully for a long time. But I just discovered something that makes me think I need to relearn some stuff.
When my C++/CLI functions receive an instance of any managed class, they use the "hat" operator ('^') and when they receive an instance of a managed struct, they do not. I thought this was how I was supposed to write it.
To illustrate as blandly as I can
using Point = System::Windows::Point;
public ref class CppCliClass
{
String^ ReturnText(String^ text) { return text; } // Hat operator for class
Point ReturnStruct(Point pt) { return pt; } // No hat operator for struct
};
I thought this was required. It certainly works. But just today I discovered that CancellationToken is a struct, not a class. My code accepts it with a hat. I thought it was a class when I wrote it. And this code works just fine. My cancellations are honored in the C++/CLI layer.
void DoSomethingWithCancellation(CancellationToken^ token)
{
// Code that uses the token. It works just fine
}
So apparently I can choose either method.
But then what is the difference between passing in a struct by value (as I've done with every other struct type I use, like Point) and by reference (as I'm doing with CancellationToken?). Is there a difference?
^ for reference types and without for value types matches C#, but C++/CLI does give you more flexibility:
Reference type without ^ is called "stack semantics" and automatically tries to call IDisposable::Dispose on the object at the end of the variable's lifetime. It's like a C# using block, except more user-friendly. In particular:
The syntax can be used whether the type implements IDisposable or not. In C#, you can only write a using block if the type can be proved, at compile time, to implement IDisposable. C++/CLI scoped resource management works fine in generic and polymorphic cases, where some of the objects do and some do not implement IDisposable.
The syntax can be used for class members, and automatically implements IDisposable on the containing class. C# using blocks only work on local scopes.
Value types used with ^ are boxed, but with the exact type tracked statically. You'll get errors if a boxed value of a different type is passed in.

Can I have a string object store its data within the structure?

I'm looking for a quick way to serialize custom structures consisting of basic value types and strings.
Using C++CLI to pin the pointer of the structure instance and destination array and then memcpy the data over is working quite well for all the value types. However, if I include any reference types such as string then all I get is the reference address.
Expected as much since otherwise it would be impossible for the structure to have a fixed.. structure. I figured that maybe, if I make the string fixed size, it might place it inside the structure though. Adding < VBFixedString(256) > to the string declaration did not achieve that.
Is there anything else that would place the actual data inside the structure?
Pinning a managed object and memcpy-ing the content will never give you what you want. Any managed object, be it String, a character array, or anything else will show up as a reference, and you'll just get a memory location.
If I read between the lines, it sounds like you need to call some C or C++ (not C++/CLI) code, and pass it a C struct that looks similar to this:
struct UnmanagedFoo
{
int a_number;
char a_string[256];
};
If that's the case, then I'd solve this by setting up the automatic marshaling to handle this for you. Here's how you'd define that struct so that it marshals properly. (I'm using C# syntax here, but it should be an easy conversion to VB.net syntax.)
[StructLayout(LayoutKind.Sequential, CharSet=CharSet.Ansi)]
public struct ManagedFoo
{
public int a_number;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst=256)]
public string a_string;
}
Explanation:
StructLayout(LayoutKind.Sequential) specifies that the fields should be in the declared order. The default LayoutKind, Auto, allows the fields to be re-ordered if the compiler wants.
CharSet=CharSet.Ansi specifies the type of strings to marshal. You can specify CharSet.Ansi to get char strings on the C++ side, or CharSet.Unicode to get wchar_t strings in C++.
MarshalAs(UnmanagedType.ByValTStr) specifies a string inline to the struct, which is what you were asking about. There are several other string types, with different semantics, see the UnmanagedType page on MSDN for descriptions.
SizeConst=256 specifies the size of the character array. Note that this specifies the number of characters (or when doing arrays, number of array elements), not the number of bytes.
Now, these marshal attributes are an instruction to the built-in marshaler in .Net, which you can call directly from your VB.Net code. To use it, call Marshal.StructureToPtr to go from the .Net object to unmanaged memory, and Marshal.PtrToStructure to go from unmanaged memory to a .Net object. MSDN has some good examples of calling those two methods, take a look at the linked pages.
Wait, what about C++/CLI? Yes, you could use C++/CLI to marshal from the .Net object to a C struct. If your structs get too complex to represent with the MarshalAs attribute, it's highly appropriate to do that. In that case, here's what you do: Declare your .Net struct like I listed above, without the MarshalAs or StructLayout. Also declare the C struct, plain and ordinary, also as listed above. When you need to switch from one to the other, copy things field by field, not a big memcpy. Yes, all the fields that are basic types (integers, doubles, etc.) will be a repetitive output.a_number = input.a_number, but that's the proper way to do it.

dlang inheritance design for types passed between threads

I'm writing a multithreaded program in the D programming language, but am pretty new to the language. There is a restriction on types passed between threads using the Tid.send() and receive[Only]() APIs in the std.concurrency package that they must be value types or must be constant to avoid race conditions between the sender and receiver threads. I have a simple struct Message type that I have been passing by value:
enum MessageType {
PrepareRequest,
PrepareResponse,
AcceptRequest,
Accepted
}
struct Message {
MessageType type;
SysTime timestamp;
uint node;
ulong value;
}
However, some MessageTypes don't have all the fields, and it's annoying to use a switch statement and remember which types have which fields when I could use polymorphism to do this work automatically. Is using an immutable class hierarchy recommended here, or is the approach I'm already using the best way to go, and why?
Edit
Also, if I should use immutable classes, what's the recommended way to create immutable objects of a user-defined class? A static method on the class they come from that casts the return value to immutable?
As a rule of a thumb, if you have a polymorphic type hierarchy, classes are the tool to use. And if mutation is out of the question by design, immutable classes should do the trick efficiently.
Great presentation from DConf2013 by Ali has been published recently : http://youtu.be/mPr2UspS0fE . It goes through topic of usage of const and immutable in D in great detail. Among the other good stuff it suggests to use
auto var = new immutable(ClassType)(...); syntax for creating immutable classes. All initialization goes to constructor then and no special hacks are needed.

boost::serialization polymorphic type initialization

I have a base class and 4 derived classes. I store all my derived classes in a vector of base class pointer type. During first initialization I create each derived type differently using their constructors. Basically they each have different param types in their ctors. (I had to provide a protected default ctor to make BOOST_CLASS_EXPORT compile but that's a different story). I don't/can't save all the members (filled in ctor) of these derived classes.
Now, when I load objects from the disk using boost::serialize, these members (that are not serialized and specific to each derived type) are destroyed. And, I cannot think of a way to re-initialize these derived types since I only store the base class pointers.
What exactly I need is being able to load my derived types (pointers) partially, without deleting all their content..
Is there a way to overcome this, a magic boost define or function call perhaps? Otherwise, polymorphism with boost::serialize is not possible at all.. I should be missing something and hope I could define my problem good enough.
You shouldn't need to create a default constructor just for serialization. You can instead have boost save/load the data needed by a non-default constructor, and use that to construct new objects when loading.
That way, whatever your constructors do to ensure the validity of data members can also happen during serialization, and the serialization library never has to manipulate the data members of your objects directly. This should prevent data erasure.
For example, if your class can be constructed using a name and a size, you could overload the functions as follows:
template <class Archive>
inline void save_construct_data(Archive & ar, const my_class * t, const unsigned int) {
ar << my_class->name();
ar << my_class->size();
}
template<class Archive>
inline void load_construct_data(Archive & ar, my_class * t, const unsigned int) {
std::string name;
int size;
ar >> name;
ar >> size;
::new(t)my_class(name, size); // placement 'new' using your regular constructor
}
Check out the docs here.

How do I pass an array of structs (containing std:string or BSTR) from ATL to C#. SafeArray? Variant?

I have an ATL COM object that I am using from C#. The interface currently looks like:
interface ICHASCom : IDispatch{
[id(1), helpstring("method Start")] HRESULT Start([in] BSTR name, [out,retval] VARIANT_BOOL* result);
...
[id(4), helpstring("method GetCount")] HRESULT GetCount([out,retval] LONG* numPorts);
...
[id(7), helpstring("method EnableLogging")] HRESULT EnableLogging([in] VARIANT_BOOL enableLogging);
};
That is, it's a very simple interface. I also have some events that I send back too.
Now, I would like to add something to the interface. In the ATL I have some results, which are currently structs and look like
struct REPORT_LINE
{
string creationDate;
string Id;
string summary;
};
All the members of the struct are std::string. I have an array of these that I need to get back to the C#. What's the best way to do this?
I suspect someone is going to say, "hey, you can't just send std::string over COM like that. If so, fine, but what's the best way to modidfy the struct? Change the std::string to BSTR? And then how do I,
1) Set up the IDL to pass an array of structs (structs with BSTR or std::string)
2) If I must use SAFEARRAYS, how do I fill the SAFEARRAYS with the structs.
I'm not familiar with COM except for use with simple types.
a user defined structure is incompatible with the automation interface. You can probably work out a nested array or two dimensional safe array of BSTRs, but a more maintainable solution would be wrapping the structure as an automation object with 3 properties, then wrap the array as a collection that has an enumerator.
Neither IDL nor Automation define byte alignment for a struct. So you can have compatibility problems if your COM server has different struct alignment with the client. e.g. VB has a 4-byte alignment, while the #import in Visual C++ default to a 8-byte alignment. If you have a slightest chance in the future to use the interface in scripting, avoid using structs.
Suggested reading:
"Passing Structures through IDispatch" by Don Box, Microsoft
Systems Journal, June 1996.
*Design Principles for Collection and Enumerator Interfaces
*ATL internals: working with ATL 8 By Chris Tavares, Kirk Fertitta page 392