Is there a way to deserialize raw byte[] back into thrift object without knowing its thrift type? - cross-platform

I'm running a project that requires
inter-communication between different programming languages (mostly java, c++).
could serialize/deserialize into both binary format and json format.
IDL to generate class code for different languages
Thrift matches these criteria perfectly, although we don't need its RPC functions. We will be sending/receiving the serialized thrift data via MQ. Serializing the object is very straight forward. However, when it comes to deserializing, we cannot do something like this:
byte[] data = recv();
Object object = TDeserializer.deserialize(data);
if (object instanceof TypeA) {
TypeA a = (TypeA) object;
} else if (object instanceof TypeB) {
TypeB b = (TypeB) object;
}
It seems we have to tell thrift exactly which struct it needs to deserialize into like:
byte[] data = recv();
TypeA a;
TDeserializer.deserialize(a, data);
Just wondering if there's a way to deserialize raw data into thrift object without knowing its exact type.
Thanks!!

Thrift serialized message itself doesn't contain type info, so deserializer must be aware of message data type. However, it's possible to wrap all necessary data types into union.
Thrift code:
union Message {
1: TypeA a;
2: TypeB b;
}
Deserialization code:
byte[] data = recv();
Message msg;
TDeserializer.deserialize(msg, data);
<find out message type with msg.getSetField()>
If you need to add new message types, just add another field into union. If you don't touch old field IDs, you will retain backward compatibility:
union Message {
1: TypeA a;
2: TypeB b;
3: TypeC c; <-- OK
}
You will be able to receive messages from old producers (they will never send TypeC messages) and send TypeA/TypeB messages to old consumers. If you send TypeC message to the consumer that's not aware of field #3, it will get exception.
The big advantage of this approach is that type information is very compact. If you use TCompactProtocol, type info will only take 1 extra byte in most cases (if the field IDs in Message are less that 127).
Be careful, if you change field IDs, you will loose backward compatibility. For example:
union Message {
1: TypeA a;
2: TypeC c; <-- Wrong
3: TypeB b; <-- Wrong
4: TypeD d; <-- OK
}

Related

Need of serialization

I'm new to serialization concept, please help in understanding concept.
What exactly serialization means? I have read the definition, but could not understand in details.
How basic types (int, string) are serialized?
If we don't use serialization in our code how data will be transmitted?
Is there any implicit serialization process involved while accessing database from front end Java/C# code? example insert/delete from database.
Serialization just takes an object and translates it into something simpler. Imagine that you had an object in C# like so:
class Employee
{
public int age;
public string fullname;
}
public static void Main()
{
var john = new Employee();
john.age = 21;
john.fullname = "John Smith";
var matt = new Employee();
matt.age = 44;
matt.fullname = "Matt Rogers";
...
This is C# friendly. But if you wanted to save that information in a text file in CSV format, you would end up with something like this:
age,fullname
21,John Smith
44,Matt Rogers
When you write a CSV, you are basically serializing information into a different format - in this case a CSV file. You can serialize your object to XML, JSON, database table(s), memory or something else. Here's an example from Udemy regarding serialization.
If you don't serialize, confusion will be transmitted. Perhaps your object's ToString() will be implictly called before transmission and whatever result gets transmitted. Therefore it is vital to convert your data to something that is receiver friendly.
There's always some serialization happening. When you execute a query that populates a DataTable, for example, serialization occurred.
Concept :
Serialization is the process of converting an object into series of bytes.
Usually the objects we use in application will be complex and all of them can be easily represented in the form of series of bytes which can be stored in the file/database or transfered over network.
You can make a class Serializable just by making it implement Serializable interface.
For a class to be serialized successfully, two conditions must be met:
The class must implement the java.io.Serializable interface.
All of the fields in the class must be serializable. If a field is not serializable, it must be marked transient.
When the program is done serializing, and if it is stored in a file with extension .ser then it can be used for deserializing.
Serialization gives an serialVersionUID to the serialized object which has to match for deserialization

Protobuf concatenation of serialized messages into one file

I have some serialization in google protobuf in a series of files, but wonder if there is a shortcut way of concatenating these smaller files into one larger protobuf without worrying about reading each and every protobuf and then grouping these objects and outputting.
Is there a cheap way to join files together? I.e. do I have serialize each individual file?
You can combine protocol buffers messages by simple concatenation. It appears that you want the result to form an array, so you'll need to serialize each individual file as an array itself:
message MyItem {
...
}
message MyCollection {
repeated MyItem items = 1;
}
Now if you serialize each file as a MyCollection and then concatenate them (just put the raw binary data together), the resulting file can be read as a one large collection itself.
In addition to jpas answer, it might be relevant to say that the data does not need to be in the exact same container, when being serialized, for it being compatible on deserialisation.
Consider the following messages:
message FileData{
required uint32 versionNumber = 1;
repeated Data initialData = 2;
}
message MoreData{
repeated Data data = 2;
}
It is possible to serialize those different messages into one single data container and deserialize it as one single FileData message, as long as the FileData is serialized before zero or more MoreData and both, the FileData and MoreData have the same index for the repeated field.

Do I need to translate enum values across a WCF service?

The scenario is as follows: I implemented a WCF service (lets call it X) which has its own data objects.
The service X is using another WCF service (Y) which has its own set of data objects. Service X needs to pass some data it receive from service Y to its clients (service X clients).
As far as i know, it is considered a "best practice" to translatethe objects received from Y service to data objects of X service.
What is the best practice when it comes to Enum values? Do i need to map each enum value or is there another way?
Generally the idea is to isolate users of your service from changes in your implementation. Therefore, you do not expose your implementation types on the wire. Image the situation where you decide to rename an enum value. If the service consumer does not update their implementation you will have introduced a breaking change as the service user will be sending the old enum value to you which will not deserialize correctly.
In addition you may find that not all of the enum values are applicable to users of your service (maybe they are used internally)
So, yes, you should translate enum values just like other types
If you give your enums explicit numeric values you could translate between them using casts
class Program
{
static void Main(string[] args)
{
Internal i = Internal.Too;
External e = (External) i;
Console.WriteLine(e);
}
}
enum Internal
{
One = 1,
Too = 2
}
[DataContract]
enum External
{
[EnumMember]
One = 1,
[EnumMember]
Two = 2
}
However, you would have to be careful that they did not become out of sync

Strangest LINQ to SQL case I have ever seen

OK, so this is the strangest issue in .net programming I have ever seen. It seems that object fields are serialized in .net web services in order of field initialization.
It all started with Flex not accepting SOAP response from .net web service. I have found out that it was due to the order of serialized fields was statisfying the order of fields in declared serializable class.
It had something to do with generic lists and LINQ to SQL but I can't find out what. This one is really hard to reproduce.
Example to get the idea:
[Serializable]
public class SomeSample
{
public int A;
public int B;
public int C;
}
I was querying some data tables within asmx web service using linq and returning list of SomeSample objects:
var r = (from ...... select new SomeSample { A = 1, C = 3 }).ToList();
Now the list was once more iterated and B field was applied some value (ex. 2).
However the returned soap envelope contained following excerpt:
<A>1</A><C>3</C><B>2</B>
Please notice the order of serialization. If I initially initialized all fields:
var r = (from ...... select new SomeSample { A = 1, B = 2, C = 3 }).ToList();
object was serialized in correct order.
I must add, that in both cases the debugger shows exactly the same content of "r" variable.
Am I losing my mind or is this normal behavior?
Thanks in advance.
I think you should not rely on serialization order. Actually it doesn't matter for correct deserialization.
Yes it does. :-( At least for elements for SOAP deserializer in Flex SDK 3.6.0.x.
I meant in previous post, that -- no matter if it is OK with SOAP/WSDL specification -- Flex SDK expects
<sequence>
child elements to be provided in order as defined in WSDL.

What is the difference between Serialization and Marshaling?

I know that in terms of several distributed techniques (such as RPC), the term "Marshaling" is used but don't understand how it differs from Serialization. Aren't they both transforming objects into series of bits?
Related:
What is Serialization?
What is Object Marshalling?
Marshaling and serialization are loosely synonymous in the context of remote procedure call, but semantically different as a matter of intent.
In particular, marshaling is about getting parameters from here to there, while serialization is about copying structured data to or from a primitive form such as a byte stream. In this sense, serialization is one means to perform marshaling, usually implementing pass-by-value semantics.
It is also possible for an object to be marshaled by reference, in which case the data "on the wire" is simply location information for the original object. However, such an object may still be amenable to value serialization.
As #Bill mentions, there may be additional metadata such as code base location or even object implementation code.
Both do one thing in common - that is serializing an Object. Serialization is used to transfer objects or to store them. But:
Serialization: When you serialize an object, only the member data within that object is written to the byte stream; not the code that
actually implements the object.
Marshalling: Term Marshalling is used when we talk about passing Object to remote objects(RMI). In Marshalling Object is serialized(member data is serialized) + Codebase is attached.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location.
From the Marshalling (computer science) Wikipedia article:
The term "marshal" is considered to be synonymous with "serialize" in the Python standard library1, but the terms are not synonymous in the Java-related RFC 2713:
To "marshal" an object means to record its state and codebase(s) in such a way that when the marshalled object is "unmarshalled", a copy of the original object is obtained, possibly by automatically loading the class definitions of the object. You can marshal any object that is serializable or remote. Marshalling is like serialization, except marshalling also records codebases. Marshalling is different from serialization in that marshalling treats remote objects specially. (RFC 2713)
To "serialize" an object means to convert its state into a byte stream in such a way that the byte stream can be converted back into a copy of the object.
So, marshalling also saves the codebase of an object in the byte stream in addition to its state.
Basics First
Byte Stream - Stream is a sequence of data. Input stream - reads data from source. Output stream - writes data to destination.
Java Byte Streams are used to perform input/output byte by byte (8 bits at a time). A byte stream is suitable for processing raw data like binary files.
Java Character Streams are used to perform input/output 2 bytes at a time, because Characters are stored using Unicode conventions in Java with 2 bytes for each character. Character stream is useful when we process (read/write) text files.
RMI (Remote Method Invocation) - an API that provides a mechanism to create distributed application in java. The RMI allows an object to invoke methods on an object running in another JVM.
Both Serialization and Marshalling are loosely used as synonyms. Here are few differences.
Serialization - Data members of an object is written to binary form or Byte Stream (and then can be written in file/memory/database etc). No information about data-types can be retained once object data members are written to binary form.
Marshalling - Object is serialized (to byte stream in binary format) with data-type + Codebase attached and then passed Remote Object (RMI). Marshalling will transform the data-type into a predetermined naming convention so that it can be reconstructed with respect to the initial data-type.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location. (Copied from #Nasir answer)
Serialization is almost like a stupid memory-dump of the memory used by the object(s), while Marshalling stores information about custom data-types.
In a way, Serialization performs marshalling with implementation of pass-by-value because no information of data-type is passed, just the primitive form is passed to byte stream.
Serialization may have some issues related to big-endian, small-endian if the stream is going from one OS to another if the different OS have different means of representing the same data. On the other hand, marshalling is perfectly fine to migrate between OS because the result is a higher-level representation.
Marshaling refers to converting the signature and parameters of a function into a single byte array.
Specifically for the purpose of RPC.
Serialization more often refers to converting an entire object / object tree into a byte array
Marshaling will serialize object parameters in order to add them to the message and pass it across the network.
*Serialization can also be used for storage to disk.*
I think that the main difference is that Marshalling supposedly also involves the codebase. In other words, you would not be able to marshal and unmarshal an object into a state-equivalent instance of a different class.
Serialization just means that you can store the object and reobtain an equivalent state, even if it is an instance of another class.
That being said, they are typically synonyms.
Marshalling is the rule to tell compiler how the data will be represented on another environment/system;
For example;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
as you can see two different string values represented as different value types.
Serialization will only convert object content, not representation (will stay same) and obey rules of serialization, (what to export or no). For example, private values will not be serialized, public values yes and object structure will stay same.
Here's more specific examples of both:
Serialization Example:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct {
char value[11];
} SerializedInt32;
SerializedInt32 SerializeInt32(int32_t x)
{
SerializedInt32 result;
itoa(x, result.value, 10);
return result;
}
int32_t DeserializeInt32(SerializedInt32 x)
{
int32_t result;
result = atoi(x.value);
return result;
}
int main(int argc, char **argv)
{
int x;
SerializedInt32 data;
int32_t result;
x = -268435455;
data = SerializeInt32(x);
result = DeserializeInt32(data);
printf("x = %s.\n", data.value);
return result;
}
In serialization, data is flattened in a way that can be stored and unflattened later.
Marshalling Demo:
(MarshalDemoLib.cpp)
#include <iostream>
#include <string>
extern "C"
__declspec(dllexport)
void *StdCoutStdString(void *s)
{
std::string *str = (std::string *)s;
std::cout << *str;
}
extern "C"
__declspec(dllexport)
void *MarshalCStringToStdString(char *s)
{
std::string *str(new std::string(s));
std::cout << "string was successfully constructed.\n";
return str;
}
extern "C"
__declspec(dllexport)
void DestroyStdString(void *s)
{
std::string *str((std::string *)s);
delete str;
std::cout << "string was successfully destroyed.\n";
}
(MarshalDemo.c)
#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
void *myStdString;
LoadLibrary("MarshalDemoLib");
myStdString = ((void *(*)(char *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"MarshalCStringToStdString"
))("Hello, World!\n");
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"StdCoutStdString"
))(myStdString);
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"DestroyStdString"
))(myStdString);
}
In marshaling, data does not necessarily need to be flattened, but it needs to be transformed to another alternative representation. all casting is marshaling, but not all marshaling is casting.
Marshaling doesn't require dynamic allocation to be involved, it can also just be transformation between structs. For example, you might have a pair, but the function expects the pair's first and second elements to be other way around; you casting/memcpy one pair to another won't do the job because fst and snd will get flipped.
#include <stdio.h>
typedef struct {
int fst;
int snd;
} pair1;
typedef struct {
int snd;
int fst;
} pair2;
void pair2_dump(pair2 p)
{
printf("%d %d\n", p.fst, p.snd);
}
pair2 marshal_pair1_to_pair2(pair1 p)
{
pair2 result;
result.fst = p.fst;
result.snd = p.snd;
return result;
}
pair1 given = {3, 7};
int main(int argc, char **argv)
{
pair2_dump(marshal_pair1_to_pair2(given));
return 0;
}
The concept of marshaling becomes especially important when you start dealing with tagged unions of many types. For example, you might find it difficult to get a JavaScript engine to print a "c string" for you, but you can ask it to print a wrapped c string for you. Or if you want to print a string from JavaScript runtime in a Lua or Python runtime. They are all strings, but often won't get along without marshaling.
An annoyance I had recently was that JScript arrays marshal to C# as "__ComObject", and has no documented way to play with this object. I can find the address of where it is, but I really don't know anything else about it, so the only way to really figure it out is to poke at it in any way possible and hopefully find useful information about it. So it becomes easier to create a new object with a friendlier interface like Scripting.Dictionary, copy the data from the JScript array object into it, and pass that object to C# instead of JScript's default array.
(test.js)
var x = new ActiveXObject('Dmitry.YetAnotherTestObject.YetAnotherTestObject');
x.send([1, 2, 3, 4]);
(YetAnotherTestObject.cs)
using System;
using System.Runtime.InteropServices;
namespace Dmitry.YetAnotherTestObject
{
[Guid("C612BD9B-74E0-4176-AAB8-C53EB24C2B29"), ComVisible(true)]
public class YetAnotherTestObject
{
public void send(object x)
{
System.Console.WriteLine(x.GetType().Name);
}
}
}
above prints "__ComObject", which is somewhat of a black box from the point of view of C#.
Another interesting concept is that you might have the understanding how to write code, and a computer that knows how to execute instructions, so as a programmer, you are effectively marshaling the concept of what you want the computer to do from your brain to the program image. If we had good enough marshallers, we could just think of what we want to do/change, and the program would change that way without typing on the keyboard. So, if you could have a way to store all the physical changes in your brain for the few seconds where you really want to write a semicolon, you could marshal that data into a signal to print a semicolon, but that's an extreme.
Marshalling is usually between relatively closely associated processes; serialization does not necessarily have that expectation. So when marshalling data between processes, for example, you may wish to merely send a REFERENCE to potentially expensive data to recover, whereas with serialization, you would wish to save it all, to properly recreate the object(s) when deserialized.
My understanding of marshalling is different to the other answers.
Serialization:
To Produce or rehydrate a wire-format version of an object graph utilizing a convention.
Marshalling:
To Produce or rehydrate a wire-format version of an object graph by utilizing a mapping file, so that the results can be customized. The tool may start by adhering to a convention, but the important difference is the ability to customize results.
Contract First Development:
Marshalling is important within the context of contract first development.
Its possible to make changes to an internal object graph, while keeping the external interface stable over time. This way all of the service subscribers won't have to be modified for every trivial change.
Its possible to map the results across different languages. For example from the property name convention of one language ('property_name') to another ('propertyName').
Marshaling uses Serialization process actually but the major difference is that it in Serialization only data members and object itself get serialized not signatures but in Marshalling Object + code base(its implementation) will also get transformed into bytes.
Marshalling is the process to convert java object to xml objects using JAXB so that it can be used in web services.
Serialisation vs Marshalling
Problem: Object belongs to some process(VM) and it's lifetime is the same
Serialisation - transform object state into stream of bytes(JSON, XML...) for saving, sharing, transforming...
Marshalling - contains Serialisation + codebase. Usually it used by Remote procedure call(RPC) -> Java Remote Method Invocation(Java RMI) where you are able to invoke a object's method which is hosted on remote Java processes.
codebase - is a place or URL to class definition where it can be downloaded by ClassLoader. CLASSPATH[About] is as a local codebase
JVM -> Class Loader -> load class definition
java -Djava.rmi.server.codebase="<some_URL>" -jar <some.jar>
Very simple diagram for RMI
Serialisation - state
Marshalling - state + class definition
Official doc
Think of them as synonyms, both have a producer that sends stuff over to a consumer... In the end fields of instances are written into a byte stream and the other end foes the reverse ands up with the same instances.
NB - java RMI also contains support for transporting classes that are missing from the recipient...