Can Protocol Buffer be partially serialized? - serialization

Originally, the program saves the data to file by its own defined behavior. First, the data is defined as following:
struct Data{
DWORD m_Location;
BYTE m_StableCount;
BYTE extra[3]; /* nice 4 byte divisible value */
// the following data is not stored in the file
DWORD m_Uid;
WORD m_Address;
};
Those fields before m_Uid will be stored into file, however, the others does NOT.
Now, I want to convert the Data into protocol buffer message. As far as I know, all fields defined in the message can be serialized. So I have to split the Data into two parts: one including all saved fields, the other including the rest fields.
Here is my question: What if I declare all fields of Data in one message, and only serialize some partial fields in protocol buffer? Any API support it or NOT?
Thanks in advance.

This largely depends on what library you are using. A lot of protocol buffers implementations work as code-gen from the schema, and you have to use the generated DTO - so you would already need to push the data into a different object model. That is an implementation detail, though - it isn't a protocol requirement. For example, protobuf-net allows your existing model to be used, and makes it possible to ignore/include values both generally, and specifically (i.e. it allows per-instance conditional serialization, using the standard conventions of the .NET world for such things). However, I'm assuming that your question relates specifically to non-.NET code, in which case the challenge would be to find a C/C++ library that allows for this approach.

Related

Code design: Who's responsible for changing object data?

Assuming I have some kind of data structure to work on (for example images) which I want to pre- and postprocess in different ways to make further processing steps easier. What's the best way to implement this responsibility with an OOP language like C++?
Further assuming I have a lot of different processing algorithms with inherent complexity I very likely want to encapsulate them in dedicated classes. This means though that the algorithm implementations externally have to set some kind of info in my data to indicate it having been processed. And that also doesn't look like clean design to me because having been processed seems like an info associated with the data and thus something the data object itself should determine and set on its own.
It also looks like a very common source of error in complex applications: Someone implements another processing algorithm, forgets to set the flags in the data appropriately, something in completely different parts of the application won't work as expected and someone will have lots of fun spotting the error.
Can someone outline a general structure of a good and fail-save way to implement sth like this?
To make sure I understand what you are asking, here are my assumptions based on my reading of the question:
The data is some kind of binary format (presumably an image but as you say it could be anything) that can be represented as an array of bytes
There are a number of processing steps (I'll refer to them as transformations) that can be applied to the data
Some transformations depend on other such that, for example, you would like to avoid applying a transformation if its pre-requisite has not been applied. You would like it to be robust, so that attempting to apply an illegal transformation will be detected and prevented.
And the question is how to do this in an object-oriented way that avoids future bugs as the complexity of the program increases.
One way is to have the image data object, which encapsulates both the binary data and a record of the transformations that have been applied to it, be responsible for executing the transformation through a Transformation object delegate; and the Transformation objects implement both the processing algorithm and the knowledge of whether it can be applied based on previous transformations.
So you might define the following (excuse my Java-like naming style; it's been a long time since I've done C++):
An enumerated type called TransformationType
An abstract class called Transformer, with the following methods:
A method called 'getType' which returns a TransformationType
A method called 'canTransform' that accepts a list of TransformationType and returns a boolean. The list indicates transformations that have already been applied to the data, and the boolean indicates whether it is OK to execute this transformation.
A method called 'transform' that accepts an array of bytes and returns an array of (presumably modified) bytes
A class called BinaryData, containing a byte array and a list of TransformationType. This class implements the method 'void transform(Transformer t)' to do the following:
Query the transformer's 'canTransform' method, passing the list of transformation types; either throw an exception or return if canTransform returns false
Replace he byte array with the results of invoking t.transform(data)
Add the transfomer's type to the list
I think this accomplishes what you want - the image transformation algorithms are defined polymorphically in classes, but the actual application of the transformations is still 'controlled' by the data object. Hence we do not have to trust external code to do the right thing wrt setting / checking flags, etc.

Why do we use serialization?

Why do we need to use serialization?
If we want to send an object or piece of data through a network we can use streams of bytes. If we want to save some data to the disk, we can again use the binary mode along with the byte streams and save it.
So what's the advantage of using serialization?
Technically on the low-level, your serialized object will also end up as a stream of bytes on your cable or your filesystem...
So you can also think of it as a standardized and already available way of converting your objects to a stream of bytes. Storing/transferring object is a very common requirement, and it has less or little meaning to reinvent this wheel in every application.
As other have mentioned, you also know that this object->stream_of_bytes implementation is quite robust, tested, and generally architecture-independent.
This does not mean it is the only acceptable way to save or transfer an object: in some cases, you'll have to implement your own methods, for example to avoid saving unnecessary/private members (for example for security or performance reasons). But if you are in a simple case, you can make your life easier by using the serialization/deserialization of your framework, language or VM instead of having to implement it by yourself.
Hope this helps.
Quoting from Designing Data Intensive Applications book:
Programs usually work with data in (at least) two different
representations:
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for
efficient access and manipulation by the CPU (typically using
pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes
(for example, a JSON document). Since a pointer wouldn’t make sense to
any other process, this sequence-of-bytes representation looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two
representations. The translation from the in-memory representation to
a byte sequence is called encoding (also known as serialization or
marshalling), and the reverse is called decoding (parsing,
deserialization, unmarshalling).
Among other reasons to be compatible between architecture. An integer doesn't have the same number of bytes from one architecture to another, and sometimes from one compiler to another.
Plus what you're talking about is still serialization. Binary Serialization. You're putting all the bytes of your object together in order to store them and be able to reconvert them as an object later. This is serializing.
More info on wikipedia
Serialization is the process of converting an object into a stream so that it can be saved in any physical file like (XML) or can be saved in Database. The main purpose of Serialization in C# is to persist an object and save it in any specified storage medium like stream, physical file or DataBase.
In General, serialization is a method to persist an object's state, but i suggest you to read this wiki page, it is pretty detailed and correct in my opinion:
http://en.wikipedia.org/wiki/Serialization
In serialization, the point is not turning an object into bits and bytes, objects ARE bits and bytes already. Serialization is the process of making the object's "state" persistent. Notice the word "state", which means the values of the instance variables of the entire object graph (the target object and all the objects it references either directly or indirectly) WITHOUT methods and other extra runtime stuff stuck to them (and of course plus a little more info that JVM needs for restoration of these objects, such as their class types).
So this is the main reason of its necessity: Storing the whole bytes of objects would be expensive, and for all intents and purposes, unnecessary.

Is "serialisation without duplication" possible in c++0x?

One of the big uses of code generation in c++ is to support message serialisation. Typically, you want to support specifying message contents and layout in the same step and produce code for that message type that can give you objects capable of being serialised to/from communication streams. In the past, this has usually resulted in code that looks like:
class MyMessage : public SerialisableObject
{
// message members
int myNumber_;
std::string myString_;
std::vector<MyOtherSerialisableObject> aBunchOfThingsIWantToSerialise_;
public:
// ctor, dtor, accesors, mutators, then:
virtual void Serialise(SerialisationStream & stream)
{
stream & myNumber_;
stream & myString_;
stream & aBunchOfThingsIWantToSerialise_;
}
};
The problem with using this kind of design is that violates an important rule of good architecture: you should not have to specify the intent of a design twice. Duplication of intent, like duplicated code and other common development duplication, leaves room for one place in the code to become divergent with the other, causing errors.
In the above, the duplication is the list of members. Potential errors include adding a member to the class but forgetting to add it to the serialisation list, serialising a member twice (possibly by not using the same order as the member declaration or possibly due to a misspelling of a similar member, among other ways), or serialising something that is not a member (which might produce a compiler error, unless name lookup finds something at a different scope than the object that matches lookup rules). That kind of mistake is the same reason we no longer try to match every heap allocation with a delete (instead using smart pointers) or ever file open with a close (using RAII ctor//dtor mechanisms) - we don't want to have to match up our intent in multiple places because there are times we - or another engineer less familiar with the intent - make mistakes.
Generally, therefore, this has been one of the things that code generation could take care of. You might create a file MyMessage.cg to specify both layout and members in one step
serialisable MyMessage
{
int myNumber_;
std::string myString_;
std::vector<MyOtherSerialisableObject> aBunchOfThingsIWantToSerialise_;
};
that would be run through a code generation utility and produce the code.
I was wondering if it was possible yet to do this in c++0x without external code generation. Are there any new language mechanisms that make it possible to specify a class as serialisable once, and the names and layout of it's members are used to layout the message during serialisation?
To be clear, I know that there are tricks with boost tuples and fusion that can come close to this kind of behavior even in the pre-c++0x language. Those usages, though, being based on indexing into the tuple rather than by-member-name access, have all been brittle to changing the layout, as other places in the code that access the messages would then also need to be reordered. Some kind of by-member-name access is necessary to not have to duplicate the layout specification in places in the code that use the messages.
Also, I know it might be nice to take this up to the next level and ask for specifying when some of the members shouldn't be serialised. Other languages that offer serialisation built in often offer some kind of attribute to do this, so
int myNonSerialisedNumber_ [[noserialise]];
might seem natural. However, I personally think it is bad design to have serialisable objects where everything is not serialised, since the lifetime of messages is in the transport to/from the communications layer, separate from other data lifetimes. Also, you could have an object which has a purely serialisable as on of it's members, so such functionality doesn't by anything the language doesn't already offer.
Is this possible? Or did the standards committee leave out this kind of introspective capability? I don't need it to look like the code gen file above - any simple method for compiletime specification of layout and members in a single step would solve this common problem.
This is both possible and practical in C++11 – in fact it was possible back in C++03, the syntax was just a little too unwieldy. I wrote a small library based around the same idea - see the following:
www.github.com/molw5/framework
Sample syntax:
class Object : serializable <Object,
value <NAME(“Field 1”), int>,
value <NAME(“Field 2”), float>,
value <NAME(“Field 3”), double>>
{
};
Most of the underlying code could be reproduced, in principal, in C++03 – some of the implementation details without variadic templates would have been...tricky, but I believe it would have been possible to recover the core functionality. What you could not reproduce in C++03 was the NAME macro above and the syntax relies fairly heavily on it. The macro provides the machinery necessary to generate a unique typename from a string, that is the following:
NAME(“Field 1”)
expands to
type_string <'F', 'i', 'e', 'l', 'd', ' ', '1'>
through the use of some common macros and constexpr (for character extraction). Back in C++03 something similar to the type_string above would need to be entered manually.
C++, of any form, supports neither introspection nor reflection (to the extent that they are different).
One nice thing about doing serialization manually (ie: without introspection or reflection) is that you can provide object versioning. You can support older forms of the serialization, and simply create reasonable defaults for the data that wasn't in the old versions. Or if a new version removes some data, you can simply serialize and discard it.
It seems to me that what you need is Boost.Serialization.

What does Serializing a graph mean?

I have seen this expression "Graph Serialization" in so many places. what does it mean? And what does serialization mean in general and when it is used or in which domains it is mentioned?
Serialization is the process of turning a data set into binary data for transmission or storage. On the iPhone for example, we do this:
NSString *myStringToSerialize = #"I'm going to be bits!";
NSData *data = [myStringToSerialize dataUsingEncoding: NSUnicodeStringEncoding];
The data object is now a binary representation of myStringToSerialize, that we can do something with it (POST it to a web server, save it to a file, email it, etc...).
Graph Serialization is when you take the graph structure and write it to bits so that you can send it somewhere and read it again.
We normally serialize for two reasons:
1) Serialization provides:
A method of persisting objects which is more convenient than writing their properties to a text file on disk, and re-assembling them by reading this back in.
A method of issuing remote procedure calls, e.g., as in SOAP
A method for distributing objects, especially in software componentry such as COM, CORBA, etc.
A method for detecting changes in time-varying data.
2) Serialization allows us to transfer objects between programming languages and various systems that would not be interoperable without serialization.
Serialization is used to flatten a complex structure in something that can be easily transmitted or stored. Every application uses objects that can represent a functional structure (List, Tree, Graph).
But problems come when you have to use them outside your application. How for instance, will you save your fabulous customer list once your edited it ? How can you provide a temperature graph through a web-service. Think something about putting them in a linear structure, like an array of bytes or a string or a database field.
For example, xml file is the result of serializing a tree.
Graph serialization is about serializing graphs. The big issue with this type of content, they are harder to crawl. Unlike trees, you can loop through nodes ; they are harder to represent them in a linear way.

Serialization vs. Archiving?

The iOS docs differentiate between "serializing" and "archiving." Is this a general distinction (i.e., holds in other languages) or is it specific to Objective-C? Also, what is the difference between these two?
This is a case of one being the other some (but not all) of the time.
Wikipedia has this to say about serialization:
"Serialization is the process of converting a data structure or object into a sequence of bits so that it can be stored in a file or memory buffer, or transmitted across a network connection link to be "resurrected" later in the same or another computer environment"
So, archiving may only be serialization, but it could also be the combination of serialization and compresssion, for example. Or perhaps it adds some kind of header info. So serialization is a form of archive, but an archive is not necessarily a serialization.
This isn't really specific to iOS - these terms are thrown around all over. Their specific meaning in the context of iOS could be quite specific, though.
I was actually trying to look for their difference from IOS perspective. Adding the following for people interested :
Purpose:
Archiving is used to store object graphs. complete data model can be archived and restored easily. The way Nib files work can be considered as example for archiving.
Serialization is used for storing arbitrary hierarchy of objects.
The wat plist files work can be considered as example fo serializations.
Differences(excerpts from Archives programing guide):
"The archive preserves the identity of every object in the graph and all the relationships it has with all the other objects in the graph."
Every object encoded within the context of rootObject invocation is tracked. If the coder is asked to encode an object more than once, the coder encodes a reference to the first encoding instead of encoding the object again.
"The serialization only preserves the values of the objects and their position in the hierarchy. Multiple references to the same value object might result in multiple objects when deserialized. The mutability of the objects is not maintained."
Implementation differences:
Any object that implements NSCoding protocol can be archived where as Only instances of NSArray, NSDictionary, NSString, NSDate, NSNumber, and NSData (and some of their subclasses) can be serialized. The contents of array and dictionary objects must also contain only objects of these few classes.
When to Use:
property lists(serialization) should be used for data that consists primarily of strings and numbers. They are very inefficient when used with large blocks of binary data.
It is worthy to Archive objects other than plist objects or storing large blocks of data.
Generally speaking, Serialization is concerned with converting your program data types into architecture independent byte streams. Archiving is specialized serialization in that you could store type and other relationship based information that allow you to unserialize/unmarshall easily. So archival can be thought of as a specialization and subset of Serialization. For Objective-C
Serialization converts Objective-C
types to and from an
architecture-independent byte stream.
In contrast to archiving, basic
serialization does not record the data
type of the values nor the
relationships between them; only the
values themselves are recorded. It is
your responsibility to deserialize the
data in the proper order. Several
convenience classes, however, do
provide the ability to serialize
property lists, recording their
structure along with their values.
With C++ boost serialization --
http://www.boost.org/doc/libs/1_45_0/libs/serialization/doc/index.html
Here, we use the term "serialization"
to mean the reversible deconstruction
of an arbitrary set of C++ data
structures to a sequence of bytes.
Such a system can be used to
reconstitute an equivalent structure
in another program context. Depending
on the context, this might used
implement object persistence, remote
parameter passing or other facility.
In this system we use the term
"archive" to refer to a specific
rendering of this stream of bytes.
This could be a file of binary data,
text data, XML, or some other created
by the user of this library.