Should a BOM (byte order mark) be added for empty strings (UTF-16 and UTF-32)? - utf-16

Excluding UTF-8, is there a general understanding, or unspoken of convention, that if a string is empty the encoder can (should) safely omit the BOM.
It seems like it would be a waste for empty strings, especially when sending to a server.
Encoding type and byte order would be irrelevant in such a case.
Is there an RFC that specifically addresses BOM for empty strings?
Thank you.

A BOM is typically used only when there is no other external information about the string's encoding. Makes sense for text files, the data has to be self-describing, but not so much for transmission protocols unless there is no other encoding information available, like the Content-Type header in HTTP, the <meta> tag for HTML, hard-coded by protocol specs or protocol extensions, etc.
For simply storing a string in memory, a BOM is useless if you are tracking the string properly. Also, depending on the particular string type you are actually using, an empty string may or may not be implemented as a NULL pointer, so you might not be able to include a BOM anyway.
And no, there is no RFC about general BOM usage.

Related

Serialization of data for protocol implementation

I am to implement a communication protocol. The data structures used in the protocol are defined as bytes per field in each message
bytes 1-2 -> stx bytes
bytes 3 -> mesg type
bytes 4-5 -> size of pay load
bytes 6-... -> pay load bytes (unsigned bytes)
bytes ... - ...+1 -> checksum from byte 3 - ...
bytes ...+2 -> end byte
the example above is variable pay load size, but some Messages are also fixed size.
I have checked a serialization library, namely "protocol buffers" for this purpose but I concluded that protobuf is not complainant as the variant types used change the data serialized.
similar libraries exist but I am not sure if they can be used fir this purpose (flat buffers, cap'n proto).
So, is there a framework to define the interface structures and generate appropriate code (data structures + parser + serializer, with support for multiple languages if possible) for the defined interface?
Or what is the best approach you would suggest for this purpose?
Defining the messages used in a protocol by defining what each byte means is, well, old fashioned. Having said that, an awful lot of current protocols are defined that way.
If you can, it's better to start off with a schema for the protocol (e.g. a .proto file for Google Protocol Buffers, or an .asn file for ASN.1, etc. There's many to choose from), defining the messages you want to exchange, and then use your chosen serialisation technologies tools (e.g. protoc for G.P.B, asn1c for ASN.1, etc) to generate code.
That schema would be defining the content of the "payload" bytes in your example, and you'd leave it up to GPB or whatever to work out how to convey message type, size and length for you. Different serialisation technologies have different capabilities in this regard. In GPB you'd use a oneof structure to incorporate all the different types of content you want to send into a single structure, but GPB doesn't demarcate between different messages on the wire (you have to add that yourself, perhaps by sending messages using ZeroMQ). Several ASN.1 wire formats do demarcate between different messages, saving you the bother (useful on a raw stream connection). XML is also demarcated. I don't think Cap'n Proto does demarcation.
If you're stuck with the protocol as defined byte-by-byte, exactly as you've shown, it's going to be difficult to find a serialisation technology that usefully matches. You'd likely be condemned to writing all the code yourself.

Can Protocol Buffer be partially serialized?

Originally, the program saves the data to file by its own defined behavior. First, the data is defined as following:
struct Data{
DWORD m_Location;
BYTE m_StableCount;
BYTE extra[3]; /* nice 4 byte divisible value */
// the following data is not stored in the file
DWORD m_Uid;
WORD m_Address;
};
Those fields before m_Uid will be stored into file, however, the others does NOT.
Now, I want to convert the Data into protocol buffer message. As far as I know, all fields defined in the message can be serialized. So I have to split the Data into two parts: one including all saved fields, the other including the rest fields.
Here is my question: What if I declare all fields of Data in one message, and only serialize some partial fields in protocol buffer? Any API support it or NOT?
Thanks in advance.
This largely depends on what library you are using. A lot of protocol buffers implementations work as code-gen from the schema, and you have to use the generated DTO - so you would already need to push the data into a different object model. That is an implementation detail, though - it isn't a protocol requirement. For example, protobuf-net allows your existing model to be used, and makes it possible to ignore/include values both generally, and specifically (i.e. it allows per-instance conditional serialization, using the standard conventions of the .NET world for such things). However, I'm assuming that your question relates specifically to non-.NET code, in which case the challenge would be to find a C/C++ library that allows for this approach.

Why do we use serialization?

Why do we need to use serialization?
If we want to send an object or piece of data through a network we can use streams of bytes. If we want to save some data to the disk, we can again use the binary mode along with the byte streams and save it.
So what's the advantage of using serialization?
Technically on the low-level, your serialized object will also end up as a stream of bytes on your cable or your filesystem...
So you can also think of it as a standardized and already available way of converting your objects to a stream of bytes. Storing/transferring object is a very common requirement, and it has less or little meaning to reinvent this wheel in every application.
As other have mentioned, you also know that this object->stream_of_bytes implementation is quite robust, tested, and generally architecture-independent.
This does not mean it is the only acceptable way to save or transfer an object: in some cases, you'll have to implement your own methods, for example to avoid saving unnecessary/private members (for example for security or performance reasons). But if you are in a simple case, you can make your life easier by using the serialization/deserialization of your framework, language or VM instead of having to implement it by yourself.
Hope this helps.
Quoting from Designing Data Intensive Applications book:
Programs usually work with data in (at least) two different
representations:
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for
efficient access and manipulation by the CPU (typically using
pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes
(for example, a JSON document). Since a pointer wouldn’t make sense to
any other process, this sequence-of-bytes representation looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two
representations. The translation from the in-memory representation to
a byte sequence is called encoding (also known as serialization or
marshalling), and the reverse is called decoding (parsing,
deserialization, unmarshalling).
Among other reasons to be compatible between architecture. An integer doesn't have the same number of bytes from one architecture to another, and sometimes from one compiler to another.
Plus what you're talking about is still serialization. Binary Serialization. You're putting all the bytes of your object together in order to store them and be able to reconvert them as an object later. This is serializing.
More info on wikipedia
Serialization is the process of converting an object into a stream so that it can be saved in any physical file like (XML) or can be saved in Database. The main purpose of Serialization in C# is to persist an object and save it in any specified storage medium like stream, physical file or DataBase.
In General, serialization is a method to persist an object's state, but i suggest you to read this wiki page, it is pretty detailed and correct in my opinion:
http://en.wikipedia.org/wiki/Serialization
In serialization, the point is not turning an object into bits and bytes, objects ARE bits and bytes already. Serialization is the process of making the object's "state" persistent. Notice the word "state", which means the values of the instance variables of the entire object graph (the target object and all the objects it references either directly or indirectly) WITHOUT methods and other extra runtime stuff stuck to them (and of course plus a little more info that JVM needs for restoration of these objects, such as their class types).
So this is the main reason of its necessity: Storing the whole bytes of objects would be expensive, and for all intents and purposes, unnecessary.

Is it possible to hook into the protobuf-net serializer to add some custom logic?

This may be overkill, but I am trying to reduce the network consumption of a client/server protocol, by having both sides keep copies of previously transferred URIs, so as to use 2-4 byte placeholders instead of the full URIs on subsequent chatter.
Problem is I think it will be quite expensive to reflect through all the complex objects being transferred to locate the URIs that need processing, whereas the serializer is already visiting all these fields and probably using a mechanism much faster than reflection.
Can this be done in protobuf-net?
If this is part of a single call to Serialize/Deserialize (i.e. your data has the same uri repeated at multiple locations), then you can already do this, simply by telling it to treat those strings as references (it has special handling of strings, so two different references of the same string contents count as equal):
[ProtoMember(7, AsReference=true)]
public string Uri {get;set;}
During serialization, the first time it spots a new string value (decorated with AsReference=true) it will generate a unique token to represent the string; all subsequent usages of that same string will serialize only the token.
If this is in separate calls to Serialize/Deserialize, then no: you would have to do it manually. I can think of some ways of doing it, but I think this would be better handled outside of the serialization layer.
Could you possibly customise the Objects that you are using that you want to Tokenise your URIs and have them inherit or implement an interface that you can check to see if that particular object is a Tokenizer.
Then if that's the case you might be able to use the BeforeSerialization / AfterDeserialization to make your transformations.

sending data between server and client in twisted

I'm trying to transport data between server and client implemented with twisted. As far as I know, using self.transport.write([data]) will work only if data is a string. Is there any other way I can send an object of other type? Thank you!
Sockets carry bytes. That's the only kind of thing they carry. Any two endpoints of a TCP connection can only convey bytes to each other.
Bytes aren't the most useful data structure for every form of communication. So on top of this byte transport, we invent schemes for formatting and interpreting bytes. These are protocols.
Twisted represents protocols as classes, almsot always subclasses of twisted.internet.protocol.Protocol, which implement a particular scheme.
These classes have methods for turning something which isn't pure bytes into something which is pure bytes. For example, twisted.protocols.basic.NetstringReceiver is an implementation of the netstring protocol. It turns a particular number of bytes into bytes which represent both the number of bytes and the bytes themselves. This is a rather subtle protocol, since it's not instantly obvious that the byte count is information that needs to be conveyed as well.
These classes also interpret bytes received from the network, in their dataReceived method, according to the protocol they implement, and turn the resulting information into something more structured. NetstringReceiver uses the length information to accept exactly the right number of bytes from the network and then deliver them to its stringReceived callback as a single Python str instance.
Other protocols do more than NetstringReceiver. For example, twisted.protocols.ftp includes an implementation of the FTP protocol. FTP is a protocol geared towards passing file listings and files over a socket (or several sockets, actually). twisted.mail.pop3 implements POP3, a protocol for transferring email over sockets.
There are lots and lots of different protocols because there are lots and lots of different things you might want to do. Depending on exactly what you're trying to do, there are probably different ways to convert to and from bytes to make things easier or faster or more robust (and so on). So there's no single protocol that is ideal for the general case. That includes the case of "sending an object", since objects can take many different forms, and there may be many different reasons you want to send them, and many different ways you might want to handle things like mutation of an object you'd previously sent, and so on.
You probably want to spend a little time thinking about what kind of communication you need. This should suggest certain things about the protocol you'll select to do the communication.
For example, if you want to be able to call methods on Python objects that exist on the other side of a connection, then Twisted Spread might be interesting.
If you want something cross-language instead, and only need to convey simple types, like integers, strings, and lists, then XML-RPC (Twisted How-To) might be a better fit.
If you need a protocol that's more space efficient than XML-RPC and supports serialization of more complicated types, then AMP might be more appropriate.
And the list goes on. :)