What is a protobuf message? - tensorflow

I'm learning how to use tf.records and in the official tutorial they mention you can print a tf.train.Example message (which is a primitive of the protobuf protocol if I get it right).
I understand that tf.records are used to serialize the data, and that they use the protobuf protocol in this case. I also understand that using tf.train.Feature, tf.train.Features and tf.train.Example one can convert the data into the right format.
My question is what does it mean to print a messege in this context? (the tutorial shows how to print an tf.train.Example message)

A message is classically thought of as a collection of bytes that are conveyed from one process/thread to another process/thread. Typically (but not necessarily), the collection of bytes means something to the sender and receiver, e.g. it's an object that has been serialised somehow (perhaps using Google Protocol Buffers). So, an object can become a message by serialising it and placing the bytes into an array that one might term a "message".
It's not necessarily the case the processes handling the collection of bytes will deserialise them. For example, a process that is simply going to pass them onwards down another connection need not actually deserialise them, if it already knows where the bytes are supposed to be sent.
The means by which a message is conveyed is typically some sort of queue / pipe / socket / stream / etc. Where it gets interesting is that most data transports of this sort are stream connections; whatever bytes you push in one end comes out the other. So, then, how to use those for sending messages?
The answer is that there has to be some way of demarcating between messages. There's lots of ways of doing that, but these days it makes far more sense to use something like ZeroMQ, which takes care of all that for you (and more besides). ZeroMQ is a library / protocol that allows a program to transfer a collection of bytes from one process/thread to another via stream connections, and ensure that the receiving program gets the collection in one nice and complete buffer. Those bytes could be objects serialised by Google Protocol Buffer, or serialised in some other way (there's lots). HTTP is also used as a way of moving objects around, e.g. a page of HTML.
So the pattern is object -> serialisation -> message buffer -> some sort of byte transport that demarcates one message from another -> message buffer -> deserialisation -> object.
An advantage of serialisations like Protocol Buffers is that the sender and receiver need not be written in the same language, or share anything at all except for the .proto file. Other approaches to serialisation often involves marking up class definitions in the program source code, which then makes it difficult to deserialise data in another language.
Also in languages like C/C++ one might get away with simply copying the bytes at the object's address from one place to another. This can be a total disaster if the destination is a different machine; endianness etc. can matter a lot. There are serialisation standards that get close to this, specifically Cap'n Proto (see this).
There are variations. Within a process, "passing a message" can simply mean passing ownership of an object around. Ownership can be by convention, i.e. if I've just written the object pointer to a message queue, I won't mutate the object anymore. I think in Rust it's even expressed by the language syntax, in that once object ownership has been given up the language won't let you mutate the object (worked out at compile time, part of what makes Rust so good). The net result looks like message transfer, but in fact all that's happened is a pointer (typically, 64bits) has been copied from A to B, not the entire data in the object. This is a lot faster.
EDIT
So, How Does a Message Transport Protocol Work?
It's worth digging into how something like ZeroMQ works. For it to be able to pass whole application messages across a stream connection, it needs operate some sort of protocol. That protocol is itself going to involve objects (Protocol Data Units) being "serialised" (well, converted to an agreed wire format), pushed through the stream connection, deserialised, and understood by the ZeroMQ library that's on the receiving end. And, when gets on down to it, ZeroMQ is using TCP (over a network), and that too is a protocol built on IP. And that goes on down to Ethernet frames.
So, there's protocols running atop protocols, running atop other protocols (in fact, this is the Layer Model of how computer interconnectedness works).
Why That Matters, and What Can Go Wrong
It's useful to bearing this protocol layering in mind. Sometimes, one might have a requirement to (for example), take very strong measures against buffer overflows, perhaps to prevent remote exploitation. That might be a reason to pick a serialisation technology that helps guard against such things - e.g. Protocol Buffers. However, when picking such a technology, one has to realise that the requirement is met provided that all of the protocol layerings are equally robust. There's no point using, say, Protocol Buffers and declaring oneself safe against buffer overflows, if the OS's IP stack is broken and exploitable.
This is well illustrated by the Heartbleed bug in OpenSSL (see here). This was caused effectively by a weakly specified protocol (see RFC6520); it's defined in English language, and requires the programmer to read this, code up the protocol by hand, and pay attention to all the strictures written in the document. The associated RFC5426 even says:
This document deals with the formatting of data in an external
representation. The following very basic and somewhat casually
defined presentation syntax will be used. The syntax draws from
several sources in its structure. Although it resembles the
programming language "C" in its syntax and XDR [XDR] in both its
syntax and intent, it would be risky to draw too many parallels. The
purpose of this presentation language is to document TLS only; it has
no general application beyond that particular goal.
The Heartbleed bug in OpenSSL was a result of the coding up of the English language spec being done wrong, and given that highlighted statement perhaps it's no great surprise. Applications that were using OpenSSL were wide, wide open to exploitation, even thought the applications themselves (e.g. Web servers) were very well written implementations of, say, HTTPS.
Now, had the designers of TLS chosen to use a decent and strict serialisation technology - perhaps even Google Protocol Buffers (plus some message demarcation) - to define the PDUs in TLS, it would have been far more likely that Heartbleed wouldn't have happened. Specifically, the payload_length field in a request / response would have been taken care of inside Google Protocol Buffers, thereby removing responsibility for handling the length of the payload from the developer.
What's interesting is to compare protocol specifications as written in RFCs with those that tend to be found in the world of telephony (regulated by the International Telephony Union). The ITU's specifications and tools are very "comprehensive" (that ought to be an acceptably neutral way of describing them). A lot of telephony uses ASN.1, which is not disimilar to (and substantially pre-dates) Google Protocol Buffers, but allows for very strict definitions of messages, requires pretty comprehensive tools to do it right, but is bang up to date (it even has JSON as a wire format these days).
"But", one points out, "what if the ASN.1 tools (or Google Protocol Buffers) has a bug?". Well indeed that is a problem, and that has indeed happened to ASN.1 (from one of the commercial ASN.1 tools vendors, can't rememeber which). But the point is that if there's one library that is widely used for defining lots of interfaces, then there's a greater chance of bugs being identified (I myself have found and reported bugs in commercial ASN.1 tools). Whereas if a messaging protocol is defined using, say, English language, there's only ever going to be a very few set of eyes on how well the developer has coded up the meaning of that English language.
Not Everyone Has Got the Message
What I find disappointing is that, across a large portion of the software world, there's still resistance to using tools like Google Protocol Buffers, ASN.1. There's also projects that, having identified the need for such things, go and invent their own.
One such example is dBus - which to be fair is pretty good. However they did go an invent their own serialisation technology for specifying dBus messages; I'm not sure what they gained over using something mature and off-the-shelf.
Google themselves, when they first announced Google Protocol Buffers to the world, were asked "Why didn't you use ASN.1?", and the Googler on the stage had to admit to never having heard of it. So, Googlers in Google hadn't used Google to Google for "binary serialisation technologies"; they'd just gone ahead and wrote their own, and GPB is missing a ton of useful features. Oh, the irony. They'd not even have had to write a toolset from scratch; they could have simply adopted and improved on one of the open source ASN.1 implementations.
Transliteration Problem
This fragmentation and proliferation causes problems. Say, for example, in your project you want to be able to transfer some of your messages into a dBus service on Linux. To do that, you've got a .proto defining your messages, which is great for communicating in/out of Tensor Flow, but fundamentally useless for dBus, which speaks its own format. You'd end up having something like
MyProtoMsg ipMsg;
MyEquivalentDBusMsg opMsg;
opMsg.field1 = ipMsg.field1;
opMsg.field2 = ipMsg.field2;
opMsg.field3 = ipMsg.field3;
and so on. Very laborious, very unmaintainable, and needlessly consumes resources. The other option would be simply to wrap up your GPB encoded messages in a byte array in a dBus message, but one feels that's missing the point (it bypasses any opportunity for dBus to assert that messages it's passing are correctly formed and within specifications).
If the world agreed on the One True Serialisation technology then the flexibility in object / message exchange would be fantastic.

Related

Should I move from REST-HTTP to Rabbitmq-RPC for synchronous call?

I read lots, many people suggested does not use AQMP-RPC for synchronous call. My response data size is 4MB, so, REST-HTTP taking too much time to send data from server to client. So, we decided to move RPC.
Can someone please suggest, should I move from REST-HTTP to AQMP-RPC or any other RPC methods like Apache Avro, Thrift or Google Protocol Buffer for sending bigger data.
You could do worse than take a look at Cap'n Proto. It's an interesting take on serialisastion, in that it endeavours to remove the need for it at all whilst still making things sane in application code. It's written by one of the guys who did Google Protocol Buffers v2. They're doing a sneaky thing with RPC too, allowing some time saving if the result of one RPC call is merely the input to a subsequent RPC call.
GPB aren't too bad either, ASN.1, etc. Anything (apart from Cap'n Proto) that has a binary wire format is probably going to be about the same - they have to marshal bits and bytes to and from a local representations. Avro of course includes its own schema with messages - pity if that's bigger than the message that's being sent.
Anything binary is probably way better than anything text (JSON, XML, etc).

Possible to share information between an add-on to an existing program and a standalone application? [duplicate]

I'm looking at building a Cocoa application on the Mac with a back-end daemon process (really just a mostly-headless Cocoa app, probably), along with 0 or more "client" applications running locally (although if possible I'd like to support remote clients as well; the remote clients would only ever be other Macs or iPhone OS devices).
The data being communicated will be fairly trivial, mostly just text and commands (which I guess can be represented as text anyway), and maybe the occasional small file (an image possibly).
I've looked at a few methods for doing this but I'm not sure which is "best" for the task at hand. Things I've considered:
Reading and writing to a file (…yes), very basic but not very scalable.
Pure sockets (I have no experience with sockets but I seem to think I can use them to send data locally and over a network. Though it seems cumbersome if doing everything in Cocoa
Distributed Objects: seems rather inelegant for a task like this
NSConnection: I can't really figure out what this class even does, but I've read of it in some IPC search results
I'm sure there are things I'm missing, but I was surprised to find a lack of resources on this topic.
I am currently looking into the same questions. For me the possibility of adding Windows clients later makes the situation more complicated; in your case the answer seems to be simpler.
About the options you have considered:
Control files: While it is possible to communicate via control files, you have to keep in mind that the files need to be communicated via a network file system among the machines involved. So the network file system serves as an abstraction of the actual network infrastructure, but does not offer the full power and flexibility the network normally has. Implementation: Practically, you will need to have at least two files for each pair of client/servers: a file the server uses to send a request to the client(s) and a file for the responses. If each process can communicate both ways, you need to duplicate this. Furthermore, both the client(s) and the server(s) work on a "pull" basis, i.e., they need to revisit the control files frequently and see if something new has been delivered.
The advantage of this solution is that it minimizes the need for learning new techniques. The big disadvantage is that it has huge demands on the program logic; a lot of things need to be taken care of by you (Will the files be written in one piece or can it happen that any party picks up inconsistent files? How frequently should checks be implemented? Do I need to worry about the file system, like caching, etc? Can I add encryption later without toying around with things outside of my program code? ...)
If portability was an issue (which, as far as I understood from your question is not the case) then this solution would be easy to port to different systems and even different programming languages. However, I don't know of any network files ystem for iPhone OS, but I am not familiar with this.
Sockets: The programming interface is certainly different; depending on your experience with socket programming it may mean that you have more work learning it first and debugging it later. Implementation: Practically, you will need a similar logic as before, i.e., client(s) and server(s) communicating via the network. A definite plus of this approach is that the processes can work on a "push" basis, i.e., they can listen on a socket until a message arrives which is superior to checking control files regularly. Network corruption and inconsistencies are also not your concern. Furthermore, you (may) have more control over the way the connections are established rather than relying on things outside of your program's control (again, this is important if you decide to add encryption later on).
The advantage is that a lot of things are taken off your shoulders that would bother an implementation in 1. The disadvantage is that you still need to change your program logic substantially in order to make sure that you send and receive the correct information (file types etc.).
In my experience portability (i.e., ease of transitioning to different systems and even programming languages) is very good since anything even remotely compatible to POSIX works.
[EDIT: In particular, as soon as you communicate binary numbers endianess becomes an issue and you have to take care of this problem manually - this is a common (!) special case of the "correct information" issue I mentioned above. It will bite you e.g. when you have a PowerPC talking to an Intel Mac. This special case disappears with the solution 3.+4. together will all of the other "correct information" issues.]
+4. Distributed objects: The NSProxy class cluster is used to implement distributed objects. NSConnection is responsible for setting up remote connections as a prerequisite for sending information around, so once you understand how to use this system, you also understand distributed objects. ;^)
The idea is that your high-level program logic does not need to be changed (i.e., your objects communicate via messages and receive results and the messages together with the return types are identical to what you are used to from your local implementation) without having to bother about the particulars of the network infrastructure. Well, at least in theory. Implementation: I am also working on this right now, so my understanding is still limited. As far as I understand, you do need to setup a certain structure, i.e., you still have to decide which processes (local and/or remote) can receive which messages; this is what NSConnection does. At this point, you implicitly define a client/server architecture, but you do not need to worry about the problems mentioned in 2.
There is an introduction with two explicit examples at the Gnustep project server; it illustrates how the technology works and is a good starting point for experimenting:
http://www.gnustep.org/resources/documentation/Developer/Base/ProgrammingManual/manual_7.html
Unfortunately, the disadvantages are a total loss of compatibility (although you will still do fine with the setup you mentioned of Macs and iPhone/iPad only) with other systems and loss of portability to other languages. Gnustep with Objective-C is at best code-compatible, but there is no way to communicate between Gnustep and Cocoa, see my edit to question number 2 here: CORBA on Mac OS X (Cocoa)
[EDIT: I just came across another piece of information that I was unaware of. While I have checked that NSProxy is available on the iPhone, I did not check whether the other parts of the distributed objects mechanism are. According to this link: http://www.cocoabuilder.com/archive/cocoa/224358-big-picture-relationships-between-nsconnection-nsinputstream-nsoutputstream-etc.html (search the page for the phrase "iPhone OS") they are not. This would exclude this solution if you demand to use iPhone/iPad at this moment.]
So to conclude, there is a trade-off between effort of learning (and implementing and debugging) new technologies on the one hand and hand-coding lower-level communication logic on the other. While the distributed object approach takes most load of your shoulders and incurs the smallest changes in program logic, it is the hardest to learn and also (unfortunately) the least portable.
Disclaimer: Distributed Objects are not available on iPhone.
Why do you find distributed objects inelegant? They sounds like a good match here:
transparent marshalling of fundamental types and Objective-C classes
it doesn't really matter wether clients are local or remote
not much additional work for Cocoa-based applications
The documentation might make it sound like more work then it actually is, but all you basically have to do is to use protocols cleanly and export, or respectively connect to, the servers root object.
The rest should happen automagically behind the scenes for you in the given scenario.
We are using ThoMoNetworking and it works fine and is fast to setup. Basically it allows you to send NSCoding compliant objects in the local network, but of course also works if client and server are on he same machine. As a wrapper around the foundation classes it takes care of pairing, reconnections, etc..

Transmitting SOAP messages over Named Pipes in WCF

I fear I may be displaying my ignorance with this question, but here goes...
I would like to use WCF to implement interprocess communication between a .NET app and a third-party app written in Qt. The Qt app has a plugin architecture that, if I choose to, can be used to bootstrap some .NET classes to handle WCF cleanly at both ends, but I'd rather keep the codebase native and therefore I'm thinking of ways to make sure that whatever I send down the wire with WCF, I can reassemble at the other end using classes available in Qt.
Qt has a SOAP message class, so I figured the preferable solution - and the one that's closest to the one we've hacked together already - is to send SOAP messages and pick them up off a QLocalSocket. Question is, is it possible to force WCF to encode messages as SOAP over a NetNamedPipeBinding, and if so, is it wise to do so?
I'm feeling rather wary at this point that my question might not make complete sense due to my shaky understanding of the technology involved. If this is the case, please take the time to explain why instead of just saying 'no'.
edit #1: I figure an update is warranted, as I've investigated some and should report my findings.
Firstly, I have found that Qt is a pig. The QtSoapMessage class I mentioned, it turns out, doesn't exist in the current version, and is available only as an after-market source package that you have to compile yourself. It took me many hours of googling to find out why this wasn't working. The Qt documentation is utterly dreadful, Qt Creator is counterintuitive in the extreme, and I've all but lost patience with it so haven't pursued this idea any further as yet. Furthermore, it isn't obvious how exactly I am to pass the socket data into the soap message constructor, which takes a QDomDocument, whereas the API for reading XML from a socket uses a QXmlStreamReader or somesuch. There doesn't seem to be any conversion between them.
You actually have a different problem to the one you think you have.
WCF will by default exchange SOAP messages over the NetNamedPipeBinding.
However, the message exchange is layered over some Microsoft proprietary protocols for transaction flow, message framing and encoding, which means that if on the Qt side you pick up a byte stream directly from a QLocalSocket you will have a lot of work to do to implement these underlying protocols before you will be able to get at the SOAP infoset itself.
It is possible to configure the NetNamedPipe binding to remove some of these protocol layers, but not all of them - the framing protocol will always be there, for example.
You might like to read my blog for a lot more detail on this.

Ruminations on highly-scalable and modular distributed server side architectures

Mine is not really a question, it's more of a call for opinions - and perhaps this isn't even the right place to post it. Nevertheless, the community here is very informed, and there's no harm in trying...
I was thinking about ways to create a highly scalable and, above all, highly modular back-end architecture. For example, an entire back-end ecosystem for a large site that had the potential for future-proof evolution into a massive site.
This would entail a very high degree of separation of concerns, to the extent that not only could (say) the underling DB be replaced (ie from Oracle to MySQL) but the actual type of database could be replaced (ed SQL to KV, or vice versa).
I envision a situation where each sub-system exposes its own API within the back-end ecosystem. In this way, the API could remain constant, whilst the implementation could change (even radically) over time.
The system must be heterogeneous in that it's not tied to a specific language. It must be able to accommodate modules or entire sub-systems using different languages.
It then occurred to me that what I was imagining was simply the architecture of the web itself.
So here is my discussion point: apart from the overhead of using (mainly) text-based protocols is there any overriding reason why a complex back-end architecture should not be implemented in the manner I describe, or is there some strong rationale I'm missing for using communication protocols such as Twisted, AMQP, Thrift, etc?
UPDATE: Following a comment from #meagar, I should perhaps reformulate the question: are the clear advantages of using a very simple, flexible and well-understood architecture (ie all functionality exposed as a series RESTful APIs) enough to compensate the obvious performance hit incurred when using this architecture in a back-end context?
[code]the actual type of database could be replaced (ed SQL to KV, or vice versa).[/code]
And anyone who wrote a join between two tables will be sad. If you want the "ability" to switch to KV, then you should not expose an API richer than what KV can support.
The answer to your question depends on what it is you're trying to accomplish. You want to keep each module within reasonable reins. Use proper physical layering of code, use defined interfaces with side-effect contracts, use test cases for each success and failure case of each interface. That way, you can depend on things like "when user enters blah page, a user-blah fact is generated so that all registered fact listeners will be invoked." This allows you to extend the system without having direct calls from point A to point B, while still having some kind of control over widely disparate dependencies. (I hate code bases where you can't find-all to find all possible references to a symbol!)
However, the fact that we put lots of code and classes into a single system is because calling between systems is often very, very expensive. You want to think in terms of code modules making requests of each other where you can. The difference in timing between a function call and a REST call is something like one to a million (maybe you can get it as low as one to ten thousand, if you only count cycles, not wallclock time -- but I'm not so sure). Also, anything that goes on a wire in a datacenter may potentially suffer from packet loss, because there is no such thing as a 100% loss-free data center, no matter how hard you try. Packet loss means random latency spikes in the response time for your application.

A smart UDP protocol analyzer?

Is there a "smart" UDP protocol analyzer that can help me reverse engineer a message based protocol?
I'm using Wireshark to do the sniffing, but if there's a tool that can detect regularities in the protocol (repeated strings, bits of the protocol that are CRC/Checksum or length, ...) and aid the process that would help.
You are asking for a universal inference engine. The best way to try to recover the protocol (assuming you are in a jurisdiction that permits this) is to understand the underlying message transfer from the beginning of a session, and then trying to manually simulate the behaviour of each party through a sequence of ping-pong message trials. This way you develop an understanding of the message structures and their functioning.
Using the UDP frame boundaries is a good place to start looking for structure.
If you have no documentation, you will find that even if you gain a good understanding of the protocol, expect to be surprised many times during the project.
If you can, have your existing systems carry out exactly the scenario you need to use, and then simply replicate the same sequence with payload (and any checksum) changes only. This way you can possibly achieve the requirement without a comprehensive understanding of the protocol.
For an example of the effort in doing this you could look at a historical review of the Samba project at A bit of history and a bit of fun.