Is it common to have RESTful endpoint returning Protobuf strings? - api

Instead of having a gRPC server (say, due to platform restrictions), you have a REST endpoint that returns data.SerializeToString() as the payload. Of course, any clients of this endpoint would have the appropriate proto files for each response, so they can ParseFromString(data) and be on their way. Reasons for doing this includes the benefits of Protobufs.

Improved understanding of the question: is it common to use PBs for other purposes than gRPC transport?
Yes it is totally common and reasonable. PBs are really nothing more than a data serialization format. gRPC just uses it as message interchange format (natural choice as both are Google creations). Let the answer be the description from Google itself:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data.
Google's basic tutorial is saving it to disk. Do anything you would do with any other binary blob (jpeg, mp3, ...)
BUT! if serialization speed is really critical for you, don't assume anything. Today's JSON libs may be unexpectedly well performing - depends on your specific platforms and dominant message characteristics. Do your own performance tests. If JSON inferiority is confirmed, then there are again libs with faster serialization than PB. To name a couple: Google's less popular PB sibling FlatBuffers and something called Simple Binary Encoding, which was developed for High Frequency Trading... speaks for itself.

Related

What is a protobuf message?

I'm learning how to use tf.records and in the official tutorial they mention you can print a tf.train.Example message (which is a primitive of the protobuf protocol if I get it right).
I understand that tf.records are used to serialize the data, and that they use the protobuf protocol in this case. I also understand that using tf.train.Feature, tf.train.Features and tf.train.Example one can convert the data into the right format.
My question is what does it mean to print a messege in this context? (the tutorial shows how to print an tf.train.Example message)
A message is classically thought of as a collection of bytes that are conveyed from one process/thread to another process/thread. Typically (but not necessarily), the collection of bytes means something to the sender and receiver, e.g. it's an object that has been serialised somehow (perhaps using Google Protocol Buffers). So, an object can become a message by serialising it and placing the bytes into an array that one might term a "message".
It's not necessarily the case the processes handling the collection of bytes will deserialise them. For example, a process that is simply going to pass them onwards down another connection need not actually deserialise them, if it already knows where the bytes are supposed to be sent.
The means by which a message is conveyed is typically some sort of queue / pipe / socket / stream / etc. Where it gets interesting is that most data transports of this sort are stream connections; whatever bytes you push in one end comes out the other. So, then, how to use those for sending messages?
The answer is that there has to be some way of demarcating between messages. There's lots of ways of doing that, but these days it makes far more sense to use something like ZeroMQ, which takes care of all that for you (and more besides). ZeroMQ is a library / protocol that allows a program to transfer a collection of bytes from one process/thread to another via stream connections, and ensure that the receiving program gets the collection in one nice and complete buffer. Those bytes could be objects serialised by Google Protocol Buffer, or serialised in some other way (there's lots). HTTP is also used as a way of moving objects around, e.g. a page of HTML.
So the pattern is object -> serialisation -> message buffer -> some sort of byte transport that demarcates one message from another -> message buffer -> deserialisation -> object.
An advantage of serialisations like Protocol Buffers is that the sender and receiver need not be written in the same language, or share anything at all except for the .proto file. Other approaches to serialisation often involves marking up class definitions in the program source code, which then makes it difficult to deserialise data in another language.
Also in languages like C/C++ one might get away with simply copying the bytes at the object's address from one place to another. This can be a total disaster if the destination is a different machine; endianness etc. can matter a lot. There are serialisation standards that get close to this, specifically Cap'n Proto (see this).
There are variations. Within a process, "passing a message" can simply mean passing ownership of an object around. Ownership can be by convention, i.e. if I've just written the object pointer to a message queue, I won't mutate the object anymore. I think in Rust it's even expressed by the language syntax, in that once object ownership has been given up the language won't let you mutate the object (worked out at compile time, part of what makes Rust so good). The net result looks like message transfer, but in fact all that's happened is a pointer (typically, 64bits) has been copied from A to B, not the entire data in the object. This is a lot faster.
EDIT
So, How Does a Message Transport Protocol Work?
It's worth digging into how something like ZeroMQ works. For it to be able to pass whole application messages across a stream connection, it needs operate some sort of protocol. That protocol is itself going to involve objects (Protocol Data Units) being "serialised" (well, converted to an agreed wire format), pushed through the stream connection, deserialised, and understood by the ZeroMQ library that's on the receiving end. And, when gets on down to it, ZeroMQ is using TCP (over a network), and that too is a protocol built on IP. And that goes on down to Ethernet frames.
So, there's protocols running atop protocols, running atop other protocols (in fact, this is the Layer Model of how computer interconnectedness works).
Why That Matters, and What Can Go Wrong
It's useful to bearing this protocol layering in mind. Sometimes, one might have a requirement to (for example), take very strong measures against buffer overflows, perhaps to prevent remote exploitation. That might be a reason to pick a serialisation technology that helps guard against such things - e.g. Protocol Buffers. However, when picking such a technology, one has to realise that the requirement is met provided that all of the protocol layerings are equally robust. There's no point using, say, Protocol Buffers and declaring oneself safe against buffer overflows, if the OS's IP stack is broken and exploitable.
This is well illustrated by the Heartbleed bug in OpenSSL (see here). This was caused effectively by a weakly specified protocol (see RFC6520); it's defined in English language, and requires the programmer to read this, code up the protocol by hand, and pay attention to all the strictures written in the document. The associated RFC5426 even says:
This document deals with the formatting of data in an external
representation. The following very basic and somewhat casually
defined presentation syntax will be used. The syntax draws from
several sources in its structure. Although it resembles the
programming language "C" in its syntax and XDR [XDR] in both its
syntax and intent, it would be risky to draw too many parallels. The
purpose of this presentation language is to document TLS only; it has
no general application beyond that particular goal.
The Heartbleed bug in OpenSSL was a result of the coding up of the English language spec being done wrong, and given that highlighted statement perhaps it's no great surprise. Applications that were using OpenSSL were wide, wide open to exploitation, even thought the applications themselves (e.g. Web servers) were very well written implementations of, say, HTTPS.
Now, had the designers of TLS chosen to use a decent and strict serialisation technology - perhaps even Google Protocol Buffers (plus some message demarcation) - to define the PDUs in TLS, it would have been far more likely that Heartbleed wouldn't have happened. Specifically, the payload_length field in a request / response would have been taken care of inside Google Protocol Buffers, thereby removing responsibility for handling the length of the payload from the developer.
What's interesting is to compare protocol specifications as written in RFCs with those that tend to be found in the world of telephony (regulated by the International Telephony Union). The ITU's specifications and tools are very "comprehensive" (that ought to be an acceptably neutral way of describing them). A lot of telephony uses ASN.1, which is not disimilar to (and substantially pre-dates) Google Protocol Buffers, but allows for very strict definitions of messages, requires pretty comprehensive tools to do it right, but is bang up to date (it even has JSON as a wire format these days).
"But", one points out, "what if the ASN.1 tools (or Google Protocol Buffers) has a bug?". Well indeed that is a problem, and that has indeed happened to ASN.1 (from one of the commercial ASN.1 tools vendors, can't rememeber which). But the point is that if there's one library that is widely used for defining lots of interfaces, then there's a greater chance of bugs being identified (I myself have found and reported bugs in commercial ASN.1 tools). Whereas if a messaging protocol is defined using, say, English language, there's only ever going to be a very few set of eyes on how well the developer has coded up the meaning of that English language.
Not Everyone Has Got the Message
What I find disappointing is that, across a large portion of the software world, there's still resistance to using tools like Google Protocol Buffers, ASN.1. There's also projects that, having identified the need for such things, go and invent their own.
One such example is dBus - which to be fair is pretty good. However they did go an invent their own serialisation technology for specifying dBus messages; I'm not sure what they gained over using something mature and off-the-shelf.
Google themselves, when they first announced Google Protocol Buffers to the world, were asked "Why didn't you use ASN.1?", and the Googler on the stage had to admit to never having heard of it. So, Googlers in Google hadn't used Google to Google for "binary serialisation technologies"; they'd just gone ahead and wrote their own, and GPB is missing a ton of useful features. Oh, the irony. They'd not even have had to write a toolset from scratch; they could have simply adopted and improved on one of the open source ASN.1 implementations.
Transliteration Problem
This fragmentation and proliferation causes problems. Say, for example, in your project you want to be able to transfer some of your messages into a dBus service on Linux. To do that, you've got a .proto defining your messages, which is great for communicating in/out of Tensor Flow, but fundamentally useless for dBus, which speaks its own format. You'd end up having something like
MyProtoMsg ipMsg;
MyEquivalentDBusMsg opMsg;
opMsg.field1 = ipMsg.field1;
opMsg.field2 = ipMsg.field2;
opMsg.field3 = ipMsg.field3;
and so on. Very laborious, very unmaintainable, and needlessly consumes resources. The other option would be simply to wrap up your GPB encoded messages in a byte array in a dBus message, but one feels that's missing the point (it bypasses any opportunity for dBus to assert that messages it's passing are correctly formed and within specifications).
If the world agreed on the One True Serialisation technology then the flexibility in object / message exchange would be fantastic.

Should I move from REST-HTTP to Rabbitmq-RPC for synchronous call?

I read lots, many people suggested does not use AQMP-RPC for synchronous call. My response data size is 4MB, so, REST-HTTP taking too much time to send data from server to client. So, we decided to move RPC.
Can someone please suggest, should I move from REST-HTTP to AQMP-RPC or any other RPC methods like Apache Avro, Thrift or Google Protocol Buffer for sending bigger data.
You could do worse than take a look at Cap'n Proto. It's an interesting take on serialisastion, in that it endeavours to remove the need for it at all whilst still making things sane in application code. It's written by one of the guys who did Google Protocol Buffers v2. They're doing a sneaky thing with RPC too, allowing some time saving if the result of one RPC call is merely the input to a subsequent RPC call.
GPB aren't too bad either, ASN.1, etc. Anything (apart from Cap'n Proto) that has a binary wire format is probably going to be about the same - they have to marshal bits and bytes to and from a local representations. Avro of course includes its own schema with messages - pity if that's bigger than the message that's being sent.
Anything binary is probably way better than anything text (JSON, XML, etc).

Binary Serialization vs. use of WCF

I am wondering if there are any performance overhead issues to consider when using WCF vs. Binary Serialization done manually. I am building an n-tier site and wish to implement asynchronous behavior across tiers. I plan on passing data in binary form to lessen bandwidth. WCF seems to be a good shortcut to building your own tools, but I am wondering if there are any points to be aware of when making the choice between use of the WCF vs. System.IO Namespace and building your own light weight library.
There is a binary formatter for WCF, though its not entirely binary; it produces SOAP messages whose content is formatted using the .NET Binary Format for XML, which is a highly compacted form of XML. (Examples of what this looks like are found on this samples page.)
Alternatively, you can implement your own custom message formatter, as long as the formatter was available on both client and server side, to format however you want. (I think you'll still have some overhead from WCF but not much.)
My personal opinion, no amount of overhead savings you might get from defining a custom binary format, and writing all of the serialization/deserialization code to implement it manually, will ever compensate the time and effort you will spend trying to implement and debug such a mechanism.

GWT Data Serialization

I'm looking for the algorithm that Google's Web Toolkit uses to serialize data posted to the server during an AJAX request. I'm looking to duplicate it in another language so that I can tie in another of my projects with a GWT project.
Any help is much appreciated!
The GWT-RPC serialization is heavily tied to Java. It even sends Java class names over the wire.
I suggest you use something like JSON to communicate with the server. This way, you can use any programming language with the GWT server.
Update: There are no definitive references to the GWT-RPC format, and a mailing list post explains that decision:
The GWT RPC format is intentionally opaque JSON. This makes it
somewhere between difficult and impossible to add a non-GWT agent to
the RPC discussion. There isn't really a nice work-around for
creating a non-Java server-side implementation but, because your
RemoteServiceServlet implementation just has to implement your
synchronous RPC interface, it's quite possible for non-GWT clients to
talk to the same server-side business logic, just without using the
RPC protocol.
and the little detail which surfaced was
The wire format is plain text. It's actually JSON. It's just
unreadable JSON because the assumption is that both the producing and
consuming code is auto-generated and can make all kinds of assumptions
about the structure of the text.
I've written a design document explaining the GWT-RPC wire format. Hopefully you'll find it useful.

Biggest differences of Thrift vs Protocol Buffers? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
What are the biggest pros and cons of Apache Thrift vs Google's Protocol Buffers?
They both offer many of the same features; however, there are some differences:
Thrift supports 'exceptions'
Protocol Buffers have much better documentation/examples
Thrift has a builtin Set type
Protocol Buffers allow "extensions" - you can extend an external proto to add extra fields, while still allowing external code to operate on the values. There is no way to do this in Thrift
I find Protocol Buffers much easier to read
Basically, they are fairly equivalent (with Protocol Buffers slightly more efficient from what I have read).
Another important difference are the languages supported by default.
Protocol Buffers: Java, Android Java, C++, Python, Ruby, C#, Go, Objective-C, Node.js
Thrift: Java, C++, Python, Ruby, C#, Go, Objective-C, JavaScript, Node.js, Erlang, PHP, Perl, Haskell, Smalltalk, OCaml, Delphi, D, Haxe
Both could be extended to other platforms, but these are the languages bindings available out-of-the-box.
RPC is another key difference. Thrift generates code to implement RPC clients and servers wheres Protocol Buffers seems mostly designed as a data-interchange format alone.
Protobuf serialized objects are about 30% smaller than Thrift.
Most actions you may want to do with protobuf objects (create, serialize, deserialize) are much slower than thrift unless you turn on option optimize_for = SPEED.
Thrift has richer data structures (Map, Set)
Protobuf API looks cleaner, though the generated classes are all packed as inner classes which is not so nice.
Thrift enums are not real Java Enums, i.e. they are just ints. Protobuf has real Java enums.
For a closer look at the differences, check out the source code diffs at this open source project.
As I've said as "Thrift vs Protocol buffers" topic :
Referring to Thrift vs Protobuf vs JSON comparison :
Thrift supports out of the box AS3, C++, C#, D, Delphi, Go, Graphviz, Haxe, Haskell, Java, Javascript, Node.js, OCaml, Smalltalk, Typescript, Perl, PHP, Python, Ruby, ...
C++, Python, Java - in-box support in Protobuf
Protobuf support for other languages (including Lua, Matlab, Ruby, Perl, R, Php, OCaml, Mercury, Erlang, Go, D, Lisp) is available as Third Party Addons (btw. Here is SWI-Prolog support).
Protobuf has much better documentation and plenty of examples.
Thrift comes with a good tutorial
Protobuf objects are smaller
Protobuf is faster when using "optimize_for = SPEED" configuration
Thrift has integrated RPC implementation, while for Protobuf RPC solutions are separated, but available (like Zeroc ICE ).
Protobuf is released under BSD-style license
Thrift is released under Apache 2 license
Additionally, there are plenty of interesting additional tools available for those solutions, which might decide. Here are examples for Protobuf: Protobuf-wireshark , protobufeditor.
Protocol Buffers seems to have a more compact representation, but that's only an impression I get from reading the Thrift whitepaper. In their own words:
We decided against some extreme storage optimizations (i.e. packing
small integers into ASCII or using a 7-bit continuation format)
for the sake of simplicity and clarity in the code. These alterations
can easily be made if and when we encounter a performance-critical
use case that demands them.
Also, it may just be my impression, but Protocol Buffers seems to have some thicker abstractions around struct versioning. Thrift does have some versioning support, but it takes a bit of effort to make it happen.
I was able to get better performance with a text based protocol as compared to protobuff on python. However, no type checking or other fancy utf8 conversion, etc... which protobuff offers.
So, if serialization/deserialization is all you need, then you can probably use something else.
http://dhruvbird.blogspot.com/2010/05/protocol-buffers-vs-http.html
One obvious thing not yet mentioned is that can be both a pro or con (and is same for both) is that they are binary protocols. This allows for more compact representation and possibly more performance (pros), but with reduced readability (or rather, debuggability), a con.
Also, both have bit less tool support than standard formats like xml (and maybe even json).
(EDIT) Here's an Interesting comparison that tackles both size & performance differences, and includes numbers for some other formats (xml, json) as well.
I think most of these points have missed the basic fact that Thrift is an RPC framework, which happens to have the ability to serialize data using a variety of methods (binary, XML, etc).
Protocol Buffers are designed purely for serialization, it's not a framework like Thrift.
ProtocolBuffers is FASTER.
There is a nice benchmark here:
https://github.com/eishay/jvm-serializers/wiki (last updated 2016, but there are forks that contain faster serializers as of 2020, e.g. ActiveJ created a fork to demonstrate their speed on the JVM: https://github.com/activej/jvm-serializers).
You might also want to look into Avro, which can be faster. There are two libraries for Avro in .NET:
Apache.Avro
Chr.Avro - written by engineers at C.H. Robinson, a supply chain logistics company
By the way, the fastest I've ever seen is Cap'nProto;
A C# implementation can be found at the Github-repository of Marc Gravell.
And according to the wiki the Thrift runtime doesn't run on Windows.
For one, protobuf isn't a full RPC implementation. It requires something like gRPC to go with it.
gPRC is very slow compared to Thrift:
http://szelei.me/rpc-benchmark-part1/
I think the basic data structure is different
Protocol Buffer use variable-length integee which refers to variable-length digital encoding, turning a fixed-length number into a variable-length number to save space.
Thrift proposed different types of serialization formats (called "protocols").
In fact, Thrift has two different JSON encodings, and no less than three different binary encoding methods.
In conclusion,these two libraries are completely different. Thrift likes a one-stop shop, giving you the entire integrated RPC framework and many options (supporting cross-language), while Protocol Buffers is more inclined to "just do one thing and do it well".
There are some excellent points here and I'm going to add another one in case someones' path crosses here.
Thrift gives you an option to choose between thrift-binary and thrift-compact (de)serializer, thrift-binary will have an excellent performance but bigger packet size, while thrift-compact will give you good compression but needs more processing power. This is handy because you can always switch between these two modes as easily as changing a line of code (heck, even make it configurable). So if you are not sure how much your application should be optimized for packet size or in processing power, thrift can be an interesting choice.
PS: See this excellent benchmark project by thekvs which compares many serializers including thrift-binary, thrift-compact, and protobuf: https://github.com/thekvs/cpp-serializers
PS: There is another serializer named YAS which gives this option too but it is schema-less see the link above.
It's also important to note that not all supported languages compair consistently with thrift or protobuf. At this point it's a matter of the modules implementation in addition to the underlying serialization. Take care to check benchmarks for whatever language you plan to use.