A smart UDP protocol analyzer? - udp

Is there a "smart" UDP protocol analyzer that can help me reverse engineer a message based protocol?
I'm using Wireshark to do the sniffing, but if there's a tool that can detect regularities in the protocol (repeated strings, bits of the protocol that are CRC/Checksum or length, ...) and aid the process that would help.

You are asking for a universal inference engine. The best way to try to recover the protocol (assuming you are in a jurisdiction that permits this) is to understand the underlying message transfer from the beginning of a session, and then trying to manually simulate the behaviour of each party through a sequence of ping-pong message trials. This way you develop an understanding of the message structures and their functioning.
Using the UDP frame boundaries is a good place to start looking for structure.
If you have no documentation, you will find that even if you gain a good understanding of the protocol, expect to be surprised many times during the project.
If you can, have your existing systems carry out exactly the scenario you need to use, and then simply replicate the same sequence with payload (and any checksum) changes only. This way you can possibly achieve the requirement without a comprehensive understanding of the protocol.
For an example of the effort in doing this you could look at a historical review of the Samba project at A bit of history and a bit of fun.

Related

What is a protobuf message?

I'm learning how to use tf.records and in the official tutorial they mention you can print a tf.train.Example message (which is a primitive of the protobuf protocol if I get it right).
I understand that tf.records are used to serialize the data, and that they use the protobuf protocol in this case. I also understand that using tf.train.Feature, tf.train.Features and tf.train.Example one can convert the data into the right format.
My question is what does it mean to print a messege in this context? (the tutorial shows how to print an tf.train.Example message)
A message is classically thought of as a collection of bytes that are conveyed from one process/thread to another process/thread. Typically (but not necessarily), the collection of bytes means something to the sender and receiver, e.g. it's an object that has been serialised somehow (perhaps using Google Protocol Buffers). So, an object can become a message by serialising it and placing the bytes into an array that one might term a "message".
It's not necessarily the case the processes handling the collection of bytes will deserialise them. For example, a process that is simply going to pass them onwards down another connection need not actually deserialise them, if it already knows where the bytes are supposed to be sent.
The means by which a message is conveyed is typically some sort of queue / pipe / socket / stream / etc. Where it gets interesting is that most data transports of this sort are stream connections; whatever bytes you push in one end comes out the other. So, then, how to use those for sending messages?
The answer is that there has to be some way of demarcating between messages. There's lots of ways of doing that, but these days it makes far more sense to use something like ZeroMQ, which takes care of all that for you (and more besides). ZeroMQ is a library / protocol that allows a program to transfer a collection of bytes from one process/thread to another via stream connections, and ensure that the receiving program gets the collection in one nice and complete buffer. Those bytes could be objects serialised by Google Protocol Buffer, or serialised in some other way (there's lots). HTTP is also used as a way of moving objects around, e.g. a page of HTML.
So the pattern is object -> serialisation -> message buffer -> some sort of byte transport that demarcates one message from another -> message buffer -> deserialisation -> object.
An advantage of serialisations like Protocol Buffers is that the sender and receiver need not be written in the same language, or share anything at all except for the .proto file. Other approaches to serialisation often involves marking up class definitions in the program source code, which then makes it difficult to deserialise data in another language.
Also in languages like C/C++ one might get away with simply copying the bytes at the object's address from one place to another. This can be a total disaster if the destination is a different machine; endianness etc. can matter a lot. There are serialisation standards that get close to this, specifically Cap'n Proto (see this).
There are variations. Within a process, "passing a message" can simply mean passing ownership of an object around. Ownership can be by convention, i.e. if I've just written the object pointer to a message queue, I won't mutate the object anymore. I think in Rust it's even expressed by the language syntax, in that once object ownership has been given up the language won't let you mutate the object (worked out at compile time, part of what makes Rust so good). The net result looks like message transfer, but in fact all that's happened is a pointer (typically, 64bits) has been copied from A to B, not the entire data in the object. This is a lot faster.
EDIT
So, How Does a Message Transport Protocol Work?
It's worth digging into how something like ZeroMQ works. For it to be able to pass whole application messages across a stream connection, it needs operate some sort of protocol. That protocol is itself going to involve objects (Protocol Data Units) being "serialised" (well, converted to an agreed wire format), pushed through the stream connection, deserialised, and understood by the ZeroMQ library that's on the receiving end. And, when gets on down to it, ZeroMQ is using TCP (over a network), and that too is a protocol built on IP. And that goes on down to Ethernet frames.
So, there's protocols running atop protocols, running atop other protocols (in fact, this is the Layer Model of how computer interconnectedness works).
Why That Matters, and What Can Go Wrong
It's useful to bearing this protocol layering in mind. Sometimes, one might have a requirement to (for example), take very strong measures against buffer overflows, perhaps to prevent remote exploitation. That might be a reason to pick a serialisation technology that helps guard against such things - e.g. Protocol Buffers. However, when picking such a technology, one has to realise that the requirement is met provided that all of the protocol layerings are equally robust. There's no point using, say, Protocol Buffers and declaring oneself safe against buffer overflows, if the OS's IP stack is broken and exploitable.
This is well illustrated by the Heartbleed bug in OpenSSL (see here). This was caused effectively by a weakly specified protocol (see RFC6520); it's defined in English language, and requires the programmer to read this, code up the protocol by hand, and pay attention to all the strictures written in the document. The associated RFC5426 even says:
This document deals with the formatting of data in an external
representation. The following very basic and somewhat casually
defined presentation syntax will be used. The syntax draws from
several sources in its structure. Although it resembles the
programming language "C" in its syntax and XDR [XDR] in both its
syntax and intent, it would be risky to draw too many parallels. The
purpose of this presentation language is to document TLS only; it has
no general application beyond that particular goal.
The Heartbleed bug in OpenSSL was a result of the coding up of the English language spec being done wrong, and given that highlighted statement perhaps it's no great surprise. Applications that were using OpenSSL were wide, wide open to exploitation, even thought the applications themselves (e.g. Web servers) were very well written implementations of, say, HTTPS.
Now, had the designers of TLS chosen to use a decent and strict serialisation technology - perhaps even Google Protocol Buffers (plus some message demarcation) - to define the PDUs in TLS, it would have been far more likely that Heartbleed wouldn't have happened. Specifically, the payload_length field in a request / response would have been taken care of inside Google Protocol Buffers, thereby removing responsibility for handling the length of the payload from the developer.
What's interesting is to compare protocol specifications as written in RFCs with those that tend to be found in the world of telephony (regulated by the International Telephony Union). The ITU's specifications and tools are very "comprehensive" (that ought to be an acceptably neutral way of describing them). A lot of telephony uses ASN.1, which is not disimilar to (and substantially pre-dates) Google Protocol Buffers, but allows for very strict definitions of messages, requires pretty comprehensive tools to do it right, but is bang up to date (it even has JSON as a wire format these days).
"But", one points out, "what if the ASN.1 tools (or Google Protocol Buffers) has a bug?". Well indeed that is a problem, and that has indeed happened to ASN.1 (from one of the commercial ASN.1 tools vendors, can't rememeber which). But the point is that if there's one library that is widely used for defining lots of interfaces, then there's a greater chance of bugs being identified (I myself have found and reported bugs in commercial ASN.1 tools). Whereas if a messaging protocol is defined using, say, English language, there's only ever going to be a very few set of eyes on how well the developer has coded up the meaning of that English language.
Not Everyone Has Got the Message
What I find disappointing is that, across a large portion of the software world, there's still resistance to using tools like Google Protocol Buffers, ASN.1. There's also projects that, having identified the need for such things, go and invent their own.
One such example is dBus - which to be fair is pretty good. However they did go an invent their own serialisation technology for specifying dBus messages; I'm not sure what they gained over using something mature and off-the-shelf.
Google themselves, when they first announced Google Protocol Buffers to the world, were asked "Why didn't you use ASN.1?", and the Googler on the stage had to admit to never having heard of it. So, Googlers in Google hadn't used Google to Google for "binary serialisation technologies"; they'd just gone ahead and wrote their own, and GPB is missing a ton of useful features. Oh, the irony. They'd not even have had to write a toolset from scratch; they could have simply adopted and improved on one of the open source ASN.1 implementations.
Transliteration Problem
This fragmentation and proliferation causes problems. Say, for example, in your project you want to be able to transfer some of your messages into a dBus service on Linux. To do that, you've got a .proto defining your messages, which is great for communicating in/out of Tensor Flow, but fundamentally useless for dBus, which speaks its own format. You'd end up having something like
MyProtoMsg ipMsg;
MyEquivalentDBusMsg opMsg;
opMsg.field1 = ipMsg.field1;
opMsg.field2 = ipMsg.field2;
opMsg.field3 = ipMsg.field3;
and so on. Very laborious, very unmaintainable, and needlessly consumes resources. The other option would be simply to wrap up your GPB encoded messages in a byte array in a dBus message, but one feels that's missing the point (it bypasses any opportunity for dBus to assert that messages it's passing are correctly formed and within specifications).
If the world agreed on the One True Serialisation technology then the flexibility in object / message exchange would be fantastic.

Possible to share information between an add-on to an existing program and a standalone application? [duplicate]

I'm looking at building a Cocoa application on the Mac with a back-end daemon process (really just a mostly-headless Cocoa app, probably), along with 0 or more "client" applications running locally (although if possible I'd like to support remote clients as well; the remote clients would only ever be other Macs or iPhone OS devices).
The data being communicated will be fairly trivial, mostly just text and commands (which I guess can be represented as text anyway), and maybe the occasional small file (an image possibly).
I've looked at a few methods for doing this but I'm not sure which is "best" for the task at hand. Things I've considered:
Reading and writing to a file (…yes), very basic but not very scalable.
Pure sockets (I have no experience with sockets but I seem to think I can use them to send data locally and over a network. Though it seems cumbersome if doing everything in Cocoa
Distributed Objects: seems rather inelegant for a task like this
NSConnection: I can't really figure out what this class even does, but I've read of it in some IPC search results
I'm sure there are things I'm missing, but I was surprised to find a lack of resources on this topic.
I am currently looking into the same questions. For me the possibility of adding Windows clients later makes the situation more complicated; in your case the answer seems to be simpler.
About the options you have considered:
Control files: While it is possible to communicate via control files, you have to keep in mind that the files need to be communicated via a network file system among the machines involved. So the network file system serves as an abstraction of the actual network infrastructure, but does not offer the full power and flexibility the network normally has. Implementation: Practically, you will need to have at least two files for each pair of client/servers: a file the server uses to send a request to the client(s) and a file for the responses. If each process can communicate both ways, you need to duplicate this. Furthermore, both the client(s) and the server(s) work on a "pull" basis, i.e., they need to revisit the control files frequently and see if something new has been delivered.
The advantage of this solution is that it minimizes the need for learning new techniques. The big disadvantage is that it has huge demands on the program logic; a lot of things need to be taken care of by you (Will the files be written in one piece or can it happen that any party picks up inconsistent files? How frequently should checks be implemented? Do I need to worry about the file system, like caching, etc? Can I add encryption later without toying around with things outside of my program code? ...)
If portability was an issue (which, as far as I understood from your question is not the case) then this solution would be easy to port to different systems and even different programming languages. However, I don't know of any network files ystem for iPhone OS, but I am not familiar with this.
Sockets: The programming interface is certainly different; depending on your experience with socket programming it may mean that you have more work learning it first and debugging it later. Implementation: Practically, you will need a similar logic as before, i.e., client(s) and server(s) communicating via the network. A definite plus of this approach is that the processes can work on a "push" basis, i.e., they can listen on a socket until a message arrives which is superior to checking control files regularly. Network corruption and inconsistencies are also not your concern. Furthermore, you (may) have more control over the way the connections are established rather than relying on things outside of your program's control (again, this is important if you decide to add encryption later on).
The advantage is that a lot of things are taken off your shoulders that would bother an implementation in 1. The disadvantage is that you still need to change your program logic substantially in order to make sure that you send and receive the correct information (file types etc.).
In my experience portability (i.e., ease of transitioning to different systems and even programming languages) is very good since anything even remotely compatible to POSIX works.
[EDIT: In particular, as soon as you communicate binary numbers endianess becomes an issue and you have to take care of this problem manually - this is a common (!) special case of the "correct information" issue I mentioned above. It will bite you e.g. when you have a PowerPC talking to an Intel Mac. This special case disappears with the solution 3.+4. together will all of the other "correct information" issues.]
+4. Distributed objects: The NSProxy class cluster is used to implement distributed objects. NSConnection is responsible for setting up remote connections as a prerequisite for sending information around, so once you understand how to use this system, you also understand distributed objects. ;^)
The idea is that your high-level program logic does not need to be changed (i.e., your objects communicate via messages and receive results and the messages together with the return types are identical to what you are used to from your local implementation) without having to bother about the particulars of the network infrastructure. Well, at least in theory. Implementation: I am also working on this right now, so my understanding is still limited. As far as I understand, you do need to setup a certain structure, i.e., you still have to decide which processes (local and/or remote) can receive which messages; this is what NSConnection does. At this point, you implicitly define a client/server architecture, but you do not need to worry about the problems mentioned in 2.
There is an introduction with two explicit examples at the Gnustep project server; it illustrates how the technology works and is a good starting point for experimenting:
http://www.gnustep.org/resources/documentation/Developer/Base/ProgrammingManual/manual_7.html
Unfortunately, the disadvantages are a total loss of compatibility (although you will still do fine with the setup you mentioned of Macs and iPhone/iPad only) with other systems and loss of portability to other languages. Gnustep with Objective-C is at best code-compatible, but there is no way to communicate between Gnustep and Cocoa, see my edit to question number 2 here: CORBA on Mac OS X (Cocoa)
[EDIT: I just came across another piece of information that I was unaware of. While I have checked that NSProxy is available on the iPhone, I did not check whether the other parts of the distributed objects mechanism are. According to this link: http://www.cocoabuilder.com/archive/cocoa/224358-big-picture-relationships-between-nsconnection-nsinputstream-nsoutputstream-etc.html (search the page for the phrase "iPhone OS") they are not. This would exclude this solution if you demand to use iPhone/iPad at this moment.]
So to conclude, there is a trade-off between effort of learning (and implementing and debugging) new technologies on the one hand and hand-coding lower-level communication logic on the other. While the distributed object approach takes most load of your shoulders and incurs the smallest changes in program logic, it is the hardest to learn and also (unfortunately) the least portable.
Disclaimer: Distributed Objects are not available on iPhone.
Why do you find distributed objects inelegant? They sounds like a good match here:
transparent marshalling of fundamental types and Objective-C classes
it doesn't really matter wether clients are local or remote
not much additional work for Cocoa-based applications
The documentation might make it sound like more work then it actually is, but all you basically have to do is to use protocols cleanly and export, or respectively connect to, the servers root object.
The rest should happen automagically behind the scenes for you in the given scenario.
We are using ThoMoNetworking and it works fine and is fast to setup. Basically it allows you to send NSCoding compliant objects in the local network, but of course also works if client and server are on he same machine. As a wrapper around the foundation classes it takes care of pairing, reconnections, etc..

Decent Packet Tutorials for OBJ-C?

I want to learn more about sending and receiving packets with Obj C. I want to learn more about packet IDs and types. Any ideas on some easy tutorials apart from the Apple Documentation?
Thanks guys!
Note: I do have basic knowlage of how to send packets. But I need to learn more about types and IDs. At the moment Im failing to make a login with login.minecraft.net.
Holy layer violation, batman!
I think what you really want to learn about is HTTP communication under iOS, not TCP/IP packets. In this day and age, there is no reason to dip down to TCP/UDP unless you are inventing a new protocol. For every protocol that exists with any popularity, there is a library that encapsulates it.
In this case, it looks like Minecraft is built on HTTP. Which is no surprise; unless your game needs realtime interaction, HTTP is a ubiquitous protocol that routes through anything.
Thus, I'd suggest you start with one of the HTTP programming guides.

Why use AMQP/ZeroMQ/RabbitMQ

as opposed to writing your own library.
We're working on a project here that will be a self-dividing server pool, if one section grows too heavy, the manager would divide it and put it on another machine as a separate process. It would also alert all connected clients this affects to connect to the new server.
I am curious about using ZeroMQ for inter-server and inter-process communication. My partner would prefer to roll his own. I'm looking to the community to answer this question.
I'm a fairly novice programmer myself and just learned about messaging queues. As i've googled and read, it seems everyone is using messaging queues for all sorts of things, but why? What makes them better than writing your own library? Why are they so common and why are there so many?
what makes them better than writing your own library?
When rolling out the first version of your app, probably nothing: your needs are well defined and you will develop a messaging system that will fit your needs: small feature list, small source code etc.
Those tools are very useful after the first release, when you actually have to extend your application and add more features to it.
Let me give you a few use cases:
your app will have to talk to a big endian machine (sparc/powerpc) from a little endian machine (x86, intel/amd). Your messaging system had some endian ordering assumption: go and fix it
you designed your app so it is not a binary protocol/messaging system and now it is very slow because you spend most of your time parsing it (the number of messages increased and parsing became a bottleneck): adapt it so it can transport binary/fixed encoding
at the beginning you had 3 machine inside a lan, no noticeable delays everything gets to every machine. your client/boss/pointy-haired-devil-boss shows up and tell you that you will install the app on WAN you do not manage - and then you start having connection failures, bad latency etc. you need to store message and retry sending them later on: go back to the code and plug this stuff in (and enjoy)
messages sent need to have replies, but not all of them: you send some parameters in and expect a spreadsheet as a result instead of just sending and acknowledges, go back to code and plug this stuff in (and enjoy.)
some messages are critical and there reception/sending needs proper backup/persistence/. Why you ask ? auditing purposes
And many other use cases that I forgot ...
You can implement it yourself, but do not spend much time doing so: you will probably replace it later on anyway.
That's very much like asking: why use a database when you can write your own?
The answer is that using a tool that has been around for a while and is well understood in lots of different use cases, pays off more and more over time and as your requirements evolve. This is especially true if more than one developer is involved in a project. Do you want to become support staff for a queueing system if you change to a new project? Using a tool prevents that from happening. It becomes someone else's problem.
Case in point: persistence. Writing a tool to store one message on disk is easy. Writing a persistor that scales and performs well and stably, in many different use cases, and is manageable, and cheap to support, is hard. If you want to see someone complaining about how hard it is then look at this: http://www.lshift.net/blog/2009/12/07/rabbitmq-at-the-skills-matter-functional-programming-exchange
Anyway, I hope this helps. By all means write your own tool. Many many people have done so. Whatever solves your problem, is good.
I'm considering using ZeroMQ myself - hence I stumbled across this question.
Let's assume for the moment that you have the ability to implement a message queuing system that meets all of your requirements. Why would you adopt ZeroMQ (or other third party library) over the roll-your-own approach? Simple - cost.
Let's assume for a moment that ZeroMQ already meets all of your requirements. All that needs to be done is integrating it into your build, read some doco and then start using it. That's got to be far less effort than rolling your own. Plus, the maintenance burden has been shifted to another company. Since ZeroMQ is free, it's like you've just grown your development team to include (part of) the ZeroMQ team.
If you ran a Software Development business, then I think that you would balance the cost/risk of using third party libraries against rolling your own, and in this case, using ZeroMQ would win hands down.
Perhaps you (or rather, your partner) suffer, as so many developers do, from the "Not Invented Here" syndrome? If so, adjust your attitude and reassess the use of ZeroMQ. Personally, I much prefer the benefits of Proudly Found Elsewhere attitude. I'm hoping I can proud of finding ZeroMQ... time will tell.
EDIT: I came across this video from the ZeroMQ developers that talks about why you should use ZeroMQ.
what makes them better than writing your own library?
Message queuing systems are transactional, which is conceptually easy to use as a client, but hard to get right as an implementor, especially considering persistent queues. You might think you can get away with writing a quick messaging library, but without transactions and persistence, you'd not have the full benefits of a messaging system.
Persistence in this context means that the messaging middleware keeps unhandled messages in permanent storage (on disk) in case the server goes down; after a restart, the messages can be handled and no retransmit is necessary (the sender does not even know there was a problem). Transactional means that you can read messages from different queues and write messages to different queues in a transactional manner, meaning that either all reads and writes succeed or (if one or more fail) none succeeds. This is not really much different from the transactionality known from interfacing with databases and has the same benefits (it simplifies error handling; without transactions, you would have to assure that each individual read/write succeeds, and if one or more fail, you have to roll back those changes that did succeed).
Before writing your own library, read the 0MQ Guide here: http://zguide.zeromq.org/page:all
Chances are that you will either decide to install RabbitMQ, or else you will make your library on top of ZeroMQ since they have already done all the hard parts.
If you have a little time give it a try and roll out your own implemntation! The learnings of this excercise will convince you about the wisdom of using an already tested library.

Serializing objects for asynchronous messaging

I'm considering using AMQP (using qpid) to enable a mixture of Python and Java services communicate with each other. Basic text messaging seems simple enough but, as with every other messaging technology I've investigated, that's where it seems to stop. Except for building instant messaging applications, I would have thought sending strings wasn't a particularly useful thing to do yet example after example demonstrates sending unformatted text around.
My instinct then is to use XML (de-)serialization or something similar (JSON, YAML, Protocol Buffers etc.) which has good library support in both languages. Is this a best practice and, if so, which (de-)serialization protocol would people recommend? Or am I missing the point somewhere and should be quite content sending small bits of text?
Owen, may I offer a few words about RabbitMQ.
AMQP is a binary protocol and you can certainly do much more than send strings around! Which Python client do you plan to use? We recommend Barry Pederson's client for most uses: http://barryp.org/software/py-amqplib/ You are most welcome to come to the RabbitMQ list and ask any questions you like about anything in relation to your post and the comments :-)
As James points out, JSON is goodness. RabbitMQ supports JSON-RPC over HTTP connecting to an AMQP back end. People also use RabbitMQ with Orbited for comet type apps.
In addition we are fans of, and support XMPP, and STOMP too which James invented. STOMP is handy for a certain class of messaging apps and RabbitMQ supports it for both direct and topic based routing. We've found it a fine way to interop with ActiveMQ, preferring it to JMS in that scenario.
I hope you find the right server for your use cases, and recommend you try out different combinations, for best results.
Cheers,
alexis
For what it's worth, I've been having a good experience using AMQP + Protocol Buffers.
If the sender is serializing the messages, you will probably need to include an id in the header so that you know how to de-serialize on the receiving side. However, this isn't too much trouble to accomplish.
XML or JSON are probably the easiest. Protocol buffers is cool but I'd treat it as an optimisation to think of later on if you really need to (as its a bit harder to use being essentially a binary wire format).
BTW you might want to look at Stomp rather than AMQP; its got way more client libraries and supported message brokers. e.g. Apache ActiveMQ which is way more popular than qpid or rabbitmq - or indeed any JMS provider would work just fine.