What are the delimiters for protobuf messages?

What are the delimiters for protobuf messages? - serialization

What are the delimiters for protobuf messages? I'm working with serialized messages. I would like to know if the messages begins with $$__$$ and ends with the same sign.

For top level messages (i.e. separate calls to serialize): there literally isn't one. Unless you add your own framing, messages actively bleed into each-other, as the deserializer will (by default) just read to the end of a stream. So: if you have blindly concatenated multiple objects without your own framing protocol: you now have problems.
For the internals of messages, there are two ways of encoding sub-objects - length prefix and groups. Groups are largely deprecated, and the encoding of sub-objects is ambiguous in that it is also the same markers that describe strings, blobs (bytes), and "packed arrays". You probably don't want to try to handle that.
So: it sounds like you need to add your own framing protocol, in which case the answer will be : whatever your framing protocol defines. Just remember that protobuf is binary, so you cannot rely on any byte sequence as a sentinel / terminator. You should ideally use a length prefix approach instead.

(In addition to existing answers 1, 2)
Common framing method for protocol buffers is to prepend a varint before actual protobuf message.
The implementation is already part of the protobuf library, e.g.:
for java: MessageLite.writeDelimitedTo(), Parser.parseDelimitedFrom()
for C: methods in header google/protobuf/util/delimited_message_util.h (e.g. SerializeDelimitedToFileDescriptor())
Good luck with your project!
EDIT> The official reference states that:
If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer.

Related

boost::asio::udp::socket::async_receive_from() seems to corrupt incoming data under certain circumstances

I'm not gonna post any real code here, because it's a part of a complex code base, but I'll ask anyway just in case somebody find this issue familiar:
I create a boost::asio::io_service and run it in a boost::thread.
Then I use boost::asio::udp::socket::async_receive_from() to wait for incoming packet.
The call looks like this:
udpSocket.async_receive_from(
inDataBuffer,
udpEndpoint,
boost::bind( &Node::handleReceiveFrom, this,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred
)
The signature of handleReceiveFrom() is this:
void Node::handleReceiveFrom( const boost::system::error_code& errc, size_t bytesRecvd )
Inside handleReceiveFrom(), I access the inDataBuffer passed to async_receive_from() and read bytesRecvd bytes from it. But sometimes when the packets arrive really fast, the bytesRecvd value refers to the size of the packet before the one that's actually found in the inDataBuffer.
Precisely, the packet whose size is found in bytesRecvd never actually appears in inDataBuffer, at least not as fat as handleReceiveFrom() can see, and instead, the data of the next packet is in the inDataBuffer by the time handleReceiveFrom() gets a chance to look at it.
I thought the problem was that I was somehow calling async_receive_from() from two different threads but after some testing, that doesn't seem to be the case.
Apart from that, I'm at a loss what could be going on here.
I would very much appreciate any thought on this!

I didn't find the root cause of the problem but I solved it by using the synchronous Asio receive function instead of the asynchronous one. In other words, I ditched boost::asio::udp::socket::async_receive_from() and replaced it with boost::asio::udp::socket::receive_from() run in a separate thread.
This way, I actually implemented what I thought asio was doing when I called async_receive_from(). But apparently, there's something inside that routine that works in a different way than I'd expect.
Still, this solution works like a charm, so I'm declaring my previous question closed.

How to use dexlib2 to instrument certain methods, especially allocating registers to add new instructions?

I'm using dexlib2 to programmatically instrument some methods in a dex file, for example, if I find some instructions like this:
invoke-virtual {v8, v9, v10}, Ljava/lang/Class;->getMethod(Ljava/lang/String;[Ljava/lang/Class;)Ljava/lang/reflect/Method;
I'd like to insert an instruction before it, and therefore at runtime I can know the exact arguments of Class.getMethod().
However, now I run into some questions about how to allocate registers to be used in my inserted monitoring instruction?
I know of two ways, but either way has its problems:
I can use DexRewriter to increase the registerCount of this method (e.g from .register 6 to .register 9), so that I can have extra (3) registers to be used. But first this is restricted by 16 registers; second when I increase the registerCount, the parameters will be passed into the last ones, and therefore I have to rewrite all instructions in this method that use parameters, which is tiring.
Or I can reuse registers. This way I have to analysis the liveness of every registers, while dexlib2 seems does not have existing API to construct CFG and def-use chain, which means I have to write it myself.
Besides, I doubt whether by this way I can get enough available registers.
So am I understanding this problem right? are there any existing tools/algorithms to do this? Or any advice that I can do it in a better way?
Thanks.

A few points:
You're not limited to 16 registers in the method. Most instructions can only address the first 16 registers, but there are mov instructions that can can use to swap values out with higher registers
If you can get away with not having to allocate any new registers, your life will be much easier. One approach is to create a new static method with your instrumented logic, and then add a call to that static method with the appropriate values from the target method.
One approach I've seen used is to increase the register count, and then add a series of move instructions at the beginning of the method to move all the parameter registers back down to the same registers they were in before you incremented the register count. This makes it so that you don't have to rewrite all the existing instructions, and guarantees that the new registers at the end of the range are unused. The main annoyance with this approach is when the new registers are v16 or higher, you'll have to do some swaps before and after the point at where they're used to get the value back down into a low register, and then restore whatever was in that register afterward.

You may code like this:
if (it.opcode == Opcode.INVOKE_VIRTUAL || it.opcode == Opcode.INVOKE_STATIC) { logger.warn("${it.opcode.name} ${(it as DexBackedInstruction35c).reference}") }
Format of Opcode.INVOKE_VIRTUAL is Format35c, so the type of instruction is DexBackedInstruction35c.

How to obtain the initial class file bytes

At some point in my program I need the initial class file bytes (the bytes describing the class before any transformations were applied). The methods I evaluated so far are:
Using the corresponding classloader to get the resource and simply loading the byte array again. This won't work for dynamically generated classes though (ASM, proxies, etc).
Storing a reference to the initial class file bytes in a ClassFileTransformer. While this works it means that I need to proactively store all byte arrays for all classes in case I need some of them later one. No cool.
Pretty much the same as above but using JVMTIs ClassFileLoadHook. The issue is the same as with the ClassFileTransformer though.
I checked what is happening when Instrumentation.retransformClasses is called. In the end this comes down to a native method needing the instanceKlassHandles to get the class file bytes. So nothing I can really access as well (at least I wouldn't know how).
Any other ideas for how I could get the initial class file bytes without storing a reference to the bytes for all classes upfront?

Spring AMQP 1.3.5 correlation id

Does anyone know why Spring Integration (AMQP 1.3.5) requires the correlation-id to be a byte array? Rabbit's AMQP-Client 3.3.5 takes a String for the correlation-id in the AMQP.BasicProperties class. Doesn't Spring need to convert the byte array to this String at some point? We're finding that the correlation-id in the message Rabbit sends is still a byte array, and is never converted to a String. Any insight?

Good question, I have no insight; it was before my time on the project and it's a day one issue.
Spring AMQP converts the byte[] to a String in DefaultMessagePropertiesConverter (outbound); invoked by the RabbitTemplate using UTF-8 by default. The resulting string is added to the BasicProperties.
On the listener container (inbound) side, UTF-8 is used unconditionally.
The rabbit client converts to byte[] when writing to the wire (in ValueWriter.writeShortStr()), unconditionally using charset UTF-8.
So, unless you change the charset in RabbitTemplate (which would be bad), it's a no-op (but unnecessary if you already have a String).
I can only speculate that, since MessageProperties is an abstraction (and not tied to RabbitMQ), some other client had it as a byte[] when the abstraction was being designed.
Since we only have a rabbitmq implementation, I wouldn't be averse to adding an optimization to the abstraction to avoid the unnecessary conversion.
Feel free to open an Improvement JIRA Issue and we'll take a look at it for the upcoming 1.5 release.

IJVM using a local variable for GOTO statement

I am working with IJVM and trying to use the GOTO instruction using a local variable in place of a static offset (or label). It won't work. I suppose it is simply treating the variable name as a label and trying to branch to it, but no such label exists. Is there any way I can force it to read the contents of the variable (which contains an offset), or some other solution?
Thanks in advance.

For security reasons, JVM bytecode doesn't let you jump to arbitrary instructions based on the contents of a variable. This restriction makes it possible for the JVM to verify various security properties of the bytecode by statically enumerating all control paths through a particular method. If you were able to jump anywhere, the static analyzer couldn't prove that all necessary program invariants held.
If you do need to jump to an arbitrary index, consider looking into the tableswitch or lookupswitch instructions, which would let you enumerate possible destinations in advance. It's not exactly what you're looking for, but to the best of my knowledge the sort of arbitrary jump you're trying to make isn't possible in JVM bytecode.
Hope this helps!

The GOTO instruction is implemented in MIC1. It interprets the 2 bytes after the opcode as an offset to the PC at the start of the instruction.
I think that the assignment must be asking you to write a new GOTO in MIC1 that interprets the byte after the opcode as the offset to a local variable that contains the branch offset.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What are the delimiters for protobuf messages? - serialization

What are the delimiters for protobuf messages? I'm working with serialized messages. I would like to know if the messages begins with $$__$$ and ends with the same sign.

Related

boost::asio::udp::socket::async_receive_from() seems to corrupt incoming data under certain circumstances

How to use dexlib2 to instrument certain methods, especially allocating registers to add new instructions?

How to obtain the initial class file bytes

Spring AMQP 1.3.5 correlation id

IJVM using a local variable for GOTO statement

Categories

Resources