Protobuf concatenation of serialized messages into one file - serialization

I have some serialization in google protobuf in a series of files, but wonder if there is a shortcut way of concatenating these smaller files into one larger protobuf without worrying about reading each and every protobuf and then grouping these objects and outputting.
Is there a cheap way to join files together? I.e. do I have serialize each individual file?

You can combine protocol buffers messages by simple concatenation. It appears that you want the result to form an array, so you'll need to serialize each individual file as an array itself:
message MyItem {
...
}
message MyCollection {
repeated MyItem items = 1;
}
Now if you serialize each file as a MyCollection and then concatenate them (just put the raw binary data together), the resulting file can be read as a one large collection itself.

In addition to jpas answer, it might be relevant to say that the data does not need to be in the exact same container, when being serialized, for it being compatible on deserialisation.
Consider the following messages:
message FileData{
required uint32 versionNumber = 1;
repeated Data initialData = 2;
}
message MoreData{
repeated Data data = 2;
}
It is possible to serialize those different messages into one single data container and deserialize it as one single FileData message, as long as the FileData is serialized before zero or more MoreData and both, the FileData and MoreData have the same index for the repeated field.

Related

Kafka, Avro and Schema Registry

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.
In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

how can I serialize tuples as list in F#?

I have a library that sends me results that include tuples. I need to process some of the data, serialize it and then it goes on its way to another system.
the tuples are ALWAYS made of 2 values but they are extremely wasteful when serialized:
(3, 4)
will serialize as:
{"Item1":3,"Item2":4}
whereas
[3; 4]
will serialize as:
[3,4]
I would like to avoid rebuilding the whole data structure and copying all the data to change this part.
Is there a way, at the serializer level, to convert the tuples into list?
the next process' parser can be easily changed to accommodate a list instead of tuples, so it seems like the best scenario.
the ugly option would be to fix the serialized string with a regex, but I would really like to avoid doing this.
You can override the default serialization behaviour by specifying your own JsonConverter. The following example shows a formatter that writes int * int tuples as two-element JSON arrays.
open Newtonsoft.Json
type IntIntConverter() =
inherit JsonConverter<int * int>()
override x.WriteJson(writer:JsonWriter, (a:int,b:int), serializer:JsonSerializer) =
writer.WriteStartArray()
writer.WriteValue(a)
writer.WriteValue(b)
writer.WriteEndArray()
override x.ReadJson(reader, objectType, existingValue, hasExistingValue, serializer) =
(0, 0)
let sample = [ (1,2); (3,4) ]
let json = JsonConvert.SerializeObject(sample, Formatting.None, IntIntConverter())
The result of running this will be [[1,2],[3,4]]. Note that I have not implemented the ReadJson method, so you cannot yet parse the tuples. This will involve some extra work, but you can look at existing JsonConverters to see how this should be done.
Also note that this is for a specific tuple type containing two integers. If you need to support other tuples, you will probably need to provide several variants of the converter.

storing non-root table of flatbuffers object for later deserialization

Consider the following flatbuffers schema (from this stack overflow question):
table Foo {
...
}
table Bar {
value:[Foo];
}
root_type Bar;
Assume the number of Foos in a typical object is significant so we want to avoid modifying schema to make Foo the root_type.
Scenario:
A C++ client serializes a proper flatbuffers object and posts it to another component (nodejs backend) that partially deserializes the object and stores the binary representing every Foo in a database as separate documents:
const buf = new flatbuffers.ByteBuffer(req.body)
const bar = fbs.Bar.getRootAsBar(buf)
for (let i = 0; i < bar.valueLength(); i++) {
const foo = bar.value(i)
let item = {
'raw': foo.bb.bytes_ // <-- primary suspect
}
// ... store `item` as an individual entity (mongodb doc)
}
Later, a third component fetches the binary data stored in "raw" key of the mongodb documents and tries to deserialize it into a Foo object:
auto mongoCol = db.collection("results");
auto mongoResult = mongoCol.find_one(
bsoncxx::builder::stream::document{}
<< "_id" << oid << bsoncxx::builder::stream::finalize);
// ...check that mongoResult is not null
const auto result = mongoResult->view();
const auto& binary = result["raw"].get_binary();
std::string content((const char*)binary.bytes, binary.size);
const auto& foo = flatbuffers::GetRoot<fbs::Foo>(content.c_str());
The problem:
But the pointer given as foo does not point to the expected data and any operation on foo potentially leads to segfault or access violation.
Suspicions:
I speculate that the root cause is that the binary that is stored in the database uses offsets according to the original message. So it is essentially invalid in its own original format and the offsets should be readjusted before inserting into database. But I do not see any flatbuffers function API to readjust the offsets?
One less likely root cause may be that the final deserialization code is incomplete and we have to readjust the offsets?
The reason I suspect it is related to offsets is the fact that this same code works just fine if we make a compromise and post smaller flatbuffers objects with one Foo element in every Bar vector (and change backend code to store bar.bb.bytes in raw instead).
Question:
In any way, is it even possible to grab part of a larger properly constructed flatbuffers binary file that you know represents your desired table and deserialize it on its own?
You can't simply copy a sub-table out of a larger FlatBuffer byte-wise, since this data is not necessarily contiguous. The best workaround is to instead make Bar store a [FooBuffer] where table FooBuffer { buf:[byte] (nested_flatbuffer: Foo) }. When you construct one of these, you construct each Foo into its own FlatBufferBuilder and then store the resulting bytes in the parent. Then when you need to stores Foos seperately this then becomes an easy copy.

Kotlin convert String to Iterable

I have an iterable of People that I save as a string after converting from json. I want to know how would I convert the string back to a list.
// Save data
val peopleString = myList.toString()
// String saved is
[People(name=john, age=23), People(name=mary, age=21), People(name=george, age=11)]
Now is it possible to convert peopleString back to a list?
val peopleList: List<People> = peopleString.?
In short, no... kind of.
Your output is not JSON, and toString() is the wrong function to use if you wanted JSON. The output of toString() is not a proper serialization format that can be understood and used to rebuild the original data structure.
Converting a data structure into some format so that it can be transmitted and later rebuilt is known as serialization. Kotlin has a serializer which can serialize objects into a number of different formats, including JSON: https://github.com/Kotlin/kotlinx.serialization#quick-example.
It's not as easy to use as toString(), but that's to be expected as toStrings's purpose is very different from serialization.

Encoding cyclic data structures (eg directed graphs) using protocol buffers

I have a graph data structure that I'd like to encode with protocol buffers. There are cyclic connections between the graph vertices. Is there a standard/common way to encode such structures in protobuf? One approach that comes to mind is to add an "id" field to each vertex, and use those ids instead of pointers. E.g.:
message Vertex {
required int32 id = 1;
required string label = 2;
repeated int32 outgoing_edges = 3; // values should be id's of other nodes
}
message Graph {
repeated Vertex vertices = 1;
}
Then I could write classes that wrap the protobuf-generated classes, and automatically convert these identifiers to real pointers on deserialization (and back to ids on serialization). Is this the best approach? If so, then does anyone know of existing projects that use/document this approach? If not, then what approach would you recommend?
If you need cross platform support, then using a DTO as you propose in the question, then mapping that to/from a separate graph-based model in your own code is probably your best approach.
As a side note, in protobuf-net (c# / .net) I've added support for this which adds a layer of abstraction silently. Basically, the following works:
[ProtoContract]
class Vertex {
...
[ProtoMember(3, AsReference = true)]
public List<Vertex> OutgoingEdges {get;set;}
}