storing non-root table of flatbuffers object for later deserialization - flatbuffers

Consider the following flatbuffers schema (from this stack overflow question):
table Foo {
...
}
table Bar {
value:[Foo];
}
root_type Bar;
Assume the number of Foos in a typical object is significant so we want to avoid modifying schema to make Foo the root_type.
Scenario:
A C++ client serializes a proper flatbuffers object and posts it to another component (nodejs backend) that partially deserializes the object and stores the binary representing every Foo in a database as separate documents:
const buf = new flatbuffers.ByteBuffer(req.body)
const bar = fbs.Bar.getRootAsBar(buf)
for (let i = 0; i < bar.valueLength(); i++) {
const foo = bar.value(i)
let item = {
'raw': foo.bb.bytes_ // <-- primary suspect
}
// ... store `item` as an individual entity (mongodb doc)
}
Later, a third component fetches the binary data stored in "raw" key of the mongodb documents and tries to deserialize it into a Foo object:
auto mongoCol = db.collection("results");
auto mongoResult = mongoCol.find_one(
bsoncxx::builder::stream::document{}
<< "_id" << oid << bsoncxx::builder::stream::finalize);
// ...check that mongoResult is not null
const auto result = mongoResult->view();
const auto& binary = result["raw"].get_binary();
std::string content((const char*)binary.bytes, binary.size);
const auto& foo = flatbuffers::GetRoot<fbs::Foo>(content.c_str());
The problem:
But the pointer given as foo does not point to the expected data and any operation on foo potentially leads to segfault or access violation.
Suspicions:
I speculate that the root cause is that the binary that is stored in the database uses offsets according to the original message. So it is essentially invalid in its own original format and the offsets should be readjusted before inserting into database. But I do not see any flatbuffers function API to readjust the offsets?
One less likely root cause may be that the final deserialization code is incomplete and we have to readjust the offsets?
The reason I suspect it is related to offsets is the fact that this same code works just fine if we make a compromise and post smaller flatbuffers objects with one Foo element in every Bar vector (and change backend code to store bar.bb.bytes in raw instead).
Question:
In any way, is it even possible to grab part of a larger properly constructed flatbuffers binary file that you know represents your desired table and deserialize it on its own?

You can't simply copy a sub-table out of a larger FlatBuffer byte-wise, since this data is not necessarily contiguous. The best workaround is to instead make Bar store a [FooBuffer] where table FooBuffer { buf:[byte] (nested_flatbuffer: Foo) }. When you construct one of these, you construct each Foo into its own FlatBufferBuilder and then store the resulting bytes in the parent. Then when you need to stores Foos seperately this then becomes an easy copy.

Related

Queries on schema and JSON data conversion

We already have flatbuffer library embedded in our software code for simple schemas with JSON output data generation.
More update: We are generating the header files using flatc compiler against the schema and integrate these files inside of our code along with FB library for further serialization/deserialization.
Now we also need to have the following schema tree to be supported.
namespace SampleNS;
/// user defined key value pairs to add custom metadata
/// key namespacing is the responsibility of the user
table KeyValue {
key:string (key, required);
value:string (required);
}
enum SchemaVersion:byte {
V1,
V2
}
table Sometable {
value1:ubyte;
value2:ushort (key);
}
table ComponentData {
inputs: [Sometable];
outputs: [Sometable];
}
table Node {
name:string (key);
/// IO definition
data:ComponentData;
/// nested child
child:[Components];
}
table Components {
type:ubyte;
index:ubyte;
nodes:[Node];
}
table GroupMasterData {
schemaversion:SchemaVersion = sampleNS::SchemaVersion::V1;
metainfo:[KeyValue];
/// List of expected components in the system
components:[Components];
}
root_type GroupMasterData;
As from above, table Components is nested recursively. The intention is components may have childs that have the same fields.
I have few queries:
Flatc didnt gave me any error during schema compilation for such
recursive nested tables. But is this supported during the field
access for such tables?
I tried to generate a sample json data file based on above data but I
could not see the field for schemaversion. I learned FB doesn't
serialize the default values. so, I removed the default value that I
assigned in the schema. But, it still doesnt write into the json data
file. On this I also learned we can forcefully write into the file
using force_defaults option. I don't know where is this is to be
put: in the attribute or elsewhere?
Can I create a struct of enum field?
Is their any API to set Flatbuffer options that we otherwise pass to the compiler arguments? or if not, may be I think we have to tinker with the FB library code. Please suggest.
Method 1:
In our serialization method, we do this:
flatbuffers::Parser* parser = new flatbuffers::Parser();
parser->opts.output_default_scalars_in_json = true;
Is this the right method or should I use any other API?
Yes, trees (and even DAG) structures are fully supported. The type definition is recursive, but the data will eventually have leaf nodes with an empty vector of children, presumably.
The integer value for V1 is 0, and that is also the default value for all fields with no explicit default assigned. Use --defaults-json to see this field when converting. Note that explicit versions in a schema is an anti-pattern, since schemas are naturally evolvable without breaking backwards compatibility.
You can put enum fields in structs, yes. Is that what you mean?

What is the best practice of iterating record keys and values in Reasonml?

I'm new to ReasonML, but I read through most of the official documents. I could go through the casual trial and errors for this, but since I need to write codes in ReasonML right now, I'd like to know the best practices of iterating keys and values of reason record types.
I fully agree with #Shawn that you should use a more appropriate data structure. A list of tuples, for example, is a nice and easy way to pass in a user-defined set of homogeneous key/value pairs:
fooOnThis([
("test1", ["a", "b", "c"]),
("test2", ["c"]),
])
If you need heterogeneous data I would suggest using a variant to specify the data type:
type data =
| String(string)
| KvPairs(list((string, data)));
fooOnThis([
("test1", [String("a"), String("b"), String("c")]),
("test2", [String("c"), KvPairs([("innerTest", "d")])]),
])
Alternatively you can use objects instead of records, which seems like what you actually want.
For the record, a record requires a pre-defined record type:
type record = {
foo: int,
bar: string,
};
and this is how you construct them:
let value = {
foo: 42,
bar: "baz",
};
Objects on the other hand are structurally typed, meaning they don't require a pre-defined type, and you construct them slightly differently:
let value
: {. "foo": int, "bar": string }
= {"foo": 42, "bar": "baz"};
Notice that the keys are strings.
With objects you can use Js.Obj.keys to get the keys:
let keys = Js.Obj.keys(value); // returns [|"foo", "bar"|]
The problem now is getting the values. There is no Js.Obj API for getting the values or entries because it would either be unsound or very impractical. To demonstrate that, let's try making it ourselves.
We can easily write our own binding to Object.entries:
[#bs.val] external entries: Js.t({..}) => array((string, _)) = "Object.entries";
entries here is a function that takes any object and returns an array of tuples with string keys and values of a type that will be inferred based on how we use them. This is neither safe, because we don't know what the actual value types are, or particularly practical as it will be homogeneously typed. For example:
let fields = entries({"foo": 42, "bar": "baz"});
// This will infer the value's type as an `int`
switch (fields) {
| [|("foo", value), _|] => value + 2
| _ => 0
};
// This will infer the value's type as an `string`, and yield a type error
// because `fields` can't be typed to hold both `int`s and `string`s
switch (fields) {
| [|("foo", value), _|] => value ++ "2"
| _ => ""
};
You can use either of these switch expressions (with unexpected results and possible crashes at runtime), but not both together as there is no unboxed string | int type to be inferred in Reason.
To get around this we can make the value an abstract type and use Js.Types.classify to safely get the actual underlying data type, akin to using typeof in JavaScript:
type value;
[#bs.val] external entries: Js.t({..}) => array((string, value)) = "Object.entries";
let fields = entries({"foo": 42, "bar": "baz"});
switch (fields) {
| [|("foo", value), _|] =>
switch (Js.Types.classify(value)) {
| JSString(str) => str
| JSNumber(number) => Js.Float.toString(number)
| _ => "unknown"
}
| _ => "unknown"
};
This is completely safe but, as you can see, not very practical.
Finally, we can actually modify this slightly to use it safely with records as well, by relying on the fact that records are represented internally as JavaScript objects. All we need to do is not restrict entries to objects:
[#bs.val] external entries: 'a => array((string, value)) = "Object.entries";
let fields = keys({foo: 42, bar: 24}); // returns [|("foo", 42), ("bar", 24)|]
This is still safe because all values are objects in JavaScript and we don't make any assumptions about the type of the values. If we try to use this with a primitive type we'll just get an empty array, and if we try to use it with an array we'll get the indexes as keys.
But because records need to be pre-defined this isn't going to be very useful. So all this said, I still suggest going with the list of tuples.
Note: This uses ReasonML syntax since that's what you asked for, but refers to the ReScript documentation, which uses the slightly different ReScript syntax, since the BuckleScript documentation has been taken down (Yeah it's a mess right now, I know. Hopefully it'll improve eventually.)
Maybe I am not understanding the question or the use case. But as far as I know there is no way to iterate over key/value pairs of a record. You may want to use a different data model:
hash table https://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
Js.Dict (if you're working in bucklescript/ReScript) https://rescript-lang.org/docs/manual/latest/api/js/dict
a list of tuples
With a record all keys and value types are known so you can just write code to handle each one, no iteration needed.

Protobuf concatenation of serialized messages into one file

I have some serialization in google protobuf in a series of files, but wonder if there is a shortcut way of concatenating these smaller files into one larger protobuf without worrying about reading each and every protobuf and then grouping these objects and outputting.
Is there a cheap way to join files together? I.e. do I have serialize each individual file?
You can combine protocol buffers messages by simple concatenation. It appears that you want the result to form an array, so you'll need to serialize each individual file as an array itself:
message MyItem {
...
}
message MyCollection {
repeated MyItem items = 1;
}
Now if you serialize each file as a MyCollection and then concatenate them (just put the raw binary data together), the resulting file can be read as a one large collection itself.
In addition to jpas answer, it might be relevant to say that the data does not need to be in the exact same container, when being serialized, for it being compatible on deserialisation.
Consider the following messages:
message FileData{
required uint32 versionNumber = 1;
repeated Data initialData = 2;
}
message MoreData{
repeated Data data = 2;
}
It is possible to serialize those different messages into one single data container and deserialize it as one single FileData message, as long as the FileData is serialized before zero or more MoreData and both, the FileData and MoreData have the same index for the repeated field.

Encoding cyclic data structures (eg directed graphs) using protocol buffers

I have a graph data structure that I'd like to encode with protocol buffers. There are cyclic connections between the graph vertices. Is there a standard/common way to encode such structures in protobuf? One approach that comes to mind is to add an "id" field to each vertex, and use those ids instead of pointers. E.g.:
message Vertex {
required int32 id = 1;
required string label = 2;
repeated int32 outgoing_edges = 3; // values should be id's of other nodes
}
message Graph {
repeated Vertex vertices = 1;
}
Then I could write classes that wrap the protobuf-generated classes, and automatically convert these identifiers to real pointers on deserialization (and back to ids on serialization). Is this the best approach? If so, then does anyone know of existing projects that use/document this approach? If not, then what approach would you recommend?
If you need cross platform support, then using a DTO as you propose in the question, then mapping that to/from a separate graph-based model in your own code is probably your best approach.
As a side note, in protobuf-net (c# / .net) I've added support for this which adds a layer of abstraction silently. Basically, the following works:
[ProtoContract]
class Vertex {
...
[ProtoMember(3, AsReference = true)]
public List<Vertex> OutgoingEdges {get;set;}
}

Reference Semantics in Google Protocol Buffers

I have slightly peculiar program which deals with cases very similar to this
(in C#-like pseudo code):
class CDataSet
{
int m_nID;
string m_sTag;
float m_fValue;
void PrintData()
{
//Blah Blah
}
};
class CDataItem
{
int m_nID;
string m_sTag;
CDataSet m_refData;
CDataSet m_refParent;
void Print()
{
if(null == m_refData)
{
m_refParent.PrintData();
}
else
{
m_refData.PrintData();
}
}
};
Members m_refData and m_refParent are initialized to null and used as follows:
m_refData -> Used when a new data set is added
m_refParent -> Used to point to an existing data set.
A new data set is added only if the field m_nID doesn't match an existing one.
Currently this code is managing around 500 objects with around 21 fields per object and the format of choice as of now is XML, which at 100k+ lines and 5MB+ is very unwieldy.
I am planning to modify the whole shebang to use ProtoBuf, but currently I'm not sure as to how I can handle the reference semantics. Any thoughts would be much appreciated
Out of the box, protocol buffers does not have any reference semantics. You would need to cross-reference them manually, typically using an artificial key. Essentially on the DTO layer you would a key to CDataSet (that you simply invent, perhaps just an increasing integer), storing the key instead of the item in m_refData/m_refParent, and running fixup manually during serialization/deserialization. You can also just store the index into the set of CDataSet, but that may make insertion etc more difficult. Up to you; since this is serialization you could argue that you won't insert (etc) outside of initial population and hence the raw index is fine and reliable.
This is, however, a very common scenario - so as an implementation-specific feature I've added optional (opt-in) reference tracking to my implementation (protobuf-net), which essentially automates the above under the covers (so you don't need to change your objects or expose the key outside of the binary stream).