Encoding cyclic data structures (eg directed graphs) using protocol buffers - serialization

I have a graph data structure that I'd like to encode with protocol buffers. There are cyclic connections between the graph vertices. Is there a standard/common way to encode such structures in protobuf? One approach that comes to mind is to add an "id" field to each vertex, and use those ids instead of pointers. E.g.:
message Vertex {
required int32 id = 1;
required string label = 2;
repeated int32 outgoing_edges = 3; // values should be id's of other nodes
}
message Graph {
repeated Vertex vertices = 1;
}
Then I could write classes that wrap the protobuf-generated classes, and automatically convert these identifiers to real pointers on deserialization (and back to ids on serialization). Is this the best approach? If so, then does anyone know of existing projects that use/document this approach? If not, then what approach would you recommend?

If you need cross platform support, then using a DTO as you propose in the question, then mapping that to/from a separate graph-based model in your own code is probably your best approach.
As a side note, in protobuf-net (c# / .net) I've added support for this which adds a layer of abstraction silently. Basically, the following works:
[ProtoContract]
class Vertex {
...
[ProtoMember(3, AsReference = true)]
public List<Vertex> OutgoingEdges {get;set;}
}

Related

Common return type for all ANTLR visitor methods

I'm writing a parser for an old proprietary report specification with ANTLR and I'm currently trying to implement a visitor of the generated parse tree extending the autogenerated abstract visito class.
I have little experience both with ANTLR (which I learned only recently) and with the visitor pattern in general, but if I understood it correctly, the visitor should encapsulate one single operation on the whole data structure (in this case the parse tree), thus sharing the same return type between each Visit*() method.
Taking an example from The Definitive ANTLR 4 Reference book by Terence Parr, to visit a parse tree generated by a grammar that parses a sequence of arithmetic expressions, it feels natural to choose the int return type, as each node of the tree is actually part of the the arithmetic operation that contributes to the final result by the calculator.
Considering my current situation, I don't have a common type: my grammar parses the whole document, which is actually split in different sections with different responsibilities (variable declarations, print options, actual text for the rows, etc...), and I can't find a common type between the result of the visit of so much different nodes, besides object of course.
I tried to think to some possible solutions:
I firstly tried implementing a stateless visitor using object as
the common type, but the amount of type casts needed sounds like a
big red flag to me. I was considering the usage of JSON, but I think
the problem remains, potentially adding some extra overhead in the
serialization process.
I was also thinking about splitting the visitor in more smaller
visitors with a specific purpose (get all the variables, get all the
rows, etc.), but with this solution for each visitor I would
implement only a small subset of the method of the autogenerated
interface (as it is meant to support the visit of the whole tree),
because each visiting operation would probably focus only on a
specific subtree. Is it normal?
Another possibility could be to redesign the data structure so that
it could be used at every level of the tree or, better, define a generic
specification of the nodes that can be used later to build the data
structure. This solution sounds good, but I think it is difficult to
apply in this domain.
A final option could be to switch to a stateful visitor, which
incapsulates one or more builders for the different sections that
each Visit*() method could use to build the data structure
step-by-step. This solution seems to be clean and doable, but I have
difficulties to think about how to scope the result of each visit
operation in the parent scope when needed.
What solution is generally used to visit complex ANTLR parse trees?
ANTLR4 parse trees are often complex because of recursion, e.g.
I would define the class ParsedDocumentModel whose properties would added or modified as your project evolves (which is normal, no program is set in stone).
Assuming your grammar be called Parser in the file Parser.g4, here is sample C# code:
public class ParsedDocumentModel {
public string Title { get; set; }
//other properties ...
}
public class ParserVisitor : ParserBaseVisitor<ParsedDocumentModel>
{
public override ParsedDocumentModel VisitNounz(NounzContext context)
{
var res = "unknown";
var s = context.GetText();
if (s == "products")
res = "<<products>>"; //for example
var model = new ParsedDocumentModel();
model.Title = res; //add more info...
return model;
}
}

How to convert existing POCO classes in C# to google Protobuf standard POCO

I have POCO classes , I use NewtonSoft json for seralization. Now i want to migrate it to Google protocol buff. Is there any way i can migrate all my classes (not manually) so that i can use google protocol buff for serialization and deseralization.
Do you just want it to work? The absolute simplest way to do this would be to use protobuf-net and add [ProtoContract(ImplicitFields = ImplicitFields.AllPublic)]. What this does is tell protobuf-net to make up the field numbers, which it does by taking all the public members, sorting them alphabetically, and just counting upwards. Then you can use your type with ProtoBuf.Serializer and it should behave in the way you expect.
This is simple, but it isn't very robust. If you add, remove or rename members it can all get out of sync. The problem here is that the protocol buffers format doesn't include names - just field numbers, and it is much harder to guarantee numbers over time. If your type is likely to change, you probably want to define field numbers explicitly. For example:
[ProtoContract]
public class Foo {
[ProtoMember(1)]
public int Id {get;set;}
[ProtoMember(2)]
public List<string> Names {get;} = new List<string>();
}
One other thing to watch out for would be non-zero default values. By default protobuf-net assumes certain things about implicit default values. If you are routinely using non-zero default values without doing it very carefully, protobuf-net may misunderstand you. You can turn that off globally if you desire:
RuntimeTypeModel.Default.UseImplicitZeroDefaults = false;

Is protobuf-net suited for serializing arbitrary object/domain models?

I have been exploring the CQRS/DDD-principles and patterns for a while now and have started implementing a sample project where I have split my storage-model into a WriteModel and a ReadModel. The WriteModel will use a simple NoSQL-like database where aggregates are stored in a key-value style, with value being just a serialized version of the aggregate.
I am now looking at ProtoBuf-Net for serializing and deserializing my domain model aggregates in and out of storage. Other than this post I haven't found any guidance or tips for using ProtoBuf-Net in this area. The point is that the (ideal) requirements for serialization and deserialization of aggregates is that the domain model should have as little knowledge as possible about this infrastructural concern, which implies the following:
No attributes on the classes
No constructors, getters, setters or any other piece of code just for the sake of serialization.
Ability to use any (custom) type possible and have it serialized/deserialized.
Thus far I have implemented just the serialization of the first versions of my aggregates which works perfectly fine. I use the RuntimeTypeModel.Default-instance to configure the MetaModel at runtime and have UseConstructor = false everywhere, which enables me to completely separate the serialization mechanics from my domain-assembly. I have even implemented a custom post-deserialization mechanism that enables me to just-in-time initialize fields after ProtoBuf-Net has deserialized it into a valid instance. So suppose I have class AggregateA like so:
[Version(1)]
public sealed class AggregateA
{
private readonly int _x;
private readonly string _y;
...
}
Then in my serialization-library I have code something along the following lines:
var metaType = RuntimeTypeModel.Default.Add(typeof(AggregateA), false);
metaType.UseConstructor = false;
metaType.AddField(1, "_x");
metaType.AddField(2, "_y");
...
However, I realize that up to this point I have only implemented the basic scenario, and I am now starting to think about how to approach versioning of my model. I am particularly interested in larger refactoring-scenario's, where type A has been split into type A1 and A2, for example:
[Version(2)]
public sealed class AggregateA1
{
private readonly int _x;
...
}
[Version(2)]
public sealed class AggregateA2
{
private readonly string _y;
...
}
Suppose I have a serialized bunch of instances of AggregateA, but now my domain model knows only AggregateA1 and AggregateA2, how would you handle this scenario with ProtoBuf-Net?
A second question deals with point 3: is ProtoBuf-Net capable of handling arbitrary types if you're willing to put in some extra configuration-effort? I've read about exceptions raised when using the DateTimeOffset-type, which makes me think not all types can be serialized by the framework out-of-the-box, but can I serialize these types by registering them in the RuntimeTypeModel? Should I even want to go there? Or better to forget about serializing common .NET types other than the simple ones?
protobuf-net is intended to work with predictable known models. It is true that everything can be configured at runtime, but I have not put any thought as to how to handle your A1/A2 scenario, precisely because that is not a supported scenario (in my defense, I can't see that working nicely with most serializers). Thinking off the top of my head, if you have the configuration/mapping data somewhere, then you could simply deserialize twice; i.e. as long as we still tell it that AggregateA1._x maps to 1 and AggregateA2._y maps to 2, you can do:
object a1 = model.Deserialize(source, null, typeof(AggregateA1));
source.Position = 0; // rewind
object a2 = model.Deserialize(source, null, typeof(AggregateA2));
However, more complex tweaks would require additional thought.
Re "arbitrary types"... define "arbitrary" ;p In particular, there is support for "surrogate" types which can be useful for some transformations - but without a very specific "problem statement" it is hard to answer completely.
Summary:
protobuf-net has an intended usage, which includes both serialization-aware (attributed, etc) and non-aware scenarios (runtime configuration, etc) - but it also works for a range of more bespoke scenarios (letting you drop to the raw reader/writer API if you want to). It does not and cannot guarantee to be a direct fit for every serialization scenario imaginable, and how well it behaves will depend on how far from that scenario you are.

Reference Semantics in Google Protocol Buffers

I have slightly peculiar program which deals with cases very similar to this
(in C#-like pseudo code):
class CDataSet
{
int m_nID;
string m_sTag;
float m_fValue;
void PrintData()
{
//Blah Blah
}
};
class CDataItem
{
int m_nID;
string m_sTag;
CDataSet m_refData;
CDataSet m_refParent;
void Print()
{
if(null == m_refData)
{
m_refParent.PrintData();
}
else
{
m_refData.PrintData();
}
}
};
Members m_refData and m_refParent are initialized to null and used as follows:
m_refData -> Used when a new data set is added
m_refParent -> Used to point to an existing data set.
A new data set is added only if the field m_nID doesn't match an existing one.
Currently this code is managing around 500 objects with around 21 fields per object and the format of choice as of now is XML, which at 100k+ lines and 5MB+ is very unwieldy.
I am planning to modify the whole shebang to use ProtoBuf, but currently I'm not sure as to how I can handle the reference semantics. Any thoughts would be much appreciated
Out of the box, protocol buffers does not have any reference semantics. You would need to cross-reference them manually, typically using an artificial key. Essentially on the DTO layer you would a key to CDataSet (that you simply invent, perhaps just an increasing integer), storing the key instead of the item in m_refData/m_refParent, and running fixup manually during serialization/deserialization. You can also just store the index into the set of CDataSet, but that may make insertion etc more difficult. Up to you; since this is serialization you could argue that you won't insert (etc) outside of initial population and hence the raw index is fine and reliable.
This is, however, a very common scenario - so as an implementation-specific feature I've added optional (opt-in) reference tracking to my implementation (protobuf-net), which essentially automates the above under the covers (so you don't need to change your objects or expose the key outside of the binary stream).

An alternative way to use Azure Table Storage?

I'd like to use for table storage an entity like this:
public class MyEntity
{
public String Text { get; private set; }
public Int32 SomeValue { get; private set; }
public MyEntity(String text, Int32 someValue)
{
Text = text;
SomeValue = someValue;
}
}
But it's not possible, because the ATS needs
Parameterless constructor
All properties public and
read/write.
Inherit from TableServiceEntity;
The first two, are two things I don't want to do. Why should I want that anybody could change some data that should be readonly? or create objects of this kind in a inconsistent way (what are .ctor's for then?), or even worst, alter the PartitionKey or the RowKey. Why are we still constrained by these deserialization requirements?
I don't like develop software in that way, how can I use table storage library in a way that I can serialize and deserialize myself the objects? I think that as long the objects inherits from TableServiceEntity it shouldn't be a problem.
So far I got to save an object, but I don't know how retrieve it:
Message m = new Message("message XXXXXXXXXXXXX");
CloudTableClient tableClient = account.CreateCloudTableClient();
tableClient.CreateTableIfNotExist("Messages");
TableServiceContext tcontext = new TableServiceContext(account.TableEndpoint.AbsoluteUri, account.Credentials);
var list = tableClient.ListTables().ToArray();
tcontext.AddObject("Messages", m);
tcontext.SaveChanges();
Is there any way to avoid those deserialization requirements or get the raw object?
Cheers.
If you want to use the Storage Client Library, then yes, there are restrictions on what you can and can't do with your objects that you want to store. Point 1 is correct. I'd expand point 2 to say "All properties that you want to store must be public and read/write" (for integer properties you can get away with having read only properties and it won't try to save them) but you don't actually have to inherit from TableServiceEntity.
TableServiceEntity is just a very light class that has the properties PartitionKey, RowKey, Timestamp and is decorated with the DataServiceKey attribute (take a look with Reflector). All of these things you can do to a class that you create yourself and doesn't inherit from TableServiceEntity (note that the casing of these properties is important).
If this still doesn't give you enough control over how you build your classes, you can always ignore the Storage Client Library and just use the REST API directly. This will give you the ability to searialize and deserialize the XML any which way you like. You will lose the all of the nice things that come with using the library, like ability to create queries in LINQ.
The constraints around that ADO.NET wrapper for the Table Storage are indeed somewhat painful. You can also adopt a Fat Entity approach as implemented in Lokad.Cloud. This will give you much more flexibility concerning the serialization of your entities.
Just don't use inheritance.
If you want to use your own POCO's, create your class as you want it and create a separate tableEntity wrapper/container class that holds the pK and rK and carries your class as a serialized byte array.
You can use composition to achieve what you want.
Create your Table Entities as you need to for storage and create your POCOs as wrappers on those providing the API you want the rest of your application code to see.
You can even mix in some interfaces for better code.
How about generating the POCO wrappers at runtime using System.Reflection.Emit http://blog.kloud.com.au/2012/09/30/a-better-dynamic-tableserviceentity/