Difference between using Flink addDefaultKryoSerializer and registerTypeWithKryoSerializer - serialization

A few questions regarding serialization in Flink. In order to ensure Flink state to evolve safely and in a backwards compatible way, I decided to represent all non-primitive state types as Protobuf messages.
Question 1: are there any downsides to representing state with Protobufs? It looks like POJO might be faster / more efficient, but that performance hit is okay for my use case. Tradeoff is towards safety since there's better tooling available to ensure only compatible changes are made to Protobuf message schemas.
Question 2: is it sufficient to just using the interface type (eg. Message.class) when specifying serializers? Or, should I be adding/registering each concrete message type? Just the interface seems to work, but I'm not sure if there are gotchas here that I should consider.
Question 3: Does it matter how I specify ProtobufSerializer with Flink, addDefaultKryoSerializer or registerTypeWithKryoSerializer? I understand there's this mostly impacts Kryo's reference IDs, but I also understand the IDs are re-calculated within an object graph. So, do the IDs really matter in that case? Are there any situations where the same object graph would see classes in different orders?
Code sample of the various options described above:
javaEnv.addDefaultKryoSerializer(Message.class, ProtobufSerializer.class); // Option 1
javaEnv.registerTypeWithKryoSerializer(Message.class, ProtobufSerializer.class); // Option 2
javaEnv.addDefaultKryoSerializer(FooPb.class, ProtobufSerializer.class); // Option 3
javaEnv.registerTypeWithKryoSerializer(FooPb.class, ProtobufSerializer.class); // Option 4

Related

Functional data-structures, OO notions of dispatched equality and comparison, StructuralEquality, and referential transparency

I have a very CPU intensive F# program that depends on persistent data-structures - about 40% of the total CPU time is spent in the Map module. So I thought I'd try out the PersistentHashMap in FSharpX collections. (BTW, this is already a big improvement over the previous version of F# in VS2013 where the same program spent 70% of its time in Map. I also notice that running programs with the debugger attached doesn't have the huge penalty it did before - good work guys...) There is also a hot-spot where I'm re-sorting all the time, where instead I should be adding to a Heap, so I thought I'd give that a go as well.
Two issue became immediately apparent:
(1) Swapping out one for the other from an interface perspective proved harder than it seems it should - I.e., making a shim that let me switch from a Map to a PersistentMap, preserving both the needed module-based let-bound functions and Types necessary to use the each map. I know that having full HM type-inference (and no type-classes) is orthogonal to LSP-style referential transparency for the most part - but maybe I was missing some way to do this better with a minimal amount of code.
(2) The biggest problem (which I'd like to focus on here) is the reliance of the F# functional data-structs on oo-style dispatched equality and comparison via the IComparison (when 't : comparison), etc., family of interfaces.
Even for OO programs ISTM that the idea of dispatching equality and comparison is a bad idea -- an object "knows" how to perform its own domain-specific tasks, but it doesn't "know" for the most part what notion of equality is going to be necessary at various points in the program for various purposes -- so equality/comparison should not be part of the object's interface, but when these concepts are needed, they should always be mentioned explicitly. For example, there should never be a .Sort(), only a .SortWith(...). One could argue that even something as basic as structural equality in F# could be explicit a.StructEq(b) or a ~= b - otherwise you always get object.Equals -- but even stipulating that doing things this way is the best for a multi-paradigm language that's a first-class .Net citizen, it seems like there should at least be the option of using passed-in comparison and equality functions, but this is not the case.
This means that: (a) type constraints are enforced even if you don't want them, causing ripples of broken inferred typing (and hundreds of wavy red lines with it being unclear where the actual "problem" is) and (b), that by implementing a notion of equality or comparison that makes one container type happy in one part of your program (and in my case I want to use the same container and item type with two different notions of ordering in two different places), it is likely to silently break (or cause inefficiency, if one subsumes the other) in other parts of the code that depended on the default/previous implementation.
The only way around this that I could think of is wrapping each item a adapter object using new...with object expression - but I really don't want to create so much garbage just to get the code to work.
So, ISTM that we could have a "pure" version of each persistent data struct that could be loaded if desired (even basics like List, etc.) that do not depend on dispatched equality/comparison/hashing and do not impose type constraints - all such needs should be via a passed in fn's at the time of the call. (Dispatched eq/cmp would be only for used for interop with BCL collections that don't accept delegates.) Then we could have a [EqCmpHashThrowNotImplemented] attribute, and I could be sure that there were no default operations happening at all, and I would feel better about the efficiency and predictability of my code. (And this also let's one change from a Record to a Class or visa-versa w/o worrying about any changes in behavior due to default implementations.) Again, this would be optional, but done by with a simple import. (Which does mean that each base core collection type would have to be broken out into its own module, which isn't really a bad idea anyway.)
If I've overlooked a better way to do things or there are some patterns people are using here, I'd be interested.

Implementing similar UseCases looks like code duplication

I have the following case. User can export several object types (transaction, invoice, etc) to external accounting system.
Export algorithm has steps:
fetch objects by some filter
export objects one by one to the accounting system (web service method per object type)
register the fact that given document was exported, so it wont be exported again
prepare summary for user (num of exported documents, error messages etc)
The algorithm is the same for all object types but there are some important differences which must be handled:
different types
different target web service method, different object to DTO mappings
different filters per object type
I've considered a few solutions:
don't treat the export algorithm as code duplication and implement an algorithm per object type. Export of any data to any external system may be described by such algorithm - does it mean that we should always have one general class to export anything to anywhere?:)
move the differences to strategies (one strategy interface to create abstraction for all differences) - I even implemented it.
use generics - unfortunately I'm coding in PHP and it's not possible
The question:
Is creating a separate export algorithm per object type a code duplication?
Maybe all of them should be treated as separate Use Cases?
If it's a duplication then what techniques to avoid it should I consider?
Description of my first implementation:
In first approach I defined an Exportable abstraction but I was not happy about it. Each object has completely different payload.
An Exportable interface defined only one method getId and it was used to register that object was exported (and thanks to that wont be exported again).
For this purpose the abstraction was fine, but the problem was moved to exportService which had to check the concrete instance to choose DTO mapper and endpoint. So the exportService broke SOLID.
None of the things you have described above are domain-specific logic (and in fact you don't even mention the problem domain in your question), so I don't think it really falls under domain-driven design. Because it's not domain-specific logic I wouldn't worry too much about code duplication, especially considering that the solution doesn't seem obvious.
Keep it simple and just write out each use case separately. If you find that there's common code that's easily refactored, do so after you get everything working smooth. Don't overthink it or add patterns before they are obviously necessary.

Deciding extent of coupling

I have a Component which has API exposed with some 10 functionality in all. I can think of two ways to achieve it:
Give out all these functionality as separate functions.
Expose only one function which takes an XML as input. Based on request_Type specified and the parameters passed in the XML, I internally call one of the respective functions.
Q1. Will the second design be more loosely coupled than the first ?
I always read about how I should try my components to be loosely coupled, should I really go to this extent to achieve lose coupling ?
Q2. Which one of these would be a better design in terms of OOP and why?
Edit:
If I am exposing this API over D-Bus for others to use, will type checking still be a consideration to compare the two approaches? From what I understand type checking is done at compile time, but in case when this function is exposed over some IPC, issue of type checking comes into picture ?
The two alternatives you propose do not differ in the (obviously quite large) number of "functions" you want to offer from your API. However, the second seems to have many disadvantages because you are loosing any strong type checking, it will become much harder to document the functionality etc. (The only advantage I see is that you don't need to change your API if you add functionality. But at the disadvantage that users will not be able to figure out API changes like deleted functions until run-time.)
What is more related with this question is the Single Responsiblity Principle (http://en.wikipedia.org/wiki/Single_responsibility_principle). As you are talking about OOP, you should not expose your tens of functions within one class but split them among different classes, each with a single responsibility. Defining good "responsibilities" and roles requires some practice, but following some basic guidelines will help you to get started quickly. See Are there any rules for OOP? for a good starting point.
Reply to the question edit
I haven't used D-Bus, so this might be totally wrong. But from a quick look at the tutorial I read
Each object supports one or more interfaces. Think of an interface as
a named group of methods and signals, just as it is in GLib or Qt or
Java. Interfaces define the type of an object instance.
DBus identifies interfaces with a simple namespaced string, something
like org.freedesktop.Introspectable. Most bindings will map these
interface names directly to the appropriate programming language
construct, for example to Java interfaces or C++ pure virtual classes.
As far as I understand, D-Bus has the concept of differnt objects which provide interfaces consisting of several methods. This means (to me) that my answer above still applies. The "D-Bus native" way of specifying your API would mean to exhibit interfaces and I don't see any reason why good OOP design guidelines shouldn't be valid, here. As D-Bus seems to map these even to native language constructs, this is even more likely.
Of course, nobody keeps you from just building your own API description language in XML. However, things like are some kind of abuse of underlying techniques. You should have good reasons for doing such things.

Is "serialisation without duplication" possible in c++0x?

One of the big uses of code generation in c++ is to support message serialisation. Typically, you want to support specifying message contents and layout in the same step and produce code for that message type that can give you objects capable of being serialised to/from communication streams. In the past, this has usually resulted in code that looks like:
class MyMessage : public SerialisableObject
{
// message members
int myNumber_;
std::string myString_;
std::vector<MyOtherSerialisableObject> aBunchOfThingsIWantToSerialise_;
public:
// ctor, dtor, accesors, mutators, then:
virtual void Serialise(SerialisationStream & stream)
{
stream & myNumber_;
stream & myString_;
stream & aBunchOfThingsIWantToSerialise_;
}
};
The problem with using this kind of design is that violates an important rule of good architecture: you should not have to specify the intent of a design twice. Duplication of intent, like duplicated code and other common development duplication, leaves room for one place in the code to become divergent with the other, causing errors.
In the above, the duplication is the list of members. Potential errors include adding a member to the class but forgetting to add it to the serialisation list, serialising a member twice (possibly by not using the same order as the member declaration or possibly due to a misspelling of a similar member, among other ways), or serialising something that is not a member (which might produce a compiler error, unless name lookup finds something at a different scope than the object that matches lookup rules). That kind of mistake is the same reason we no longer try to match every heap allocation with a delete (instead using smart pointers) or ever file open with a close (using RAII ctor//dtor mechanisms) - we don't want to have to match up our intent in multiple places because there are times we - or another engineer less familiar with the intent - make mistakes.
Generally, therefore, this has been one of the things that code generation could take care of. You might create a file MyMessage.cg to specify both layout and members in one step
serialisable MyMessage
{
int myNumber_;
std::string myString_;
std::vector<MyOtherSerialisableObject> aBunchOfThingsIWantToSerialise_;
};
that would be run through a code generation utility and produce the code.
I was wondering if it was possible yet to do this in c++0x without external code generation. Are there any new language mechanisms that make it possible to specify a class as serialisable once, and the names and layout of it's members are used to layout the message during serialisation?
To be clear, I know that there are tricks with boost tuples and fusion that can come close to this kind of behavior even in the pre-c++0x language. Those usages, though, being based on indexing into the tuple rather than by-member-name access, have all been brittle to changing the layout, as other places in the code that access the messages would then also need to be reordered. Some kind of by-member-name access is necessary to not have to duplicate the layout specification in places in the code that use the messages.
Also, I know it might be nice to take this up to the next level and ask for specifying when some of the members shouldn't be serialised. Other languages that offer serialisation built in often offer some kind of attribute to do this, so
int myNonSerialisedNumber_ [[noserialise]];
might seem natural. However, I personally think it is bad design to have serialisable objects where everything is not serialised, since the lifetime of messages is in the transport to/from the communications layer, separate from other data lifetimes. Also, you could have an object which has a purely serialisable as on of it's members, so such functionality doesn't by anything the language doesn't already offer.
Is this possible? Or did the standards committee leave out this kind of introspective capability? I don't need it to look like the code gen file above - any simple method for compiletime specification of layout and members in a single step would solve this common problem.
This is both possible and practical in C++11 – in fact it was possible back in C++03, the syntax was just a little too unwieldy. I wrote a small library based around the same idea - see the following:
www.github.com/molw5/framework
Sample syntax:
class Object : serializable <Object,
value <NAME(“Field 1”), int>,
value <NAME(“Field 2”), float>,
value <NAME(“Field 3”), double>>
{
};
Most of the underlying code could be reproduced, in principal, in C++03 – some of the implementation details without variadic templates would have been...tricky, but I believe it would have been possible to recover the core functionality. What you could not reproduce in C++03 was the NAME macro above and the syntax relies fairly heavily on it. The macro provides the machinery necessary to generate a unique typename from a string, that is the following:
NAME(“Field 1”)
expands to
type_string <'F', 'i', 'e', 'l', 'd', ' ', '1'>
through the use of some common macros and constexpr (for character extraction). Back in C++03 something similar to the type_string above would need to be entered manually.
C++, of any form, supports neither introspection nor reflection (to the extent that they are different).
One nice thing about doing serialization manually (ie: without introspection or reflection) is that you can provide object versioning. You can support older forms of the serialization, and simply create reasonable defaults for the data that wasn't in the old versions. Or if a new version removes some data, you can simply serialize and discard it.
It seems to me that what you need is Boost.Serialization.

Object persistence terminology: 'repository' vs. 'store' vs. 'context' vs. 'retriever' vs. (...)

I'm not sure how to name data store classes when designing a program's data access layer (DAL).
(By data store class, I mean a class that is responsible to read a persisted object into memory, or to persist an in-memory object.)
It seems reasonable to name a data store class according to two things:
what kinds of objects it handles;
whether it loads and/or persists such objects.
⇒ A class that loads Banana objects might be called e.g. BananaSource.
I don't know how to go about the second point (ie. the Source bit in the example). I've seen different nouns apparently used for just that purpose:
repository: this sounds very general. Does this denote something read-/write-accessible?
store: this sounds like something that potentially allows write access.
context: sounds very abstract. I've seen this with LINQ and object-relational mappers (ORMs).
P.S. (several months later): This is probably appropriate for containers that contain "active" or otherwise supervised objects (the Unit of Work pattern comes to mind).
retriever: sounds like something read-only.
source & sink: probably not appropriate for object persistence; a better fit with data streams?
reader / writer: quite clear in its intention, but sounds too technical to me.
Are these names arbitrary, or are there widely accepted meanings / semantic differences behind each? More specifically, I wonder:
What names would be appropriate for read-only data stores?
What names would be appropriate for write-only data stores?
What names would be appropriate for mostly read-only data stores that are occasionally updated?
What names would be appropriate for mostly write-only data stores that are occasionally read?
Does one name fit all scenarios equally well?
As noone has yet answered the question, I'll post on what I have decided in the meantime.
Just for the record, I have pretty much decided on calling most data store classes repositories. First, it appears to be the most neutral, non-technical term from the list I suggested, and it seems to be well in line with the Repository pattern.
Generally, "repository" seems to fit well where data retrieval/persistence interfaces are something similar to the following:
public interface IRepository<TResource, TId>
{
int Count { get; }
TResource GetById(TId id);
IEnumerable<TResource> GetManyBySomeCriteria(...);
TId Add(TResource resource);
void Remove(TId id);
void Remove(TResource resource);
...
}
Another term I have decided on using is provider, which I'll be preferring over "repository" whenever objects are generated on-the-fly instead of being retrieved from a persistence store, or when access to a persistence store happens in a purely read-only manner. (Factory would also be appropriate, but sounds more technical, and I have decided against technical terms for most uses.)
P.S.: Some time has gone by since writing this answer, and I've had several opportunities at work to review someone else's code. One term I've thus added to my vocabulary is Service, which I am reserving for SOA scenarios: I might publish a FooService that is backed by a private Foo repository or provider. The "service" is basically just a thin public-facing layer above these that takes care of things like authentication, authorization, or aggregating / batching DTOs for proper "chunkiness" of service responses.
Well so to add something to you conclusion:
A repository: is meant to only care about one entity and has certain patterns like you did.
A store: is allowed to do a bit more, also working with other entities.
A reader/writer: is separated to allow semantically show and inject only reading and wrting functionality into other classes. It's coming from the CQRS pattern.
A context: is more or less bound to a ORM mapper as you mentioned and is usually used under the hood of a repository or store, some use it directly instead of making a repository on top. But it's harder to abstract.