How to store complex objects into hadoop Hbase? - serialization

I have complex objects with collection fields which needed to be stored to Hadoop. I don't want to go through whole object tree and explicitly store each field. So I just think about serialization of complex fields and store it as one big piece. And than desirialize it when reading object. So what is the best way to do it? I though about using some kind serilization for that but I hope that Hadoop has means to handle this situation.
Sample object's class to store:
class ComplexClass {
<simple fields>
List<AnotherComplexClassWithCollectionFields> collection;
}

HBase only deals with byte arrays, so you can serialize your object in any way you see fit.
The standard Hadoop way of serializing objects is to implement the org.apache.hadoop.io.Writable interface. Then you can serialize your object into a byte array using org.apache.hadoop.io.WritableUtils.toByteArray(Writable ... writable).
Also, there are other serialization frameworks that people in the Hadoop community use, like Avro, Protocol Buffers, and Thrift. All have their specific use cases, so do your research. If you're doing something simple, implementing Hadoop's Writable should be good enough.

Related

Decoding and encoding strings for kotlinx.serialization.properties

I'm currently struggling with the experimental KXS-properties serialization backend, mainly because of two reasons:
I can't find any documentation for it (I think there is none)
KXS-properties only includes a serializer / deserializer, but no encoder / decoder
The endpoint provided by the framework is essentially Map<String, Any>, but the map is flat and the keys already have the usual dot-separated properties syntax. So the step that I have to take is to encode the map to a single string that is printable to a .properties file AND decode a single string from a .properties file into the map. I'm generally following the Properties Format Spec from https://docs.oracle.com/javase/10/docs/api/java/util/Properties.html#load(java.io.Reader), it's not as easy as one might think.
The problem is that I can't use java.util.Properties right away because KXS is multiplatform and it would kinda kill the purpose of it when I'd restrict it to JVM because I use java.util.Properties. If I were to use it, the solution would be pretty simple, like this: https://gist.github.com/RaphaelTarita/748e02c06574b20c25ab96c87235096d
So I'm trying to implement my own encoder / decoder, following the rough structure of kotlinx.serialization.json.Json.kt. Although it's pretty tedious, it went well so far, but now I've stumbled upon a new problem:
As far as I know (I am not sure because there is no documentation), the map only contains primitives (or primitive-equivalents, as Kotlin does not really have primitives). I suspect this because when you write your own KSerializers for the KXS frontend, you can specify to encode to any primitive by invoking the encodeXXX() functions of the Encoder interface. Now the problem is: When I try to decode to the map that should contain primitives, how do I even know which primitives are expected by the model class?
I've once written my own serializer / deserializer in Java to learn about the topic, but in that implementation, the backend was a lot more tightly coupled to the frontend, so that I could query the expected primitive type from the model class in the backend. But in my situation, I don't have access to the model class and I have no clue how to retrieve the expected types.
As you can see, I've tried multiple approaches, but none of them worked right away. If you can help me to get any of these to work, that would be very much appreciated
Thank you!
The way it works in kotlinx.serialization is that there are serializers that describe classes and structures etc. as well as code that writes/read properties as well as the struct. It is then the job of the format to map those operations to/from a data format.
The intended purpose of kotlinx.serialization.Properties is to support serializing a Kotlin class to/from a java.util.Properties like structure. It is fairly simple in setup in that every nested property is serialized by prepending the property name to the name (the dotted properties syntax).
Unfortunately it is indeed the case that this deserializing from this format requires knowing the types expected. It doesn't just read from string. However, it is possible to determine the structure. You can use the descriptor property of the serializer to introspect the expectations.
From my perspective this format is a bit more simple than it should be. It is a good example of a custom format though. A key distinction between formats is whether they are intended to just provide a storage format, or whether the output is intended (be able to) to represent a well designed api. The latter ones need to be more complex.

Hazelcast : NotSerializableException

I am using a map with Hazelcast :
//When I do :
map.put(gen.newId(), myObject);
myObject is a very complex object and does not implements Serializable.
I thought that putting the config like below was enough for not having to implement serializable :
<map name="myMap">
<in-memory-format>OBJECT</in-memory-format>
</map>
The Hazelcast doc says :
http://docs.hazelcast.org/docs/3.5/manual/html/entryprocessor.html
"When it is stored as an object (OBJECT format), then the entry processor is applied directly on the object. In that case, no serialization or deserialization is performed"
Thanks for any suggestion.
Unfortunately the object will always be deserialized when the map.put is called, no matter the in memory format. This is because normally there are backups and they need to receive a copy as well. So in this case your only way out is to make your object 'serializable'. You can use Java serialization, but you can also rely on something like Kryo to deal with complex object graphs.
I think you could also use more efficient hazelcast specific solutions. Here is a comparison table of the solutions. Portable has been working for me but is a pain in the ass to implement and maintain for big objects. DataSerializable is a more easy solution and looks a lot like Parcelable from Android.

Write/read class objects to/from file, D-Lang

I'm trying to write/read a class object from/to a file.
I'm new to D and I just want to play a little bit around with it.
Is there a Class/Function to write/read an object to/from a file?
I'm looking for something similar to the ObjectOutputStream сlass in Java.
Or do I have to serialize (concatenate) the object's variables as strings in the file?
I have a Movie class and a MovieManager class, which contains a dynamic movie-array.
A Movie object contains just a few strings and integer values.
Extending answer, provided in comment, it is worth explicitly stating, that D does not provide "one true way" of reading/writing objects to/from files, as there can't be a single optimal one. Different considerations about speed, resulting file format, handling references and similar corner cases may results in different serialization strategies.
That being said, most likely proper serialization library is needed, and, by lucky chance, one of most mature D solutions ("Orange" by Jacob Carlborg https://github.com/jacob-carlborg/orange) is being reviewed right now as a candidate for inclusion into standard library as a std.serialization: newsgroup thread. It may be your best bet.
The library Unmanaged provides a serialization system. You also have Orange
which is less restrictive, as Unmanaged serialization only works if the object to serialize is an ancestor of one of the framework base class.But...Unmanaged works on the "accessor" principle. The data serialized are get via a method and the data deserialized are set via a method, which allows to update some stuffs when the deserializer recall for example...

Why do we use serialization?

Why do we need to use serialization?
If we want to send an object or piece of data through a network we can use streams of bytes. If we want to save some data to the disk, we can again use the binary mode along with the byte streams and save it.
So what's the advantage of using serialization?
Technically on the low-level, your serialized object will also end up as a stream of bytes on your cable or your filesystem...
So you can also think of it as a standardized and already available way of converting your objects to a stream of bytes. Storing/transferring object is a very common requirement, and it has less or little meaning to reinvent this wheel in every application.
As other have mentioned, you also know that this object->stream_of_bytes implementation is quite robust, tested, and generally architecture-independent.
This does not mean it is the only acceptable way to save or transfer an object: in some cases, you'll have to implement your own methods, for example to avoid saving unnecessary/private members (for example for security or performance reasons). But if you are in a simple case, you can make your life easier by using the serialization/deserialization of your framework, language or VM instead of having to implement it by yourself.
Hope this helps.
Quoting from Designing Data Intensive Applications book:
Programs usually work with data in (at least) two different
representations:
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for
efficient access and manipulation by the CPU (typically using
pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes
(for example, a JSON document). Since a pointer wouldn’t make sense to
any other process, this sequence-of-bytes representation looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two
representations. The translation from the in-memory representation to
a byte sequence is called encoding (also known as serialization or
marshalling), and the reverse is called decoding (parsing,
deserialization, unmarshalling).
Among other reasons to be compatible between architecture. An integer doesn't have the same number of bytes from one architecture to another, and sometimes from one compiler to another.
Plus what you're talking about is still serialization. Binary Serialization. You're putting all the bytes of your object together in order to store them and be able to reconvert them as an object later. This is serializing.
More info on wikipedia
Serialization is the process of converting an object into a stream so that it can be saved in any physical file like (XML) or can be saved in Database. The main purpose of Serialization in C# is to persist an object and save it in any specified storage medium like stream, physical file or DataBase.
In General, serialization is a method to persist an object's state, but i suggest you to read this wiki page, it is pretty detailed and correct in my opinion:
http://en.wikipedia.org/wiki/Serialization
In serialization, the point is not turning an object into bits and bytes, objects ARE bits and bytes already. Serialization is the process of making the object's "state" persistent. Notice the word "state", which means the values of the instance variables of the entire object graph (the target object and all the objects it references either directly or indirectly) WITHOUT methods and other extra runtime stuff stuck to them (and of course plus a little more info that JVM needs for restoration of these objects, such as their class types).
So this is the main reason of its necessity: Storing the whole bytes of objects would be expensive, and for all intents and purposes, unnecessary.

Serialization vs. Archiving?

The iOS docs differentiate between "serializing" and "archiving." Is this a general distinction (i.e., holds in other languages) or is it specific to Objective-C? Also, what is the difference between these two?
This is a case of one being the other some (but not all) of the time.
Wikipedia has this to say about serialization:
"Serialization is the process of converting a data structure or object into a sequence of bits so that it can be stored in a file or memory buffer, or transmitted across a network connection link to be "resurrected" later in the same or another computer environment"
So, archiving may only be serialization, but it could also be the combination of serialization and compresssion, for example. Or perhaps it adds some kind of header info. So serialization is a form of archive, but an archive is not necessarily a serialization.
This isn't really specific to iOS - these terms are thrown around all over. Their specific meaning in the context of iOS could be quite specific, though.
I was actually trying to look for their difference from IOS perspective. Adding the following for people interested :
Purpose:
Archiving is used to store object graphs. complete data model can be archived and restored easily. The way Nib files work can be considered as example for archiving.
Serialization is used for storing arbitrary hierarchy of objects.
The wat plist files work can be considered as example fo serializations.
Differences(excerpts from Archives programing guide):
"The archive preserves the identity of every object in the graph and all the relationships it has with all the other objects in the graph."
Every object encoded within the context of rootObject invocation is tracked. If the coder is asked to encode an object more than once, the coder encodes a reference to the first encoding instead of encoding the object again.
"The serialization only preserves the values of the objects and their position in the hierarchy. Multiple references to the same value object might result in multiple objects when deserialized. The mutability of the objects is not maintained."
Implementation differences:
Any object that implements NSCoding protocol can be archived where as Only instances of NSArray, NSDictionary, NSString, NSDate, NSNumber, and NSData (and some of their subclasses) can be serialized. The contents of array and dictionary objects must also contain only objects of these few classes.
When to Use:
property lists(serialization) should be used for data that consists primarily of strings and numbers. They are very inefficient when used with large blocks of binary data.
It is worthy to Archive objects other than plist objects or storing large blocks of data.
Generally speaking, Serialization is concerned with converting your program data types into architecture independent byte streams. Archiving is specialized serialization in that you could store type and other relationship based information that allow you to unserialize/unmarshall easily. So archival can be thought of as a specialization and subset of Serialization. For Objective-C
Serialization converts Objective-C
types to and from an
architecture-independent byte stream.
In contrast to archiving, basic
serialization does not record the data
type of the values nor the
relationships between them; only the
values themselves are recorded. It is
your responsibility to deserialize the
data in the proper order. Several
convenience classes, however, do
provide the ability to serialize
property lists, recording their
structure along with their values.
With C++ boost serialization --
http://www.boost.org/doc/libs/1_45_0/libs/serialization/doc/index.html
Here, we use the term "serialization"
to mean the reversible deconstruction
of an arbitrary set of C++ data
structures to a sequence of bytes.
Such a system can be used to
reconstitute an equivalent structure
in another program context. Depending
on the context, this might used
implement object persistence, remote
parameter passing or other facility.
In this system we use the term
"archive" to refer to a specific
rendering of this stream of bytes.
This could be a file of binary data,
text data, XML, or some other created
by the user of this library.