Kryo registration issue when upgrading to Spark 2.0 - serialization

I am upgrading an application from Spark 1.6.2 to Spark 2.0.2. The issue is not strictly Spark-related. Spark 1.6.2 includes Kryo 2.21. Spark 2.0.2 includes Kryo 3.0.3.
The application stores some data serialized with Kryo on HDFS. To save space, Kryo registration is enforced. When a class is registered with Kryo, it gets a sequential ID and this ID is used to represent the class in the wire format instead of the full class name. When we register a new class, we always put it at the end, so it gets an unused ID. We also never delete a class from registration. (If a class is deleted, we register a placeholder in its place to reserve the ID.) This way the IDs are stable and one version of the application can read the data written by a previous version.
It turns out Kryo uses the same registration mechanism to register primitive classes in its constructor. In Kryo 2.21 it registers 9 primitive classes, so the first user-registered class gets ID 9. But Kryo 2.22 and later register 10 primitive classes. (void was added.) This means the user-registered classes start from ID 10.
How can we still load the old data after upgrading to Spark 2.0.2?
(It would be great if our first user-registered class were a deprecated class. But it is not. It is scala.Tuple2[_, _].)

There is actually a Kryo.register(Class type, int id) method that can be used to explicitly specify an ID. The comment for the id parameter says:
id: Must be >= 0. Smaller IDs are serialized more efficiently. IDs 0-8 are used by default for primitive types and String, but these IDs can be repurposed.
The comment is wrong since 2.22: ID 9 is now also used by default. But indeed it can be repurposed!
kryo.register(classOf[Tuple2[_, _]], 9)
The normal sequential registration works for the rest of the classes. The explicit ID is only necessary for the first class.

Related

Byte Buddy LocationStrategy types

I saw that the default LocationStrategy is STRONG, which keeps a strong reference to the class loader when creating a ClassFileLocator. Does this mean Byte Buddy can prevent a class loader from being garbage collected (eg when undeploying a webapp from a servlet container) or is there another mechanism to evacuate those?
Also in this regard- the documentation about the WEAK strategy says a ClassFileLocator will "stop working" after the corresponding class loader is garbage collected. What are the implications? How would a locator for a garbage-collected class loader be used?
You are right about your assertion. With a strong type locator, all TypeDescriptions will reference the class loader as dependent types are resolved lazily. This means for example that if you look up a type's field type, that type will only be loaded if you are using it for the first time what might never happen.
Typically, those type descriptions are not stored over the life time of a class being loaded. Since the class loader will never be garbage collected during a loading of one of its classes, referencing the class loader strongly does not render any issue. However, once you want to cache type descriptions in between multiple class loadings (what can make a lot of sence since some applications load thousands of classes using the same class loader), this might become a problem if a class loader would be garbage collected while the cache is still referencing the type description with the underlying class loader.
In this case, reusing the type descriptions will be problematic since no lazily referenced classes can be resolved after the class loader was garbage collected. Note that a type description might be resolved using a specific class loader while the class is defined by a parent of that class loader which is why this might be a problem.
Typically, if you maintain a cache of type descriptions per class loader, this should however not be a problem.

How can deserialization of polymorphic trait objects be added in Rust if at all?

I'm trying to solve the problem of serializing and deserializing Box<SomeTrait>. I know that in the case of a closed type hierarchy, the recommended way is to use an enum and there are no issues with their serialization, but in my case using enums is an inappropriate solution.
At first I tried to use Serde as it is the de-facto Rust serialization mechanism. Serde is capable of serializing Box<X> but not in the case when X is a trait. The Serialize trait can’t be implemented for trait objects because it has generic methods. This particular issue can be solved by using erased-serde so serialization of Box<SomeTrait> can work.
The main problem is deserialization. To deserialize polymorphic type you need to have some type marker in serialized data. This marker should be deserialized first and after that used to dynamically get the function that will return Box<SomeTrait>.
std::any::TypeId could be used as a marker type, but the main problem is how to dynamically get the deserialization function. I do not consider the option of registering a function for each polymorphic type that should be called manually during application initialization.
I know two possible ways to do it:
Languages that have runtime reflection like C# can use it to get
deserialization method.
In C++, the cereal library uses magic of static objects to register deserializer in a static map at the library initialization time.
But neither of these options is available in Rust. How can deserialization of polymorphic objects be added in Rust if at all?
This has been implemented by dtolnay.
The concept is quite clever ans is explained in the README:
How does it work?
We use the inventory crate to produce a registry of impls of your trait, which is built on the ctor crate to hook up initialization functions that insert into the registry. The first Box<dyn Trait> deserialization will perform the work of iterating the registry and building a map of tags to deserialization functions. Subsequent deserializations find the right deserialization function in that map. The erased-serde crate is also involved, to do this all in a way that does not break object safety.
To summarize, every implementation of the trait declared as [de]serializable is registered at compile-time, and this is resolved at runtime in case of [de]serialization of a trait object.
All your libraries could provide a registration routine, guarded by std::sync::Once, that register some identifier into a common static mut, but obviously your program must call them all.
I've no idea if TypeId yields consistent values across recompiles with different dependencies.
A library to do this should be possible. To create such a library, we would create a bidirectional mapping from TypeId to type name before using the library, and then use that for serialization/deserialization with a type marker. It would be possible to have a function for registering types that are not owned by your package, and to provide a macro annotation that automatically does this for types declared in your package.
If there's a way to access a type ID in a macro, that would be a good way to instrument the mapping between TypeId and type name at compile time rather than runtime.

What's the package of a Java array class?

This question is for JVM specification advocates. According to the JVMS, Java SE 7 edition, section 5.3.3, last paragraph:
If the component type is a reference type, the accessibility of the array class is determined by the accessibility of its component type. Otherwise, the accessibility of the array class is public.
Thus an array class can have package visibility. Logically I would expect that, if foo.baz.MyClass has package visibility, then an array of MyClass is visible only to package foo.baz. But I can't find anything in the specification supporting this view. Section 5.3 says that the run-time package, that should be used to determine visibility constraints, is built of the binary name of the package plus the defining classloader. But the binary name comes from the classfile, and array classes do not have classfiles. Similar issue for the primitive classes (e.g., Boolean.TYPE), which apparently have public visibility, but I cannot find information about them anywhere.
Can you spot a point in the JVMS where the package of array/primitive classes is clearly defined (or a reason why there is none)?
Isn't that exactly what the quote from the specification is saying?
If the component type is a reference type, the accessibility of the array class is determined by the accessibility of its component type.
If you have a class some.pkg.SomeClass and want to use it as some.pkg.SomeClass[] the accessibility is determined by the accessibility of its component type. In this case the component type is some.pkg.SomeClass.
The other case is for native types, and you can't add more native types to Java.

Different SUID on different classes

Assuming that I have got completely different classes with different class names. Should I use different serialVersionUID in the classes?
If so, why is it necessary?
How does the JRE the deserialization exactly?
no you don't need different SUID (all classes can use 1 for it)
when an object is serialized the class identifier (package.name.ClassName) and SUID are both part of the header to identify the class the object belongs to and to ensure that there is no incompatibility between the writing side and the reading side
but when you change a class structurally (add/remove a field) you should set a new SUID for that class (during debugging you can let the JVM create a new one at runtime based on the .class file)

Why is setting the classloader necessary with Scala RemoteActors?

When using Scala RemoteActors I was getting a ClassNotFoundException that referred to scala.actors.remote.NetKernel. I copied someone else's example and added RemoteActor.classLoader = getClass.getClassLoader to my Actor and now everything works. Why is this necessary?
Remote Actors use Java serialization to send messages back and forth. Inside the actors library, you'll find a custom object input stream ( https://lampsvn.epfl.ch/trac/scala/browser/scala/trunk/src/actors/scala/actors/remote/JavaSerializer.scala ) that is used to serialize objects to/from a socket. There's also some routing code and other magic.
In any case, the ClassLoader used for remoting is rather important. I'd recommend looking up Java RMI if you're unfamiliar with it. In any case, the ClassLoader that Scala picks when serializing/deserializing actors is the one Located on RemoteActor which defaults to null.
This means that by default, you will be unhappy without specifying a ClassLoader ;).
If you were in an environment that controls classloaders, such as OSGi, you'd want to make sure you set this value to a classloader that has access to all classes used by all serialized actors.
Hope that helps!