How to let Spark serialize an object using Kryo? - serialization

I'd like to pass an object from the driver node to other nodes where an RDD resides, so that each partition of the RDD can access that object, as shown in the following snippet.
object HelloSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("Testing HelloSpark")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "xt.HelloKryoRegistrator")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(1 to 20, 4)
val bytes = new ImmutableBytesWritable(Bytes.toBytes("This is a test"))
rdd.map(x => x.toString + "-" + Bytes.toString(bytes.get) + " !")
.collect()
.foreach(println)
sc.stop
}
}
// My registrator
class HelloKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) = {
kryo.register(classOf[ImmutableBytesWritable], new HelloSerializer())
}
}
//My serializer
class HelloSerializer extends Serializer[ImmutableBytesWritable] {
override def write(kryo: Kryo, output: Output, obj: ImmutableBytesWritable): Unit = {
output.writeInt(obj.getLength)
output.writeInt(obj.getOffset)
output.writeBytes(obj.get(), obj.getOffset, obj.getLength)
}
override def read(kryo: Kryo, input: Input, t: Class[ImmutableBytesWritable]): ImmutableBytesWritable = {
val length = input.readInt()
val offset = input.readInt()
val bytes = new Array[Byte](length)
input.read(bytes, offset, length)
new ImmutableBytesWritable(bytes)
}
}
In the snippet above, I tried to serialize ImmutableBytesWritable by Kryo in Spark, so I did the follwing:
configure the SparkConf instance passed to spark context, i.e., set "spark.serializer" to "org.apache.spark.serializer.KryoSerializer" and set "spark.kryo.registrator" to "xt.HelloKryoRegistrator";
Write a custom Kryo registrator class in which I register the class ImmutableBytesWritable;
Write a serializer for ImmutableBytesWritable
However, when I submit my Spark application in yarn-client mode, the following exception was thrown:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at xt.HelloSpark$.main(HelloSpark.scala:23)
at xt.HelloSpark.main(HelloSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:325)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.hadoop.hbase.io.ImmutableBytesWritable
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 12 more
It seems that ImmutableBytesWritable can't be serialized by Kryo. So what is the correct way to let Spark serialize an object using Kryo? Can Kryo serialize any type?

This is happening because you're using ImmutableBytesWritable in your closure. Spark doesn't support closure serialization with Kryo yet (only objects in RDDs). You can take the help of this to solve your problem:
Spark - Task not serializable: How to work with complex map closures that call outside classes/objects?
You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)
Here's an example sketch:
def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
(foo: Foo) : Bar = {
kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new ImmutableBytesWritable(Bytes.toBytes("This is a test")))) _
rdd.flatMap(mapper).collectAsMap()
object ImmutableBytesWritable(bytes: Bytes) extends (Foo => Bar) {
def apply(foo: Foo) : Bar = { //This is the real function }
}

Related

Polymorphic serialization of sealed hierarchies with generic type parameters

Using Kotlin serialization, I would like to serialize and deserialize (to JSON) a generic data class with type parameter from a sealed hierarchy. However, I get a runtime exception.
To reproduce the issue:
import kotlinx.serialization.*
import kotlin.test.Test
import kotlin.test.assertEquals
/// The sealed hierarchy used a generic type parameters:
#Serializable
sealed interface Coded {
val description: String
}
#Serializable
#SerialName("CodeOA")
object CodeOA: Coded {
override val description: String = "Code Object OA"
}
#Serializable
#SerialName("CodeOB")
object CodeOB: Coded {
override val description: String = "Code Object OB"
}
/// Simplified class hierarchy
#Serializable
sealed interface NumberedData {
val number: Int
}
#Serializable
#SerialName("CodedData")
data class CodedData<out C : Coded> (
override val number: Int,
val info: String,
val code: C
): NumberedData
internal class GenericSerializerTest {
#Test
fun `polymorphically serialize and deserialize a CodedData instance`() {
val codedData: NumberedData = CodedData(
number = 42,
info = "Some test",
code = CodeOB
)
val codedDataJson = Json.encodeToString(codedData)
val codedDataDeserialized = Json.decodeFromString<NumberedData>(codedDataJson)
assertEquals(codedData, codedDataDeserialized)
}
}
Running the test results in the following runtime exception:
kotlinx.serialization.SerializationException: Class 'CodeOB' is not registered for polymorphic serialization in the scope of 'Coded'.
Mark the base class as 'sealed' or register the serializer explicitly.
This error message does not make sense to me, as both hierarchies are sealed and marked as #Serializable.
I don't understand the root cause of the problem - do I need to explicitly register one of the plugin-generated serializers? Or do I need to roll my own serializer? Why would that be the case?
I am using Kotlin 1.7.20 with kotlinx.serialization 1.4.1
Disclaimer: I do not consider my solution to be very statisfying, but I cannot find a better way for now.
KotlinX serialization documentation about sealed classes states (emphasis mine):
you must ensure that the compile-time type of the serialized object is a polymorphic one, not a concrete one.
In the following example of the doc, we see that serializing child class instead of parent class prevent it to be deserialized using parent (polymorphic) type.
In your case, you have nested polymorphic types, so this is even more complicated I think. To make serialization and deserialization work, then, I've tried multiple things, and finally, the only way I've found to make it work is to:
Remove generic on CodedData (to be sure that code attribute is interpreted in a polymorphic way:
#Serializable
#SerialName("CodedData")
data class CodedData (
override val number: Int,
val info: String,
val code: Coded
): NumberedData
Cast coded data object to NumberedData when encoding, to ensure polymorphism is triggered:
Json.encodeToString<NumberedData>(codedData)
Tested using a little main program based on your own unit test:
fun main() {
val codedData = CodedData(
number = 42,
info = "Some test",
code = CodeOB
)
val json = Json.encodeToString<NumberedData>(codedData)
println(
"""
ENCODED:
--------
$json
""".trimIndent()
)
val decoded = Json.decodeFromString<NumberedData>(json)
println(
"""
DECODED:
--------
$decoded
""".trimIndent()
)
}
It prints:
ENCODED:
--------
{"type":"CodedData","number":42,"info":"Some test","code":{"type":"CodeOB"}}
DECODED:
--------
CodedData(number=42, info=Some test, code=CodeOB(description = Code Object OB))

java.lang.ExceptionInInitializerError in self-referential objects?

I'm having problems with the jvm compiler.
I'm trying to write a factory method for classes. The factory method has an init() block that helps to define behaviour for the new object. While this method compiles for JVM, I encounter a problem when running it:
java.lang.ExceptionInInitializerError
Caused by: java.lang.IllegalArgumentException: Parameter specified as non-null is null: method type.ProblematicKt.nullable, parameter $this$nullable
Apparently, the object isn't yet defined when I attempt to run the problematicInit() block. How do I fix this?
Seems to be a JVM problem. It seems to work for Javascript compilations. My understanding was that getProblematic would be hoisted, but what's inside the scope would be deferred until it's run designed to be run later - after the factory method is completed.
interface ProblematicBuilderScope {
fun problematicInit(getX: () -> ProblematicInterface)
}
fun getProblematic() = X
class Problematic(...): ProblematicInterface
// Factory method with init() block
val X = Problematic.factory(...) {
problematicInit{ getProblematic() }
}
fun factory(init: ProblematicBuilderScope.() -> Unit): Problematic {
val newObject = Problematic(...)
val scope = ProblematicBuilderScope(newObject)
scope.init()
return newObject
}
here is a cleaner simpler way to achieve the same builder implementation
interface ProblematicInterface
class Problematic(): ProblematicInterface
fun buildProblematic(init: Problematic.() -> Unit): Problematic {
val newObject = Problematic()
init(newObject)
return newObject
}
val x = buildProblematic {
// this object type inside this clouse is Problematic
}

Moshi Retrofit2 Kotlin Class Not Found Exception

I'm trying to learn how to implement Retrofit2 and Moshi inside the Kotlin programming language. However, I seem to be having trouble trying to compile my code.
I define the following data classes/models which map to the json response I receive from the api I am hitting:
#JsonClass(generateAdapter = true)
data class Catalogs(
val languages: List<LanguageCatalog>
)
#JsonClass(generateAdapter = true)
data class LanguageCatalog(
val direction: String,
val identifier: String,
val title: String,
val resources: List<ResourceCatalog>
)
#JsonClass(generateAdapter = true)
data class Project(
val identifier: String,
val sort: Int,
val title: String,
val versification: String?
)
#JsonClass(generateAdapter = true)
data class ResourceCatalog(
val identifier: String,
val modified: String,
val projects: List<Project>,
val title: String,
val version: String
)
Then I have an interface which defines the behavior for the API:
interface Door43Service
{
#GET("v3/catalog.json")
fun getFormat() : Observable<Catalogs>
companion object
{
fun create(): Door43Service
{
val retrofit = Retrofit.Builder()
.addCallAdapterFactory(
RxJava2CallAdapterFactory.create()
)
.addConverterFactory(
MoshiConverterFactory.create()
)
.baseUrl("https://api.door32.org/")
.build()
return retrofit.create(Door43Service::class.java)
}
}
}
Lastly, I implemented everything inside a main function to get the json data from the api:
val door43Service by lazy {
Door43Service.create()
}
var disposable: Disposable? = null
fun main(args: Array<String>)
{
door43Service.getFormat()
.subscribe(
{ result -> println(result.languages)},
{ error -> println(error.message)}
)
}
The data that gets returned from the api is pretty long, but an example of it can be found at http://api-info.readthedocs.io/en/latest/door43.html
My issue is I am getting the following error in my stack:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to create converter for class model.Catalogs
for method Door43Service.getFormat
at retrofit2.ServiceMethod$Builder.methodError(ServiceMethod.java:755)
at retrofit2.ServiceMethod$Builder.createResponseConverter(ServiceMethod.java:741)
at retrofit2.ServiceMethod$Builder.build(ServiceMethod.java:172)
at retrofit2.Retrofit.loadServiceMethod(Retrofit.java:170)
at retrofit2.Retrofit$1.invoke(Retrofit.java:147)
at com.sun.proxy.$Proxy0.getFormat(Unknown Source)
at MainKt.main(main.kt:11)
Caused by: java.lang.RuntimeException: Failed to find the generated JsonAdapter class for class model.Catalogs
at com.squareup.moshi.StandardJsonAdapters.generatedAdapter(StandardJsonAdapters.java:249)
at com.squareup.moshi.StandardJsonAdapters$1.create(StandardJsonAdapters.java:62)
at com.squareup.moshi.Moshi.adapter(Moshi.java:130)
at retrofit2.converter.moshi.MoshiConverterFactory.responseBodyConverter(MoshiConverterFactory.java:91)
at retrofit2.Retrofit.nextResponseBodyConverter(Retrofit.java:330)
at retrofit2.Retrofit.responseBodyConverter(Retrofit.java:313)
at retrofit2.ServiceMethod$Builder.createResponseConverter(ServiceMethod.java:739)
... 5 more
Caused by: java.lang.ClassNotFoundException: model.CatalogsJsonAdapter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at com.squareup.moshi.StandardJsonAdapters.generatedAdapter(StandardJsonAdapters.java:236)
At first glance, my understanding is that the compiler thinks I haven't defined an adapter for my Catalogs class, but I think that's supposed to be covered using the #JsonClass(generateradapter = true) annotation. Is there anything I'm missing? Why can't my program generate the adapter for my Catalogs class?
So I got things working. All I had to do was rebuild the project by running gradle build in the terminal (I was originally running it in IntelliJ, but that didn't seem to actually run the build). The key issue was since the build wasn't running, the line in my gradle build script that says,
kapt "com.squareup.moshi:moshi-kotlin-codegen:$moshi_version"
wasn't running. This line basically gives the compiler to understand annotations in the code like #JsonClass. Without this line, the compiler won't understand the annotation. This was the root cause of my issue. I am keeping this post up in case anyone runs into the same issue.

obtain class from nested type parameters in kotlin

I have a val built like this
val qs = hashMapOf<KProperty1<ProfileModel.PersonalInfo, *> ,Question>()
How can I obtain the class of ProfileModel.PersonalInfo from this variable?
In other words what expression(involving qs of course) should replace Any so that this test passes.
#Test
fun obtaionResultTypeFromQuestionList(){
val resultType = Any()
assertEquals(ProfileModel.PersonalInfo::class, resultType)
}
Thank you for your attention
There is no straight way to get such information due to Java type erasure.
To be short - all information about generics (in your case) is unavailable at runtime and HashMap<String, String> becomes HashMap.
But if you do some changes on JVM-level, like defining new class, information about actual type parameters is kept. It gives you ability to do some hacks like this:
val toResolve = object : HashMap<KProperty1<ProfileModel.PersonalInfo, *> ,Question>() {
init {
//fill your data here
}
}
val parameterized = toResolve::class.java.genericSuperclass as ParameterizedType
val property = parameterized.actualTypeArguments[0] as ParameterizedType
print(property.actualTypeArguments[0])
prints ProfileModel.PersonalInfo.
Explanation:
We define new anonymous class which impacts JVM-level, not only runtime, so info about generic is left
We get generic supperclass of our new anonymous class instance what results in HashMap< ... , ... >
We get first type which is passed to HashMap generic brackets. It gives us KProperty1< ... , ... >
Do previous step with KProperty1
Kotlin is tied to the JVM type erasure as well as Java does. You can do a code a bit nice by moving creation of hash map to separate function:
inline fun <reified K, reified V> genericHashMapOf(
vararg pairs: Pair<K, V>
): HashMap<K, V> = object : HashMap<K, V>() {
init {
putAll(pairs)
}
}
...
val hashMap = genericHashMapOf(something to something)

How spark handles object

To test the Serialization exception in spark I wrote a task in 2 ways.
First way:
package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object dd {
def main(args: Array[String]):Unit = {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val data = List(1,2,3,4,5)
val rdd = sc.makeRDD(data)
val result = rdd.map(elem => {
funcs.func_1(elem)
})
println(result.count())
}
}
object funcs{
def func_1(i:Int): Int = {
i + 1
}
}
This way spark works pretty good.
While when I change it to following way, it does not work and throws NotSerializableException.
Second way:
package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object dd {
def main(args: Array[String]):Unit = {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val data = List(1,2,3,4,5)
val rdd = sc.makeRDD(data)
val handler = funcs
val result = rdd.map(elem => {
handler.func_1(elem)
})
println(result.count())
}
}
object funcs{
def func_1(i:Int): Int = {
i + 1
}
}
I know the reason I got error "task is not serializable" is because I am trying to send an unserializable object funcs from driver node to worker node in second example. For second example, if I make object funcs extend Serializable, this error will gone.
But In my view, because funcs is an object rather than a class, it is a singleton and supposed to be serialized and shipped from driver to workers instead of instantiating within a worker node itself. In this scenario, although way to use object funcs is different, I guess the unserializable object funcs is shipped from driver node to worker node in both of these 2 examples.
My question is why the first example can be run successfully but second one fails with 'task unserializable' exception.
When you run code in an RDD closure (map, filter, etc...), everything necessary to execute that code will be packaged up, serialized, and sent to the executors to be run. Any objects that are referenced (or whose fields are referenced) will be serialized in this task, and this is where you'll sometimes get a NotSerializableException.
Your use case is a little more complicated, though, and involves the scala compiler. Typically, calling a function on a scala object is the equivalent of calling a java static method. That object never really exists -- it's basically like writing the code inline. However, if you assign an object to a variable, then you're actually creating a reference to that object in memory, and the object behaves more like a class, and can have serialization issues.
scala> object A {
def foo() {
println("bar baz")
}
}
defined module A
scala> A.foo() // static method
bar baz
scala> val a = A // now we're actually assigning a memory location
a: A.type = A$#7e0babb1
scala> a.foo() // dereferences a before calling foo
bar baz
In order for Spark to distribute a given operation, the function used in the operation needs to be serialized. Before serialization, these functions pass through a complex process appropriately called "ClosureCleaner".
The intention is to "cut off" closures from their context in order to reduce the size of the object graph needed to be serialized and reduce the risk of serialization issues in the process. In other words, ensure that only the code needed to execute the function is serialized and sent for deserialization and execution "at the other side"
During that process, the closure is also evaluated to be Serializable to be proactive about detecting serialization issues at runtime (SparkContext#clean).
That code is dense and complex so it's hard to find the right code path leading to this case.
Intuitively, what's happening is that when the ClosureCleaner finds:
val result = rdd.map{elem =>
funcs.func_1(elem)
}
It evaluates the inner members of the closure to be from an object that can be recreated and there are no further references, so the cleaned closure only contains {elem => funcs.func_1(elem)} which can be serialized by the JavaSerializer.
Instead, when the closure cleaner evaluates:
val handler = funcs
val result = rdd.map(elem => {
handler.func_1(elem)
})
It finds that the closure has a reference to $outer (handler), hence it inspects the outer scope and adds the and variable instance to the cleaned closure. We could imagine the resulting cleaned closure to be something of this shape (this is for illustrative purposes only):
{elem =>
val handler = funcs
handler.func_1(elem)
}
When the closure is tested for serialization, it fails to serialize. Per JVM serialization rules, an object is serializable if recursively all its members are serializable. In this case handler references a non-serializable object and the check fails.