How spark handles object

How spark handles object - serialization

To test the Serialization exception in spark I wrote a task in 2 ways.
First way:
package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object dd {
def main(args: Array[String]):Unit = {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val data = List(1,2,3,4,5)
val rdd = sc.makeRDD(data)
val result = rdd.map(elem => {
funcs.func_1(elem)
})
println(result.count())
}
}
object funcs{
def func_1(i:Int): Int = {
i + 1
}
}
This way spark works pretty good.
While when I change it to following way, it does not work and throws NotSerializableException.
Second way:
package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object dd {
def main(args: Array[String]):Unit = {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val data = List(1,2,3,4,5)
val rdd = sc.makeRDD(data)
val handler = funcs
val result = rdd.map(elem => {
handler.func_1(elem)
})
println(result.count())
}
}
object funcs{
def func_1(i:Int): Int = {
i + 1
}
}
I know the reason I got error "task is not serializable" is because I am trying to send an unserializable object funcs from driver node to worker node in second example. For second example, if I make object funcs extend Serializable, this error will gone.
But In my view, because funcs is an object rather than a class, it is a singleton and supposed to be serialized and shipped from driver to workers instead of instantiating within a worker node itself. In this scenario, although way to use object funcs is different, I guess the unserializable object funcs is shipped from driver node to worker node in both of these 2 examples.
My question is why the first example can be run successfully but second one fails with 'task unserializable' exception.

When you run code in an RDD closure (map, filter, etc...), everything necessary to execute that code will be packaged up, serialized, and sent to the executors to be run. Any objects that are referenced (or whose fields are referenced) will be serialized in this task, and this is where you'll sometimes get a NotSerializableException.
Your use case is a little more complicated, though, and involves the scala compiler. Typically, calling a function on a scala object is the equivalent of calling a java static method. That object never really exists -- it's basically like writing the code inline. However, if you assign an object to a variable, then you're actually creating a reference to that object in memory, and the object behaves more like a class, and can have serialization issues.
scala> object A {
def foo() {
println("bar baz")
}
}
defined module A
scala> A.foo() // static method
bar baz
scala> val a = A // now we're actually assigning a memory location
a: A.type = A$#7e0babb1
scala> a.foo() // dereferences a before calling foo
bar baz

In order for Spark to distribute a given operation, the function used in the operation needs to be serialized. Before serialization, these functions pass through a complex process appropriately called "ClosureCleaner".
The intention is to "cut off" closures from their context in order to reduce the size of the object graph needed to be serialized and reduce the risk of serialization issues in the process. In other words, ensure that only the code needed to execute the function is serialized and sent for deserialization and execution "at the other side"
During that process, the closure is also evaluated to be Serializable to be proactive about detecting serialization issues at runtime (SparkContext#clean).
That code is dense and complex so it's hard to find the right code path leading to this case.
Intuitively, what's happening is that when the ClosureCleaner finds:
val result = rdd.map{elem =>
funcs.func_1(elem)
}
It evaluates the inner members of the closure to be from an object that can be recreated and there are no further references, so the cleaned closure only contains {elem => funcs.func_1(elem)} which can be serialized by the JavaSerializer.
Instead, when the closure cleaner evaluates:
val handler = funcs
val result = rdd.map(elem => {
handler.func_1(elem)
})
It finds that the closure has a reference to $outer (handler), hence it inspects the outer scope and adds the and variable instance to the cleaned closure. We could imagine the resulting cleaned closure to be something of this shape (this is for illustrative purposes only):
{elem =>
val handler = funcs
handler.func_1(elem)
}
When the closure is tested for serialization, it fails to serialize. Per JVM serialization rules, an object is serializable if recursively all its members are serializable. In this case handler references a non-serializable object and the check fails.

Related

Using a Jackson attribute to accumulate state as a byproduct of serialization

Here's my scenario:
I have a deep compositional tree of POJOs from various classes. I need to write a utility that can dynamically process this tree without having a baked in understanding of the class/composition structure
Some properties in my POJOs are annotated with a custom annotation #PIIData("phone-number") that declares that the property may contain PII, and optionally what kind of PII (e.g. phone number)
As a byproduct of serializing the root object, I'd like to accumulate a registry of PII locations based on their JSON path
Desired data structure:
path
type
household.primaryEmail
email-address
household.members[0].cellNumber
phone-number
household.members[0].firstName
first-name
household.members[1].cellNumber
phone-number
I don't care about the specific pathing/location language used (JSON Pointer, Json Path).
I could achieve this with some reflection and maintenance of my own path, but it feels like something I should be able to do with Jackson since it's already doing the traversal. I'm pretty sure that using Jackson's attributes feature is the right way to attach my object that will accumulate the data structure. However, I can't figure out a way to get at the path at runtime. Here's my current Scala attempt (hackily?) built on top of a filter that is applied to all objects through a mixin:
object Test {
#JsonFilter("pii")
class PiiMixin {
}
class PiiAccumulator {
val state = mutable.ArrayBuffer[String]()
def accumulate(test: String): Unit = state += test
}
def main(args: Array[String]): Unit = {
val filter = new SimpleBeanPropertyFilter() {
override def serializeAsField(pojo: Any, jgen: JsonGenerator, provider: SerializerProvider, writer: PropertyWriter): Unit = {
if (writer.getAnnotation(classOf[PiiData]) != null) {
provider.getAttribute("pii-accumulator").asInstanceOf[PiiAccumulator].accumulate(writer.getFullName.toString)
}
super.serializeAsField(pojo, jgen, provider, writer)
}
override def include(writer: BeanPropertyWriter): Boolean = true
override def include(writer: PropertyWriter): Boolean = true
}
val provider = new SimpleFilterProvider().addFilter("pii", filter)
val mapper = new ObjectMapper()
mapper.addMixIn(classOf[Object], classOf[PiiMixin])
val accum = new PiiAccumulator()
mapper.writer(provider)
.withAttributes("pii-accumulator", accum)
.writeValueAsString(null) // Pass in any arbitrary object here
}
}
This code has enabled me to dynamically buffer up a list of property names that contain PII, but I can't figure out how to get their locations within the resulting JSON doc. Perhaps the Jackson architecture somehow precludes knowing that at runtime. Is there some other place I can hook in to do something like this, perhaps while converting to a JsonNode?
Thanks!

Okay, found it. You can access the recursive path/location during serialization via JsonGenerator.getOutputContext.pathAsPointer(). So by changing my code above to the following:
if (writer.getAnnotation(classOf[PIIData]) != null) {
provider.getAttribute("pii").asInstanceOf[PiiAccumulator]
.accumulate(jgen.getOutputContext.pathAsPointer().toString + "/" + writer.getName)
}
I'm able to dynamically buffer a list of special locations in the resulting JSON document for further dynamic processing.

Kotlin arrow-kt, functional way to map a collection of either to an either of a collection

I've been using kotlin arrow quite a bit recently, and I've ran into a specific use case that has me stuck.
Let's say I have a collection of some object that I want to convert to another datatype using a convert function. Let's also say that this convert function has an ability to fail-- but instead of throwing an exception, it will just return an Either, where Either.Left() is a failure and Either.Right() is the mapped object. What is the best way to handle this use case? Some sample code below:
val list: Collection<Object> // some collection
val eithers: List<Either<ConvertError, NewObject>> = list.map { convert(it) } // through some logic, convert each object in the collection
val desired: Either<ConvertError, Collection<NewObject>> = eithers.map { ??? }
fun convert(o: Object) : Either<ConvertError, NewObject> { ... }
Essentially, I'd like to call a mapping function on a collection of data, and if any of the mappings respond with a failure, I'd like to have an Either.Left() containing the error. And then otherwise, I'd like the Either.Right() to contain all of the mapped objects.
Any ideas for a clean way to do this? Ideally, I'd like to make a chain of function calls, but have the ability to percolate an error up through the function calls.

You can use Arrow's computation blocks to unwrap Either inside map like so:
import arrow.core.Either
import arrow.core.computations.either
val list: ListObject> // some collection
val eithers: List<Either<ConvertError, NewObject>> = list.map { convert(it) } // through some logic, convert each object in the collection
val desired: Either<ConvertError, Collection<NewObject>> = either.eager {
eithers.map { convert(it).bind() }
}
fun convert(o: Object) : Either<ConvertError, NewObject> { ... }
Here bind() will either unwrap Either into NewObject in the case Either is Right, or it will exit the either.eager block in case it finds Left with ConvertError. Here we're using the eager { } variant since we're assigning it to a val immediately. The main suspend fun either { } block supports suspend functions inside but is itself also a suspend function.
This is an alternative to the traverse operator.
The traverse operation will be simplified in Arrow 0.12.0 to the following:
import arrow.core.traverseEither
eithers.traverseEither(::convert)
The traverse operator is also available in Arrow Fx Coroutines with support for traversing in parallel, and some powerful derivatives of this operation.
import arrow.fx.coroutines.parTraverseEither
eithers.parTraverseEither(Dispatcheres.IO, ::convert)

This is a frequent one, what you're looking for is called traverse. It's like map, except it collects the results following the aggregation rules of the content.
So, list.k().traverse(Either.applicative()) { convert(it) } will return Either.Left is any of the operations return Left, and Right<List< otherwise.

How about arrow.core.IterableKt#sequenceEither?
val desired: Either<ConvertError, Collection<NewObject>> = eithers.sequenceEither()

Variables not initialized properly when initializing it in an overriden abstract function called from constructor or init block

I hit a problem with some Kotlin code and I found out it was related to calling a method that assigns some variables from an init block (or a secondary constructor for that matter, either reproduces the problem).
MCVE:
abstract class Shader(/*Input arguments omitted for the sake of an MCVE*/){
init{
//Shader loading and attaching, not relevant
bindAttribs()//One of the abstract methods. In my actual program, this uses OpenGL to bind attributes
//GLSL program validation
getUniforms()//Same as the previous one: abstract method using GL calls to get uniforms. This gets locations so an integer is set (the problem)
}
abstract fun getUniforms();//This is the one causing problems
abstract fun bindAttribs();//This would to if primitives or non-lateinit vars are set
}
abstract class BoilerplateShader() : Shader(){
var loc_projectionMatrix: Int = 404//404 is an initial value. This can be anything though
var loc_transformationMatrix: Int = 404
var loc_viewMatrix: Int = 404
override fun getUniforms(){
//These would be grabbed by using glGetUniformLocations, but it's reproducable with static values as well
loc_projectionMatrix = 0
loc_transformationMatrix = 1
loc_viewMatrix = 2
println(loc_projectionMatrix.toString() + ", " + loc_transformationMatrix + ", " + loc_viewMatrix)
}
//debug method, only used to show the values
fun dump(){
println(loc_projectionMatrix.toString() + ", " + loc_transformationMatrix + ", " + loc_viewMatrix)
}
}
class TextureShader() : BoilerplateShader(){
override fun bindAttribs() {
//This doesn't cause a problem even though it's called from the init block, as nothing is assigned
//bindAttrib(0, "a_position");
//bindAttrib(1, "a_texCoord0");
}
}
//Other repetitive shaders, omitted for brevity
Then doing:
val tx = TextureShader()
tx.dump()
prints:
0, 1, 2
404, 404, 404
The print statements are called in order from getUniforms to the dump call at the end. It's assigned fine in the getUniforms method, but when calling them just a few milliseconds later, they're suddenly set to the default value of (in this case) 404. This value can be anything though, but I use 404 because that's a value I know I won't use for testing in this particular MCVE.
I'm using a system that relies heavily on abstract classes, but calling some of these methods (getUniforms is extremely important) is a must. If I add an init block in either BoilerplateShader or TextureShader with a call to getUniforms, it works fine. Doing a workaround with an init function (not an init block) called after object creation:
fun init(){
bindAttribs();
getUniforms();
}
works fine. But that would involve the created instance manually calls it:
val ts = TexturedShader();
ts.init();
ts.dump()
which isn't an option. Writing the code that causes problems in Kotlin in Java works like expected (considerably shortened code, but still reproducable):
abstract class Shader{
public Shader(){
getUniforms();
}
public abstract void getUniforms();
}
abstract class BoilerplateShader extends Shader{
int loc_projectionMatrix;//When this is initialized, it produces the same issue as Kotlin. But Java doesn't require the vars to be initialized when they're declared globally, so it doesn't cause a problem
public void getUniforms(){
loc_projectionMatrix = 1;
System.out.println(loc_projectionMatrix);
}
//and a dump method or any kind of basic print statement to print it after object creation
}
class TextureShader extends BoilerplateShader {
public TextureShader(){
super();
}
}
and printing the value of the variable after initialization of both the variable and the class prints 0, as expected.
Trying to reproduce the same thing with an object produces the same result as with numbers when the var isn't lateinit. So this:
var test: String = ""
prints:
0, 1, 2, test
404, 404, 404,
The last line is exactly as printed: the value if test is set to an empty String by default, so it shows up as empty.
But if the var is declared as a lateinit var:
lateinit var test: String
it prints:
0, 1, 2, test
404, 404, 404, test
I can't declare primitives with lateinit. And since it's called outside a constructor, it either needs to be initialized or be declared as lateinit.
So, is it possible to initialize primitives from an overridden abstract method without creating a function to call it?
Edit:
A comment suggested a factory method, but that's not going to work because of the abstraction. Since the attempted goal is to call the methods from the base class (Shader), and since abstract classes can't be initialized, factory methods won't work without creating a manual implementation in each class, which is overkill. And if the constructor is private to get it to work (avoid initialization outside factory methods), extending won't work (<init> is private in Shader).
So the constructors are forced to be public (whether the Shader class has a primary or secondary constructor, the child classes have to have a primary to initialize it) meaning the shaders can be created while bypassing the factory method. And, abstraction causes problems again, the factory method (having to be abstract) would be manually implemented in each child class, once again resulting in initialization and manually calling the init() method.
The question is still whether or not it's possible to make sure the non-lateinit and primitives are initialized when calling an abstract method from the constructor. Creating factory methods would be a perfect solution had there not been abstraction involved.

Note: The absolutely best idea is to avoid declaring objects/primitives in abstract functions called from the abstract class' constructor method, but there are cases where it's useful. Avoid it if possible.
The only workaround I found for this is using by lazy, since there are primitives involved and I can convert assignment to work in the blocks.
lateinit would have made it slightly easier, so creating object wrappers could of course be an option, but using by lazy works in my case.
Anyways, what's happening here is that the value assigned to the int in the constructor is later overridden by the fixed value. Pseudocode:
var x /* = 0 */
constructor() : super.constructor()//x is not initialized yet
super.constructor(){
overridden function();
}
abstract function()
overridden function() {
x = 4;
}
// The assignment if `= 0` takes place after the construction of the parent, setting x to 0 and overriding the value in the constructor
With lateinit, the problem is removed:
lateinit var x: Integer//x exists, but doesn't get a value. It's assigned later
constructor() : super.constructor()
super.constructor(){
overridden function()
}
abstract function()
overridden function(){
x = Integer(4);//using an object here since Kotlin doesn't support lateinit with primtives
}
//x, being lateinit and now initialized, doesn't get re-initialized by the declaration. x = 4 instead of 0, as in the first example
When I wrote the question, I thought Java worked differently. This was because I didn't initialize the variables there either (effectively, making them lateinit). When the class then is fully initialized, int x; doesn't get assigned a value. If it was declared as int x = 1234;, the same problem in Java occurs as here.
Now, the problem goes back to lateinit and primitives; primitives cannot be lateinit. A fairly basic solution is using a data class:
data class IntWrapper(var value: Int)
Since the value of data classes can be unpacked:
var (value) = intWrapperInstance//doing "var value = ..." sets value to the intWrapperInstance. With the parenthesis it works the same way as unpacking the values of a pair or triple, just with a single value.
Now, since there's an instance with an object (not a primitive), lateinit can be used. However, this isn't particularly efficient since it involves another object being created.
The only remaining option: by lazy.
Wherever it's possible to create initialization as a function, this is the best option. The code in the question was a simplified version of OpenGL shaders (more specifically, the locations for uniforms). Meaning this particular code is fairly easy to convert to a by lazy block:
val projectionMatrixLocation by lazy{
glGetUniformLocation(program, "projectionMatrix")
}
Depending on the case though, this might not be feasible. Especially since by lazy requires a val, which means it isn't possible to change it afterwards. This depends on the usage though, since it isn't a problem if it isn't going to change.

Why the variable can't be initialized correctly in inline function as in java?

We know the lambda body is lazily well, because if we don't call the lambda the code in the lambda body is never be called.
We also know in any function language that a variable can be used in a function/lambda even if it is not initialized, such as javascript, ruby, groovy and .etc, for example, the groovy code below can works fine:
def foo
def lambda = { foo }
foo = "bar"
println(lambda())
// ^--- return "bar"
We also know we can access an uninitialized variable if the catch-block has initialized the variable when an Exception is raised in try-block in Java, for example:
// v--- m is not initialized yet
int m;
try{ throw new RuntimeException(); } catch(Exception ex){ m = 2;}
System.out.println(m);// println 2
If the lambda is lazily, why does Kotlin can't use an uninitialized variable in lambda? I know Kotlin is a null-safety language, so the compiler will analyzing the code from top to bottom include the lambda body to make sure the variable is initialized. so the lambda body is not "lazily" at compile-time. for example:
var a:Int
val lambda = { a }// lambda is never be invoked
// ^--- a compile error thrown: variable is not initialized yet
a = 2
Q: But why the code below also can't be working? I don't understand it, since the variable is effectively-final in Java, if you want to change the variable value you must using an ObjectRef instead, and this test contradicts my previous conclusions:"lambda body is not lazily at compile-time" .for example:
var a:Int
run{ a = 2 }// a is initialized & inlined to callsite function
// v--- a compile error thrown: variable is not initialized yet
println(a)
So I only can think is that the compiler can't sure the element field in ObjectRef is whether initialized or not, but #hotkey has denied my thoughts. Why?
Q: why does Kotlin inline functions can't works fine even if I initializing the variable in catch-block like as in java? for example:
var a: Int
try {
run { a = 2 }
} catch(ex: Throwable) {
a = 3
}
// v--- Error: `a` is not initialized
println(a)
But, #hotkey has already mentioned that you should using try-catch expression in Kotlin to initializing a variable in his answer, for example:
var a: Int = try {
run { 2 }
} catch(ex: Throwable) {
3
}
// v--- println 2
println(a);
Q: If the actual thing is that, why I don't call the run directly? for example:
val a = run{2};
println(a);//println 2
However the code above can works fine in java, for example:
int a;
try {
a = 2;
} catch (Throwable ex) {
a = 3;
}
System.out.println(a); // println 2

Q: But why the code below also can't be working?
Because code can change. At the point where the lambda is defined the variable is not initialized so if the code is changed and the lambda is invoked directly afterwards it would be invalid. The kotlin compiler wants to make sure there is absolutely no way the uninitialized variable can be accessed before it is initialized, even by proxy.
Q: why does Kotlin inline functions can't works fine even if I initializing the variable in catch-block like as in java?
Because run is not special and the compiler can't know when the body is executed. If you consider the possibility of run not being executed then the compiler cannot guarentee that the variable will be initialized.
In the changed example it uses the try-catch expression to essentially execute a = run { 2 }, which is different from run { a = 2 } because a result is guaranteed by the return type.
Q: If the actual thing is that, why I doesn't call the run directly?
That is essentially what happens. Regarding the final Java code the fact is that Java does not follow the exact same rules of Kotlin and the same happens in reverse. Just because something is possible in Java does not mean it will be valid Kotlin.

You could make the variable lazy with the following...
val a: Int by lazy { 3 }
Obviously, you could use a function in place of the 3. But this allows the compiler to continue and guarantees that a is initialized before use.
Edit
Though the question seems to be "why can't it be done". I am in the same mind frame, that I don't see why not (within reason). I think the compiler has enough information to figure out that a lambda declaration is not a reference to any of the closure variables. So, I think it could show a different error when the lambda is used and the variables it references have not been initialized.
That said, here is what I would do if the compiler writers were to disagree with my assessment (or take too long to get around to the feature).
The following example shows a way to do a lazy local variable initialization (for version 1.1 and later)
import kotlin.reflect.*
//...
var a:Int by object {
private var backing : Int? = null
operator fun getValue(thisRef: Any?, property: KProperty<*>): Int =
backing ?: throw Exception("variable has not been initialized")
operator fun setValue(thisRef: Any?, property: KProperty<*>, value: Int) {
backing = value
}
}
var lambda = { a }
// ...
a = 3
println("a = ${lambda()}")
I used an anonymous object to show the guts of what's going on (and because lazy caused a compiler error). The object could be turned into function like lazy.
Now we are potentially back to a runtime exception if the programmer forgets to initialize the variable before it is referenced. But Kotlin did try at least to help us avoid that.

How to make a builder for a Kotlin data class with many immutable properties

I have a Kotlin data class that I am constructing with many immutable properties, which are being fetched from separate SQL queries. If I want to construct the data class using the builder pattern, how do I do this without making those properties mutable?
For example, instead of constructing via
var data = MyData(val1, val2, val3)
I want to use
builder.someVal(val1)
// compute val2
builder.someOtherVal(val2)
// ...
var data = builder.build()
while still using Kotlin's data class feature and immutable properties.

I agree with the data copy block in Grzegorz answer, but it's essentially the same syntax as creating data classes with constructors. If you want to use that method and keep everything legible, you'll likely be computing everything beforehand and passing the values all together in the end.
To have something more like a builder, you may consider the following:
Let's say your data class is
data class Data(val text: String, val number: Int, val time: Long)
You can create a mutable builder version like so, with a build method to create the data class:
class Builder {
var text = "hello"
var number = 2
var time = System.currentTimeMillis()
internal fun build()
= Data(text, number, time)
}
Along with a builder method like so:
fun createData(action: Builder.() -> Unit): Data {
val builder = Builder()
builder.action()
return builder.build()
}
Action is a function from which you can modify the values directly, and createData will build it into a data class for you directly afterwards.
This way, you can create a data class with:
val data: Data = createData {
//execute stuff here
text = "new text"
//calculate number
number = -1
//calculate time
time = 222L
}
There are no setter methods per say, but you can directly assign the mutable variables with your new values and call other methods within the builder.
You can also make use of kotlin's get and set by specifying your own functions for each variable so it can do more than set the field.
There's also no need for returning the current builder class, as you always have access to its variables.
Addition note: If you care, createData can be shortened to this:
fun createData(action: Builder.() -> Unit): Data = with(Builder()) { action(); build() }.
"With a new builder, apply our action and build"

I don't think Kotlin has native builders. You can always compute all values and create the object at the end.
If you still want to use a builder you will have to implement it by yourself. Check this question

There is no need for creating custom builders in Kotlin - in order to achieve builder-like semantics, you can leverage copy method - it's perfect for situations where you want to get object's copy with a small alteration.
data class MyData(val val1: String? = null, val val2: String? = null, val val3: String? = null)
val temp = MyData()
.copy(val1 = "1")
.copy(val2 = "2")
.copy(val3 = "3")
Or:
val empty = MyData()
val with1 = empty.copy(val1 = "1")
val with2 = with1.copy(val2 = "2")
val with3 = with2.copy(val3 = "3")
Since you want everything to be immutable, copying must happen at every stage.
Also, it's fine to have mutable properties in the builder as long as the result produced by it is immutable.

It's possible to mechanize the creation of the builder classes with annotation processors.
I just created ephemient/builder-generator to demonstrate this.
Note that currently, kapt works fine for generated Java code, but there are some issues with generated Kotlin code (see KT-14070). For these purposes this isn't an issue, as long as the nullability annotations are copied through from the original Kotlin classes to the generated Java builders (so that Kotlin code using the generated Java code sees nullable/non-nullable types instead of just platform types).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How spark handles object - serialization

Related

Using a Jackson attribute to accumulate state as a byproduct of serialization

Kotlin arrow-kt, functional way to map a collection of either to an either of a collection

Variables not initialized properly when initializing it in an overriden abstract function called from constructor or init block

Why the variable can't be initialized correctly in inline function as in java?

How to make a builder for a Kotlin data class with many immutable properties

Categories

Resources