Spark could not filter correctly? - serialization

I encounter an wired problem that the result is not correct.
I have a class called A, and it has a value called keyword.
I want to filter the RDD[A] if it has some keyword.
Spark environment:
version: 1.3.1
execution env: yarn-client
Here is the code:
class A ...
case class C(words:Set[String] ) extends Serializable {
def run(data:RDD[A])(implicit sc:SparkContext) ={
data.collect{ case x:A=> x }.filter(y => words.contains(y.keyword)).foreach(println)
}
}
// in main function
val data:RDD[A] = ....
val c = C(Set("abc"))
c.run(data)
The code above prints nothing. However if I collect RDD[A] to local, then it print something! E.g.
data.take(1000).collect{ case x:A=> x }.filter(y => words.contains(y.keyword)).foreach(println)}
How could this happen?
Let me ask another related question: Should I make case class C extends Serializable? I don't think it is necessary.

The reason is quite easy. If you run the println function when you collect data locally, what happens is that your data are trasferred over the network to the machine you are using (let's call it the client of the Spark environment) and then it is printed on your console. SO far, everything behaves as expected. Instead, if you run the println function on a distributed RDD, the println function is executed locally on the worker machine on which there are your data. So the function is actually executed but you won't see any result on the console of your client, unless it is also a worker machine: in fact, everything is printed on the console of the respective worker node.
No, it's not necessary you make it Serializable, the only thing is serialized is your words:Set[String].

Related

Hwo to convert Flux<Item> to List<Item> by blocking

Background
I have a legacy application where I need to return a List<Item>
There are many different Service classes each belonging to an ItemType.
Each service class calls a few different backend APIs and collects the responses to create a SubType of the Item.
So we can say, each service class implementation returns an Item
All backend API access code is using WebClient which returns Mono of some type, and I can zip all Mono within the service to create an Item
The user should be able to look up many different types of items in one call. This requires many backend calls
So for performance sake, I wanted to make this all asynchronous using reactor, so I introduced Spring Reactive code.
Problem
If my endpoint had to return Flux<Item> then this code work fine,
But this is some service code which is used by other legacy code caller.
So eventually I want to return the List<Item> but When I try to convert my Flux into the List I get an error
"message": "block()/blockFirst()/blockLast() are blocking,
which is not supported in thread reactor-http-nio-3",
Here is the service, which is calling a few other service classes.
Flux<Item> itemFlux = Flux.fromIterable(searchRequestByItemType.entrySet())
.flatMap(e ->
getService(e.getKey()).searchItems(e.getValue()))
.subscribeOn(Schedulers.boundedElastic());
Mono<List<Item>> listMono = itemFlux
.collectList()
.block(); //This line throws error
Here is what the above service is calling
default Flux<Item> searchItems(List<SingleItemSearchRequest> requests) {
return Flux.fromIterable(requests)
.flatMap(this::searchItem)
.subscribeOn(Schedulers.boundedElastic());
}
Here is what a single-item search is which is used by above
public Mono<Item> searchItem(SingleItemSearchRequest sisr) {
return Mono.zip(backendApi.getItemANameApi(sisr.getItemIdentifiers().getItemId()),
sisr.isAddXXXDetails()
?backendApi.getItemAXXXApi(sisr.getItemIdentifiers().getItemId())
:Mono.empty(),
sisr.isAddYYYDetails()
?backendApi.getItemAYYYApi(sisr.getItemIdentifiers().getItemId())
:Mono.empty())
.map(tuple3 -> Item.builder()
.name(tuple3.getT1())
.xxxDetails(tuple3.getT2())
.yyyDetails(tuple3.getT3())
.build()
);
}
Sample project to replicate the problem..
https://github.com/mps-learning/spring-reactive-example
I’m new to spring reactor, feel free to pinpoint ALL errors in the code.
UPDATE
As per Patrick Hooijer Bonus suggestion, updating the Mono.zip entries to always contain some default.
#Override
public Mono<Item> searchItem(SingleItemSearchRequest sisr) {
System.out.println("\t\tInside " + supportedItem() + " searchItem with thread " + Thread.currentThread().toString());
//TODO: how to make these XXX YYY calls conditionals In clear way?
return Mono.zip(getNameDetails(sisr).defaultIfEmpty("Default Name"),
getXXXDetails(sisr).defaultIfEmpty("Default XXX Details"),
getYYYDetails(sisr).defaultIfEmpty("Default YYY Details"))
.map(tuple3 -> Item.builder()
.name(tuple3.getT1())
.xxxDetails(tuple3.getT2())
.yyyDetails(tuple3.getT3())
.build()
);
}
private Mono<String> getNameDetails(SingleItemSearchRequest sisr) {
return mockBackendApi.getItemCNameApi(sisr.getItemIdentifiers().getItemId());
}
private Mono<String> getYYYDetails(SingleItemSearchRequest sisr) {
return sisr.isAddYYYDetails()
? mockBackendApi.getItemCYYYApi(sisr.getItemIdentifiers().getItemId())
: Mono.empty();
}
private Mono<String> getXXXDetails(SingleItemSearchRequest sisr) {
return sisr.isAddXXXDetails()
? mockBackendApi.getItemCXXXApi(sisr.getItemIdentifiers().getItemId())
: Mono.empty();
}
Edit: Below answer does not solve the issue, but it contains useful information about Thread switching. It does not work because .block() is no problem for non-blocking Schedulers if it's used to switch to synchronous code.
This is because the block operator inherited the reactor-http-nio-3 Thread from backendApi.getItemANameApi (or one of the other calls in Mono.zip), which is non-blocking.
Most operators continue working on the Thread on which the previous operator executed, this is because the Thread is linked to the emitted item. There are two groups of operators where the Thread of the output item differs from the input:
flatMap, concatMap, zip, etc: Operators that emit items from other Publishers will keep the Thread link they received from this inner Publisher, not from the input.
Time based operators like delayElements, interval, buffer(Duration), etc. will schedule their tasks on the provided Scheduler, or Schedulers.parallel() if none provided. The emitted items will then be linked to the Thread the task was scheduled on.
In your case, Mono.zip emits items from backendApi.getItemANameApi linked to reactor-http-nio-3, which gets propagated downstream, goes outside both the flatMap in searchItems and in itemFlux, until it reaches your block operator.
You can solve this by placing a .publishOn(Schedulers.boundedElastic()), either in searchItem, searchItems or itemFlux. This will cause the item to switch to a Thread in the provided Scheduler.
Bonus: Since you requested to pinpoint errors: Your Mono.zip will not work if sisr.isAddXXXDetails() is false, as Mono.zip discards any element it could not zip. Since you return a Mono.empty() in that case, no items can be zipped and it will return an empty Mono.
If we have only spring-boot-starter-webflux defined as application dependency, then springbok spin up a `Netty server.
One is not expected to block() in a reactive application using a non-blocking server.
However, once we add spring-boot-starter-web dependency then even with the presence of spring-boot-starter-webflux, springboot spinup a tomcat server. Which is a thread-per-request model and is expected to have blocking calls
So to solve my problem, all I had to do above is, to add spring-boot-starter-web dependency in pom.xml. After that applications is started in Tomcat
with timcat .collectList().block() works in Controller class to return the List<Item>.
Whereas with the Netty server I could return only Flux<Item> not List<Item>, which is expected.

Neo4j 3.5's embeded database does not seem to persist data

I am trying to build a small command line tool that will store data in a neo4j graph. To do this I have started experimenting with Neo4j3.5's embedded databases. After putting together the following example I have found that either the nodes I am creating are not being saved to the database or the method of database creation is overwriting my previous run.
The Example:
fun main() {
//Spin up data base
val graphDBFactory = GraphDatabaseFactory()
val graphDB = graphDBFactory.newEmbeddedDatabase(File("src/main/resources/neo4j"))
registerShutdownHook(graphDB)
val tx = graphDB.beginTx()
graphDB.createNode(Label.label("firstNode"))
graphDB.createNode(Label.label("secondNode"))
val result = graphDB.execute("MATCH (a) RETURN COUNT(a)")
println(result.resultAsString())
tx.success()
}
private fun registerShutdownHook(graphDb: GraphDatabaseService) {
// Registers a shutdown hook for the Neo4j instance so that it
// shuts down nicely when the VM exits (even if you "Ctrl-C" the
// running application).
Runtime.getRuntime().addShutdownHook(object : Thread() {
override fun run() {
graphDb.shutdown()
}
})
}
I would expect that every time I run main the resulting query count will increase by 2.
That is currently not the case and I can find nothing in the docs that references a different method of opening an already created embedded database. Am I trying to use the embedded database incorrectly or am I missing something? Any help or info would be appreciated.
build Info:
Kotlin jvm 1.4.21
Neo4j-comunity-3.5.35
Transactions in neo4j 3.x have a 3 stage model
create
success / failure
close
you missed the third, which would then commit or rollback.
You can use Kotlin's use as Transaction is an AutoCloseable

Quarkus: execute parallel unis

In a quarkus / kotlin application, I want to start multiple database requests concurrently. I am new at quarkys and I am not sure if I am doing things right:
val uni1 = Uni.createFrom().item(repo1).onItem().apply { it.request() }
val uni2 = Uni.createFrom().item(repo2).onItem().apply { it.request() }
return Uni.combine().all()
.unis(uni1, uni2)
.asTuple()
.onItem()
.apply { tuple ->
Result(tuple.item1, tuple.item2) }
.await()
.indefinitely()
Will the request() really be made in parallel? Is it the right way to do it in quarkus?
Yes, your code is right.
Uni.combine().all() runs all the passed Unis concurrently. You will get the tuple (containing the individual results) when all the Unis have completed (emitted a result).
From your code, you may remove the tuple step and use combineWith instead.
Finally, note that the await().indefinitely() blocks the caller thread, forever if one of the Uni does not complete (for whatever reason). I strongly recommend using await().atMost(...)

Lumen - seeder in Unit tests

I'm trying to implement unit tests in my company's project, and I'm running into some weird trouble trying to use a separate set of data in my database.
As I want tests to be performed in a confined environment, I'm looking for the easiest way to input data in a dedicated database. Long story short, to this extent, I decided to use a MySQL dump of inserted data.
This is basically my seeder code:
public function run()
{
\Illuminate\Support\Facades\DB::unprepared(file_get_contents(__DIR__ . '/data1.sql'));
}
Now here's the problem.
In my unit test, I can call the seeder, but :
If I call the seeder in the setUpBeforeClass(), it works. Although it doesn't fit my needs as I want to be able to invoke different sets of data for different tests
If I call the seeder within a test, the data is never inserted in the database (either with or without the transaction trait).
If I use DB::insert instead of ::raw or ::unprepared or ::statement without using a raw sql file, it works. But my inserts are too complicated for that.
Here's a few things I tried with the same results :
DB::raw(file_get_contents(__DIR__.'/database/data1.sql'));
DB::statement(file_get_contents(__DIR__ . '/database/data1.sql'));
$seeder = new CheckTestSeeder();
$seeder->run();
\Illuminate\Support\Facades\Artisan::call('db:seed', ['--class' => 'CheckTestSeeder']);
$this->seeInDatabase('jackpot.progressive', [
'name_progressive' => 'aaa'
]);
Any pointers on how to proceed and why I have different behaviors if I do that in the setUpBeforeClass() and within the test would be appreciated!
You may use Illuminate\Foundation\Testing\RefreshDatabase trait as explained here. If you need something more, you can override refreshTestDatabase method in RefreshDatabase trait.
protected function refreshTestDatabase()
{
parent::refreshTestDatabase();
\Illuminate\Support\Facades\Artisan::call('db:seed', ['--class' => 'CheckTestSeeder']);
}

How can I encapsulate the session/transaction acquisition into the lazy-init of relations in Squeryl?

I am trying to implement a One-To-Many relation using Squeryl, and following the instructions on their site.
The documentation gives the following example:
object SchoolDb extends Schema {
val courses = table[Course]
val subjects = table[Subject]
val subjectToCourses =
oneToManyRelation(subjects, courses).
via((s,c) => s.id === c.subjectId)
}
class Course(val subjectId: Long) extends SchoolDb2Object {
lazy val subject: ManyToOne[Subject] = SchoolDb.subjectToCourses.right(this)
}
class Subject(val name: String) extends SchoolDb2Object {
lazy val courses: OneToMany[Course] = SchoolDb.subjectToCourses.left(this)
}
I find that any calls to Course.subject or Subject.courses needs to be wrapped in a transaction. However, One of my goals in using an ORM is to hide these details from callers. As such, I don't want the calling code to have to wrap a call to these fields in a transaction.
It seems that if I modify the example to wrap the lazy init function in a transaction, like so:
class Subject(val name: String) extends SchoolDb2Object {
lazy val courses: OneToMany[Course] = {
inTransaction {
SchoolDb.subjectToCourses.left(this)
}
}
I get the following exception:
Exception in thread "main" java.lang.RuntimeException: no session is bound to current thread, a session must be created via Session.create
and bound to the thread via 'work' or 'bindToCurrentThread'
at scala.Predef$.error(Predef.scala:58)
at org.squeryl.Session$$anonfun$currentSession$1.apply(Session.scala:111)
at org.squeryl.Session$$anonfun$currentSession$1.apply(Session.scala:111)
at scala.Option.getOrElse(Option.scala:104)
at org.squeryl.Session$.currentSession(Session.scala:110)
at org.squeryl.dsl.AbstractQuery.org$squeryl$dsl$AbstractQuery$$_dbAdapter(AbstractQuery.scala:116)
at org.squeryl.dsl.AbstractQuery$$anon$1.<init>(AbstractQuery.scala:120)
at org.squeryl.dsl.AbstractQuery.iterator(AbstractQuery.scala:118)
at org.squeryl.dsl.DelegateQuery.iterator(DelegateQuery.scala:9)
But, like I said, if I wrap the caller in a transaction, then everything works.
So, how can I encapsulate the fact that this object is backed by a database in the object itself?
I assume you get this error in calls on the courses object?
I don't know very much about how Squeryl works, but I believe that the OneToMany[Course] is a live object. That means that the calls on the courses object need a session since any call may lazily go to the database to fetch data.
How you organise this depends on what type of application you use. In a web application it often makes sense to add a filter (first point of entry) to start and stop the transaction. In a GUI client, say a swing application, it's a good solution to start the transaction at the point where you receive the user interaction. That way you get transactions that are not to long and also stretches over calls which you expect to be performed atomically (either fully or not at all).