Parallel processing Flux::groupBy hangs - spring-webflux

I'm using reactor 3.4.18 and have a question about Flux.groupBy. I have generated 1000 integers and splited them into 100 groups, I expect that each group could be process in sperate thread but it hangs after several integers processed.
#Test
void shouldGroupByKeyAndProcessInParallel() {
final Scheduler scheduler = Schedulers.newParallel("group", 1000);
StepVerifier.create(Flux.fromStream(IntStream.range(0, 1000).boxed())
.groupBy(integer -> integer % 100)
.flatMap(groupedFlux -> groupedFlux
.subscribeOn(scheduler) // this line doesn't help
.doOnNext(integer -> log.info("processing {}:{}", groupedFlux.key(), integer)),
2)
)
.expectNextCount(1000)
.verifyComplete();
}
test execution logs:
10:47:58.670 [main] DEBUG reactor.util.Loggers - Using Slf4j logging framework
10:47:58.846 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 0:0
10:47:58.866 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 1:1
10:47:58.867 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 0:100
10:47:58.867 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 1:101
10:47:58.867 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 0:200
10:47:58.867 [group-1] INFO com.huawei.hwclouds.coney.spike.FluxGroupByTest - processing 1:201
-------- start hanging ----------
I have changed the flatmap concurrecy to 2 to speed up the reproduction. I expects that flatmap should only slow the whole processing time but should not hang

Related

Calling Karate feature file returns response object including multiple copies of previous response object of parent scenario

I am investigating exponential increase in JAVA heap size when executing complex scenarios especially with multiple reusable scenarios. This is my attempt to troubleshoot the issue with simple example and possible explanation to JVM heap usage.
Environment: Karate 1.1.0.RC4 | JDK 14 | Maven 3.6.3
Example: Download project, extract and execute maven command as per READEME
Observation: As per following example, if we call same scenario multiple times, response object grows exponentially since it includes response from previous called scenario along with copies of global variables.
#unexpected
Scenario: Not over-writing nested variable
* def response = call read('classpath:examples/library.feature#getLibraryData')
* string out = response
* def resp1 = response.randomTag
* karate.log('FIRST RESPONSE SIZE = ', out.length)
* def response = call read('classpath:examples/library.feature#getLibraryData')
* string out = response
* def resp2 = response.randomTag
* karate.log('SECOND RESPONSE SIZE = ', out.length)
Output:
10:26:23.863 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 9 by tag: getLibraryData
10:26:23.875 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 14 by tag: libraryData
10:26:23.885 [main] INFO com.intuit.karate - FIRST RESPONSE SIZE = 331
10:26:23.885 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 9 by tag: getLibraryData
10:26:23.894 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 14 by tag: libraryData
10:26:23.974 [main] INFO com.intuit.karate - SECOND RESPONSE SIZE = 1783
10:26:23.974 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 9 by tag: getLibraryData
10:26:23.974 [main] INFO c.intuit.karate.core.FeatureRuntime - scenario called at line: 14 by tag: libraryData
10:26:23.988 [main] INFO com.intuit.karate - THIRD RESPONSE SIZE = 8009
Do we really need to include response and global variables in the response of called feature file (non-shared scope)?
When we read large json file and call multiple reusable scenario files, each time copy of read json data gets added to response object. Is there way to avoid this behavior?
Is there a better way to script complex test using reusable scenarios without having multiple copies of same variables?
Okay, can you look at this issue:
https://github.com/intuit/karate/issues/1675
I agree we can optimize the response and global variables. Would be great if you can contribute code.

Subscibe not print out any log if using publishOn in Project Reactor

I've got a very simple stream based on the book of Hands-On Reactive Programming in Spring 5.
Flux.just(1, 2, 3).publishOn(Schedulers.elastic()))
.concatMap(i -> Flux.range(0, i).publishOn(Schedulers.elastic()))
.subscribe(log::info);
However, there's no console output at all. But if I add doOnNext after just:
Flux.just(1, 2, 3).doOnNext(log::debug).publishOn(Schedulers.elastic()))
.concatMap(i -> Flux.range(0, i).publishOn(Schedulers.elastic()))
.subscribe(log::info);
then I can get both output of debug and info. May I know why?
Edit 1:
There's the console output of the following stream:
Flux.just(1, 2, 3).doOnNext(log::debug)
.publishOn(Schedulers.elastic())).doOnNext(log::warn)
.concatMap(i -> Flux.range(0, i).publishOn(Schedulers.elastic()))
.subscribe(log::info);
And output:
[main] INFO ReactiveTest - 1
[main] INFO ReactiveTest - 2
[elastic-2] WARN ReactiveTest - 1
[main] INFO ReactiveTest - 3
[elastic-2] DEBUG ReactiveTest - 0
[elastic-2] WARN ReactiveTest - 2
[elastic-2] DEBUG ReactiveTest - 0
[elastic-2] DEBUG ReactiveTest - 1
[elastic-2] WARN ReactiveTest - 3
[elastic-2] DEBUG ReactiveTest - 0
[elastic-2] DEBUG ReactiveTest - 1
[elastic-2] DEBUG ReactiveTest - 2
I think the log messages prove that the function in subscribe will be called at the same thread as the function of concatMap.
Your first program is probably terminating right after you call subscribe. From the docs of subscribe:
Keep in mind that since the sequence can be asynchronous, this will immediately
return control to the calling thread. This can give the impression the consumer is
not invoked when executing in a main thread or a unit test for instance.
In second program, doOnNext is invoked in the middle of processing, so it has time to output all the results. If you run the program many times you will see that it sometimes is not able to output the second log.

Failure of Importing data from Bigquery to GCS

Dear support at Google,
We recently noticed that many of the GAP site import jobs extracting&uploading data from Google Bigquery to Google Cloud Service have been failing (Since April 4th). Our uploading jobs are running fine before April 4th but have been failing since April 4th, after did investigation, we feel this is an issue/error from Bigquery side, not from our job. The details of error info from Bigquery API when uploading data is shown below:
216769 [main] INFO  org.mortbay.log  - Dataset : 130288123
217495 [main] INFO  org.mortbay.log  - Job is PENDING waiting 10000 milliseconds...
227753 [main] INFO  org.mortbay.log  - Job is PENDING waiting 10000 milliseconds...
237995 [main] INFO  org.mortbay.log  - Job is PENDING waiting 10000 milliseconds...
Heart beat
248208 [main] INFO  org.mortbay.log  - Job is PENDING waiting 10000 milliseconds..
258413 [main] INFO  org.mortbay.log  - Job is PENDING waiting 10000 milliseconds...
268531 [main] INFO  org.mortbay.log  - Job is RUNNING waiting 10000 milliseconds...
Heart beat
278675 [main] INFO  org.mortbay.log  - An internal error has occurred
278675 [main] INFO  org.mortbay.log  - ErrorProto : null
 
As per log, it is an internal error with the issue ErrorProto:null.
 
Our google account: ea.eadp#gmail.com
 
Our Google Big Query projects:
Origin-BQ              origin-bq-1
Pulse-web             lithe-creek-712
The importing failure on following data set:
 
In Pulse-web, lithe-creek-712:
101983605
130288123
48135564
56570684
57740926
64736126
64951872
72220498
72845162
73148296
77517207
86821637
 
 
Please look into this and let us know if you have any updates.
Thank you very much and looking forward to hearing back from you.
 
Thanks

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

Kitchen getting killed

I am using pentaho data integration for ETL. I am running the job in ubuntu server as a shell script. It is running for some time after that it is getting killed without throwing any error. Please help me what is the problem and tell me if I am missing any.
LOG:
INFO 14-03 11:46:52,369 - set_var - Dispatching started for transformation [set_variable]
INFO 14-03 11:46:52,370 - Get rows from result - Finished processing (I=0, O=0, R=1, W=1, U=0, E=
INFO 14-03 11:46:52,370 - Set var - Setting environment variables...
INFO 14-03 11:46:52,371 - Set var - Set variable BOOK_PATH to value [...........]
INFO 14-03 11:46:52,371 - Set var - Set variable FOLDER_NAME to value [...........]
INFO 14-03 11:46:52,375 - Set var - Finished after 1 rows.
INFO 14-03 11:46:52,375 - Set var - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)
INFO 14-03 11:46:52,377 - validate - Starting entry [file]
INFO 14-03 11:46:52,378 - file - Loading transformation from XML file
INFO 14-03 11:46:52,386 - file - Dispatching started for transformation [file][file:///c:/check/file.txt]
INFO 14-03 11:46:52,390 - path - Optimization level set to 9.
INFO 14-03 11:46:52,391 - filename - Finished processing (I=0, O=0, R=0, W=13, U=0, E=0)
INFO 14-03 11:46:52,403 - path - Finished processing (I=0, O=0, R=13, W=13, U=0, E=0)
INFO 14-03 11:46:52,407 - filenames - Finished processing (I=0, O=14, R=13, W=13, U=0, E=0)
INFO 14-03 11:46:52,409 - validate - Starting entry [Check_database]
INFO 14-03 11:46:52,410 - Check_database - Loading transformation from XML file[file:///c:/check/missing.ktr]
INFO 14-03 11:46:52,418 - count - Dispatching started for transformation [count]
INFO 14-03 11:46:52,432 - count - Finished reading query, closing connection.
INFO 14-03 11:46:52,433 - Set var - Setting environment variables...
INFO 14-03 11:46:52,433 - count - Finished processing (I=1, O=0, R=0, W=1, U=0, E=0)
INFO 14-03 11:46:52,433 - Set var - Set variable Count to value [0]
INFO 14-03 11:46:52,436 - Set var - Finished after 1 rows.
INFO 14-03 11:46:52,436 - Set var - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)
Killed `
Most likely your machine unable to provide the memory demand from your transformation and hence silently killing it. You can modify the PDI memory allocation by tweaking the PENTAHO_DI_JAVA_OPTIONS option in your spoon.sh file. I had the same issue, below was what I did:
created an environment variable PENTAHO_DI_JAVA_OPTIONS in my system. This variable if not set, kitchen will use its default. Creating this system variable gave me the control to decrease or increase memory allocation as per the transformation complexity (at least for testing on my local machine).
My machine had 8G ram, and all the processes including kitchen already used it up. So reduced PDI memory demand by export PENTAHO_DI_JAVA_OPTIONS=-Xms1g -Xmx3g . Min. 1G and Max 3G.
I might be wrong but it worked for me even though the transformation threw GC Memory outage error. At least it was not killed silently.
Did not have to do the above setting in a dedicated standalone server, since no other process except PDI was working in it.
I was using PDI 8.3, iOS BigSur, when the process got killed silently.
Hope this helps someone :).
most likely you are running out of memory. check your machine's resources while running the ETL.