java.io.IOException: Error closing multipart upload - amazon-s3

I am working on pyspark code which processes Terabytes of data and write on s3.
After processing a data I am getting below error
`
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.io.IOException: Error closing multipart upload
at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.doMultiPartUpload(MultipartUploadOutputStream.java:441)
at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:421)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
`
I tried by setting below configurations. Still I am getting same error.
self._spark_session.conf.set("spark.hadoop.fs.s3a.multipart.threshold", 2097152000)
self._spark_session.conf.set("spark.hadoop.fs.s3a.multipart.size", 104857600)
self._spark_session.conf.set("spark.hadoop.fs.s3a.connection.maximum", 500)
self._spark_session.conf.set("spark.hadoop.fs.s3a.connection.timeout", 600000)
self._spark_session.conf.set("spark.hadoop.fs.s3.maxRetries", 50)
Can someone please help me to resolve this issue ?

Related

serialization issue with kafka connect and redis sink

I have created a table in ksqlDb using key value specifically mentioned to be as 'Avro' like below.
create table (
region varchar(10) primary key,
male_count Integer,
female_count Integer
) with (kafka_topic='test',
key_format='avro',
value_format='avro',
partitions=1,
replicas=1)
Now I want the data from ksqldb to be sinked to redis using kafka connect. I am encountering the record convertion issue like below.
org.apache.kafka.connect.errors.ConnectException: failed to convert record
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:101)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:90)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:601)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.kafka.connect.errors.ConnectException: unsupported command schema io.confluent.ksql.avro_schemas.KsqlDataSourceSchema
at io.github.jaredpetersen.kafkaconnectredis.sink.writer.RecordConverter.convert(RecordConverter.java:57)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:98)
... 12 more
{"type":"log", "host":"ckaf-kc-csf-kafka-connect-redis-55cb89b7bb-v2rtz", "level":"ERROR", "neid":"kafka-connect-455e22730a28462c969de25d9f54451e", "system":"kafka-connect", "time":"2022-10-18T10:16:57.224Z", "timezone":"UTC", "log":{"message":"task-thread-kafka-connect-redis-18-0 - org.apache.kafka.connect.runtime.WorkerSinkTask - WorkerSinkTask{id=kafka-connect-redis-18-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: failed to convert record"}}
org.apache.kafka.connect.errors.ConnectException: failed to convert record
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:101)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:90)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:601)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.kafka.connect.errors.ConnectException: unsupported command schema io.confluent.ksql.avro_schemas.KsqlDataSourceSchema
at io.github.jaredpetersen.kafkaconnectredis.sink.writer.RecordConverter.convert(RecordConverter.java:57)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:98)
... 12 more
{"type":"log", "host":"ckaf-kc-csf-kafka-connect-redis-55cb89b7bb-v2rtz", "level":"ERROR", "neid":"kafka-connect-455e22730a28462c969de25d9f54451e", "system":"kafka-connect", "time":"2022-10-18T10:16:57.224Z", "timezone":"UTC", "log":{"message":"task-thread-kafka-connect-redis-18-0 - org.apache.kafka.connect.runtime.WorkerTask - WorkerSinkTask{id=kafka-connect-redis-18-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted"}}
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:631)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.kafka.connect.errors.ConnectException: failed to convert record
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:101)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:90)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:601)
... 10 more
Caused by: org.apache.kafka.connect.errors.ConnectException: unsupported command schema io.confluent.ksql.avro_schemas.KsqlDataSourceSchema
at io.github.jaredpetersen.kafkaconnectredis.sink.writer.RecordConverter.convert(RecordConverter.java:57)
at io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkTask.put(RedisSinkTask.java:98)
============================
Redis sink connector property is as shown below:
"name": "kafka-connect-redis-103",
"config": {
"connector.class": "io.github.jaredpetersen.kafkaconnectredis.sink.RedisSinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "our service url",
"key.converter.enhanced.avro.schema.support": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "our service url",
"value.converter.enhanced.avro.schema.support": "true",
"topics": "test",
"redis.client.mode": "Cluster",
"redis.uri": "our service url",
"redis.operation.timeout.ms": 10000,
"redis.password": "ciredis",
"redis.database":3
}
what is wrong here? I have the Avro conversion property being specified. Still I am getting record conversion error.
Help here would be appreciated.

Alfresco LDAP batch sync

I am trying to connect my alfresco instance to our ldap server to authenticate users.
My configuration
# LDAP Authentication
authentication.chain=alfrescoNtlm1:alfrescoNtlm,ldap1:ldap
ldap.authentication.active=true
ldap.authentication.java.naming.provider.url=ldap://myurl:389
ldap.authentication.userNameFormat=dc=example,dc=com
ldap.authentication.java.naming.security.authentication=simple
ldap.synchronization.java.naming.security.principal=cn\=myCN,ou\=admin,dc\=example,dc\=com
ldap.synchronization.java.naming.security.credentials=secret
ldap.authentication.allowGuestLogin=false
ldap.synchronization.userSearchBase=ou\=users,dc\=example,dc\=com
ldap.synchronization.groupSearchBase=dc\=example,dc\=com
ldap.synchronization.attributeBatchSize=200
ldap.synchronization.queryBatchSize=200
The problem is that I reach the sizelimit of the ldap server every time. I doesn't seem like the batch size is used. I cannot raise the size limit of the ldap server. Is there a way to process user data batchwise?
Alfresco throws the following error:
2021-04-01 13:28:54,863 ERROR [org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer] [localhost-startStop-1] Synchronization aborted due to error
org.alfresco.error.AlfrescoRuntimeException: 03010018 Error during LDAP Search. Reason:[LDAP: error code 4 - Sizelimit Exceeded]
at org.alfresco.repo.security.sync.ldap.LDAPUserRegistry.processQuery(LDAPUserRegistry.java:1335)
at org.alfresco.repo.security.sync.ldap.LDAPUserRegistry.access$14(LDAPUserRegistry.java:1287)
at org.alfresco.repo.security.sync.ldap.LDAPUserRegistry$PersonCollection.<init>(LDAPUserRegistry.java:1524)
at org.alfresco.repo.security.sync.ldap.LDAPUserRegistry.getPersons(LDAPUserRegistry.java:573)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer.syncWithPlugin(ChainingUserRegistrySynchronizer.java:1775)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer.synchronizeInternal(ChainingUserRegistrySynchronizer.java:739)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer.access$16(ChainingUserRegistrySynchronizer.java:474)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer$7.doWork(ChainingUserRegistrySynchronizer.java:2138)
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer.onBootstrap(ChainingUserRegistrySynchronizer.java:2132)
at org.springframework.extensions.surf.util.AbstractLifecycleBean.onApplicationEvent(AbstractLifecycleBean.java:56)
at org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer.onApplicationEvent(ChainingUserRegistrySynchronizer.java:2495)
at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:172)
at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:165)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:139)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:127)
at org.alfresco.repo.management.subsystems.ChildApplicationContextFactory$ChildApplicationContext.publishEvent(ChildApplicationContextFactory.java:569)
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:887)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:552)
at org.alfresco.repo.management.subsystems.ChildApplicationContextFactory$ApplicationContextState.start(ChildApplicationContextFactory.java:824)
at org.alfresco.repo.management.subsystems.AbstractPropertyBackedBean.start(AbstractPropertyBackedBean.java:1098)
at org.alfresco.repo.management.subsystems.AbstractPropertyBackedBean.onApplicationEvent(AbstractPropertyBackedBean.java:637)
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEventInternal(SafeApplicationEventMulticaster.java:221)
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:186)
at org.alfresco.repo.management.SafeApplicationEventMulticaster.multicastEvent(SafeApplicationEventMulticaster.java:206)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:399)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:353)
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:887)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:552)
at org.springframework.web.context.ContextLoader.configureAndRefreshWebApplicationContext(ContextLoader.java:409)
at org.springframework.web.context.ContextLoader.initWebApplicationContext(ContextLoader.java:291)
at org.springframework.web.context.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:103)
at org.alfresco.web.app.ContextLoaderListener.contextInitialized(ContextLoaderListener.java:70)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4753)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5215)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:129)
at org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:150)
at org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:140)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:726)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1141)
at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1875)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.naming.SizeLimitExceededException: [LDAP: error code 4 - Sizelimit Exceeded]; remaining name 'ou=users,dc=example,dc=com'
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3206)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3100)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2891)
at com.sun.jndi.ldap.AbstractLdapNamingEnumeration.getNextBatch(AbstractLdapNamingEnumeration.java:148)
at com.sun.jndi.ldap.AbstractLdapNamingEnumeration.hasMoreImpl(AbstractLdapNamingEnumeration.java:217)
at com.sun.jndi.ldap.AbstractLdapNamingEnumeration.hasMore(AbstractLdapNamingEnumeration.java:189)
at org.alfresco.repo.security.sync.ldap.LDAPUserRegistry.processQuery(LDAPUserRegistry.java:1316)
... 49 more
Thanks for every help.
You should be able to configure "ldap.synchronization.queryBatchSize=1000" (or some other batch size) in alfresco-global.properties. Are you sure you're editing the effective alfresco-global.properties?
Additionally, if you set "org.alfresco.repo.security.sync.ldap.LDAPUserRegistry" into debug, you should be able to see the bath size reflected in the log as:
Return result limit:

Using Pandas UDF with Large Broadcast object

I am trying to use Pandas UDF GROUPEDMAP to do some processing. The code is below:
from pyspark.sql.functions import pandas_udf, PandasUDFType, spark_partition_id
models_broadcast = self.spark_session.sparkContext.broadcast(models)
#pandas_udf("id string, score string",
PandasUDFType.GROUPED_MAP)
def _segment_partition_score(segment_partition_pd):
my_models = models_broadcast.value # comment this line, the code ran run through
segment_partition_pd["score"] = "aaa"
segment_score_pd = segment_partition_pd[['id', 'score']]
return segment_score_pd
model_score = my_data_df.withColumn("pid", spark_partition_id()).groupby('pid').apply(_segment_partition_score)
model_score.show(100)
Here is the error message. It simply states " java.net.SocketException: Connection reset"
y4JJavaError: An error occurred while calling o1525.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 31.0 failed 4 times, most recent failure: Lost task 2.3 in stage 31.0 (TID 2466, 10.139.64.43, executor 10): java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:181)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:144)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:494)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2362)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2350)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2349)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1102)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2582)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2529)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2517)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2280)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:270)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:280)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:80)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:86)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:57)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectResult(Dataset.scala:2905)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3517)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2634)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2634)
at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3501)
at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3496)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1$$anonfun$apply$1.apply(SQLExecution.scala:112)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:217)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:98)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:74)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:169)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3496)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2634)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2848)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:279)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:316)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:181)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:144)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:494)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
The broadcast object is a pickle object which is around ~100 MB. It is a trained ski-learn machine learning model.
But you may find, in the UDF _segment_partition_score, I actually didn't use this broadcast object. I just want to get it at the UDF, for debugging purpose. Even this, the pyspark program will crash.
If I don't receive this broadcast object in the Pandas UDF, the program can run through, and generate result.
The dataframe "my_data_df" is very small (700 MB in Parquet) . I am sure given the existing cluster size (64GB) * 20 workers, it should be no problem to deal with it.
I highly suspect that it is the problem of serialization for the broadcast object, either size or type of serializer. But I have no idea what should I tune.
Could anyone point me out?
Thanks in advance.

repast.simphony.ui.GUIScheduleRunner error message

I'm a new user in RePast learning to run the mesoFON model. I get this error message. What is the problem?
I'm using Eclipse IDE 2018-09.
FATAL [Thread-5] 11:34:18,767 repast.simphony.ui.GUIScheduleRunner -
RunTimeException when running the schedule
Current tick (1.0)
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at repast.simphony.engine.schedule.DynamicTargetAction.execute(DynamicTargetAction.java:72)
at repast.simphony.engine.schedule.DefaultAction.execute(DefaultAction.java:38)
at repast.simphony.engine.schedule.ScheduleGroup.executeList(ScheduleGroup.java:205)
at repast.simphony.engine.schedule.ScheduleGroup.execute(ScheduleGroup.java:231)
at repast.simphony.engine.schedule.Schedule.execute(Schedule.java:352)
at repast.simphony.ui.GUIScheduleRunner$ScheduleLoopRunnable.run(GUIScheduleRunner.java:52)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.reflect.InvocationTargetException
at meso_FON.application.Environment$$FastClassByCGLIB$$fd509841.invoke(<generated>)
at net.sf.cglib.reflect.FastMethod.invoke(FastMethod.java:53)
at repast.simphony.engine.schedule.DynamicTargetAction.execute(DynamicTargetAction.java:69)
... 6 more
Caused by: java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.base/java.util.TimSort.mergeLo(TimSort.java:781)
at java.base/java.util.TimSort.mergeAt(TimSort.java:518)
at java.base/java.util.TimSort.mergeCollapse(TimSort.java:448)
at java.base/java.util.TimSort.sort(TimSort.java:245)
at java.base/java.util.Arrays.sort(Arrays.java:1515)
at java.base/java.util.ArrayList.sort(ArrayList.java:1749)
at java.base/java.util.Collections.sort(Collections.java:177)
at org.khelekore.prtree.MinMaxNodeGetter.<init>(MinMaxNodeGetter.java:29)
at org.khelekore.prtree.LeafBuilder.getMM(LeafBuilder.java:69)
at org.khelekore.prtree.LeafBuilder.buildLeafs(LeafBuilder.java:34)
at org.khelekore.prtree.PRTree.load(PRTree.java:65)
at meso_FON.application.Environment.getPRTree(Environment.java:423)
at meso_FON.application.Environment.queryPRTree(Environment.java:234)
... 9 more
It appears that there is an issue with a mesoFOM specific-method call. I'd suggest reaching out to the mesoFOM model developers directly to see if they can help.

WSO2 ESB not accepting large json data

I am using WSO2 ESB in my java application for integration.
When I send very large json data, it shows the ERROR below:
Here is the error which I receive in ESB,
ERROR - NativeWorkerPool Uncaught exception
java.lang.ClassFormatError: Invalid method Code length 82129 in class file org/mozilla/javascript/gen/c330
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at org.mozilla.javascript.DefiningClassLoader.defineClass(DefiningClassLoader.java:62)
at org.mozilla.javascript.optimizer.Codegen.defineClass(Codegen.java:126)
at org.mozilla.javascript.optimizer.Codegen.createScriptObject(Codegen.java:81)
at org.mozilla.javascript.Context.compileImpl(Context.java:2361)
at org.mozilla.javascript.Context.compileReader(Context.java:1310)
at org.mozilla.javascript.Context.compileReader(Context.java:1282)
at org.mozilla.javascript.Context.evaluateReader(Context.java:1224)
at com.sun.phobos.script.javascript.RhinoScriptEngine.eval(RhinoScriptEngine.java:172)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:249)
at org.apache.synapse.mediators.bsf.ScriptMediator.processJSONPayload(ScriptMediator.java:322)
at org.apache.synapse.mediators.bsf.ScriptMediator.mediateForInlineScript(ScriptMediator.java:294)
at org.apache.synapse.mediators.bsf.ScriptMediator.invokeScript(ScriptMediator.java:239)
at org.apache.synapse.mediators.bsf.ScriptMediator.mediate(ScriptMediator.java:207)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:81)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:48)
at org.apache.synapse.mediators.filters.FilterMediator.mediate(FilterMediator.java:160)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:81)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:48)
at org.apache.synapse.mediators.base.SequenceMediator.mediate(SequenceMediator.java:149)
at org.apache.synapse.mediators.base.SequenceMediator.mediate(SequenceMediator.java:214)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:81)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:48)
at org.apache.synapse.config.xml.AnonymousListMediator.mediate(AnonymousListMediator.java:30)
at org.apache.synapse.mediators.filters.FilterMediator.mediate(FilterMediator.java:197)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:81)
at org.apache.synapse.mediators.AbstractListMediator.mediate(AbstractListMediator.java:48)
at org.apache.synapse.mediators.base.SequenceMediator.mediate(SequenceMediator.java:149)
at org.apache.synapse.rest.Resource.process(Resource.java:297)
at org.apache.synapse.rest.API.process(API.java:378)
at org.apache.synapse.rest.RESTRequestHandler.dispatchToAPI(RESTRequestHandler.java:97)
at org.apache.synapse.rest.RESTRequestHandler.process(RESTRequestHandler.java:65)
at org.apache.synapse.core.axis2.Axis2SynapseEnvironment.injectMessage(Axis2SynapseEnvironment.java:266)
at org.apache.synapse.core.axis2.SynapseMessageReceiver.receive(SynapseMessageReceiver.java:83)
at org.apache.axis2.engine.AxisEngine.receive(AxisEngine.java:180)
at org.apache.synapse.transport.passthru.ServerWorker.processNonEntityEnclosingRESTHandler(ServerWorker.java:317)
at org.apache.synapse.transport.passthru.ServerWorker.processEntityEnclosingRequest(ServerWorker.java:363)
at org.apache.synapse.transport.passthru.ServerWorker.run(ServerWorker.java:142)
at org.apache.axis2.transport.base.threads.NativeWorkerPool$1.run(NativeWorkerPool.java:172)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am not sure what causes this error. Please help me to resolve this.
when processing large JSON data volumes, the code length must be less than 65536 characters, since the Script mediator converts the payload into a Java object so, try reducing the size of JSON.