I am running the application in load balancer and using shared Azure Redis Cache service with connection pool of 50, but I am getting following timeout exception very frequently, Anyone please guide me what to do:
RedisTimeoutException
Timeout performing EXISTS (60000ms), next: GET vstfs:///Classification/TeamProject/b81283a0-baf4-46df-a616-fbdd9d387034##CheckOutFiles, inst: 6, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: mr4devops.redis.cache.windows.net:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: inteGREAT.Web.UI.v2_IN_1(SE.Redis-v2.6.66.47313), IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=151,Free=32616,Min=4,Max=32767), v: 2.6.66.47313 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 1867
at StackExchange.Redis.RedisDatabase.KeyExists(RedisKey key, CommandFlags flags) in /_/src/StackExchange.Redis/RedisDatabase.cs:line 811
As per my understanding you are facing a time out error when running application in load balancer using shared Azure Redis Cache with connection pool of 50.
The answer is in that link included in the exception message. The number of busy worker threads is greater than the min, so you will need to increase the min.
Hope that helps
Related
My FaunaDB docker dev node recently started timing out in response to any request, with error messages like Read timed out. (Timed out waiting for appliedTimestamp for key 6(323942845125755392) to reach 2022-02-25T13:10:03.913Z).
My guess is that this has something to do with a desynchronization between the fauna instance's clock and the system clock. How can it be fixed?
I have a Symfony 4 application using the Symfony Messenger component (version 4.3.2) to dispatch messages.
For asynchronous message handling some Redis transports are configured and they work fine. But then I decided that one of them should retry a few times when message handling fails. I configured a retry strategy and the transport actually started retrying on failure, but it seems to ignore the delay configuration (keys delay, multiplier, max_delay) and all the retry attempts are always made without any delay, all within one second or a similarly short timespan, which is really undesirable in this use case.
My Messenger configuration (config/packages/messenger.yaml) looks like this
framework:
messenger:
default_bus: messenger.bus.default
transports:
transport_without_retry:
dsn: '%env(REDIS_DSN)%/without_retry'
retry_strategy:
max_retries: 0
transport_with_retry:
dsn: '%env(REDIS_DSN)%/with_retry'
retry_strategy:
max_retries: 5
delay: 10000 # 10 seconds
multiplier: 3
max_delay: 3600000
routing:
'App\Message\RetryWorthMessage': transport_with_retry
I tried replacing Redis with Doctrine (as implementation of the retrying transport) and voila - the delays started to work as expected. I therefore suspect that the Redis transport imlementation doesn't support delayed retry. But I read the docs carefully, searched related Github issues, and still didn't find a definite answer.
So my question is: does Redis transport support delayed retry? If it does, how do I make it work?
It turned out that Redis transport supports delayed retry, but only since Messenger version 4.4.
I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database.
I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket.
Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error:
Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a://myfilepath/a/b/d/4: org.apache.flink.fs.s3hadoop.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:125)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:155)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.flink.fs.s3hadoop.shaded.org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:702)
at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
at org.apache.flink.api.common.io.GenericCsvInputFormat.open(GenericCsvInputFormat.java:301)
at org.apache.flink.api.java.io.CsvInputFormat.open(CsvInputFormat.java:53)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:160)
at org.apache.flink.api.java.io.PojoCsvInputFormat.open(PojoCsvInputFormat.java:37)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I was able to control this error by increasing the max connection parameter for s3a API.
As of now, I have around 1000 files in the s3 bucket which is pushed and pulled by my app in the s3 bucket and my max connection is 3000. I'm using Flink's parallelism to upload/download these files from s3 bucket. My task manager count is 14.
This is an intermittent failure, I'm having success cases also for this scenario.
My query is,
Why I'm getting an intermittent failure? If the max connection I set was low, then my app should be throwing this error every time I run.
Is there any way to calculate the optimal number of max connection required for my app to work without facing the connection pool timeout error? Or Is this error related to something else that I'm not aware of?
Thanks
In Advance
Some comments, based on my experience with processing lots of files from S3 via Flink (batch) workflows:
When you are reading the files, Flink will calculate "splits" based on the number of files, and each file's size. Each split is read separately, so the theoretical max # of simultaneous connections isn't based on the # of files, but a combination of files and file sizes.
The connection pool used by the HTTP client releases connections after some amount of time, as being able to reuse an existing connection is a win (server/client handshake doesn't have to happen). So that introduces a degree of randomness into how many available connections are in the pool.
The size of the connection pool doesn't impact memory much, so I typically set it pretty high (e.g. 4096 for a recent workflow).
When using AWS connection code, the setting to bump is fs.s3.maxConnections, which isn't the same as a pure Hadoop configuration.
I'm running a Spark Streaming application on YARN in cluster mode and I'm trying to implement a gracefully shutdown so that when the application is killed it will finish the execution of the current micro batch before stopping.
Following some tutorials I have configured spark.streaming.stopGracefullyOnShutdown to true and I've added the following code to my application:
sys.ShutdownHookThread {
log.info("Gracefully stopping Spark Streaming Application")
ssc.stop(true, true)
log.info("Application stopped")
}
However when I kill the application with
yarn application -kill application_1454432703118_3558
the micro batch executed at that moment is not completed.
In the driver I see the first line of log printed ("Gracefully stopping Spark Streaming Application") but not the last one ("Application stopped").
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
INFO streaming.MySparkJob: Gracefully stopping Spark Streaming Application
INFO scheduler.JobGenerator: Stopping JobGenerator gracefully
INFO scheduler.JobGenerator: Waiting for all received blocks to be consumed for job generation
INFO scheduler.JobGenerator: Waited for all received blocks to be consumed for job generation
INFO streaming.StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook
In the executors log I see the following error:
ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.6.21:49767 disassociated! Shutting down.
INFO storage.DiskBlockManager: Shutdown hook called
WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.6.21:49767] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
INFO util.ShutdownHookManager: Shutdown hook called
I think the problem is related to how YARN send the kill signal the application. Any idea on how can I make the application stop gracefully?
you should go to the executors page to see where your driver is running ( on which node). ssh to that node and do the following:
ps -ef | grep 'app_name'
(replace app_name with your classname/appname). it will list couple of processes. Look at the process, some will be child of the other. Pick the id of the parent-most process and send a SIGTERM
kill pid
after some time you'll see that your application has terminated gracefully.
Also now you don't need to add those hooks for shutdown.
use spark.streaming.stopGracefullyOnShutdown config to help shutdown gracefully
You can stop spark streaming application by invoking ssc.stop when a customized condition is triggered instead of using awaitTermination. As the following pseudocode shows:
ssc.start()
while True:
time.sleep(10s)
if some_file_exist:
ssc.stop(True, True)
I Just installed Openfire 3.9.3 (Ubuntu 12.04), and it works great in Pidgin, but when trying to log in via the Candy client, it fails and firebug returns the following:
500 Task
org.jivesoftware.openfire.http.HttpSessionManager$HttpPacketSender#58b901
rejected from
java.util.concurrent.ThreadPoolExecutor#8aeebf[Terminated, pool size =
0, active threads = 0, queued tasks = 0, completed tasks = 0]
The only attempt at a solution I could find is at:
openfire error when try to connect via http-bind
I logged into hxxp://127.0.0.1:9090/server-properties.jsp , but there were no xmpp.httpbind.worker.threads or xmpp.client.processing properties listed. So I added both and assigned each a value of 16. Then restarted the openfire service. But there was absolutely no change.
openfire warning log:
java.util.concurrent.RejectedExecutionException: Task org.jivesoftware.openfire.http.HttpSessionManager$HttpPacketSender#3ea82c
rejected from
java.util.concurrent.ThreadPoolExecutor#8aeebf[Terminated, pool size =
0, active threads = 0, queued tasks = 0, completed tasks = 0]
Any suggestions please?