Jedis getResource() is taking lot of time - redis

I am trying to use sentinal redis to get/set keys from redis. I was trying to stress test my setup with about 2000 concurrent requests.
i used sentinel to put a single key on redis and then I executed 1000 concurrent get requests from redis.
But the underlying jedis used my sentinel is blocking call on getResource() (pool size is 500) and the overall average response time that I am achieving is around 500 ms, but my target was about 10 ms.
I am attaching sample of jvisualvm snapshot here
redis.clients.jedis.JedisSentinelPool.getResource() 98.02227 4.0845232601E7 ms 4779
redis.clients.jedis.BinaryJedis.get() 1.6894469 703981.381 ms 141
org.apache.catalina.core.ApplicationFilterChain.doFilter() 0.12820946 53424.035 ms 6875
org.springframework.core.serializer.support.DeserializingConverter.convert() 0.046286926 19287.457 ms 4
redis.clients.jedis.JedisSentinelPool.returnResource() 0.04444578 18520.263 ms 4
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept() 0.035538 14808.45 ms 11430
May anyone help to debug further into the issue?

From JedisSentinelPool implementation of getResource() from Jedis sources (2.6.2):
#Override
public Jedis getResource() {
while (true) {
Jedis jedis = super.getResource();
jedis.setDataSource(this);
// get a reference because it can change concurrently
final HostAndPort master = currentHostMaster;
final HostAndPort connection = new HostAndPort(jedis.getClient().getHost(), jedis.getClient()
.getPort());
if (master.equals(connection)) {
// connected to the correct master
return jedis;
} else {
returnBrokenResource(jedis);
}
}
}
Note the while(true) and the returnBrokenResource(jedis), it means that it tries to get a jedis resource randomly from the pool that is indeed connected to the correct master and retries if it is not the good one. It is a dirty check and also a blocking call.
The super.getResource() call refers to JedisPool traditionnal implementation that is actually based on Apache Commons Pool (2.0). It does a lot to get an object from the pool, and I think it even repairs fail connections for instance. With a lot of contention on your pool, as probably in your stress test, it can probably take a lot of time to get a resource from the pool, just to see it is not connected to the correct master, so you end up calling it again, adding contention, slowing getting the resource etc...
You should check all the jedis instances in your pool to see if there's a lot of 'bad' connections.
Maybe you should give up using a common pool for your stress test (only create Jedis instances manually connected to the correct node, and close them nicely), or setting multiple ones to mitigate the cost of looking to "dirty" unchecked jedis resources.
Also with a pool of 500 jedis instances, you can't emulate 1000 concurrent queries, you need at least 1000.

Related

Infinispan clustered lock performance does not improve with more nodes?

I have a piece of code that is essentially executing the following with Infinispan in embedded mode, using version 13.0.0 of the -core and -clustered-lock modules:
#Inject
lateinit var lockManager: ClusteredLockManager
private fun getLock(lockName: String): ClusteredLock {
lockManager.defineLock(lockName)
return lockManager.get(lockName)
}
fun createSession(sessionId: String) {
tryLockCounter.increment()
logger.debugf("Trying to start session %s. trying to acquire lock", sessionId)
Future.fromCompletionStage(getLock(sessionId).lock()).map {
acquiredLockCounter.increment()
logger.debugf("Starting session %s. Got lock", sessionId)
}.onFailure {
logger.errorf(it, "Failed to start session %s", sessionId)
}
}
I take this piece of code and deploy it to kubernetes. I then run it in six pods distributed over six nodes in the same region. The code exposes createSession with random Guids through an API. This API is called and creates sessions in chunks of 500, using a k8s service in front of the pods which means the load gets balanced over the pods. I notice that the execution time to acquire a lock grows linearly with the amount of sessions. In the beginning it's around 10ms, when there's about 20_000 sessions it takes about 100ms and the trend continues in a stable fashion.
I then take the same code and run it, but this time with twelve pods on twelve nodes. To my surprise I see that the performance characteristics are almost identical to when I had six pods. I've been digging in to the code but still haven't figured out why this is, I'm wondering if there's a good reason why infinispan here doesn't seem to perform better with more nodes?
For completeness the configuration of the locks are as follows:
val global = GlobalConfigurationBuilder.defaultClusteredBuilder()
global.addModule(ClusteredLockManagerConfigurationBuilder::class.java)
.reliability(Reliability.AVAILABLE)
.numOwner(1)
and looking at the code the clustered locks is using DIST_SYNC which should spread out the load of the cache onto the different nodes.
UPDATE:
The two counters in the code above are simply micrometer counters. It is through them and prometheus that I can see how the lock creation starts to slow down.
It's correctly observed that there's one lock created per session id, this is per design what we'd like. Our use case is that we want to ensure that a session is running in at least one place. Without going to deep into detail this can be achieved by ensuring that we at least have two pods that are trying to acquire the same lock. The Infinispan library is great in that it tells us directly when the lock holder dies without any additional extra chattiness between pods, which means that we have a "cheap" way of ensuring that execution of the session continues when one pod is removed.
After digging deeper into the code I found the following in CacheNotifierImpl in the core library:
private CompletionStage<Void> doNotifyModified(K key, V value, Metadata metadata, V previousValue,
Metadata previousMetadata, boolean pre, InvocationContext ctx, FlagAffectedCommand command) {
if (clusteringDependentLogic.running().commitType(command, ctx, extractSegment(command, key), false).isLocal()
&& (command == null || !command.hasAnyFlag(FlagBitSets.PUT_FOR_STATE_TRANSFER))) {
EventImpl<K, V> e = EventImpl.createEvent(cache.wired(), CACHE_ENTRY_MODIFIED);
boolean isLocalNodePrimaryOwner = isLocalNodePrimaryOwner(key);
Object batchIdentifier = ctx.isInTxScope() ? null : Thread.currentThread();
try {
AggregateCompletionStage<Void> aggregateCompletionStage = null;
for (CacheEntryListenerInvocation<K, V> listener : cacheEntryModifiedListeners) {
// Need a wrapper per invocation since converter could modify the entry in it
configureEvent(listener, e, key, value, metadata, pre, ctx, command, previousValue, previousMetadata);
aggregateCompletionStage = composeStageIfNeeded(aggregateCompletionStage,
listener.invoke(new EventWrapper<>(key, e), isLocalNodePrimaryOwner));
}
The lock library uses a clustered Listener on the entry modified event, and this one uses a filter to only notify when the key for the lock is modified. It seems to me the core library still has to check this condition on every registered listener, which of course becomes a very big list as the number of sessions grow. I suspect this to be the reason and if it is it would be really really awesome if the core library supported a kind of key filter so that it could use a hashmap for these listeners instead of going through a whole list with all listeners.
I believe you are creating a clustered lock per session id. Is this what you need ? what is the acquiredLockCounter? We are about to deprecate the "lock" method in favour of "tryLock" with timeout since the lock method will block forever if the clustered lock is never acquired. Do you ever unlock the clustered lock in another piece of code? If you shared a complete reproducer of the code will be very helpful for us. Thanks!

Why Apache Ignite Cache.replace-K-V-V api call performing slow?

We are running Ignite cluster with 12 nodes running on Ignite 2.7.0 on openjdk
1.8 at RHEL platform.
Seeing heavy cputime spent with https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#replace-K-V-V-
We are witnessing slowness with one of our process and when we tried to drill it
further by profiling the JVM, the main culprit (taking ~78% of total time)
seems to be coming from Ignite cache.repalce(K,V,V) api call.
Out of 77.9 by replace, 39% is taken by GridCacheAdapater.equalVal and 38.5%
by GridCacheAdapter.put
Cache is Partitioned and ATOMIC with readThrough,writeThrough,writeBehindEnabled set to True.
Attaching the profiling snapshot of one node(similar is the profiling result on other nodes), Can someone please check and suggest what
could be the cause OR some known performance issue with this Ignite version related to cache.replace(k,v,v) api ?
JVM Prolfiling Snapshot of one node
I guess that it can be related to next issue:
https://issues.apache.org/jira/browse/IGNITE-5003
The problem there related to the operations for the same key before the previous batch of updates (that contains this key) will be stored in the database.
As I see it should be added to Ignite 2.8.
Update:
I tested putAll operation. From the next two pictures you can see that putAll waiting for GridCacheWriteBehindStore.write (two different threads) that contains updateCache:
public void write(Entry<? extends K, ? extends V> entry) {
try {
if (log.isDebugEnabled())
log.debug(S.toString("Store put",
"key", entry.getKey(), true,
"val", entry.getValue(), true));
updateCache(entry.getKey(), entry, StoreOperation.PUT);
}
And provided issue can affect your put operations (or replace as well).

Bursts of RedisTimeoutException using StackExchange.Redis

I'm trying to track down intermittent "bursts" of timeouts using the StackExchange Redis library. Here's a bit about our setup: Our API is written in C# and runs on Windows 2008 and IIS. We have 4 API servers in production, and we have 4 Redis machines (Running Linux latest LTS), each with 2 instances of Redis (one master on port 7000, one slave on port 7001). I've looked at pretty much every aspect of the Redis servers and they look fantastic. No errors in the logs, CPU and network is great, everything with the server side of things seem fantastic. I can tail -f the Redis logs while this is happening and don't see anything out of the ordinary (such as rewriting AOF files or anything). I don't think the problem is with Redis.
Here's what I know so far:
We see these timeout exceptions several times an hour. Usually between 40-50 timeouts in a minute, sometimes up to 80-90. Then, they'll go away for several minutes. There were about 5,000 of these events in the past 24 hours, and they happen in bursts from a single API client.
These timeouts only happen against Redis master nodes, never against slave nodes. However, they happen with various Redis commands such as GETs and SETs.
When a burst of these timeouts happen, the calls are coming from a single API server but happen talking to various Redis nodes. For example, API3 might have a bunch of timeouts trying to call Cache1, Cache2 and Cache3. This is strong evidence that the issue is related to the API servers, not the Redis servers.
The Redis master nodes have 108 connected clients. I log current connections, and this number remains stable. There are no big spikes in connections, and it doesn't look like there's any bad code creating too many connections or not sharing ConnectionMultiplexer instances (I have one and it's static)
The Redis slave nodes have 58 connected clients, and this also looks completely stable as well.
We're using StackExchange.Redis version 1.2.6
Redis is using AOF mode, and size on disk is about 195MB
Here's an example timeout exception. Most look pretty much the same as this:
Type=StackExchange.Redis.RedisTimeoutException,Message=Timeout
performing GET limeade:allActivities, inst: 1, mgr: ExecuteSelect,
err: never, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 0, ar: 0,
clientName: LIMEADEAPI4, serverEndpoint: 10.xx.xx.11:7000,
keyHashSlot: 1295, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER:
(Busy=9,Free=32758,Min=2,Max=32767) (Please take a look at this
article for some common client-side issues that can cause timeouts:
http://stackexchange.github.io/StackExchange.Redis/Timeouts),StackTrace=
at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisBase.ExecuteSync[T](Message message,
ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags
flags) at Limeade.Caching.Providers.RedisCacheProvider1.Get[T](K
cacheKey, CacheItemVersion& cacheItemVersion) in ...
I've done a bit of research on tracing down these timeout exceptions, but what's rather surprising is all the numbers are all zeros. Nothing in the queue, nothing waiting to be processed, I have tons of threads free and not doing anything. Everything looks great.
Anyone have any ideas on how to fix this? The problem is these bursts of cache timeouts cause our database to be hit more, and in certain circumstances this is a bad thing. I'm happy to add any more info that anyone would find helpful.
Update: Connection Code
The code to connect to Redis is part of a fairly complex system that supports various cache environments and configuration, but I can probably boil it down to the basics. First, there's a CacheFactory class:
public class CacheFactory : ICacheFactory
{
private static readonly ILogger log = LoggerManager.GetLogger(typeof(CacheFactory));
private static readonly ICacheProvider<CacheKey> cache;
static CacheFactory()
{
ICacheFactory<CacheKey> configuredFactory = CacheFactorySection.Current?.CreateConfiguredFactory<CacheKey>();
if (configuredFactory == null)
{
// Some error handling, not important
}
cache = configuredFactory.GetDefaultCache();
}
// ...
}
The ICacheProvider is what implements a way to talk to a certain cache system, which can be configured. In this case, the configuredFactory is a RedisCacheFactory which looks like this:
public class RedisCacheFactory<T> : ICacheFactory<T> where T : CacheKey, ICacheKeyRepository
{
private RedisCacheProvider<T> provider;
private readonly RedisConfiguration configuration;
public RedisCacheFactory(RedisConfiguration config)
{
this.configuration = config;
}
public ICacheProvider<T> GetDefaultCache()
{
return provider ?? (provider = new RedisCacheProvider<T>(configuration));
}
}
The GetDefaultCache method is called once, in the static constructor, and returns a RedisCacheProvider. This class is what actually connects to Redis:
public class RedisCacheProvider<K> : ICacheProvider<K> where K : CacheKey, ICacheKeyRepository
{
private readonly ConnectionMultiplexer redisConnection;
private readonly IDatabase db;
private readonly RedisCacheSerializer serializer;
private static readonly ILog log = Logging.RedisCacheProviderLog<K>();
private readonly CacheMonitor<K> cacheMonitor;
private readonly TimeSpan defaultTTL;
private int connectionErrors;
public RedisCacheProvider(RedisConfiguration options)
{
redisConnection = ConnectionMultiplexer.Connect(options.EnvironmentOverride ?? options.Connection);
db = redisConnection.GetDatabase();
serializer = new RedisCacheSerializer(options.SerializationBinding);
cacheMonitor = new CacheMonitor<K>();
defaultTTL = options.DefaultTTL;
IEnumerable<string> hosts = options.Connection.EndPoints.Select(e => (e as DnsEndPoint)?.Host);
log.InfoFormat("Created Redis ConnectionMultiplexer connection. Hosts=({0})", String.Join(",", hosts));
}
// ...
}
The constructor creates a ConnectionMultiplexer based on the configured Redis endpoints (which are in some config file). I also log every time I create a connection. We don't see any excessive number of these log statements, and the connections to Redis remains stable.
In global.asax, in try adding:
protected void Application_Start(object sender, EventArgs e)
{
ThreadPool.SetMinThreads(200, 200);
}
For us, this reduced errors from ~50-100 daily to zero. I believe there is no general rule for what numbers to set as it's system dependant (200 works for us) so might require some experimenting on your end.
I also believe this has improved the performance of the site.

Execute multiple downloads and wait for all to complete

I am currently working on an API service that allows 1 or more users to download 1 or more items from an S3 bucket and return the contents to the user. While the downloading is fine, the time taken to download several files is pretty much 100-150 ms * the number of files.
I have tried a few approaches to speeding this up - parallelStream() instead of stream() (which, considering the amount of simultaneous downloads, is at a serious risk of running out of threads), as well as CompleteableFutures, and even creating an ExecutorService, doing the downloads then shutting down the pool. Typically I would only want a few concurrent tasks e.g. 5 at the same time, per request to try and cut down on the number of active threads.
I have tried integrating Spring #Cacheable to store the downloaded files to Redis (the files are readonly) - while this certainly cuts down response times (several ms to retrieve files compared to 100-150 ms), the benefits are only there once the file has been previously retrieved.
What is the best way to handle waiting on multiple async tasks to finish then getting the results, also considering I don't want (or don't think I could) have hundreds of threads opening http connections and downloading all at once?
You're right to be concerned about tying up the common fork/join pool used by default in parallel streams, since I believe it is used for other things like sort operations outside of the Stream api. Rather than saturating the common fork/join pool with an I/O-bound parallel stream, you can create your own fork/join pool for the Stream. See this question to find out how to create an ad hoc ForkJoinPool with the size you want and run a parallel stream in it.
You could also create an ExecutorService with a fixed-size thread pool that would also be independent of the common fork/join pool, and would throttle the requests, using only the threads in the pool. It also lets you specify the number of threads to dedicate:
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS_FOR_DOWNLOADS);
try {
List<CompletableFuture<Path>> downloadTasks = s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(() -> mys3Downloader.downloadAndGetPath(s3Path), executor))
.collect(Collectors.toList());
// at this point, all requests are enqueued, and threads will be assigned as they become available
executor.shutdown(); // stops accepting requests, does not interrupt threads,
// items in queue will still get threads when available
// wait for all downloads to complete
CompletableFuture.allOf(downloadTasks.toArray(new CompletableFuture[downloadTasks.size()])).join();
// at this point, all downloads are finished,
// so it's safe to shut down executor completely
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
executor.shutdownNow(); // important to call this when you're done with the executor.
}
Following #Hank D's lead, you can encapsulate the creation of the executor service to ensure that you do, indeed, call ExecutorService::shutdownNow after using said executor:
private static <VALUE> VALUE execute(
final int nThreads,
final Function<ExecutorService, VALUE> function
) {
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
try {
return function.apply(executorService);
} catch (final InterruptedException | ExecutionException exception) {
exception.printStackTrace();
} finally {
executorService .shutdownNow(); // important to call this when you're done with the executor service.
}
}
public static void main(final String... arguments) {
// define variables
final List<CompletableFuture<Path>> downloadTasks = execute(
MAX_THREADS_FOR_DOWNLOADS,
executor -> s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(
() -> mys3Downloader.downloadAndGetPath(s3Path),
executor
))
.collect(Collectors.toList())
);
// use downloadTasks
}

Ridiculously slow simultaneous publish/consume rate with RabbitMQ

I'm evaluating RabbitMQ and while the general impression (of AMQP as such, and also RabbitMQ) is positive, I'm not very impressed by the result.
I'm attempting to publish and consume messages simultaneously and have achieved very poor message rates. I have a durable direct exchange, which is bound to a durable queue and I publish persistent messages to that exchange. The average size of the message body is about 1000 bytes.
My publishing happens roughly as follows:
AMQP.BasicProperties.Builder bldr = new AMQP.BasicProperties.Builder();
ConnectionFactory factory = new ConnectionFactory();
factory.setUsername("guest");
factory.setPassword("guest");
factory.setVirtualHost("/");
factory.setHost("my-host");
factory.setPort(5672);
Connection conn = null;
Channel channel = null;
ObjectMapper mapper = new ObjectMapper(); //com.fasterxml.jackson.databind.ObjectMapper
try {
conn = factory.newConnection();
channel = conn.createChannel();
channel.confirmSelect();
} catch (IOException e) {}
for(Message m : messageList) { //the size of messageList happens to be 9945
try {
channel.basicPublish("exchange", "", bldr.deliveryMode(2).contentType("application/json").build(), mapper.writeValueAsBytes(cm));
} catch (Exception e) {}
}
try {
channel.waitForConfirms();
channel.close();
conn.close();
} catch (Exception e1) {}
And consuming messages from the bound queue happens as so:
AMQP.BasicProperties.Builder bldr = new AMQP.BasicProperties.Builder();
ConnectionFactory factory = new ConnectionFactory();
factory.setUsername("guest");
factory.setPassword("guest");
factory.setVirtualHost("/");
factory.setHost("my-host");
factory.setPort(5672);
Connection conn = null;
Channel channel = null;
try {
conn = factory.newConnection();
channel = conn.createChannel();
channel.basicQos(100);
while (true) {
GetResponse r = channel.basicGet("rawDataQueue", false);
if(r!=null)
channel.basicAck(r.getEnvelope().getDeliveryTag(), false);
}
} catch (IOException e) {}
The problem is that when the message publisher (or several of them) and consumer (or several of them) run simultaneously then the publisher(s) appear to run at full throttle and the RabbitMQ management web interface shows a publishing rate of, say, ~2...3K messages per second, but a consumption rate of 0.5...3 per consumer. When the publisher(s) finish then I get a consumption rate of, say, 300...600 messages per consumer. When not setting the QOS prefetch value for the Java client, then a little less, when setting it to 100 or 250, then a bit more.
When experimenting with throttling the consumers somewhat, I have managed to achieve simultaneous numbers like ~400 published and ~50 consumed messages per second which is marginally better but only marginally.
Here's, a quote from the RabbitMQ blog entry which claims that queues are fastest when they're empty which very well may be, but slowing the consumption rate to a crawl when there are a few thousand persistent messages sitting in the queue is still rather unacceptable.
Higher QOS prefetching values may help a bit but are IMHO not a solution as such.
What, if anything, can be done to achieve reasonable throughput rates (2 consumed messages per consumer per second is not reasonable in any circumstance)? This is only a simple one direct exchange - one binding - one queue situation, should I expect more performance degradation with more complicated configurations? When searching around the internet there have also been suggestions to drop durability, but I'm afraid in my case that is not an option. I'd be very happy if somebody would point out that I'm stupid and that there is an evident and straightforward solution of some kind :)
Have you tried with the autoAck option? That should improve your performance. It is much faster than getting the messages one by one and ack'ing them. Increasing the prefetch count should make it even better too.
Also, what is the size of the messages you are sending and consuming including headers? Are you experiencing any flow-control in the broker?
Another question, are you creating a connection and channel every time you send/get a message? If so, that's wrong. You should be creating a connection once, and use a channel per thread (probably in a thread-local fashion) to send and receive messages. You can have multiple channels per connection. There is no official documentation about this, but if you read articles and forums this seems to be the best performance practice.
Last thing, have you considered using the basicConsume instead of basicGet? It should also make it faster.
Based on my experience, I have been able to run a cluster sending and consuming at rates around 20000 messages per second with non-persistent messages. I guess that if you are using durable and persistent messages the performance would decrease a little, but not 10x.
Operating system could schedule your process to the next time slot, if sleep is used. This could create significant performance decrease.