One task per node only in Apache Ignite - ignite

I'm relatively new to Apache Ignite. I'm using Ignite compute to distribute tasks to nodes. My goal is a task dispatcher that produces tasks and submits these only to nodes that are "free". One node can only do one task at a time. If all nodes have a task running, the dispatcher shall wait for the next node to become available and then submit the next task.
I can implement this with a queue and async Callables, however I wonder if there is an Ignite onboard class that does something like this? Not sure the class ComputeTaskSplitAdapter is what I need to look at, I'm not fully understanding its purpose.
Any help appreciated.
Server nodes can join and leave the cluster while tasks are distributed.
Tasks can take different amount of time on the nodes and as soon as a server finishes a task it shall get the next task.
Here's my node code:
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.start(cfg);
And this is my job distribution code (for testing):
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.setClientMode(true);
Ignite ignite = Ignition.start(cfg);
for (int i = 0; i < 10; i++)
{
ignite.compute().runAsync(new IgniteRunnable()
{
#Override
public void run()
{
System.out.print("Sleeping...");
try
{
Thread.sleep(10000);
} catch (InterruptedException e)
{
e.printStackTrace();
}
System.out.println("Done.");
}
});
}

Yes, Apache Ignite has direct support for it. Please take a look at the One-at-a-Time section in the Job Scheduling documentation: https://apacheignite.readme.io/docs/job-scheduling#section-one-at-a-time
Note that every server has its own waiting queue and servers will move to the next job in their queue immediately after they are done with a previous job.
If you would like even more aggressive scheduling, then you can take a look at Job-Stealing scheduling here: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/collision/jobstealing/JobStealingCollisionSpi.html
With Job Stealing enabled, servers will still jobs from the job-queues on other servers once their own queue becomes empty. Most of the parameters are configurable.

Related

Bursts of RedisTimeoutException using StackExchange.Redis

I'm trying to track down intermittent "bursts" of timeouts using the StackExchange Redis library. Here's a bit about our setup: Our API is written in C# and runs on Windows 2008 and IIS. We have 4 API servers in production, and we have 4 Redis machines (Running Linux latest LTS), each with 2 instances of Redis (one master on port 7000, one slave on port 7001). I've looked at pretty much every aspect of the Redis servers and they look fantastic. No errors in the logs, CPU and network is great, everything with the server side of things seem fantastic. I can tail -f the Redis logs while this is happening and don't see anything out of the ordinary (such as rewriting AOF files or anything). I don't think the problem is with Redis.
Here's what I know so far:
We see these timeout exceptions several times an hour. Usually between 40-50 timeouts in a minute, sometimes up to 80-90. Then, they'll go away for several minutes. There were about 5,000 of these events in the past 24 hours, and they happen in bursts from a single API client.
These timeouts only happen against Redis master nodes, never against slave nodes. However, they happen with various Redis commands such as GETs and SETs.
When a burst of these timeouts happen, the calls are coming from a single API server but happen talking to various Redis nodes. For example, API3 might have a bunch of timeouts trying to call Cache1, Cache2 and Cache3. This is strong evidence that the issue is related to the API servers, not the Redis servers.
The Redis master nodes have 108 connected clients. I log current connections, and this number remains stable. There are no big spikes in connections, and it doesn't look like there's any bad code creating too many connections or not sharing ConnectionMultiplexer instances (I have one and it's static)
The Redis slave nodes have 58 connected clients, and this also looks completely stable as well.
We're using StackExchange.Redis version 1.2.6
Redis is using AOF mode, and size on disk is about 195MB
Here's an example timeout exception. Most look pretty much the same as this:
Type=StackExchange.Redis.RedisTimeoutException,Message=Timeout
performing GET limeade:allActivities, inst: 1, mgr: ExecuteSelect,
err: never, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 0, ar: 0,
clientName: LIMEADEAPI4, serverEndpoint: 10.xx.xx.11:7000,
keyHashSlot: 1295, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER:
(Busy=9,Free=32758,Min=2,Max=32767) (Please take a look at this
article for some common client-side issues that can cause timeouts:
http://stackexchange.github.io/StackExchange.Redis/Timeouts),StackTrace=
at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisBase.ExecuteSync[T](Message message,
ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags
flags) at Limeade.Caching.Providers.RedisCacheProvider1.Get[T](K
cacheKey, CacheItemVersion& cacheItemVersion) in ...
I've done a bit of research on tracing down these timeout exceptions, but what's rather surprising is all the numbers are all zeros. Nothing in the queue, nothing waiting to be processed, I have tons of threads free and not doing anything. Everything looks great.
Anyone have any ideas on how to fix this? The problem is these bursts of cache timeouts cause our database to be hit more, and in certain circumstances this is a bad thing. I'm happy to add any more info that anyone would find helpful.
Update: Connection Code
The code to connect to Redis is part of a fairly complex system that supports various cache environments and configuration, but I can probably boil it down to the basics. First, there's a CacheFactory class:
public class CacheFactory : ICacheFactory
{
private static readonly ILogger log = LoggerManager.GetLogger(typeof(CacheFactory));
private static readonly ICacheProvider<CacheKey> cache;
static CacheFactory()
{
ICacheFactory<CacheKey> configuredFactory = CacheFactorySection.Current?.CreateConfiguredFactory<CacheKey>();
if (configuredFactory == null)
{
// Some error handling, not important
}
cache = configuredFactory.GetDefaultCache();
}
// ...
}
The ICacheProvider is what implements a way to talk to a certain cache system, which can be configured. In this case, the configuredFactory is a RedisCacheFactory which looks like this:
public class RedisCacheFactory<T> : ICacheFactory<T> where T : CacheKey, ICacheKeyRepository
{
private RedisCacheProvider<T> provider;
private readonly RedisConfiguration configuration;
public RedisCacheFactory(RedisConfiguration config)
{
this.configuration = config;
}
public ICacheProvider<T> GetDefaultCache()
{
return provider ?? (provider = new RedisCacheProvider<T>(configuration));
}
}
The GetDefaultCache method is called once, in the static constructor, and returns a RedisCacheProvider. This class is what actually connects to Redis:
public class RedisCacheProvider<K> : ICacheProvider<K> where K : CacheKey, ICacheKeyRepository
{
private readonly ConnectionMultiplexer redisConnection;
private readonly IDatabase db;
private readonly RedisCacheSerializer serializer;
private static readonly ILog log = Logging.RedisCacheProviderLog<K>();
private readonly CacheMonitor<K> cacheMonitor;
private readonly TimeSpan defaultTTL;
private int connectionErrors;
public RedisCacheProvider(RedisConfiguration options)
{
redisConnection = ConnectionMultiplexer.Connect(options.EnvironmentOverride ?? options.Connection);
db = redisConnection.GetDatabase();
serializer = new RedisCacheSerializer(options.SerializationBinding);
cacheMonitor = new CacheMonitor<K>();
defaultTTL = options.DefaultTTL;
IEnumerable<string> hosts = options.Connection.EndPoints.Select(e => (e as DnsEndPoint)?.Host);
log.InfoFormat("Created Redis ConnectionMultiplexer connection. Hosts=({0})", String.Join(",", hosts));
}
// ...
}
The constructor creates a ConnectionMultiplexer based on the configured Redis endpoints (which are in some config file). I also log every time I create a connection. We don't see any excessive number of these log statements, and the connections to Redis remains stable.
In global.asax, in try adding:
protected void Application_Start(object sender, EventArgs e)
{
ThreadPool.SetMinThreads(200, 200);
}
For us, this reduced errors from ~50-100 daily to zero. I believe there is no general rule for what numbers to set as it's system dependant (200 works for us) so might require some experimenting on your end.
I also believe this has improved the performance of the site.

Ignite Transactional From DataStream Receiver not supported?

I am using igniteDataStreamer and would like to know if it is possible to use transactions from closures.
Unfortunately, when running from different IgniteDataStreamer threads for the same record to update in the cache(receive() method in StreamReceiver), Ignite does not throw any TransactionOptimisticException even though CacheConfiguration atomicityMode is TRANSACTIONAL.
try (Transaction t = ignite.transactions().txStart(TransactionConcurrency.OPTIMISTIC, TransactionIsolation.SERIALIZABLE)) {
try {
cache.putAll(update);
t.commit();
catch (TransactionOptimisticException toe) {
LOG.error("TransactionOptimisticException Could not put all the profiles",toe);
}
}
Data streamer is not transactional. To execute updates in a single transaction, they must be initiated on the same node and by the same thread. For more details and examples read here: https://apacheignite.readme.io/docs/transactions

Apache Ignite - Distibuted Queue and Executors

I am planning to use Apache Ignite Distributed Queue.
I am using Ignite with a spring boot application. So, on bootup, I will be adding 20 names in a queue. But, since there are 3 servers in a cluster, the same 20 names gets added 3 times. But, i want to add them only once in the queue.
Ignite ignite = Ignition.ignite();
IgniteQueue<String> queue = ignite.queue(
"queueName", // Queue name.
0, // Queue capacity. 0 for unbounded queue.
null // Collection configuration.
);
Distributed executors, will be able to poll from the queue and run the task. Here, the executor is expected to poll, run the task and then add the same name to the queue. Trying to achieve round robin here.
Only one executor should be running the same task at any point of time, though there are multiple servers in a cluster.
Any suggestion for this.
You can launch ignite cluster singleton service https://apacheignite.readme.io/docs/cluster-singletons which will fill data to queue. Also you can adding data from coordinator node (oldest node in cluster) ignite.cluster().forOldest().node().isLocal()
I fixed bootup time duplicate cache loading issue this way:
final IgniteAtomicLong cacheLoadCnt = ignite.atomicLong(cacheName + "Cnt", 0, true);
if (cacheLoadCnt.get() == 0) {
loadCache();
cacheLoadCnt.addAndGet(1);
}

Execute multiple downloads and wait for all to complete

I am currently working on an API service that allows 1 or more users to download 1 or more items from an S3 bucket and return the contents to the user. While the downloading is fine, the time taken to download several files is pretty much 100-150 ms * the number of files.
I have tried a few approaches to speeding this up - parallelStream() instead of stream() (which, considering the amount of simultaneous downloads, is at a serious risk of running out of threads), as well as CompleteableFutures, and even creating an ExecutorService, doing the downloads then shutting down the pool. Typically I would only want a few concurrent tasks e.g. 5 at the same time, per request to try and cut down on the number of active threads.
I have tried integrating Spring #Cacheable to store the downloaded files to Redis (the files are readonly) - while this certainly cuts down response times (several ms to retrieve files compared to 100-150 ms), the benefits are only there once the file has been previously retrieved.
What is the best way to handle waiting on multiple async tasks to finish then getting the results, also considering I don't want (or don't think I could) have hundreds of threads opening http connections and downloading all at once?
You're right to be concerned about tying up the common fork/join pool used by default in parallel streams, since I believe it is used for other things like sort operations outside of the Stream api. Rather than saturating the common fork/join pool with an I/O-bound parallel stream, you can create your own fork/join pool for the Stream. See this question to find out how to create an ad hoc ForkJoinPool with the size you want and run a parallel stream in it.
You could also create an ExecutorService with a fixed-size thread pool that would also be independent of the common fork/join pool, and would throttle the requests, using only the threads in the pool. It also lets you specify the number of threads to dedicate:
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS_FOR_DOWNLOADS);
try {
List<CompletableFuture<Path>> downloadTasks = s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(() -> mys3Downloader.downloadAndGetPath(s3Path), executor))
.collect(Collectors.toList());
// at this point, all requests are enqueued, and threads will be assigned as they become available
executor.shutdown(); // stops accepting requests, does not interrupt threads,
// items in queue will still get threads when available
// wait for all downloads to complete
CompletableFuture.allOf(downloadTasks.toArray(new CompletableFuture[downloadTasks.size()])).join();
// at this point, all downloads are finished,
// so it's safe to shut down executor completely
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
executor.shutdownNow(); // important to call this when you're done with the executor.
}
Following #Hank D's lead, you can encapsulate the creation of the executor service to ensure that you do, indeed, call ExecutorService::shutdownNow after using said executor:
private static <VALUE> VALUE execute(
final int nThreads,
final Function<ExecutorService, VALUE> function
) {
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
try {
return function.apply(executorService);
} catch (final InterruptedException | ExecutionException exception) {
exception.printStackTrace();
} finally {
executorService .shutdownNow(); // important to call this when you're done with the executor service.
}
}
public static void main(final String... arguments) {
// define variables
final List<CompletableFuture<Path>> downloadTasks = execute(
MAX_THREADS_FOR_DOWNLOADS,
executor -> s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(
() -> mys3Downloader.downloadAndGetPath(s3Path),
executor
))
.collect(Collectors.toList())
);
// use downloadTasks
}

Jedis getResource() is taking lot of time

I am trying to use sentinal redis to get/set keys from redis. I was trying to stress test my setup with about 2000 concurrent requests.
i used sentinel to put a single key on redis and then I executed 1000 concurrent get requests from redis.
But the underlying jedis used my sentinel is blocking call on getResource() (pool size is 500) and the overall average response time that I am achieving is around 500 ms, but my target was about 10 ms.
I am attaching sample of jvisualvm snapshot here
redis.clients.jedis.JedisSentinelPool.getResource() 98.02227 4.0845232601E7 ms 4779
redis.clients.jedis.BinaryJedis.get() 1.6894469 703981.381 ms 141
org.apache.catalina.core.ApplicationFilterChain.doFilter() 0.12820946 53424.035 ms 6875
org.springframework.core.serializer.support.DeserializingConverter.convert() 0.046286926 19287.457 ms 4
redis.clients.jedis.JedisSentinelPool.returnResource() 0.04444578 18520.263 ms 4
org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept() 0.035538 14808.45 ms 11430
May anyone help to debug further into the issue?
From JedisSentinelPool implementation of getResource() from Jedis sources (2.6.2):
#Override
public Jedis getResource() {
while (true) {
Jedis jedis = super.getResource();
jedis.setDataSource(this);
// get a reference because it can change concurrently
final HostAndPort master = currentHostMaster;
final HostAndPort connection = new HostAndPort(jedis.getClient().getHost(), jedis.getClient()
.getPort());
if (master.equals(connection)) {
// connected to the correct master
return jedis;
} else {
returnBrokenResource(jedis);
}
}
}
Note the while(true) and the returnBrokenResource(jedis), it means that it tries to get a jedis resource randomly from the pool that is indeed connected to the correct master and retries if it is not the good one. It is a dirty check and also a blocking call.
The super.getResource() call refers to JedisPool traditionnal implementation that is actually based on Apache Commons Pool (2.0). It does a lot to get an object from the pool, and I think it even repairs fail connections for instance. With a lot of contention on your pool, as probably in your stress test, it can probably take a lot of time to get a resource from the pool, just to see it is not connected to the correct master, so you end up calling it again, adding contention, slowing getting the resource etc...
You should check all the jedis instances in your pool to see if there's a lot of 'bad' connections.
Maybe you should give up using a common pool for your stress test (only create Jedis instances manually connected to the correct node, and close them nicely), or setting multiple ones to mitigate the cost of looking to "dirty" unchecked jedis resources.
Also with a pool of 500 jedis instances, you can't emulate 1000 concurrent queries, you need at least 1000.