Ignite: Failed to wait for initial partition map exchange - ignite

I have a very simple spring component:
#Component
public abstract class IgniteComponent {
protected final Ignite ignite;
/**
* Start the ignite
*/
public IgniteComponent() {
this.ignite = Ignition.getOrStart(new IgniteConfiguration());
}
/**
* Get the ignite
*
* #return The ignite
*/
public Ignite getIgnite() {
return this.ignite;
}
}
When I use this component in unit tests locally everything works fine.
But when I want to run my unit tests on a bamboo agent I always get the follwoing:
24-Jul-2018 13:36:38 2018-07-24 11:36:38.888 WARN 7259 --- [ Test worker] .i.p.c.GridCachePartitionExchangeManager : Failed to wait for initial partition map exchange. Possible reasons are:
24-Jul-2018 13:36:38 ^-- Transactions in deadlock.
24-Jul-2018 13:36:38 ^-- Long running transactions (ignore if this is the case).
24-Jul-2018 13:36:38 ^-- Unreleased explicit locks.
And I cant find any reason for this. The ignite version im working with is:
dependencySet (group: 'org.apache.ignite', version: '2.2.0') {
entry 'ignite-core'
entry 'ignite-spring'
}
What is usually the cause of this issue?

Multicast IP finder is used by default. If you run your Ignite on shared agent it will try and join any nodes that are present there with unexpected results. Try to disable multicast (by using VM finder for example) or provide the whole log of your instance.

Related

Why Apache Ignite Cache.replace-K-V-V api call performing slow?

We are running Ignite cluster with 12 nodes running on Ignite 2.7.0 on openjdk
1.8 at RHEL platform.
Seeing heavy cputime spent with https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#replace-K-V-V-
We are witnessing slowness with one of our process and when we tried to drill it
further by profiling the JVM, the main culprit (taking ~78% of total time)
seems to be coming from Ignite cache.repalce(K,V,V) api call.
Out of 77.9 by replace, 39% is taken by GridCacheAdapater.equalVal and 38.5%
by GridCacheAdapter.put
Cache is Partitioned and ATOMIC with readThrough,writeThrough,writeBehindEnabled set to True.
Attaching the profiling snapshot of one node(similar is the profiling result on other nodes), Can someone please check and suggest what
could be the cause OR some known performance issue with this Ignite version related to cache.replace(k,v,v) api ?
JVM Prolfiling Snapshot of one node
I guess that it can be related to next issue:
https://issues.apache.org/jira/browse/IGNITE-5003
The problem there related to the operations for the same key before the previous batch of updates (that contains this key) will be stored in the database.
As I see it should be added to Ignite 2.8.
Update:
I tested putAll operation. From the next two pictures you can see that putAll waiting for GridCacheWriteBehindStore.write (two different threads) that contains updateCache:
public void write(Entry<? extends K, ? extends V> entry) {
try {
if (log.isDebugEnabled())
log.debug(S.toString("Store put",
"key", entry.getKey(), true,
"val", entry.getValue(), true));
updateCache(entry.getKey(), entry, StoreOperation.PUT);
}
And provided issue can affect your put operations (or replace as well).

Bursts of RedisTimeoutException using StackExchange.Redis

I'm trying to track down intermittent "bursts" of timeouts using the StackExchange Redis library. Here's a bit about our setup: Our API is written in C# and runs on Windows 2008 and IIS. We have 4 API servers in production, and we have 4 Redis machines (Running Linux latest LTS), each with 2 instances of Redis (one master on port 7000, one slave on port 7001). I've looked at pretty much every aspect of the Redis servers and they look fantastic. No errors in the logs, CPU and network is great, everything with the server side of things seem fantastic. I can tail -f the Redis logs while this is happening and don't see anything out of the ordinary (such as rewriting AOF files or anything). I don't think the problem is with Redis.
Here's what I know so far:
We see these timeout exceptions several times an hour. Usually between 40-50 timeouts in a minute, sometimes up to 80-90. Then, they'll go away for several minutes. There were about 5,000 of these events in the past 24 hours, and they happen in bursts from a single API client.
These timeouts only happen against Redis master nodes, never against slave nodes. However, they happen with various Redis commands such as GETs and SETs.
When a burst of these timeouts happen, the calls are coming from a single API server but happen talking to various Redis nodes. For example, API3 might have a bunch of timeouts trying to call Cache1, Cache2 and Cache3. This is strong evidence that the issue is related to the API servers, not the Redis servers.
The Redis master nodes have 108 connected clients. I log current connections, and this number remains stable. There are no big spikes in connections, and it doesn't look like there's any bad code creating too many connections or not sharing ConnectionMultiplexer instances (I have one and it's static)
The Redis slave nodes have 58 connected clients, and this also looks completely stable as well.
We're using StackExchange.Redis version 1.2.6
Redis is using AOF mode, and size on disk is about 195MB
Here's an example timeout exception. Most look pretty much the same as this:
Type=StackExchange.Redis.RedisTimeoutException,Message=Timeout
performing GET limeade:allActivities, inst: 1, mgr: ExecuteSelect,
err: never, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 0, ar: 0,
clientName: LIMEADEAPI4, serverEndpoint: 10.xx.xx.11:7000,
keyHashSlot: 1295, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER:
(Busy=9,Free=32758,Min=2,Max=32767) (Please take a look at this
article for some common client-side issues that can cause timeouts:
http://stackexchange.github.io/StackExchange.Redis/Timeouts),StackTrace=
at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisBase.ExecuteSync[T](Message message,
ResultProcessor1 processor, ServerEndPoint server) at
StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags
flags) at Limeade.Caching.Providers.RedisCacheProvider1.Get[T](K
cacheKey, CacheItemVersion& cacheItemVersion) in ...
I've done a bit of research on tracing down these timeout exceptions, but what's rather surprising is all the numbers are all zeros. Nothing in the queue, nothing waiting to be processed, I have tons of threads free and not doing anything. Everything looks great.
Anyone have any ideas on how to fix this? The problem is these bursts of cache timeouts cause our database to be hit more, and in certain circumstances this is a bad thing. I'm happy to add any more info that anyone would find helpful.
Update: Connection Code
The code to connect to Redis is part of a fairly complex system that supports various cache environments and configuration, but I can probably boil it down to the basics. First, there's a CacheFactory class:
public class CacheFactory : ICacheFactory
{
private static readonly ILogger log = LoggerManager.GetLogger(typeof(CacheFactory));
private static readonly ICacheProvider<CacheKey> cache;
static CacheFactory()
{
ICacheFactory<CacheKey> configuredFactory = CacheFactorySection.Current?.CreateConfiguredFactory<CacheKey>();
if (configuredFactory == null)
{
// Some error handling, not important
}
cache = configuredFactory.GetDefaultCache();
}
// ...
}
The ICacheProvider is what implements a way to talk to a certain cache system, which can be configured. In this case, the configuredFactory is a RedisCacheFactory which looks like this:
public class RedisCacheFactory<T> : ICacheFactory<T> where T : CacheKey, ICacheKeyRepository
{
private RedisCacheProvider<T> provider;
private readonly RedisConfiguration configuration;
public RedisCacheFactory(RedisConfiguration config)
{
this.configuration = config;
}
public ICacheProvider<T> GetDefaultCache()
{
return provider ?? (provider = new RedisCacheProvider<T>(configuration));
}
}
The GetDefaultCache method is called once, in the static constructor, and returns a RedisCacheProvider. This class is what actually connects to Redis:
public class RedisCacheProvider<K> : ICacheProvider<K> where K : CacheKey, ICacheKeyRepository
{
private readonly ConnectionMultiplexer redisConnection;
private readonly IDatabase db;
private readonly RedisCacheSerializer serializer;
private static readonly ILog log = Logging.RedisCacheProviderLog<K>();
private readonly CacheMonitor<K> cacheMonitor;
private readonly TimeSpan defaultTTL;
private int connectionErrors;
public RedisCacheProvider(RedisConfiguration options)
{
redisConnection = ConnectionMultiplexer.Connect(options.EnvironmentOverride ?? options.Connection);
db = redisConnection.GetDatabase();
serializer = new RedisCacheSerializer(options.SerializationBinding);
cacheMonitor = new CacheMonitor<K>();
defaultTTL = options.DefaultTTL;
IEnumerable<string> hosts = options.Connection.EndPoints.Select(e => (e as DnsEndPoint)?.Host);
log.InfoFormat("Created Redis ConnectionMultiplexer connection. Hosts=({0})", String.Join(",", hosts));
}
// ...
}
The constructor creates a ConnectionMultiplexer based on the configured Redis endpoints (which are in some config file). I also log every time I create a connection. We don't see any excessive number of these log statements, and the connections to Redis remains stable.
In global.asax, in try adding:
protected void Application_Start(object sender, EventArgs e)
{
ThreadPool.SetMinThreads(200, 200);
}
For us, this reduced errors from ~50-100 daily to zero. I believe there is no general rule for what numbers to set as it's system dependant (200 works for us) so might require some experimenting on your end.
I also believe this has improved the performance of the site.

One task per node only in Apache Ignite

I'm relatively new to Apache Ignite. I'm using Ignite compute to distribute tasks to nodes. My goal is a task dispatcher that produces tasks and submits these only to nodes that are "free". One node can only do one task at a time. If all nodes have a task running, the dispatcher shall wait for the next node to become available and then submit the next task.
I can implement this with a queue and async Callables, however I wonder if there is an Ignite onboard class that does something like this? Not sure the class ComputeTaskSplitAdapter is what I need to look at, I'm not fully understanding its purpose.
Any help appreciated.
Server nodes can join and leave the cluster while tasks are distributed.
Tasks can take different amount of time on the nodes and as soon as a server finishes a task it shall get the next task.
Here's my node code:
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.start(cfg);
And this is my job distribution code (for testing):
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.setClientMode(true);
Ignite ignite = Ignition.start(cfg);
for (int i = 0; i < 10; i++)
{
ignite.compute().runAsync(new IgniteRunnable()
{
#Override
public void run()
{
System.out.print("Sleeping...");
try
{
Thread.sleep(10000);
} catch (InterruptedException e)
{
e.printStackTrace();
}
System.out.println("Done.");
}
});
}
Yes, Apache Ignite has direct support for it. Please take a look at the One-at-a-Time section in the Job Scheduling documentation: https://apacheignite.readme.io/docs/job-scheduling#section-one-at-a-time
Note that every server has its own waiting queue and servers will move to the next job in their queue immediately after they are done with a previous job.
If you would like even more aggressive scheduling, then you can take a look at Job-Stealing scheduling here: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/collision/jobstealing/JobStealingCollisionSpi.html
With Job Stealing enabled, servers will still jobs from the job-queues on other servers once their own queue becomes empty. Most of the parameters are configurable.

How to initiate warming when a cache is created?

I'd like to to use an Ignite cluster to warm a PARTITIONED cache from an existing database. The existing database is not partitioned and expensive to scan, so I'd like to perform a single scan when the cache is created by the cluster. Once the job completes, the result would be a cache containing all data from the existing database partitioned and evenly distributed across the cluster.
How do you implement a job that runs when a cache is created by Ignite?
Ignite integrates with underlying stores via CacheStore [1] implementations. Refer to [2] for details about your particular use case.
[1] https://apacheignite.readme.io/docs/persistent-store
[2] https://apacheignite.readme.io/docs/data-loading
You can create a Service that runs once on cluster start and then cancels itself. It can use a cache to store state, so it will not run if it's deployed in the cluster a second time.
The following abstract Service runs executeOnce once per cluster the first time it's deployed after cluster start:
abstract class ExecuteOnceService extends Service {
val ExecuteOnceCacheName = "_execute_once_service"
val config = new CacheConfiguration[String, java.lang.Boolean](ExecuteOnceCacheName)
.setCacheMode(CacheMode.PARTITIONED)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
#IgniteInstanceResource
var ignite: Ignite = _
override def execute(ctx: ServiceContext): Unit = {
val cache = ignite.getOrCreateCache(config)
val executed = cache.getAndPutIfAbsent(ctx.name(), java.lang.Boolean.TRUE)
if (executed != java.lang.Boolean.TRUE) executeOnce(ctx)
ignite.services().cancel(ctx.name())
}
def executeOnce(ctx: ServiceContext): Unit
}

Execute multiple downloads and wait for all to complete

I am currently working on an API service that allows 1 or more users to download 1 or more items from an S3 bucket and return the contents to the user. While the downloading is fine, the time taken to download several files is pretty much 100-150 ms * the number of files.
I have tried a few approaches to speeding this up - parallelStream() instead of stream() (which, considering the amount of simultaneous downloads, is at a serious risk of running out of threads), as well as CompleteableFutures, and even creating an ExecutorService, doing the downloads then shutting down the pool. Typically I would only want a few concurrent tasks e.g. 5 at the same time, per request to try and cut down on the number of active threads.
I have tried integrating Spring #Cacheable to store the downloaded files to Redis (the files are readonly) - while this certainly cuts down response times (several ms to retrieve files compared to 100-150 ms), the benefits are only there once the file has been previously retrieved.
What is the best way to handle waiting on multiple async tasks to finish then getting the results, also considering I don't want (or don't think I could) have hundreds of threads opening http connections and downloading all at once?
You're right to be concerned about tying up the common fork/join pool used by default in parallel streams, since I believe it is used for other things like sort operations outside of the Stream api. Rather than saturating the common fork/join pool with an I/O-bound parallel stream, you can create your own fork/join pool for the Stream. See this question to find out how to create an ad hoc ForkJoinPool with the size you want and run a parallel stream in it.
You could also create an ExecutorService with a fixed-size thread pool that would also be independent of the common fork/join pool, and would throttle the requests, using only the threads in the pool. It also lets you specify the number of threads to dedicate:
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS_FOR_DOWNLOADS);
try {
List<CompletableFuture<Path>> downloadTasks = s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(() -> mys3Downloader.downloadAndGetPath(s3Path), executor))
.collect(Collectors.toList());
// at this point, all requests are enqueued, and threads will be assigned as they become available
executor.shutdown(); // stops accepting requests, does not interrupt threads,
// items in queue will still get threads when available
// wait for all downloads to complete
CompletableFuture.allOf(downloadTasks.toArray(new CompletableFuture[downloadTasks.size()])).join();
// at this point, all downloads are finished,
// so it's safe to shut down executor completely
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
executor.shutdownNow(); // important to call this when you're done with the executor.
}
Following #Hank D's lead, you can encapsulate the creation of the executor service to ensure that you do, indeed, call ExecutorService::shutdownNow after using said executor:
private static <VALUE> VALUE execute(
final int nThreads,
final Function<ExecutorService, VALUE> function
) {
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
try {
return function.apply(executorService);
} catch (final InterruptedException | ExecutionException exception) {
exception.printStackTrace();
} finally {
executorService .shutdownNow(); // important to call this when you're done with the executor service.
}
}
public static void main(final String... arguments) {
// define variables
final List<CompletableFuture<Path>> downloadTasks = execute(
MAX_THREADS_FOR_DOWNLOADS,
executor -> s3Paths
.stream()
.map(s3Path -> completableFuture.supplyAsync(
() -> mys3Downloader.downloadAndGetPath(s3Path),
executor
))
.collect(Collectors.toList())
);
// use downloadTasks
}