To understand Infinispan Behavior in Rebalancing & Async mode - infinispan

I am new to Infinispan. Even after going through Infinispan User Guide & googling through, I am not able to figure out behavior of Infinispan in below cases :
1) Does it lock on HotRod Client reading when rebalancing is taking place ?
2) How does Infinispan behaves with REPL mode with async & nearCache at HotRod client end ?
(I found that if nearCache is disabled then it can get the data, but not with nearCache. Does it have to anything with nearCache update?)
Server Code :
GlobalConfigurationBuilder globalConfig = GlobalConfigurationBuilder.defaultClusteredBuilder();
globalConfig.transport().clusterName("infiniReplicatedCluster").globalJmxStatistics().enable().allowDuplicateDomains(Boolean.TRUE);
ConfigurationBuilder configBuilder = new ConfigurationBuilder();
EmbeddedCacheManager embeddedCacheManager = new DefaultCacheManager(globalConfig.build());
configBuilder.dataContainer().compatibility().enable().clustering().cacheMode(CacheMode.REPL_ASYNC)
.async().replQueueInterval(120, TimeUnit.SECONDS).useReplQueue(true).hash();
embeddedCacheManager.defineConfiguration("TestCache", configBuilder.build());
Cache<String, TopologyData> cache = embeddedCacheManager.getCache("TestCache");
cache.put("00000", new TopologyData());
HotRodServerConfiguration build = new HotRodServerConfigurationBuilder().build();
HotRodServer server = new HotRodServer();
server.start(build, embeddedCacheManager);
Client Code :
ConfigurationBuilder remoteBuilder = new ConfigurationBuilder();
remoteBuilder.nearCache().mode(NearCacheMode.EAGER).maxEntries(100);
RemoteCacheManager remoteCacheManager = new RemoteCacheManager(remoteBuilder.build());
remoteCache = remoteCacheManager.getCache("TestCache");
System.out.println(remoteCache.get(fetchKey));
With above code, scenarios & results are listed below(all the run were done multiple times, resulting in same result) :
-Without nearCache 1 Key --> got value as expected
-With nearCache (LAZY/EAGER) 1 key --> null
-In same run, same Key two times With nearCache (LAZY/EAGER) --> null(first time) - expected value(next time)
Clarification needed : If a sample code to re-verify HotRod client's load balancing(RoundRobin) behavior in DIST mode. (I am successfully able to check it with REPL mode, & it works as it claims)

State transfer in Infinispan is non-blocking
Not sure I understand completely: you mean to say that, when enabling near caching on a Hot Rod client, reading from an async REPL cache doesn't work ? Does it just hang ? Do you have code that you can share ?
A clarification: Hot Rod goes to the primary owner both in DIST and REPL mode (REPL is simply a special DIST mode where number of owners is equal to the size of the cluster) hashed according to the key and will only use Round Robin if the primary is not responding.

Related

Infinispan clustered lock performance does not improve with more nodes?

I have a piece of code that is essentially executing the following with Infinispan in embedded mode, using version 13.0.0 of the -core and -clustered-lock modules:
#Inject
lateinit var lockManager: ClusteredLockManager
private fun getLock(lockName: String): ClusteredLock {
lockManager.defineLock(lockName)
return lockManager.get(lockName)
}
fun createSession(sessionId: String) {
tryLockCounter.increment()
logger.debugf("Trying to start session %s. trying to acquire lock", sessionId)
Future.fromCompletionStage(getLock(sessionId).lock()).map {
acquiredLockCounter.increment()
logger.debugf("Starting session %s. Got lock", sessionId)
}.onFailure {
logger.errorf(it, "Failed to start session %s", sessionId)
}
}
I take this piece of code and deploy it to kubernetes. I then run it in six pods distributed over six nodes in the same region. The code exposes createSession with random Guids through an API. This API is called and creates sessions in chunks of 500, using a k8s service in front of the pods which means the load gets balanced over the pods. I notice that the execution time to acquire a lock grows linearly with the amount of sessions. In the beginning it's around 10ms, when there's about 20_000 sessions it takes about 100ms and the trend continues in a stable fashion.
I then take the same code and run it, but this time with twelve pods on twelve nodes. To my surprise I see that the performance characteristics are almost identical to when I had six pods. I've been digging in to the code but still haven't figured out why this is, I'm wondering if there's a good reason why infinispan here doesn't seem to perform better with more nodes?
For completeness the configuration of the locks are as follows:
val global = GlobalConfigurationBuilder.defaultClusteredBuilder()
global.addModule(ClusteredLockManagerConfigurationBuilder::class.java)
.reliability(Reliability.AVAILABLE)
.numOwner(1)
and looking at the code the clustered locks is using DIST_SYNC which should spread out the load of the cache onto the different nodes.
UPDATE:
The two counters in the code above are simply micrometer counters. It is through them and prometheus that I can see how the lock creation starts to slow down.
It's correctly observed that there's one lock created per session id, this is per design what we'd like. Our use case is that we want to ensure that a session is running in at least one place. Without going to deep into detail this can be achieved by ensuring that we at least have two pods that are trying to acquire the same lock. The Infinispan library is great in that it tells us directly when the lock holder dies without any additional extra chattiness between pods, which means that we have a "cheap" way of ensuring that execution of the session continues when one pod is removed.
After digging deeper into the code I found the following in CacheNotifierImpl in the core library:
private CompletionStage<Void> doNotifyModified(K key, V value, Metadata metadata, V previousValue,
Metadata previousMetadata, boolean pre, InvocationContext ctx, FlagAffectedCommand command) {
if (clusteringDependentLogic.running().commitType(command, ctx, extractSegment(command, key), false).isLocal()
&& (command == null || !command.hasAnyFlag(FlagBitSets.PUT_FOR_STATE_TRANSFER))) {
EventImpl<K, V> e = EventImpl.createEvent(cache.wired(), CACHE_ENTRY_MODIFIED);
boolean isLocalNodePrimaryOwner = isLocalNodePrimaryOwner(key);
Object batchIdentifier = ctx.isInTxScope() ? null : Thread.currentThread();
try {
AggregateCompletionStage<Void> aggregateCompletionStage = null;
for (CacheEntryListenerInvocation<K, V> listener : cacheEntryModifiedListeners) {
// Need a wrapper per invocation since converter could modify the entry in it
configureEvent(listener, e, key, value, metadata, pre, ctx, command, previousValue, previousMetadata);
aggregateCompletionStage = composeStageIfNeeded(aggregateCompletionStage,
listener.invoke(new EventWrapper<>(key, e), isLocalNodePrimaryOwner));
}
The lock library uses a clustered Listener on the entry modified event, and this one uses a filter to only notify when the key for the lock is modified. It seems to me the core library still has to check this condition on every registered listener, which of course becomes a very big list as the number of sessions grow. I suspect this to be the reason and if it is it would be really really awesome if the core library supported a kind of key filter so that it could use a hashmap for these listeners instead of going through a whole list with all listeners.
I believe you are creating a clustered lock per session id. Is this what you need ? what is the acquiredLockCounter? We are about to deprecate the "lock" method in favour of "tryLock" with timeout since the lock method will block forever if the clustered lock is never acquired. Do you ever unlock the clustered lock in another piece of code? If you shared a complete reproducer of the code will be very helpful for us. Thanks!

using .net StackExchange.Redis with "wait" isn't working as expected

doing a R/W test with redis cluster (servers): 1 master + 2 slaves. the following is the key WRITE code:
var trans = redisDatabase.CreateTransaction();
Task<bool> setResult = trans.StringSetAsync(key, serializedValue, TimeSpan.FromSeconds(10));
Task<RedisResult> waitResult = trans.ExecuteAsync("wait", 3, 10000);
trans.Execute();
trans.WaitAll(setResult, waitResult);
using the following as the connection string:
[server1 ip]:6379,[server2 ip]:6379,[server3 ip]:6379,ssl=False,abortConnect=False
running 100 threads which do 1000 loops of the following steps:
generate a GUID as key and random as value of 1024 bytes
writing the key (using the above code)
retrieve the key using "var stringValue =
redisDatabase.StringGet(key, CommandFlags.PreferSlave);"
compare the two values and print an error if they differ.
running this test a few times generates several errors - trying to understand why as the "wait" with (10 seconds!) operation should have guaranteed the write to all slaves before returning.
Any idea?
WAIT isn't supported by SE.Redis as explained by its prolific author at Stackexchange.redis lacks the "WAIT" support
What about improving consistency guarantees, by adding in some "check, write, read" iterations?
SET a new key value pair (master node)
Read it (set CommandFlags to DemandReplica.
Not there yet? Wait and Try X times.
4.a) Not there yet? SET again. go back to (3) or give up
4.b) There? You're "done"
Won't be perfect but it should reduce probability of losing a SET??

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

Sync Framework 2.1 taking too much time to sync for first time

I am using sync framework 2.1 to sync my database and it is unidirectional only i.e. source to destination database. I am facing problem while syncing database for the first time. I have divided tables in different groups so that there is less overhead while syncing. I have one table which has 900k records in it. While syncing that table, SyncOrchestrator.Synchronize(); method not returning anything. Network usage,disk i/o everything goes high. I have wait to complete the sync process for 2 days but it still happening nothing. I have also check in sql db using "sp_who2" and the process is in suspended mode. I have also use some queries found from online and it says table_selectchanges takes too much time.
I have used following code to sync my database.
//setup the connections
var serverConn = new SqlConnection(sourceConnectionString);
var clientConn = new SqlConnection(destinationConnectionString);
// create the sync orchestrator
var syncOrchestrator = new SyncOrchestrator();
//setup providers
var localProvider = new SqlSyncProvider(scopeName, clientConn);
var remoteProvider = new SqlSyncProvider(scopeName, serverConn);
localProvider.CommandTimeout = 0;
remoteProvider.CommandTimeout = 0;
localProvider.ObjectSchema = schemaName;
remoteProvider.ObjectSchema = schemaName;
syncOrchestrator.LocalProvider = localProvider;
syncOrchestrator.RemoteProvider = remoteProvider;
// set the direction of sync session Download
syncOrchestrator.Direction = SyncDirectionOrder.Download;
// execute the synchronization process
syncOrchestrator.Synchronize();
We had this problem. We removed all the foreign key constraints from the database for the first sync. This reduced the initial sync from over 4 hours to about 10 minutes. After the initial sync completed we replaced the foreign key constraints.

Data is not properly stored to hsqldb when using pooled data source by dbcp

I'm using hsqldb to create cached tables and indexed tables.
The data being stored has pretty high frequency so I need to use a connection pool.
Also because there is a lot of data I do not call checkpoint on every commit, but rather expect the data to be flushed after 50,000 rows are inserted.
So the thing is that I can see the .data file is growing but when I connect with hsqldb client I don't see the tables and the data.
So I had 2 simple tests, one inserted single row and one inserted 60,000 rows to new table. In both cases I couldn't see the result in any hsqldb client.
(Note that I use shutdown=true)
So when I add checkpoint after each commit, it solve the problem.
Also if specify in the connection string to use log, it solves the problem (I don't want the log in production though). Also not using pooled connection solved the problem and last is using pooled data source and explicitly close it before shutdown.
So I guess that some connections in the connection pool are not being closed, preventing from the db to somehow commit the changes and make them available for the client. But then, why couldn't I see the result even with 60,000 rows?
I also would expect the pool to be closed automatically...
What am I doing wrong? What is happening behind the scene?
The code to get the data source looks like this:
Class.forName("org.hsqldb.jdbcDriver");
String url = "jdbc:hsqldb:" + m_dbRoot + dbName + "/db" + ";hsqldb.log_data=false;shutdown=true;hsqldb.nio_data_file=false";
ConnectionFactory connectionFactory = new DriverManagerConnectionFactory(url, user, password);
GenericObjectPool connectionPool = new GenericObjectPool();
KeyedObjectPoolFactory stmtPool = new GenericKeyedObjectPoolFactory(null);
new PoolableConnectionFactory(connectionFactory, connectionPool, stmtPool, null, false, true);
DataSource ds = new PoolingDataSource(connectionPool);
And I'm using this Pooled data source to create table:
Connection c = m_dataSource.getConnection();
Statement st = c.createStatement();
String script = String.format("CREATE CACHED TABLE IF NOT EXISTS %s (id %s NOT NULL, entity %s NOT NULL, PRIMARY KEY (id));", m_tableName, m_idGenerator.getIdType(), TABLE_ENTITY_TYPE);
st.execute(script);
c.close;
st.close();
And insert rows:
Connection c = m_dataSource.getConnection();
c.setAutoCommit(false);
Statement stmt = c.prepareStatement(m_sqlInsert);
stmt.setObject(1, id);
stmt.setBinaryStream(2, Serializer.Helper.serialize(m_serializer, entity));
stmt.executeUpdate();
stmt.close();
stmt = null;
c.commit();
c.close();
stmt.close();
so the above seems to add data but it cannot be seen.
When I explicitly called
connectionPool.close();
Then and only then I could see the result.
I also tried to use JDBCDataSource and it worked as well.
So what is going on? And what is the right way to do this?
Your method of accessing the database from outside your application process is simply wrong.
Only one java process is supposed to connect to the file: database.
In order to achieve your aim, launch an HSQLDB server within your application, using exactly the same JDBC URL. Then connect to this server from the external client.
See the Guide:
http://www.hsqldb.org/doc/2.0/guide/listeners-chapt.html#lsc_app_start
Update: The OP commented that the external client was used after the application had stopped. Because you have turned the log off with hsqldb.log_data=false, nothing is persisted permanently. You need to perform an explicit CHECKPOINT or SHUTDOWN when your application completes its work. You cannot rely on shutdown=true at all, even without connection pooling.
See the Guide:
http://www.hsqldb.org/doc/2.0/guide/deployment-chapt.html#dec_bulk_operations