GemFire Queries gradually slow down after running for few hours - gemfire

We are in performance testing of our Spring Boot based application which uses Spring Data GemFire to connect with Pivotal GemFire.
During these tests what we have observed is OQL queries are slowing down gradually with time. We are already monitoring heap memory, cpu, threads, number of queries, transaction counts and connection counts but nowhere can we see the gradual increase. All the parameters are consistent throughout the performance test duration but query performance shows gradual slowness with time. I am not sure apart from the parameters which we are already monitoring what else would cause this slowness and needs to be monitored. Can you please advise ?
Version Details-
Spring Data GemFire 1.8.10 RELEASE
Spring Boot - 1.4.6 RELEASE
Pivotal GemFire - 8.2.13
Websphere Application Server - 8.5.9
Other details
Cluster Size : 3 Nodes each with 60 gb heap-size each
Region Type : I see both replicated and partitioned regions in slow queries
Indexes used : Yes
Overflow Configured : Yes to the disk store
Persistent Regions : Yes
Eviction - 80% but heap is below 70 most of the time.
Indexes on Data set: Depending on usage, Lowest 1 index and around 6 max. However when we look at the queries in logs always one index is shown as used.

Related

Replication across AZ

We have a 6-node cluster setup in which 3 server nodes are spread across 3 availability zones and each zone also has a client node. All is set up in a Kubernetes-based service.
Important configurations,
storeKeepBinary = true
cacheMode = Partitioned (some cache's about
5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1 readFromBackups = false
no persistence for these tests
When we run it locally on physical boxes, we get a decent throughput. However when we deploy this in the cloud in an AZ-based setup in k8s. We see a steep drop. We can only get a performance comparable to on-prem cluster tests when we keep only a single cache node without any backups (backups=0).
I get that a different hardware and n/w latency in cloud come into play. And while i investigate all that wrt to the differences in cloud, i want to understand if there are some obvious behavioral issues under the cover wrt to ignite that i trying to understand a few things outline below,
Why should cache get calls be slower? It's a partitioned data, so lookup should be by key and since we have turned off 'readFrombackup', it should always go the primary partition. So adding number of cache servers should not change any of the get call latencies.
Similar for 'inserts/puts', other than the caches where the atomicity is 'Transactional', everything else should be the same when we go from one cache to 3 caches.
Any other areas anyone can suggest which i can take a look from configuration/etc.
TIA

Query execution slow when scaling DTUs in Azure SQL Database

Am doing some POC with real-time scenarios for SaaS product to handle high volume of message, this will reach peak within few seconds(send/process) and listener side processing message then storing that computed data into Azure SQL Database(Separate Elastic Pool, 100 eDTU with Standard subscription), to mimic this am sending & processing message in parallel with few nodes and threads, in this case am facing some slowness in first few seconds of database operation when DTU reached maximum level the query execution is normal
Is this expected behavior?
What will happen if executes query during scaling of DTU?
How to avoid this?
When you scale up or down the service tier of an Azure SQL Database open transactions are rolled back, server logins may be disconnected, query plans may vary because the number of threads available for query changes, and the data cache and query cache will be cleared.
Since the data cache is empty, the first time you run a query it has to do a lot of physical IO, memory allocation raises and it's slow. You may take a look at queries performing slow and they may be showing the PAGEIOLATCH_SH and MEMORY_ALLOCATION_EXT waits and that corresponds to pages being pulled from disk to the buffer. The second time you run the query the data is stored on the data cache and it runs faster.
If the database faces high DTU usage for a good period of time throttling may see connection timeouts and poor performance on queries.

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

Infinispan Distributed Cache Issues

I am new to Infinispan. We have setup a Infinispan cluster so that we can make use of the Distributed Cache for our CPU and memory intensive task. We are using UDP as the communication medium and Infinispan MapReduce for distributed processing. The problem we are using facing is with the throughput. When we run the program on a single node machine, the total program completes in around 11 minutes i.e. processing a few hundred thousand records and finally emitting around 400 000 records.
However, when we use a cluster for the same dataset, we are seeing a throughput of only 200 records per second being updated between the nodes in the cluster, which impacts the overall processing time. Not sure which configuration changes are impacting the throughput so badly. I am sure its something to do with the configuration of either JGroups or Infinispan.
How can this be improved?

Is Infinispan an improvement of JBoss Cache?

According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.