What is the best strategy to select which parameters to use for Geode server and locator startup script - gemfire

Our company uses Geode services for some of our applications, we are making use of Geode Member group configurations as well for maintaining different regions.
We have been undergoing an effort of migrating our applications from Geode version 1.6 to the latest version 1.12.
We have seen dramatic performance decrease after the upgrade, if we use the older parameters for the server and locator startup scripts, and things work fine when we remove those parameters.
We are now planning to take the route of understanding the parameters (earlier used) and available, to determine the most optimal configurations for the server and locator to get the best out of the new Geode version.
I was wondering if someone has any best practices or recommendations to follow for this task.
Below are the configurations for the Geode locator and server startup scripts for old and new versions.
Locator startup command
---Old configurations ( works great with Geode 1.6 version but not with any version after Geode 1.8)
gfsh start locator --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --initial-heap=2G --max-heap=2G --dir=/opt/compnayname/geode/locator --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-locator.xml --J=-DCLUSTER=${ECS_CLUSTER} --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-locator.xml' --J=-Dgemfire.distributed-system-id=${DISTRIBUTED_SYSTEM_ID} --J=-Dgemfire.member-timeout=30000 --J=-Dgemfire.max-num-reconnect-tries=0 --J=-Dgemfire.jmx-manager=true --J=-Dgemfire.jmx-manager-start=true --J=-Dgemfire.jmx-manager-port=1099 --J=-Dgemfire.http-service-port=0 --J=-Dgemfire.log-level=info --J=-Dgemfire.log-file-size-limit=10 --J=-Dgemfire.log-disk-space-limit=10 --J=-Dgemfire.disable-auto-reconnect=true
---New configuration (works great with all versions)
gfsh start locator --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --J=-Xmx2048m --dir=/opt/compnayname/geode/locator --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-locator.xml --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-locator.xml'
Server Startup command
---Old configurations ( works great with Geode 1.6 version but not with any version after Geode 1.8)
gfsh start server --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --initial-heap=${GEODE_INIT_HEAP} --max-heap=${GEODE_MAX_HEAP} --group=${SERVER_GROUP} --dir=/opt/compnayname/geode/server --classpath=/opt/compnayname/geode/services-geode.jar --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-server.xml --J=-DCLUSTER=${ECS_CLUSTER} --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-server.xml' --J=-Dgemfire.distributed-system-id=${DISTRIBUTED_SYSTEM_ID} --J=-Dgemfire.member-timeout=30000 --J=-Dgemfire.max-num-reconnect-tries=0 --J=-Dgemfire.socket-buffer-size=16777215 --J=-Dgemfire.off-heap-memory-size=${GEODE_OFF_HEAP} --J=-XX:+UseParNewGC --J=-XX:+UseConcMarkSweepGC --J=-XX:CMSInitiatingOccupancyFraction=60 --eviction-heap-percentage=70 --critical-heap-percentage=90 --J=-Dgemfire.http-service-port=0 --J=-Dgemfire.log-level=info --J=-Dgemfire.log-file-size-limit=10 --J=-Dgemfire.log-disk-space-limit=10 --J=-Dgemfire.disable-auto-reconnect=true ${ADDTL_GEODE_SERVER_OPTS}
---New configuration (works great with all versions)
gfsh start server --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --J=-Xmx${GEODE_MAX_HEAP} --group=${SERVER_GROUP} --dir=/opt/compnayname/geode/server --classpath=/opt/compnayname/geode/services-geode.jar --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-server.xml --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-server.xml'
Test Environment Details
We are using the exact same environment (read AWS) for testing the old and new configurations and performing the same test to measure the response time. We are using 3 Geode locators and 3 Geode servers for the different member groups.
The only difference is the Geode version
We are actually doing a count operation (we have written a count function to execute on Geode regions to count the records existing in the downloaded data which is actually data sketches (https://datasketches.apache.org/)). This count operation on the same data in the same testing environment is giving a drastically slow response with the old configuration using any Geode version beyond 1.8
Another surprising thing is that if I use the old configurations in my local laptop (my laptop serves as locator and server both) with any Geode version greater than 1.8 (including the latest version of Geode), then I am not seeing this issue. Somehow these extra configurations are causing slowness in the AWS environment in the distributed infrastructure.
Please let me know if more information is required and I will be glad to provide more details.
Any information on this will be appreciated.

The main difference I see is the inclusion of --J='-javaagent:/opt/xyz/geode/jmxtrans-agent-1.2.6.jar=/opt/xyz/geode/jmxtrans-agent-server.xml' in the startup parameters. This seems to be a third party Java Agent to expose several JVM metrics through JMX.
Do you know if the agent itself is modifying the byte code?, I've seen negative effects of that approach for Geode applications in the past (not performance related, though). Have you tried upgrading the agent to the latest available version (1.2.10)?. As a side note, Geode already exposes a lot of metrics and information through JMX out of the box, is there any reason why you're relying on yet another external tool for this?.
We have seen dramatic performance decrease after the upgrade
How are you measuring performance?, where do you see the degradation?, are you executing exactly the same workload on exactly the same machines, where the only difference is the Geode version?. There are several actors in play here.
That said, diagnosing and troubleshooting performance degradations can be a long and though process, so my suggestion would be to open a Geode JIRA Ticket with all the relevant information and artefacts.
Cheers.

Related

Apache geode gemfire pulse

We are using Spring data gemfire, we are planning to migrate to Apache geode latest version. In the VMWare gemfire version we had to explicitly set the path of the gemfire installable for the pulse to work properly. If we are using Apache geode jar, will we able to get the pulse up and running without specifying the installable location.
We are not using gfsh in our project, we want to ensure that we have minimal dependency on the installable version when we upgrade gemfire.
You don't need to set the GEODE_HOME environment variable when using spring-boot-data-geode, you just need to make sure the correct dependencies are within the classpath of your application (see here for more details).
I've written a very basic example showing how to start a Locator with the Pulse application embedded, you can find it here
As a side note, and regarding the following:
We are using Spring data gemfire, we are planning to migrate to Apache geode latest version.
In order to avoid weird and hard to fix runtime issues, please always make sure to use a combination of versions fully supported in the Spring Boot for Apache Geode and VMware Tanzu GemFire Version Compatibility Matrix
After going through various answers and documentations I was able to start pulse by the help of following article.
Start Gemfire Pulse

To virtualize or not to virtualize a bare metal server for a kubernetes deployment

I'd like to deploy kubernetes on a large physical server (24 cores) and I'm uncertain as to a number of things.
What are the pros and cons of creating virtual machines for the k8s cluster other than running on bare-metal.
I have the following considerations:
Creating vms will allow for work load isolation. New vms for experiments can be created and assigned to devs.
On the other hand, with k8s running on bare metal a new NAMESPACE can be created for each developer for experimentation and they can run their code in it. After all their code should be running in docker containers.
Security:
Having vms would limit the amount of access given to future maintainers, limiting the amount of damage that could be done. While on the other hand the primary task for any future maintainers would be adding/deleting nodes and they would require bare metal access to do that.
Authentication:
At the moment devs would only touch the server when their code runs through the CI pipeline and their running deployments are deployed. But what about viewing logs? Could we setup tiered kubectl authentication to allow devs to only access whatever namespaces have been assigned to them (I believe this should be possible with the k8s namespace authorization plugin).
A number of vms already exist on the server. Would this be an issue?
128 cores and doubts.... That is a lot of cores for a single server.
For kubernetes however this is not relevant:
Kubernetes can use different sized servers and utilize them to the maximum. However if you combine the master server processes and the node/worker processes on a single server, you might create unwanted resource issues. You can manage those with namespaces, as you already mention.
What we do is use continuous integration with namespaces in a single dev/qa kubernetes environment in which changes have their own namespace (So we run many many namespaces) and run full environment deployments in those namespaces. A bunch of shell scripts are used to manage this. This works both with a large server as what you have, as well as it does with smaller (or virtual) boxes. The benefit of virtualization for you could mainly be in splitting the large box in smaller ones so that you can also use it for other purposes then just kubernetes (yes, kubernetes runs except MS Windows, no desktops, no kernel modules for VPN purposes, etc).
I would separate dev and prod in the form of different vms. I once had a webapp inside docker which used too many threads so the docker daemon on the host crashed. It was limited to one host luckily. You can protect this by setting limits, but it's a risk: one mistake in dev could bring down prod as well.
I think the answer is "it depends!" which is not really an answer. Personally, I would split up the machine using VM's and deploy that way. You've got better flexibility as to how much of the server's resources you carve out and you can easily create new environments, then destroy easily.
Even if these vms are really big, I think it's still easier to manage also given that you have existing vm's on the machine.
That said, there's not a technical reason that you can't run a single node server, but you may run into problems with downtime with upgrades (if that's an issue), as well as if that server needs patched or rebooted, then your entire cluster is down.
I would look at your environment needs for HA and uptime, as well as how you are going to deploy VM's (if you go that route), and decide what works the best for you.

Better jmeter report

Currently I use jmeter aggregate report or summary report for submitting reports. But they expect something extra.. How can I give. Is there any plugins for getting server resources usage when testing load.
Reporting: since JMeter 3.0 there is a HTML Reporting Dashboard which can be generated during the test run. It contains exhaustive overview information. If you need to find out the reason of the bottleneck or memory leak or whatever you can consider extra Graphs available via JMeter Plugins project.
The same JMeter Plugins project provides PerfMon - client-server application which is able to collect over 70 different metrics and plot them via JMeter Listener. See How to Monitor Your Server Health & Performance During a JMeter Load Test guide for detailed setup and usage instructions.
There are quite a few plug-ins available that can help you analyze the results better. You can refer to https://jmeter-plugins.org/ for the same.
Most popularly used ones are:
Response Times Over Time
Response Times Percentiles
Transactions per Second
Response Latencies Over Time
In case of server usage you can use following that comes with JMeter plug-ins
PerfMon Metrics Collector and Server Agent or
In case of Unix based system use sar command that comes with sysstat package or VMstat. In case of windows based system use Perfmon to capture the system utilization data while the test is running and then use Ksar to plot graphs with the data collected using sar. https://sourceforge.net/projects/ksar/
If you have collected data using Perfmon then plot the graphs using PAL. https://pal.codeplex.com/
In this case, I would suggest using Grafana. It shows realtime results. And the best thing is, it can be configured according to the need.
Now, the thing is how to use it? Using it is not that tough.
If you're using a Mac or Linux (Any Flavour) things become easy. If you're using Windows, I would suggest using a virtual machine. The reason behind that is windows block traffic after some requests. And that causes a lot of pain in the head.
In my case, I used a virtual machine to setup ubuntu inside it and then configured Grafana.
For working with Grafana, you need to have these two things installed.
Grafana Itself
Influx Db for the backend
Links for both here below:
https://grafana.com/grafana/download?platform=linux
https://portal.influxdata.com/downloads/
Once installed and setup,
You need to use Backen Listener to push results o Graphite Client (Installed along with Influx DB Automatically).
I know it is a bit confusing but once you understand the thing, you and your client will love the detailed reports.
Remeber, Grafana is all about configuration.
Let me know if you have any confusion regarf=ding this.
Happy to help. :)

Set up distributed index using Hibernate Search and Lucene

Our application is using Hibernate Search for indexing some of its data. The application is running on two JBoss EAP 6.2 application servers for load distribution and failover. We need changes made on one machine to be immediately visible on the other. The index is a central part of the application and needs to be consistent with the database data. Completely rebuilding it takes a long time so it is important that it remains intact even in the case of a server crash. Also, the index is expected to grow too large to keep all of it in memory.
Our current solution is to use the standard filesystem directory with a shared filesystem (NFS) and the JGroups backend to ensure that only one server writes to a given index at any time. This works more or less, but sometimes we have problems with index updates taking very long (up to 20 seconds) or failing completely. Due to some other reasons we need to migrate away from the currently used file system, so we are evaluating alternatives for the current setup.
One thing we tried is the Infinispan directory with a file cache store for persistence, but we had some problems there regarding OutOfMemoryErrors (see also my post in the Infinispan forums https://developer.jboss.org/thread/253732). Also, performance was still not acceptable in our first tests (about 3 seconds for an index update with two clustered servers set up on my developer machine), though that may be due to configuration issues.
I think this is not such an uncommon requirement, but I couldn't find much information on best practices to implement it.
Who has experiences with similar setups? Does the Infinispan directory work for you? Can anybody suggest a working configuration or how to proceed to arrive at one? What alternatives have you tried and which work?
You need to be careful about which versions are being used. The Infinispan version which is bundled within JBoss EAP is not intended (i.e. tested as extensively as for other purposes) for storing the Lucene index.
When JBoss EAP 6.2 was released, the bundled Infinispan was considered good to go for the internal needs of the application server, but as you might have discovered, the feature of index storage was having at least some performance issues.
In recent developments of Infinispan we applied many improvements to the index storage feature, fixing some bugs and getting very significant performance improvements out of it. I would hope you could be willing to try Infinispan 7.2.0.Beta1 ?
All of these improvements are also being backported to JBoss Data Grid, version 6.5 will make them available as a supported product. Note this feature of storing an Hibernate Search index wasn't supported before - it is going to be a new feature of JDG 6.5.
Modules from JDG 6.5 will be compatible with JBoss EAP, you'll just have to make sure you'll use the Infinispan build provided by JDG and not the one meant for internal usage of EAP.
Performance improvements are still being worked on. It's much better already - especially compared to that older version - but we won't stop working on that yet so if you could try latest bleeding edge versions of Infinispan 7.2.x (another release is scheduled for tomorrow), I'd highly appreciate your feedback to keep pushing it.

How to monitor JVM without installing JDK

I want to monitor JVM performance on my production environment. I have installed only JRE, not JDK, Hence i can't use jstat, jconsole etc. to monitor the JVM performance.
Can somebody please help to understand how can i monitor JVM performance in this scenario?
Is there any way to achieve this?
(please note that i don't want to monitor it remotely through JMX or something else. i would like to install local agent in each machine which will send the metrics to server at the interval of 1 minute.)
Thanks,
KS
If you manage to get JMX up and running on your VM (from the comment), you can then use jmxterm or jmxfetch to push these JMX metrics into a metrics system (like graphite or Datadog).
If you have enough patience and time to write, you can probably have a look at JVMTI. You can write your code in C/C++ and run it along your Java Process and you can gather information about the JVM without affecting it.
Another simple and naive way is to start your VM with a javaagent written in java but JVMTI is even better than that. The most crucial difference between the javaagent and JVMTI app is derived from the loading mechanics. While the agents are loaded inside the heap, they are governed by the same JVM. Whereas the JVMTI agents are not governed by the JVM rules and are thus not affected by the JVM internals such as the GC or runtime error handling.
You can even give Java Mission Control a try if you're using JDK7 or above :)
Jolokia is a java agent you can use to expose JMX as http. Run jmx2graphite and get those metrics into Graphite. The link includes instructions on Graphite installation (10 minutes)