Apache Ignite Cache Read Issue (All affinity node left the grid) - ignite

Ignite Version: 2.5
Ignite Cluster Size: 10 nodes
One of our spark job is writing data to ignite cache in every one hour. The total records per hour is 530 million. Another spark job read the cache but when it try to read it, we are getting the error as "Failed to Execute the Query (all affinity nodes left the grid)
Any pointers will be helpful.

If you are using "embedded mode" deployment it will mean that nodes are started when jobs are ran and stopped when jobs finish. If you do not have enough backups you will lose data when this happens. Any chance this may be your problem? Be sure to connect to Ignite cluster with client=true.

Related

Ignite Partition Loss Immediately After Cache Creation

I am using ignite 2.8.1 in a 4 node cluster with persistence enabled. I was attempting to do a rolling restart of the cluster but I believe during that process the cluster ended up having partition loss seemingly all on one node in the cluster. I am using policy READ_ONLY_SAFE. From that point on, even though all the nodes came back up, about every 1 in 8 times I would create a cache it would immediately have partition loss in the new cache, i.e. we would create the cache and then query it 1 second later and the queries would fail with "Failed to execute query because cache partition has been lost". How can partitions be lost immediately after creation if no cluster event has happened such as nodes leaving the cluster?
Partitions for newly created caches may be lost if cluster has some nodes absent from baseline / in "lost partition" state.
This is so that affinity collocation would work. Since on 2 caches with same affinity configurations partitions need to be collocated by node, there's nowhere to put these "extra" partitions for newly created caches.
You need to reset lost partitions first.

AWS EMR Presto Cluster Terminated abruptly Error: All slaves in the job flow were terminated due to Spot

I am having trouble with AWS EMR PrestoDB.
I launched a cluster with master nodes as coordinator and core nodes as workers. Core nodes were spot instances. But, master node was on demand. After 5 weeks of cluster launch, i got this error message
Terminated with errorsAll slaves in the job flow were terminated due to Spot
Is it that if all slaves are terminated will make the cluster itself terminate?
I see the spot pricing history, and it didn't reached around the max price I set.
What I have already done?
I have checked logs that are dumped to s3. I didn't found any information about the cause of termination. It just said
Failed to visit ... <many directories>
I am answering my own question. As per the presto community, there must be at least one master node up and running in the AWS EMR Presto cluster. But since it got terminated, the whole cluster got terminated.
To avoid data loss because of spot pricing/interruption the data needs to be backed up by either snapshot, frequent copy to s3 or leaving EBS volume behind.
Ref: https://aws.amazon.com/premiumsupport/knowledge-center/spot-instance-terminate/
Your cluster should be still be up but without task nodes. Under Cluster-> Details -> Hardware you can add the task nodes.
Adding task nodes
Similar scenario : AWS EMR Error : All slaves in the job flow were terminated
For using Spot you might want to use instance termination notice and also setup the max price :
https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Ignite performing slow on cluster

We are using ignite JDBC thin driver to store 1 million records in a table on ignite cache.
To insert 1 Million records on single node it take 60 sec, where as on cluster of 2 nodes it takes 5 min and time grows exponentially as number of nodes are increased.
attached ignite log file where time was consumed on cluster.
attached configuration file for the cluster.
The log and configuration file is here
IS there any additional configuration required to get time down to insert records over a cluster.
Please make sure that you always test networked configuration.
You should avoid testing "client and server on the same machine" configuration because it can not be compared with "two server nodes on different machines". And it certainly should not be compared with "two server nodes on the same machine" :)
I've heard that thin JDBC driver is not optimized yet for fast INSERTs. Please try client-node JDBC driver with batching (via PreparedStatement.addBatch()).

Ignite C++, Server-client cluster load balancing performance issue

I have 2 nodes, in which im trying to run 4 ignite servers, 2 on each node and 16 ignite clients, 8 on each node. I am using replicated cache mode. I could see the load on cluster is not distributed eventually to all servers.
My intension of having 2 servers per node is to split the load of 8 local clients to local servers and server can work in write behind to replicate the data across all servers.
But I could notice that only one server is taking the load, which is running at 200% cpu and other 3 servers are running at very less usage of around 20%cpu. How can I setup the cluster to eventually distribute the client loads across all servers. Thanks in advance.
I'm generating load by inserting same value 1Million times and trying to get the value using the same key
Here is your problem. Same key is always stored on the same Ignite node, according to Affinity Function (see https://apacheignite.readme.io/docs/data-grid), so only one node takes read and write load.
You should use a wide range of keys instead.

Aerospike cluster rebalancing causing errors

When adding a new node to an Aerospike cluster, a rebalance happens for the new node. For large data sets this takes time and some requests to the new node fail until rebalance is complete. The only solution I could figure out is retry the request until it gets the data.
Is there a better way?
I don't think it is possible to keep the node out of cluster for requests until it's done replicating because it is also master for one of the partitions.
If you are performing batch-reads, there is an improvement in 3.6.0. While the cluster is in-flux, if the client directs the read transaction to Node_A, but the partition containing the record has been moved to Node_B, Node_A proxies the request to Node_B.
Is that what you are doing?
You should not be in a position where the client cannot connect to the cluster, or it cannot complete a transaction.
I know that SO frowns on this, but can you provide more detail about the failures? What kinds of transactions are you performing? What versions are you using?
I hope this helps,
-DM
Requests shouldn't be failing, the new node will proxy to the node that currently has the data.
Prior to Aerospike 3.6.0 batch read requests were the exception. I suspect this is your problem.