Modify large graph DSE Datastax - datastax

I want to go over every node and calculate the number of connections of that node.
Is there a way to go over the nodes in a distributed manner ? I have around 50 million nodes
Thanks
Cristi

Related

Scaled up ignite nodes are not participating in the existing topology

Ignite cluster was running with 3 replicas and having persistence data storage of 80 GB. Backup as one. CPU 3 cores RAM is 6 GB.
Half of the caches are created with Partitioned mode and collocated. One of this cache is loaded with 70% of data. Remaining caches are replicated.
Scaled up the ignite cluster to 5 with 10 GB RAM. All nodes were up and the newly created nodes were not participating in the existing topology.
Observed one node was using more CPU and RAM and other nodes were using very less.
Can you help us to overcome this scenario?
When you add nodes to a cluster with persistence enabled, you need to adjust the "baseline topology." There are multiple ways, but the easiest is to use the control.sh script:
./control.sh --baseline add consistentId1,consistentId2 --yes

Apache Ignite Cache Read Issue (All affinity node left the grid)

Ignite Version: 2.5
Ignite Cluster Size: 10 nodes
One of our spark job is writing data to ignite cache in every one hour. The total records per hour is 530 million. Another spark job read the cache but when it try to read it, we are getting the error as "Failed to Execute the Query (all affinity nodes left the grid)
Any pointers will be helpful.
If you are using "embedded mode" deployment it will mean that nodes are started when jobs are ran and stopped when jobs finish. If you do not have enough backups you will lose data when this happens. Any chance this may be your problem? Be sure to connect to Ignite cluster with client=true.

Redis sizing for Production

I have few queries regarding Redis Cluster setup:
1.Does redis support cross site replication ? When we start redis cluster,can we decide what will be the slave of each instance.
2.I need to store around 11 billion keys,with full persistance and fault tolerance,how many master -slaves should i start with? I have a high tps requirement of both read and writes.
Pls suggest.
Data is important for you and you don't want to lose any data during failover/DR scenarios.
You need to create a Redis cluster.
Redis cluster needs at least 3 master nodes which shares 16384 slots between each other.
You need to create slave nodes as much as you want to replicate data from master nodes.(Please consider right slave number by yourself.)

Hadoop 2.6.4 and Big File

I am a new user of Apache Hadoop. There is one moment which I do not understand. I have a simple cluster (3 nodes). Every node have about 30GB free space. When I look at Overview site of Hadoop I see DFS Remaining: 90.96 GB. I set the Replication factor to 1.
Then I create one file 50GB and try to upload it to HDFS. But space is out. Why? Do I can't upload file which more than free space one node of cluster?
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

some questions about couchbase's replicas detail

Here I get several questions about the replica functions in couchbase, and hope can be answered. First of all, I wanna give some my own understanding ahout the couchbase; If there are 10 nodes in my cluster, and I set the number of replica to be 3 in each bucket (
actually I find that the maximal value is 3 , and I coundn't find any other way to make it larger than 3), then, does it mean that each data in bucket can only be copied to
other three nodes(I guess the three nodes should be random choosen, but could it select manually )in totally 10 nodes; Furthermore, if some of the 10 nodes have downtime,
will it cause loss of data?
I conclude my questions as follows;
1, The maximal value of the replica number in couchbase is 3, right or wrong? If wrong, how could it be largger than 3.
2, I guess the three nodes should be random choosen, but could it select manually
3, If my understanding is right, it will have loss of data when we find some nodes being in downtime. How could we avoid the loss under that condition
The maximal value of the replica number in couchbase is 3, right or wrong? If wrong, how could it be larger than 3.
The maximum number of replicas that you can have is 3, we run in production with 1 replica but it all comes down to how large your cluster is and performance impact. The more replicas you have the more inter node communication and transfer that will occur.
When you have 3 replicas this means that each node has its data replicated to 3 other nodes, this means you could handle 3 node failures in your cluster. It could happen but it is unlikely, what's more likely to happen is a machine dies and then Couchbase can automatically fail over and promote a replica held in an other node to serve requests/updates.
Couchbase's system is nice because it means you can scale up and down by failing over a node and automatic re-balancing.
I guess the three nodes should be randomly chosen, but could I select it manually?
You have no say on which nodes replicas are held, in fact I think it's a great thing that all of Couchbase's sharding and replica processes are taken out of the developers hands, it's all an automatic process.
If my understanding is right, it will have loss of data when we find some nodes being in downtime. How could we avoid data loss under that condition?
As I said before, if one node goes down then a replica is promoted, with 3 back ups you'd need 3 nodes to fail before you noticed something happening. In a production environment you should already have a warning system for each individual node, be it New Relic, Nagios etc that would report if a server dies. If there was a catastrophic problem and you lost more than 4 nodes then yes you would have data loss.
Bare in mind automatic fail over in Couchbase isn't instantaneous but still it's pretty quick. If you need downtime across the cluster say for server maintenance that needs a restart or something then it is possible to fail a node over, remove it from the cluster, perform operations and tasks on it, then add it back into the cluster and rebalance. Perform those stops again for as many nodes as you need, I've personally done that when I forgot to set specific Linux things that needed a system restart.
Overall to avoid data loss, monitor your servers, make (daily/hourly) backups of the data in your cluster and have machines that are well provisioned for your workrate.
Hope that helps!