Datastax consistency - datastax

We've installed Datastax on five nodes with search enabled on the five nodes and replication factor of 3. After adding 590 rows to a table and select from node 1 it retrieve 590. And when selecting from other nodes the number varies from 570 to 585 rows.
I tried using CONSISTENCY QUORUM on cqlsh, but nothing changed. And solr_query is not supported on CONSISTENCY QUORUM.
Is there a way to assure all data written to Cassandra is relieved as it is?

As LHWizard mentioned, if you use Consistency levels such that (nodes_written + nodes_read) > RF, you will ensure immediate consistency.
In your case, you can try using a CONSISTENCY ALL on your read so that all nodes are checked before returning (this will be immediately consistent even with write CL of ONE). This should actually trigger a read repair on the inconsistent nodes and the missing data will be streamed to those nodes.
You're right that solr queries can only be read at CL ONE. If you need higher consistency requirements, you will need to raise the CL for the writes to achieve what you need.

Related

Efficient combination of a large number of affinity calls in Apache Ignite

We would like to compute on a large, partition-able dataset or 'products' in Ignite (100.000+ products, each linked to a large amount of extra data in different caches). We require several use cases:
1) Launch a compute job, limited to a large number (100's) of products, with a strong focus on responsiveness (<200ms). We can use the product ID as an affinity key to collocate all extra data with the products. But affinityRun only allows a single key to be specified, which would mean we need to launch 100's of compute jobs. Ideally we would be able to do an affinityRun on the entire set of product IDs at once, and let Ignite distribute the compute job to the relevant nodes, but we struggle to find a way to do this. (The compute job would then use local queries only on those compute nodes.)
2) Launch a compute job over the entire space of products in an efficient manner. We could launch the compute job on each compute node and use local queries, but that would no longer give us the benefits of falling back to backup partitions in case a primary partition is unavailable. This is an extreme case of problem number 1, just with a huge (all) number of product IDs as input.
We've been brainstorming about this for a while now, but it seems like we're missing something. Any ideas?
There is a version of affinityRun that takes a partition number as a parameter. Distribute your task per partition and each node on the receiving end will be processing data residing in that partition number (just run a scan query for the partition). In case of failure, you'll just restart the process for a partition and can filter out already processed items with a custom logic.
Affinity job is nothing but the one which execute on the data node where key/value resides.
There are several ways to send job to particular node and not only affinity key. for example, you can send based on consistentID and in 2.4.10(if I remember correctly), they added way to query backup explicitly.
Regarding your scenario, I can think of below solution-
SqlFieldsQuery query = new SqlFieldsQuery("select productID from CacheTable").setLocal(true);
You can prepare affinity job with above SQL where you will select all products(from that node only) and iterate over them and do all queries locally only to get all products information like this. Send that job to required node and do your computation and reduce the result and return to client.

Aerospike cluster behavior in different consistency mode?

I want to understand the behavior of aerospike in different consistancy mode.
Consider a aerospike cluster running with 3 nodes and replication factor 3.
AP modes is simple and it says
Aerospike will allow reads and writes in every sub-cluster.
And Maximum no. of node which can go down < 3 (replication factor)
For aerospike strong consistency it says
Note that the only successful writes are those made on replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes < replication factor.
And then same document says
All writes are committed to every replica before the system returns success to the client. In case one of the replica writes fails, the master will ensure that the write is completed to the appropriate number of replicas within the cluster (or sub cluster in case the system has been compromised.)
what does appropriate number of replica means ?
So if I lose one node from my 3 node cluster with strong consistency and replication factor 3 , I will not be able to wright data ?
For aerospike strong consistency it says
Note that the only successful writes are those made on
replication-factor number of nodes. Every other write is unsuccessful
Does this really means the no writes are allowed if available nodes <
replication factor.
Yes, if there are fewer than replication-factor nodes then it is impossible to meet the user specified replication-factor.
All writes are committed to every replica before the system returns
success to the client. In case one of the replica writes fails, the
master will ensure that the write is completed to the appropriate
number of replicas within the cluster (or sub cluster in case the
system has been compromised.)
what does appropriate number of replica means ?
It means replication-factor nodes must receive the write. When a node fails, a new node can be promoted to replica status until either the node returns or an operator registers a new roster (cluster membership list).
So if I lose one node from my 3 node cluster with strong consistency
and replication factor 3 , I will not be able to wright data ?
Yes, so having all nodes a replicas wouldn't be a very useful configuration. Replication-factor 3 allows up to 2 nodes to be down, but only if the remaining nodes are able to satisfy the replication-factor. So for replication-factor 3 you would probably want to run with a minimum of 5 nodes.
You are correct, with 3 nodes and RF 3, losing one node means the cluster will not be able to successfully take write transactions since it wouldn't be able to write the required number of copies (3 in this case).
Appropriate number of replicas means a number of replicas that would match the replication factor configured.

How to Let Spark Handle Bigger Data Sets?

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.

Best way to model millions of exist checks in Aerospike?

Having grown out of Redis for some data structures I'm looking to other solutions with good disk/SSD performance. I recently discovered Aerospike which seems to excel in an SSD environment.
One of the most memory hungry structures are about 100.000 Redis sets, which can each contain up to 10.000 strings. Each string is between 10 and 30 characters.
These sets are mostly used for exists / uniqueness checks.
What would be the best way to model these? I generally see 2 options:
* model a redis set as an Aerospike lset
* model each value in a set separately.
Besides this choice, the 100.000 Redis sets are used as a partitioning on the keys. For reasons of locality it would probably make sense to have a similar sort of partitioning/namespacing in Aerospike. However, I'm pretty sure the notion of 'namespacing' in Aerospike isn't used for this sort of key partitioning. What would be a correct way (if any) to do this in Aerospike, or is that not needed?
Aerospike does its own partitioning for load balancing and high availability needs. Namespace is synonymous to Database in traditional sense and NOT to Partition of data. Data in a Namespace is partitioned and stored in cluster. You as a user need not worry about placement of the data.
I would map a Redis set to Aerospike "lset" (one to one). Aerospike should takes care of data locality for the data in a given "lset".
Yes, you should not be worrying about the locality of the data as Aerospike does auto-sharding. This ensures equal balancing of data distribution and read/write load across all nodes of the cluster.
Putting in lset has its advantages. It gives functionality similar to redis where you do not need to write your own functionality. But at the same time it has its disadvantes too. So, you should choose based on your requirements. All the operations on a single set will be serialized. So, if you are expecting the read/wirte to the set to be parallelised, lset may not be the right fit for you. Also, the exists check in lset will actually read the full record and return true false. Aerospike has an exists api for normal keys, which will return true/false based on the in-memory index which is way faster.
For this usecase, you may not be able to segregate them into the 'sets' of aerospike. You need 100,000 sets. But as of now, Aerospike only supports 1024 sets.
Let me add a third option to your list. You can model the key itself to create virtual sets for you as below:
if you actual key is key1 and you want it to go to set1, you can set your mashed keys as set1_key1.
when you want to search for existence of key7 in set5, search for existence of set5_key7
If you go with this model, you are exploiting Aerospike's data-distribution, and load balancing to its best. The exists check will be the fastest as there will be no I/O.

Are Neo4J node ids optimized for access?

I am building a large graph database using neo4j.
I have my own external indexes which give me identifiers for relevant nodes that I use for further neo4j graph traversal. In other words I already have my start node ids when I get to query the database.
My question is: can node lookups be faster if I use neo4j/lucene indexes to access relevant nodes?
Or are queries such as:
START n=node({ids})
already optimized for node access and nothing can be gained by using:
START n=node:nodeIndexName(key={value})
?
Thanks,
Yes. Neo4j is optimized for Node ID as at the persistence level, all nodes are a block, so accessing node 100 is like accessing block 100.
I will warn you though that Neo4j makes no guarantee about the node id if you delete it. Neo4j reclaims ID's. So if in the course of your DB's life you delete and add multiple nodes, your external entries may be "valid" but not what you'd expect.
//EDIT: Also, why not just use Lucene to perform your lookups? Of course accessing the Node ID is faster, but that's what Lucene does under the cover when you do a lookup, so key:name, value:frank will return node id 5123 and neo4j will return the node that corresponds to that ID.