How does Apache Ignite partition the spatial data? - ignite

I ran a geospatial query on Apache Ignite successfully. But I didn't understand how their partitioning work. How does Apache Ignite partition the spatial data among nodes when we use PARTITIONED CacheMode? Do they use any partitioning technique like Grid or Quad-tree? I saw that they are creating 1024 partitions for each data set. How can I change the number of partitions? I already read their documents but I didn't find anything about this. Any suggestion or document links will be appreciated.

Partitioning is performed on key basis using Rendezvous hashing. Apache Ignite is key-value based.
You can change its properties by specifying affinityFunction in CacheConfiguration.
Partitioning is usually wholly unrelated to spatial indexing since spatial index is secondary.

Related

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

Understanding distributed join of Apache Ignite

We are exploring to use Apache Ignite in our project. Basically, we have dozens of oracle tables.And we want to load each table into Ignite Cache ,and then do join between these caches. There are many joins between our tables(so there will be many distributed join between caches).
The uncertain thing it that it could be really hard to collocate our data using the affinity-collocation feature... as described here:
https://apacheignite.readme.io/docs/affinity-collocation
So, I would ask if our data in cache is not collocated, then does Ignite distributed join support this(we are using Ignite 1.7.0)? I would imagine there will be many data movement when doing the join(This would be very similar to SQL on Hadoop, like Hive or Spark SQL)
Also, I am wondering the performance between non-collocation distributed join and spark sql.
I would add that if you use distributed non-collocated mode for SQL queries then it doesn't mean that the data will be silly moved all the time. The engine will try all its best to optimize the execution and, even, it may result in no data movement at all. However, it depends on a type of query and how data is spread our across the cluster.
In any case, my recommendation will be to collocate as much data as you can so that you can rely on the most performant collocated mode and fallback to non-collocated mode for the rest of the scenarios.
I do believe that the performance of non-collocated Ignite queries will be still better than the performance of Spark SQL engine simply because Ignite allows you to index the data while Spark doesn't which is essential.
You are right, non-collocated joins causes many data movement. http://apacheignite.gridgain.org/docs/sql-queries#distributed-joins
Ignite tries to reduce unnecessary data movement using all available ways. Affinity-collocation, Replicated caches, Near Caches, Indices, In-memory data storage.
Also, if you already use Spark, you can try to back it by Ignite to improve performance.
http://insidebigdata.com/2016/06/20/apache-ignite-and-apache-spark-complementary-in-memory-computing-solutions/

Type of Indices supported by Apache Ignite

In the documentation for Apache ignite, it states that they provide indexing functionality for RDDS.
Also in the below link we can find methods to create indices.
http://apacheignite.gridgain.org/docs/sql-queries
Is there any documentation on what kind of indices it supports underneath (B-trees, R-trees)
Ignite indexes are either based on SnapTreeMap [1] or on ConcurrentSkipListMap. The former is used for indexes stored in off-heap memory, while the latter is for on-heap.
[1] https://github.com/nbronson/snaptree

RavenDb Sharding Hilo storage pattern

My understanding was that RavenDb was designed so that if one shard goes down, the other shards can operate without problems.
But recently I was implementing ShardingResolutionStrategy and found out the MetadataShardIdFor method. It is the method where for each document type we can specify what shard to use for storage.
So if I get it right, if the shard where Hilo for specific document type is stored is down, we can not create new documents of this type at other shards (at least autogenerated ids will not work). Or may be I am wrong and Hilo is replicated between shards in some magical way?
Sharding is designed to be independent, but in order to create consistent ids, we need to be able to create them from a consistent store.
Because of that, we separate the notion of splitting data to multiple nodes and HA.
The typical scenario is that the metadata shard is independent, and is running with replicated database that is shared on all sharded nodes. In this fashion, if you lose the metadata shard, you just switch over.
This take advantage on the fact that RavenDB sharding & replication are orthogonal

What is a 'Partition' in Apache Helix

I am learning Apache Helix. I came across the keyword 'Partitions'.
According to the definition mentioned here http://helix.apache.org/Concepts.html, Each subtask (of a main task) is referred to as a partition in Helix.
When I gone through the recipe - Distributed Lock Manager, partitions are nothing but instances of a resource. (Increase the numOfPartitions, number of locks is increased).
final int numPartitions = 12;
admin.addResource(clusterName, lockGroupName, numPartitions, "OnlineOffline",
RebalanceMode.FULL_AUTO.toString());
Can someone explain with simple example, what exactly the partition in Apache Helix is ?
I think you're right that a partition is essentially an instance of a resource. As is the case in other distributed systems, partitions are used to achieve parallelism. A resource with only one instance can only run on one machine. Partitions simply provide the construct necessary to split a single resource among many machines by, well, partitioning the resource.
This is a pattern that is found in a large portion of distributed systems. The difference, though, is while e.g. distributed databases explicitly define partitions essentially as a subset of some larger data set that can fit on a single node, Helix is more generic in that partitions don't have a definite meaning or use case, but many potential meanings and potential use cases.
One of these use cases in a system with which I'm very familiar is Apache Kafka's topic partitions. In Kafka, each topic - essentially a distributed log - is broken into a number of partitions. While the topic data can be spread across many nodes in the cluster, each partition is constrained to a single log on a single node. Kafka provides scalability by adding new partitions to new nodes. When messages are produced to a Kafka topic, internally they're hashed to some specific partition on some specific node. When messages are consumed from a topic, the consumer switches between partitions - and thus nodes - as it consumes from the topic.
This pattern generally applies to many scalability problems and is found in almost any HA distributed database (e.g. DynamoDB, Hazelcast), map/reduce (e.g. Hadoop, Spark), and other data or task driven systems.
The LinkedIn blog post about Helix actually gives a bunch of useful examples of the relationships between resources and partitions as well.