Understanding distributed join of Apache Ignite - ignite

We are exploring to use Apache Ignite in our project. Basically, we have dozens of oracle tables.And we want to load each table into Ignite Cache ,and then do join between these caches. There are many joins between our tables(so there will be many distributed join between caches).
The uncertain thing it that it could be really hard to collocate our data using the affinity-collocation feature... as described here:
https://apacheignite.readme.io/docs/affinity-collocation
So, I would ask if our data in cache is not collocated, then does Ignite distributed join support this(we are using Ignite 1.7.0)? I would imagine there will be many data movement when doing the join(This would be very similar to SQL on Hadoop, like Hive or Spark SQL)
Also, I am wondering the performance between non-collocation distributed join and spark sql.

I would add that if you use distributed non-collocated mode for SQL queries then it doesn't mean that the data will be silly moved all the time. The engine will try all its best to optimize the execution and, even, it may result in no data movement at all. However, it depends on a type of query and how data is spread our across the cluster.
In any case, my recommendation will be to collocate as much data as you can so that you can rely on the most performant collocated mode and fallback to non-collocated mode for the rest of the scenarios.
I do believe that the performance of non-collocated Ignite queries will be still better than the performance of Spark SQL engine simply because Ignite allows you to index the data while Spark doesn't which is essential.

You are right, non-collocated joins causes many data movement. http://apacheignite.gridgain.org/docs/sql-queries#distributed-joins
Ignite tries to reduce unnecessary data movement using all available ways. Affinity-collocation, Replicated caches, Near Caches, Indices, In-memory data storage.
Also, if you already use Spark, you can try to back it by Ignite to improve performance.
http://insidebigdata.com/2016/06/20/apache-ignite-and-apache-spark-complementary-in-memory-computing-solutions/

Related

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

Hazelcast vs Redis vs S3

I am currently evaluating the fastest possible caching solutions that we can use among the Technologies in question. We know that while Redis and Hazelcast are caching solutions by their very intent and definition, and there is a clear stackoverflow link # redis vs hazelcast, there is also the AWS S3 which may not be a caching solution but is nevertheless, a storage and retrieval service + it supports SQL as well which makes it in my opinion a qualifier in the race as well. Considering this, are there any forethoughts on comparing the three based on speed, volumes of data etc.?
Hazelcast also provides SQL alike capabilities - run queries to fetch data in a resultset. Technology wise, Hazelcast/Redis and S3 are fundamentally different; for the latter is a disk bound data store and that are proven/known to be significantly slower than their in-memory counterparts.
To put things in a logical perspective: S3 or any other disk bound data store can not match the performance of accessing data from in-memory data stores.
However, it is also a common practice to run Hazelcast on top of a disk bound data store to get performance boost. In such type of architectures, your applications basically always only interact with Hazelcast. One can then use Hazelcast tools to keep the cached data in sync with underlying database.

Why Presto is faster than Spark SQL [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?
In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines. 
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.
I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.
In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.
Position:
Presto emphasis on query, however spark emphasis on calculation.
Memory storage:
Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.
Tasks, resources:
The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.
Data processing:
In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).
Data fault tolerance:
If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.

Apache Cassandra and Spark

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.
So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.
But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.
Thanks in advance.
Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.
C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.
Apache spark has a concept of RDD(Resilient Distributed DataSet)which gets created in memory.
Its basically a fundamental data structure in spark.
Joins, queries are performed on this RDDs and as it operates in memory ,that`s the reason it is very efficient.
Please go through the docs below for getting some idea on Resilient Dataset
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

How many Vertica Databases can run on a Host in the same time?

I know in Oracle i can have multiple homes running on the same host ?
Can this be done in Vertica to ? i am running CE vertion of Vertica and it seems i can not do this !!
They don't allow multiple databases within a single instance of vertica to be active; it makes sense that they wouldn't allow multiple instance of vertica, resulting in multiple databases, active at the same time.
EDIT: Reasons I say it makes sense: Vertica can be resource intensive. It is designed to deal with A LOT of data. Having multiple 'Verticas' fighting for disk, cpu, bandwidth is going to negatively impact performance for all of them.
Also remember, Vertica is a distributed database unlike Oracle. Therefore you have an instance in each of your cluster node and the cluster has access to larger disk storage. Distributed databases are best used if data of various apps stays in a single cluster as each instance takes up lot of CPU for data compression, read/write optimized stores, and delivering performance.
You cannot run multiple databases in a Vertica cluster. That said, there are no set limits on the number of schemas you can run in a single database. Considering how many users and how much data can be handled by a single Vertica cluster (with one database), I would need a very compelling reason why multiple databases are necessary. It rather like complaining that a house doesn't have three kitchens, one each for breakfast lunch and dinner.