Why Presto is faster than Spark SQL [closed] - apache-spark-sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?

In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines. 
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.

I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.
In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.

Position:
Presto emphasis on query, however spark emphasis on calculation.
Memory storage:
Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.
Tasks, resources:
The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.
Data processing:
In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).
Data fault tolerance:
If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.

Related

What are some workloads where it makes sense to use MapReduce over SQL and vice versa?

It seems like all queries expressed in SQL can be converted into MapReduce jobs. This is in essence what Spark SQL does. SparkSQL takes in SQL, converts it to a MapReduce job then executes the MapReduce job on Spark's runtime.
All questions which can be answered by SQL can be answered by MapReduce jobs. Can all MapReduce jobs also be written as SQL (maybe with custom user defined functions)? When does it make sense to use MapReduce over SQL or vice versa?
SQL is useful when you have structured data (e.g. tables, with clearly defined columns and, usually, data types). Using SQL with that structure you can select columns, join them, etc.
With MapReduce you can do that (Spark SQL will help you do that) but you can also do much more. A typical example is a word count app that counts the words in text files. Text files do not have any predefined structure that you can use to query them using SQL. Take into account that kind of applications are usually coded using Spark core (i.e. RDD) instead of Spark SQL, since Spark SQL needs also a structure.
Another maybe more real use case is processing large amounts of log files using MapReduce (again, log files does not have a relational structure such as the one required by SQL).
SQL and MapReduce also have their own advantages. To data analysist , they don't need to learn how to write the MapReduce Program. And from the perspective of developer, writing MapReduce program leave enough room to tuning the program ,like add random prefix to skewing data.
And In the long run, with the development of SQL interpreters, use SQL over MapReduce/Spark RDD.

Understanding distributed join of Apache Ignite

We are exploring to use Apache Ignite in our project. Basically, we have dozens of oracle tables.And we want to load each table into Ignite Cache ,and then do join between these caches. There are many joins between our tables(so there will be many distributed join between caches).
The uncertain thing it that it could be really hard to collocate our data using the affinity-collocation feature... as described here:
https://apacheignite.readme.io/docs/affinity-collocation
So, I would ask if our data in cache is not collocated, then does Ignite distributed join support this(we are using Ignite 1.7.0)? I would imagine there will be many data movement when doing the join(This would be very similar to SQL on Hadoop, like Hive or Spark SQL)
Also, I am wondering the performance between non-collocation distributed join and spark sql.
I would add that if you use distributed non-collocated mode for SQL queries then it doesn't mean that the data will be silly moved all the time. The engine will try all its best to optimize the execution and, even, it may result in no data movement at all. However, it depends on a type of query and how data is spread our across the cluster.
In any case, my recommendation will be to collocate as much data as you can so that you can rely on the most performant collocated mode and fallback to non-collocated mode for the rest of the scenarios.
I do believe that the performance of non-collocated Ignite queries will be still better than the performance of Spark SQL engine simply because Ignite allows you to index the data while Spark doesn't which is essential.
You are right, non-collocated joins causes many data movement. http://apacheignite.gridgain.org/docs/sql-queries#distributed-joins
Ignite tries to reduce unnecessary data movement using all available ways. Affinity-collocation, Replicated caches, Near Caches, Indices, In-memory data storage.
Also, if you already use Spark, you can try to back it by Ignite to improve performance.
http://insidebigdata.com/2016/06/20/apache-ignite-and-apache-spark-complementary-in-memory-computing-solutions/

Apache Cassandra and Spark

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.
So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.
But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.
Thanks in advance.
Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.
C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.
Apache spark has a concept of RDD(Resilient Distributed DataSet)which gets created in memory.
Its basically a fundamental data structure in spark.
Joins, queries are performed on this RDDs and as it operates in memory ,that`s the reason it is very efficient.
Please go through the docs below for getting some idea on Resilient Dataset
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Pros & cons of BigQuery vs. Amazon Redshift [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Comparing Google BigQuery vs. Amazon Redshift shows that both can answer same set of requirements, differ mostly by cost plans. It seems that Redshift is more complex to configure (defining keys and optimization work) vs. Google BigQuery that perhaps has an issue with joining tables.
Is there a pros & cons list of Google BigQuery vs. Amazon Redshift?
I posted this comparison on reddit. Quickly enough a long term RedShift practitioner came to comment on my statements. Please see https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_preference_for_running_jobs_in_the_aws/cur518e for the full conversation.
Sizing your cluster:
Redshift will ask you to choose a number of CPUs, RAM, HD, etc. and to turn them on.
BigQuery doesn't care. Use it whenever you want, no provisioning needed.
Hourly costs when doing nothing:
Redshift will ask you to pay per hour of each of these servers running, even when you are doing nothing.
When idle BigQuery only charges you $0.02 per month per GB stored. 2 cents per month per GB, that's it.
Speed of queries:
Redshift performance is limited by the amount of CPUs you are paying for
BigQuery transparently brings in as many resources as needed to run your query in seconds.
Indexing:
Redshift will ask you to index (correction: distribute) your data under certain criteria, and you'll only be able to run fast queries based on this index.
BigQuery has no indexes. Every operation is fast.
Vacuuming:
Redshift requires periodic maintenance and 'vacuum' operations that last hours. You are paying for each of these server hours.
BigQuery does not. Forget about 'vacuuming'.
Data partitioning and distributing:
Redshift requires you to think about how to distribute data within your servers to keep performance up - optimization that works only for certain queries.
BigQuery does not. Just run whatever query you want.
Streaming live data:
Impossible(?) with Redshift.
BigQuery easily handles ingesting up to 100,000 rows per second per table.
Growing your cluster:
If you have more data, or more concurrent users scaling up will be painful with Redshift.
BigQuery will just work.
Multi zone:
You want a multi-zone Redshift for availability and data integrity? Painful.
BigQuery is multi-zoned by default.
To try BigQuery you don't need a credit card or any setup time. Just try it (quick instructions to try BigQuery).
When you are ready to put your own data into BigQuery, just copy your JSON new-line separated logs from to Google Cloud Storage and import them.
See this in depth guide to data warehouse pricing on the cloud:
Understanding Cloud Pricing Part 3.2 - More Data Warehouses
Amazon Redshift is a standard SQL database (based on Postgres) with MPP features that allow it to scale. These features also require you to conform your data model somewhat to get the best performance. It supports a large amount of the SQL standard and most tools that can speak to Postgres can use it unchanged.
BigQuery is not a database, in the sense that there it doesn't use standard SQL and doesn't provide JDBC/ODBC connectivity. It's a unique service with it's own API and interfaces. It provides limited support for SQL queries but most users interact with via custom code (Java, Python, etc.). Some 3rd party tools have added support for BigQuery but existing tools will not work without modification.
tl;dr - Redshift is better for interacting with existing tools and using complex SQL. BigQuery is better for custom coded interactions and teams who dislike SQL.
UPDATE 2017-04-17 - Here's a much more up to date summary of the cost and speed differences (wrapped in a sales pitch so YMMV). TL;DR - Redshift is usually faster and will be cheaper if you query the data somewhat regularly. http://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery
UPDATE - Since I keep getting down votes on this (🤷‍♂️) here's an up-to-date response to the items in the other answer:
Sizing your cluster:
Redshift allows you to tailor your costs to your usage. If you want the fastest possible queries choose SSD nodes and if you want the lowest possible cost per GB choose HDD nodes. Start small and add nodes whenever you want.
Hourly costs when doing nothing:
Redshift keeps your cluster ready for queries, can respond in milliseconds (result cache) and it provides a simple, predictable monthly bill.
For example, even if some script accidentally runs 10,000 giant queries over the weekend your Redshift bill will not increase at all.
Speed of queries:
Redshift performance is absolutely best in class and gets faster all the time. 3-5x faster in the last 6 months.
Indexing:
Redshift has no indexes. It allows you to define sort keys to optimize performance from fast to insanely fast.
Vacuuming:
Redshift now automatically runs routine maintenance such as ANALYZE and VACUUM DELETE when your cluster has free resource.
Data partitioning and distributing:
Redshift never requires distribution. It allows you to define distribution keys which can make even huge joins very fast.
{Ask competitors about join performance…}
Streaming live data:
Redshift has 2 choices
Stream real time data into Redshift using Amazon Kinesis Firehose.
Skip ingestion altogether by querying your real time instantly on S3 as soon as it land (and at high speeds) using Redshift Spectrum external tables.
Growing your cluster:
Redshift can elastically resize most clusters in a few minutes.
Multi zone:
Redshift seamlessly replaces any failed hardware and continuously backs up your data, including across regions if desired.

Optimization techniques for large databases [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What optimization techniques do you use on extremely large databases? If our estimations are correct, our application will have billions of records stored in the db (MS SQL Server 2005), mostly logs that will be used for statistics. The data contains numbers (mostly integer) and text (error message texts, URLs) alike.
I am interested in ANY kind of tips, hacks, solutions.
The question is a little big vague, but here are a few tips:
Use appropriate hardware for your databases. I'd opt for 64-bit OS as well.
Have dedicated machines for the DBs. Use fast disks configured for optimal performance. The more disks you can span over, the better the performance.
Optimize the DB for the type of queries that will be performed. What happens more SELECTs or INSERTs?
Does the load happens for the entire day, or for just few hours? Can you postpone some of the things to be run for the night?
Have incremental backups.
If you'll consider Oracle instead of SQL Server, you could use features such as Grid and Table Partitioning, which might boost performance considerably.
Consider having some load-balancing solution between the DB servers.
Pre-design the schemes and tables, so queries will be performed as fast as possible. Consider the appropriate indexes as well.
You're gonna have to be more specific about the way you're going to store those logs. Are they LOBs in the DB? Simple text records?
I don't use it myself but I have read that one can use Hadoop in combination with hbase for distributed storage and distributed analysing of data like logs.
duncan's link has a good set of tips. Here are a few more tips:
If you do not need to query against totally up-to-date data (i.e. if data up to the last hour or close of business yesterday is acceptable), consider building a separate data mart for the analytics. This allows you to optimise this for fast analytic queries.
The SQL Server query optimiser has a star transformation operator. If the query optimiser recongises this type of query it can select what slice of data you want by filtering based on the dimension tables before it touches the fact table. This reduces the amount of I/O needed for the query.
For VLDB applications involving large table scans, consider direct attach storage with as many controllers as possible rather than a SAN. You can get more bandwidth cheaper. However, if your data set is less than (say) 1TB or so it probably won't make a great deal of difference.
A 64-bit server with lots of RAM is good for caching if you have locality of reference in your query accesses. However, a table scan has no locality of reference so once it gets significantly bigger than the RAM on your server extra memory doesn't help so much.
If you partition your fact tables, consider putting each partition on a sepaarate disk array - or at least a separate SAS or SCSI channel if you have SAS arrays with port replication. Note that this will only make a difference if you routinely do queries across multiple partitions.