Is it true that generally for smaller queries (expecting result in interactive fashion, in minutes, than hours) Tez performs better and for batch queries (taking hours) MR performs better as an execution engine? Or can we say that irrespective of query type, Tez is always the best choice?
Tez simplifies processing for both small scale (low-latency) and large-scale (high throughput) workloads. The more complex query is the more benefit from TEZ. For simple queries consisting of single map step it will be most probably no difference at all because there is nothing to optimize. TEZ represents query as a DAG (directed acyclic graph) for a single job and eliminates unnecessary steps like read/write to durable storage, sort of the output from each Map, also enables containers reuse. Tez is always the best choice, for simple queries it will work not worse than MR and much better for complex queries. And consider this: For MR and for TEZ you have to tune different sets of configuration parameters, there are a lot of TEZ-specific and a lot of MR-specific. Choose TEZ and you will simplify you life even in cases when there is nothing to optimize. Also, Hive-on-MR has been deprecated in Hive 2 releases.
Related
A Spark newbie here.
I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?
Thanks for looking. Much appreciated.
Because:
Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
You can go on like this for a long time...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?
In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.
I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.
In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.
Position:
Presto emphasis on query, however spark emphasis on calculation.
Memory storage:
Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.
Tasks, resources:
The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.
Data processing:
In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).
Data fault tolerance:
If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.
We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.
I am aware that multiple connections use multiple CPU cores in postgres and hence run in parallel.But when I execute a long running query say 30 seconds(Let's assume that this cannot be optimized further), the I/O is blocked and it does not run any other query from the same client/connection.
Is this by design or can it be improved ?
So I am assuming that the best way to run long running queries is to get a new connection or not to run any other query in the same connection until that query is complete ?
It is a design limitation.
PostgreSQL uses one process per connection, and has one session per process. Each process is single-threaded and makes heavy use of globals inherited via fork() from the postmaster. Shared memory is managed explicitly.
This has some big advantages in ease of development, debugging and maintenance, and makes the system more robust in the face of errors. However, it makes it significantly harder to add parallelization on a query level.
There's ongoing work to add parallel query support, but at present the system is really limited to using one CPU core per query. It can benefit from parallel I/O in some areas, like bitmap index scans (via effective_io_concurrency), but not in others.
There are some IMO pretty hacky workarounds like PL/Proxy but mostly you have to deal with parallelization yourself client-side if it's needed. This is rapidly becoming one of the more significant limitations impacting PostgreSQL. Applications can split up large queries into multiple smaller queries that affect a subset of the data, then unify client-side (or into an unlogged table that then gets further processed), i.e. a map/reduce-style pattern. If a mix of big long running queries and low-latency OLTP queries is needed, multiple connections are required and the app should usually use an internal connection pool.
I'm a student and I'm doing my database assignment.
I want to use indexing and query optimization for my database optimization strategy.
The problem is how can I prove my strategy make a improvement? my lecture said about query optimization that I can prove by calculation, anyone got more ideas? what to calculate?
what about indexing, I need evidence to prove it. how??
In terms of evidence of optimization, you have to have instrumented code for your test cases (e.g. you can take timings accurately) and re-runnable test cases. The ideal situation for a re-runable set of test cases is to also be able to reset to a baseline database so you can guarentee the starting conditions of the data is the same per test run.
You also need to understand for each test case other more subtle factors:
Are you running against a cold procedure cache / warm procedure cache.
Are you running against a cold data cache / warm data cache.
For larger datasets, are you using the exact same table, e.g. no page splits have occured since.
I would think a before and after explain plan would go a long way towards proving an improvement.
See SQL Server Performance HERE.
Which DBMS are you using?
I suggest you take a look at what tracing options your DBMS product provides. For example, in Oracle you can use SQL Trace and parse the output using tkprof to provide you with the figures you'll need to prove that your database optimization strategy shows an improvement.