I try to figure out how and where the execution happens with sparkSQL
query Hive Metastore.
SparkSQl run's on the "spark" engine as we already know but how does interact with hive metastore in these scenarios (internal and external tables):
SparkSQl uses "spark" distributed SQL engine but hive uses "map-reduce or tez" engine, does spark copy data into memory and run query or does use thrift server to just interact with Hive metastore ? reference Distributed SQL engine, I try to understand where does happens computation and if can be optimized in a way to reduce execution time.
I do know result is a dataframe so that is an overhead so understanding this execution and best practices maybe would have better idea where I can change execution behaviour.
SparkSQl uses "spark" distributed SQL engine but hive uses "spark" distributed engine, does spark use thrift server to interact with Hive metastore, does happens any copy and where does execution of the query happens?
Note: All this questions is due the fact that direct hive query using beeline or hue is fast but in SparkSQL is very slow. I do know inferring schema and copy results from query converting into a dataframe takes place but where all this execution happens and how to avoid bottlenecks, hope not to get answer to just read directly parquet files into spark and go from there, my main point is interacting with HIVE metastore (internal and external tables).
On top of all: converting HIVEQL queries to Pyspark directly does improve performance but my question is not related to this approach ...
Related
It seems like all queries expressed in SQL can be converted into MapReduce jobs. This is in essence what Spark SQL does. SparkSQL takes in SQL, converts it to a MapReduce job then executes the MapReduce job on Spark's runtime.
All questions which can be answered by SQL can be answered by MapReduce jobs. Can all MapReduce jobs also be written as SQL (maybe with custom user defined functions)? When does it make sense to use MapReduce over SQL or vice versa?
SQL is useful when you have structured data (e.g. tables, with clearly defined columns and, usually, data types). Using SQL with that structure you can select columns, join them, etc.
With MapReduce you can do that (Spark SQL will help you do that) but you can also do much more. A typical example is a word count app that counts the words in text files. Text files do not have any predefined structure that you can use to query them using SQL. Take into account that kind of applications are usually coded using Spark core (i.e. RDD) instead of Spark SQL, since Spark SQL needs also a structure.
Another maybe more real use case is processing large amounts of log files using MapReduce (again, log files does not have a relational structure such as the one required by SQL).
SQL and MapReduce also have their own advantages. To data analysist , they don't need to learn how to write the MapReduce Program. And from the perspective of developer, writing MapReduce program leave enough room to tuning the program ,like add random prefix to skewing data.
And In the long run, with the development of SQL interpreters, use SQL over MapReduce/Spark RDD.
I have terabytes of data stored in Parquet format for analytics use case. There are multiple big tables which needs joins as well and there are heavy queries. The system is expected to be highly scalable. Currently, evaluating Spark SQL, Hive and Presto SQL. Based on theories, all seem to be meeting the requirements. Could you please shed some light on the differences and what should be considered for the above mentioned use case. Tableau will be used for visualization on top of this.
I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite
I want to know the difference between hive and map reduce
And if there any comparision between them.
Does hive also show some part of map reduce
Hive and MapReduce have completely different purpose, they are like oranges and apples.
MapReduce is a software framework for writing applications which process big amounts of data on large clusters in parallel.
Hive is a database for processing large datasets residing in the distributed file system using SQL. Hive on Tez and Hive on MapReduce translates SQL queries into series of mapReduce jobs (Tez execution engine uses DAGs).
MapReduce is general purpose purpose framework (a set of libraries and tools), you can use it to write your own MapReduce application in Java, Python, Scala, R.
And Hive is SQL database, it has reach SQL and data warehousing features and cost-based optimizer for building optimal query plan.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?
In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.
I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.
In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.
Position:
Presto emphasis on query, however spark emphasis on calculation.
Memory storage:
Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.
Tasks, resources:
The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.
Data processing:
In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).
Data fault tolerance:
If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.