I want to know the difference between hive and map reduce
And if there any comparision between them.
Does hive also show some part of map reduce
Hive and MapReduce have completely different purpose, they are like oranges and apples.
MapReduce is a software framework for writing applications which process big amounts of data on large clusters in parallel.
Hive is a database for processing large datasets residing in the distributed file system using SQL. Hive on Tez and Hive on MapReduce translates SQL queries into series of mapReduce jobs (Tez execution engine uses DAGs).
MapReduce is general purpose purpose framework (a set of libraries and tools), you can use it to write your own MapReduce application in Java, Python, Scala, R.
And Hive is SQL database, it has reach SQL and data warehousing features and cost-based optimizer for building optimal query plan.
Related
I try to figure out how and where the execution happens with sparkSQL
query Hive Metastore.
SparkSQl run's on the "spark" engine as we already know but how does interact with hive metastore in these scenarios (internal and external tables):
SparkSQl uses "spark" distributed SQL engine but hive uses "map-reduce or tez" engine, does spark copy data into memory and run query or does use thrift server to just interact with Hive metastore ? reference Distributed SQL engine, I try to understand where does happens computation and if can be optimized in a way to reduce execution time.
I do know result is a dataframe so that is an overhead so understanding this execution and best practices maybe would have better idea where I can change execution behaviour.
SparkSQl uses "spark" distributed SQL engine but hive uses "spark" distributed engine, does spark use thrift server to interact with Hive metastore, does happens any copy and where does execution of the query happens?
Note: All this questions is due the fact that direct hive query using beeline or hue is fast but in SparkSQL is very slow. I do know inferring schema and copy results from query converting into a dataframe takes place but where all this execution happens and how to avoid bottlenecks, hope not to get answer to just read directly parquet files into spark and go from there, my main point is interacting with HIVE metastore (internal and external tables).
On top of all: converting HIVEQL queries to Pyspark directly does improve performance but my question is not related to this approach ...
It seems like all queries expressed in SQL can be converted into MapReduce jobs. This is in essence what Spark SQL does. SparkSQL takes in SQL, converts it to a MapReduce job then executes the MapReduce job on Spark's runtime.
All questions which can be answered by SQL can be answered by MapReduce jobs. Can all MapReduce jobs also be written as SQL (maybe with custom user defined functions)? When does it make sense to use MapReduce over SQL or vice versa?
SQL is useful when you have structured data (e.g. tables, with clearly defined columns and, usually, data types). Using SQL with that structure you can select columns, join them, etc.
With MapReduce you can do that (Spark SQL will help you do that) but you can also do much more. A typical example is a word count app that counts the words in text files. Text files do not have any predefined structure that you can use to query them using SQL. Take into account that kind of applications are usually coded using Spark core (i.e. RDD) instead of Spark SQL, since Spark SQL needs also a structure.
Another maybe more real use case is processing large amounts of log files using MapReduce (again, log files does not have a relational structure such as the one required by SQL).
SQL and MapReduce also have their own advantages. To data analysist , they don't need to learn how to write the MapReduce Program. And from the perspective of developer, writing MapReduce program leave enough room to tuning the program ,like add random prefix to skewing data.
And In the long run, with the development of SQL interpreters, use SQL over MapReduce/Spark RDD.
We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?
In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.
I think the key difference is that the architecture of Presto is very similar to an MPP SQL engine. That means is highly optimized just for SQL query execution vs Spark being a general purpose execution framework that is able to run multiple different workloads such as ETL, Machine Learning etc.
In addition, one trade-off Presto makes to achieve lower latency for SQL queries is to not care about the mid-query fault tolerance. If one of the Presto worker nodes experiences a failure (say, shuts down) in most cases queries that are in progress will abort and need to be restarted. Spark on the other hand supports mid-query fault-tolerance and can recover from such a situation but in order to do that, it needs to do some extra bookkeeping and essentially "plan for failure". That overhead results in slower performance when your cluster does not experience any faults.
Position:
Presto emphasis on query, however spark emphasis on calculation.
Memory storage:
Both are memory store and calculations, spark will write the data to disk when it cannot get enough memory, but presto lead to OOM.
Tasks, resources:
The spark commits tasks and applies for resources in real time at each stages(this strategy can result in a slightly slower processing speed compared to presto); Presto applies for all required resources and commits all tasks once.
Data processing:
In spark, data needs to be fully processed before passing to the next stage. Presto is a batch (page) pipeline processing mode.. As long as the page is finished, it can be sent to the next task(This approach greatly reduces the end-to-end response time of various queries).
Data fault tolerance:
If spark fails or loses data, it will be recalculated based on kinship. But presto will result in query failure.
I am trying to profile a pig query but haven't got any thing useful so far.
I am trying to measure CPU, disk I/O, RAM usage.
Can anyone guide me on this ?
Things tried so far
Starfish - Works with Hadoop job but NOT with Pig
- Does not support pig query
Hprof - Works with Hadoop job but NOT with Pig query.
- Generates profile file only for Hadoop job
Both Hadoop and pig jobs are executed in the same cluster.
Thanks for reading !!
You could get some latency data using JXInsight/Opus (which is free) and marking or tagging the cluster before executing the query and then taking a snapshot following completion of the job.
http://www.jinspired.com/site/jxinsight-opus-1-0
We will be coming out with JXInsight/Opus for X editions for various big data platforms including Cassandra, Hadoop, Pig,....
If you need more power and more meters (cpu, io,...) you can then always look at the JXInsight/OpenCore product.
You can use hprof or other profiling tools on the MR job generated by pig . See https://cwiki.apache.org/confluence/display/PIG/HowToProfile