Hive accepts SQL like queries. Under the hood, how is a query executed? Is it the same as how a RDBMS executes a SQL query?
Hive query processing has significant similarities and differences from standard RDBMS.
Some key Similarities:
Support for a SQL grammar. Though not full ANSI SQL 92 it is a fair subset.
Query Parser
Query Optimizer
Execution planner
Some key Differences:
Support for HDFS loading and features
Hive specific functions such as explode, regexp_*, split
Accepting/ processing Hadoop / Cluster configuration/tuning parameters
Managing Input/Output Formats such as for HDFS, S3, Avro, etc
Creation of DAG of Map/Reduce stages/jobs
Coordination with JobTracker for Management of Map/Reduce jobs including job lifecyle: submission /monitoring /
Related
I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.
What is the command to execute DMLs like Insert,Update,Delete in Google Big Query?
I tried using bq query "select query"
It is working only for Select statements
Note that BigQuery really excels at being a secondary database used for performing fast analytical queries on big data that is static, such as recorded data analysis, logs, and audit history.
If you instead require regular data updates, it is highly recommended to use a separate master database such as the Datastore to perform fast entity operations and updates. You would then persist your data from your master database to your secondary BigQuery database for further analysis.
Therefore, you must tell the bq commandline to use the full standard SQL --use_legacy_sql=false instead of the original BigQuery default legacy SQL to access the Data Manipulation Language (DML) functionality.
I am new to Spark I am trying to access Hive table to Spark
1) Created Spark Context
val hc=new HiveContext(sc)
val hivetable= hc.sql("Select * from test_db.Table")
My Question is I got the table into Spark.
1) Why we need to register the Table ?
2) We can Perform Directly SQL operations still why do we need Dataframe functions
Like Join, Select, Filter...etc ?
What makes difference in both operations between SQL Query` and Dataframe Operations
3) What is Spark Optimization ? How does it works?
You don't need to register temporary table if you are accessing Hive table using Spark HiveContext. Registering a DataFrame as a temporary table allows you to run SQL queries over its data.Suppose a scenario that you are accessing data from a file from some location and you want to run SQL queries over this data.
then you need to createDataframe from the Row RDD and you will register temporary table over this DataFrame to run the SQL operations. To perform SQL queries over that data, you need to use Spark SQLContext in your code.
Both methods use exactly the same execution engine and internal data structures. At the end of the day all boils down to the personal preferences of the developer.
Arguably DataFrame queries are much easier to construct programmatically and
provide a minimal type safety.
Plain SQL queries can be significantly more concise an easier to understand.
There are also portable and can be used without any modifications with every supported language. With HiveContext these can be also used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers
Reference: Spark sql queries vs dataframe functions
Here is a good read reference on performance comparison between Spark RDDs vs DataFrames vs SparkSQL
Apparently I don't have answer for it and will keep it on you to do some research over net and find out solution :)
It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?
Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.
I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!