Where to create staging data table in BigData environment? - hive

I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist

Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.

Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.

Related

How to createOrReplaceTempView in Delta Lake?

I want to use Delta Lake tables in my Hive Metastore on Azure Data Lake Gen2 as basis for my company's lakehouse.
Previously, I used "regular" hive catalog tables. I would load data from parquet into a spark dataframe, and create a temp table using df.CreateOrReplaceTempView("TableName"), so I could use Spark SQL or %%sql magic to do ETL. After doing this, I can use spark.sql or %%sql on the TableName. When I was done, I would write my tables to the hive metastore.
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
I know I can persist Delta Tables in the Hive Metastore through a multitude of ways, for instance by creating a Managed catalog table through df.write.format("delta").saveAsTable("LakeHouseDB.TableName")
I also know that I can create a DeltaTable object through the DeltaTable(spark, table_path_data_lake), but then I can only use the Python API and not sql.
Does there exist some equivalent of CreateOrReplaceTempView(), or is there a better way to achieve ETL with SQL without 'writing' to the data lake first?
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
Not possible with Delta Lake since it relies heavily on a transaction log (_delta_log) under the data directory of a delta table.

Hbase table export to Hive

Hello :) I am preparing to move the entire data of 1 hbase table to hive. The size of the table is very large (500Terabytes)
As a result of the search, there is hbase export, but only supports data movement between hbase and hbase (files dropped in hdfs are not plain text, so hive cannot read them immediately)
Also, hive's hbase handler cannot be used because hbase is a remote cluster and various security policies.
It would be nice if INSERT INTO syntax was supported like Hive to Hive, but I am looking for another way. Is there a good way to separate each colume of Hbase table by comma and drop it to hdfs?
You can try ExportSnapshot tool to move data from Hbase to HDFS on another cluster, e.g.,
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://yourserver:8020/hbase_root_dir -mappers 16
Check this out for more details.

Can Spark SQL be executed against Hive tables without any Map/Reduce (/Yarn) running?

It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?
Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.

Datawarehouse in Hive

I have a requirement to make datawarehouse in Hive and use HBase to serve real time access
So I would like to know what would be the architecture for the same
Can I first dump the data on HBase and access it as Rest Service and create external table in Hive and run hive queries on it ?
Will Hive be distributed i.e i need to install Hive on all nodes of my cluster or it it will be central
In answer to your questions:
Hive will be distributed.
For best performance, I would consider installing Hive on every node of the cluster. Hive translates HiveQL into MapReduce jobs - the jobs will be performed where the data is. If that's not possible, the data will have to move to the job. For the sake of response time, you'll want Hive on every node.
To create a Hive table that references data stored in HBase, you can check out the Hive - HBase Integration wiki. Here's a quick example:
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");

Using Hive with Pig

My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.