I want to insert Json to hive database.
I try to transform Json to SQL using ConvertJsonToSQL Ni-Fi processor. How can I use PARTITION (....) part into my query??
Can I do this or I should use ReplaceText processor for making query?
What version of Hive are you using? There are Hive 1.2 and Hive 3 versions of PutHiveStreaming and PutHive3Streaming (respectively) that let you put the data directly into Hive without having to issue HiveQL statements. For external Hive tables in ORC format, there are also ConvertAvroToORC (for Hive 1.2) and PutORC (for Hive 3) processors.
Assuming those don't work for your use case, you may also consider ConvertRecord with a FreeFormTextRecordSetWriter that generates the HiveQL with the PARTITION statement and such. It gives a lot more flexibility than trying to patch a SQL statement to turn it into HiveQL for a partitioned table.
EDIT: I forgot to mention that the Hive 3 NAR/components are not included with the NiFi release due to space reasons. You can find the Hive 3 NAR for NiFi 1.11.4 here.
Related
I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile
I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist
Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.
Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.
It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?
Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.
What is the best (less expensive) equivalent of SQL Server UPDATE SET command in Hive?
For example, consider the case in which I want to convert the following query:
UPDATE TABLE employee
SET visaEligibility = 'YES'
WHERE experienceMonths > 36
to equivalent Hive query.
I'm assuming you have a table without partitions, in which case you should be able to do the following command:
INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;
There are other ways but they are much more convoluted, I think the way Bejoy described is the most efficient.
(source: Bejoy KS blog)
Note that if you have to do this on a partitioned table (which is likely if you have a lot of data), you would probably need to overwrite your partition when doing this.
You can create an external table and use the 'insert overwrite into local directory' and in case you want to change the column values, you can use 'CASE WHEN', 'IF' or other conditional operations. And copy the output file back to HDFS location.
You can upgrade your hive to 0.14.0
Starting from 0.14.0 hive supports UPDATE operation.
To do the same we need to create hive tables such that they support ACID output format and need to set additional properties in hive-site.xml.
How to do CURD operations in Hive
My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.