Nifi - How to put data into Hive database? - hive

I am building a Nifi flow to get json elements from a kafka and write them into a Have table.
However, there is very little to none documentation about the processors and how to use them.
What I plan to do is the following:
kafka consume --> ReplaceText --> PutHiveQL
Consuming kafka topic is doing great. I receive a json string.
I would like to extract the json data (with replaceText) and put them into the hive table (PutHiveQL).
However, I have absolutely no idea how to do this. Documentation is not helping and there is no precise example of processor usage (or I could not find one).
Is my theoretical solution valid ?
How to extract json data, build a HQL query and send it to my local hive database ?

basicly you want to transform your record from kafka into HQL request then send the request to putHiveQl processor.
I am not sur that the transformation kafka record -> putHQL can be done with replacing text ( seam little bit hard/ tricky) . In general i use custom groovy script processor to do this.
Edit
Global overview :
EvaluateJsonPath
This extract the properties timestamp and uuid of my Json flowfile and put them as attribute of the flowfile.
ReplaceText
This set flowfile string to empty string and replaces it by the replacement value property, in which I build the query.

You can directly inject the streaming data using Puthivestreaming process.
create an ORC table with the strcuture matching to the flow and pass the flow to PUTHIVE3STreaming processor it works.

Related

Saving output from parsing json file and passing it to Bigqueryinsertjoboperator

I need some advise on solving this requirement for auditing purpose . I am using airflow composer - dataflow java operator job which spits out json file after job completion with status and error message details (into airflow data folder ) . I want to extract the status and error message from json file via some operator and then pass the variable to next pipeline job Bigqueryinsertjoboperator which calls the stored proc and passes status and error message as input parameter and finally gets written into BQ dataset table.
Thanks
You need to do XCom and JINJA templating. When you return meta-data from the operator, the data is stored in XCom and you can retrieve it using JINJA templating or Python code in Python operator (or Python code in your custom operator).
Those are two very good articles from Marc Lamberti (who also has really nice courses on Airlfow) describing how templating and jinja can be leveraged in Airflow https://marclamberti.com/blog/templates-macros-apache-airflow/ and this one describes XCom: https://marclamberti.com/blog/airflow-xcom/
By combining the two you can get what you want.

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile

Cloud Dataflow: Using Google-provided PubSub to BigQuery template with multiple json entries

I am using the Google-provided template for PubSub to BigQuery with no customizations. I am trying to put multiple entries(rows) into a single json payload onto the queue and then have the DataFlow template insert all entries(rows) into the BigQuery table. I have tried providing a newline delimited json payload like is required when loading data into BigQuery via the console. However, I am only able to get the first entry to insert into the table.
Does the default DataFlow template only take a single entry(row)?
Currently the Google-provided template only accepts a single JSON record as payload within the Cloud Pub/Sub message and will not detect any newline delimited JSON. Look for this to change in the near future as additional supported formats are added to the template.

Streamsets stream selector

I have a queue in JSON format in RabbitMQ and I would like to get some data that fix some conditions in StreamSets (using stream selector) and then save in a new database (JDBC Producer) a certain value. How do I write the specific value after the conditions and send to de database?
From your pipeline diagram, it looks like you're trying to set the field values you need for the database. You should be able to do this in the JDBC Producer itself - configure the Field to Column Mapping.
Actually you can use default streamsets component named Expression evaluator for it.
You add or move any fields with it.
Link to the documentation:
streamsets expression evaluator

Parsing sales force query payload

I am querying some data from Salesforce using a Mule flow after subscribing to one of the Push Topic. After the data is queried, I can see the payload using #[message.payload.next()] but when I am trying to retrieve 'StageName' field using these expressions : 'payload[0].StageName' message.payload.StageName payload['StageName'] it's not working. I can see in the log it's printed values that is a Map but retrieving the field is not working.
payload[0].StageName - this works fine in Mule 3.3.2 environment but not working in my Mule 3.7.3, appreciate if any of you could help.
The data returned after Salesforce query is of type ConsumerIterator. Just use a set payload with value #[org.apache.commons.collections.IteratorUtils.toList(payload)] after salesforce query connector to convert the payload into ArrayList type.