I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.

Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.

Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile


ConvertJsonToSQL for Hive Insert query

I want to insert Json to hive database.
I try to transform Json to SQL using ConvertJsonToSQL Ni-Fi processor. How can I use PARTITION (....) part into my query??
Can I do this or I should use ReplaceText processor for making query?
What version of Hive are you using? There are Hive 1.2 and Hive 3 versions of PutHiveStreaming and PutHive3Streaming (respectively) that let you put the data directly into Hive without having to issue HiveQL statements. For external Hive tables in ORC format, there are also ConvertAvroToORC (for Hive 1.2) and PutORC (for Hive 3) processors.
Assuming those don't work for your use case, you may also consider ConvertRecord with a FreeFormTextRecordSetWriter that generates the HiveQL with the PARTITION statement and such. It gives a lot more flexibility than trying to patch a SQL statement to turn it into HiveQL for a partitioned table.
EDIT: I forgot to mention that the Hive 3 NAR/components are not included with the NiFi release due to space reasons. You can find the Hive 3 NAR for NiFi 1.11.4 here.

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

Cannot load jdbc driver class org.apache.hive.jdbc.hivedriver in Kylo

I am trying to create a Data Ingest Feed but all the jobs are failing. I checked Nifi and there are error marks saying that "org.apache.hive.jdbc.hivedriver" was not found. I checked the nifi logs and found the following error :
So where exactly do I need to put the hivedriver jar?
Based on the comments, this seems to be the solution as mentioned by #Greg Hart:
Have you tried using a Data Transformation feed? The Data Ingest
template is for loading data into Hive, but it looks like you're using
it to move data from one Hive table into another.

Autodetect BigQuery schema within Dataflow?

Is it possible to use the equivalent of --autodetect in DataFlow?
i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?
(potentially related question)
If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.
I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).
You can use the util as follows:
TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.
I'm not sure about reading from BQ, but for writes I think that something like this will work on the latest java SDK.
.apply("WriteBigQuery", BigQueryIO.Write
Note: BigQuery Table must be of the form: <project_name>:<dataset_name>.<table_name>.

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.