Cannot load jdbc driver class org.apache.hive.jdbc.hivedriver in Kylo - hive

I am trying to create a Data Ingest Feed but all the jobs are failing. I checked Nifi and there are error marks saying that "org.apache.hive.jdbc.hivedriver" was not found. I checked the nifi logs and found the following error :
So where exactly do I need to put the hivedriver jar?

Based on the comments, this seems to be the solution as mentioned by #Greg Hart:
Have you tried using a Data Transformation feed? The Data Ingest
template is for loading data into Hive, but it looks like you're using
it to move data from one Hive table into another.

Related

Committing hudi files manually

I am using spark 3.x with apache-hudi 0.8.0 version.
While I am trying to create presto table by using hudi-hive-sync tool I am getting below error.
Got runtime exception when hive syncing
java.lang.IllegalArgumentException: Could not find any data file written for commit [20220116033425__commit__COMPLETED], could not get schema for table
But I checked all data for partitiionKeys using zepplin notebook , I see all data present.
Its understood that I need to do manually commit the file. How to do it ?

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Azure Data Factory error on Sink "UserErrorSchemaMappingCannotInferSinkColumnType"

I am using Azure Data Factory to read data from Application Insights via REST API by passing a KUSTO query and I am trying to write the results to an Azure SQL database.
Unfortunately when I execute my pipeline I get the following error:
UserErrorSchemaMappingCannotInferSinkColumnType,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Data type of column '$['Month']' can't be inferred from 1st row of data, please specify its data type in mappings of copy activity or structure of DataSet.,Source=Microsoft.DataTransfer.Common
It seems like an error in the mapping, but from the mapping tab I am unable to specify the data type of the columns:
Can you provide me a hint?
Update, I use the copy data activity with the following rest Source:
As I understand, the copy active works well with no error, but the data not be inserted.
And for now, we're glad to hear that you have resolved the issue. I help you post these as answer to end this question:
In the end you managed to solve my issue following this
blog:https://www.ben-morris.com/using-azure-data-factory-with-the-application-insights-rest-api/
This can be beneficial to other community members.

When I run snowflake stage query I get aws error

I've created an s3 linked stage on snowflake called csv_stage with my aws credentials, and the creation was successful.
Now I'm trying to query the stage like below
select t.$1, t.$2 from #sandbox_ra.public.csv_stage/my_file.csv t
However the error I'm getting is
Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.]
Any idea why? Do I have to pass something in the query itself?
Thanks for your help!
Ultimately let's say my s3 location has 3 different csv files. I would like to load each one of them individually to different snowflake tables. What's the best way to go about doing this?
Regarding the last part of your question: You can load multiple files with one COPY INTO-command by using the file names or a certain regex-pattern. But as you have 3 different files for 3 different tables you also have to use three different COPY INTO-commands.
Regarding querying your stage you can find some more hints in these questions:
Missing List-permissions on AWS - Snowflake - Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.] and
https://community.snowflake.com/s/question/0D50Z00008EKjkpSAD/failure-using-stage-area-cause-access-denied-status-code-403-error-code-accessdeniedhow-to-resolve-this-error
https://aws.amazon.com/de/premiumsupport/knowledge-center/access-key-does-not-exist/
I found out the aws credential I provided was not right. After fixing that, query worked.
This approach works to import data from S3 into a snowgflake Table from a public S3 bucket:
COPY INTO SNOW_SCHEMA.table_name FROM 's3://test-public/new/solution/file.csv'

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile