I'm creating a job that takes the inputs of a json input from a table and i'm trying to do a etl metadata injection, i tried with v5.4,v6.0 but non is working.
Is there any work around for my scenario?
Related
I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.
I need some advise on solving this requirement for auditing purpose . I am using airflow composer - dataflow java operator job which spits out json file after job completion with status and error message details (into airflow data folder ) . I want to extract the status and error message from json file via some operator and then pass the variable to next pipeline job Bigqueryinsertjoboperator which calls the stored proc and passes status and error message as input parameter and finally gets written into BQ dataset table.
Thanks
You need to do XCom and JINJA templating. When you return meta-data from the operator, the data is stored in XCom and you can retrieve it using JINJA templating or Python code in Python operator (or Python code in your custom operator).
Those are two very good articles from Marc Lamberti (who also has really nice courses on Airlfow) describing how templating and jinja can be leveraged in Airflow https://marclamberti.com/blog/templates-macros-apache-airflow/ and this one describes XCom: https://marclamberti.com/blog/airflow-xcom/
By combining the two you can get what you want.
My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow
In pentaho data integration I am using metadata injection within a stream of a transformation. How can I get the result of the metadata injection back to my stream in order to continue transforming the data outside of the metadata injection. Copy rows to result does not seem to be working here like it does with a transformation within a transformation.
Found it myself. In the Options tab you can select the step within the template to read the data from and below you can set the fields.
metadata injection options
I am building a Nifi flow to get json elements from a kafka and write them into a Have table.
However, there is very little to none documentation about the processors and how to use them.
What I plan to do is the following:
kafka consume --> ReplaceText --> PutHiveQL
Consuming kafka topic is doing great. I receive a json string.
I would like to extract the json data (with replaceText) and put them into the hive table (PutHiveQL).
However, I have absolutely no idea how to do this. Documentation is not helping and there is no precise example of processor usage (or I could not find one).
Is my theoretical solution valid ?
How to extract json data, build a HQL query and send it to my local hive database ?
basicly you want to transform your record from kafka into HQL request then send the request to putHiveQl processor.
I am not sur that the transformation kafka record -> putHQL can be done with replacing text ( seam little bit hard/ tricky) . In general i use custom groovy script processor to do this.
Edit
Global overview :
EvaluateJsonPath
This extract the properties timestamp and uuid of my Json flowfile and put them as attribute of the flowfile.
ReplaceText
This set flowfile string to empty string and replaces it by the replacement value property, in which I build the query.
You can directly inject the streaming data using Puthivestreaming process.
create an ORC table with the strcuture matching to the flow and pass the flow to PUTHIVE3STreaming processor it works.