In pentaho data integration I am using metadata injection within a stream of a transformation. How can I get the result of the metadata injection back to my stream in order to continue transforming the data outside of the metadata injection. Copy rows to result does not seem to be working here like it does with a transformation within a transformation.
Found it myself. In the Options tab you can select the step within the template to read the data from and below you can set the fields.
metadata injection options
Related
I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.
I'm testing the data flow on my Azure Data Factory. I created Data Flow with the following details:
Source dataset linked service - from CSV files dataset from Blob storage
Sink linked service - Azure SQL database with pre-created table
My CSV files are quite simple as they contain only 2 columns (PARENT, CHILD). So, my table in SQL DB also have only 2 columns.
For the sink setting of my data flow, I have allowed insert data and leaving other options as default.
I have also mapped the 2 columns for input and output columns as per screenshot.
The pipeline with data flow ran successfully when I checked the result, I could see thqat 5732 rows were processed. Is this the correct way to check? As this is the first time I try this functionality in Azure Data Factory.
But, when I click on Data preview tab, they are all NULL value.
And; when I checked my Azure SQL DB in the table where I tried to insert the data from CSV files from Blob storage with selecting top 1000 rows from this table, I don't see any data.
Could you please let me know what I configured incorrectly on my Data Flow? Thank you very much in advance.
Here is the screenshot of ADF data flow source data, it does see the data on the right side as they are not NULL, but on the left side are all NULLs. I imagine that the right side are the data from the CSV from the source on the blob right? And the left side is the sink destination as the table is empty for now?
And here is the screenshot for the sink inspect input, I think this is correct as it reads the 2 columns correctly (Parent, Child), is it?
After adding Map drifted, to map "Parent" => "parent" and "Child" => "child"
I get this error message after running the pipeline.
When checking on sink data preview, I get this error message. It seems like there is incorrect mapping?
I rename the MapDrifted1 expression to "toString(byName('Parent1))" and Child1 as suggested.
The data flow executed successfully, however I still get NULL data in the sink SQL table.
Can you copy/paste the script behind your data flow design graph? Go to the ADF UI, open the data flow, then click the Script button on top right.
In your Source transformation, click on Data Preview to see the data. Make sure you are seeing your data, not NULLs. Also, look at the Inspect on the INPUT for your Sink, to see if ADF is reading additional columns.
In source csv file the data contains white spaces. How to remove those without using any transformation tool and just using Azure Data Factory. I tried "For each" activity on copy activity but the For each #items is of JSON array and string functions doesn't apply on it. Also, Data factory does not support custom functions and expressions. Is there any way to remove the white spaces from the source or during the copy process to the sink? Source and Sink are "Azure Files".
If not all the csv data contains white spaces, as I know about DF and per my experience, it's impossible to achieve that data conversion only with Copy active! Using data flow or others tools is very easy.
There isn't a way to achieve this using ADF only or directly.
HTH.
The most performant way to achieve this would be to temporarily stage the data in Azure SQL or Cosmos DB and then trim each column with an explicit SELECT statement as the source of the subsequent Copy activity moving the data to your sink file.
I am building a Nifi flow to get json elements from a kafka and write them into a Have table.
However, there is very little to none documentation about the processors and how to use them.
What I plan to do is the following:
kafka consume --> ReplaceText --> PutHiveQL
Consuming kafka topic is doing great. I receive a json string.
I would like to extract the json data (with replaceText) and put them into the hive table (PutHiveQL).
However, I have absolutely no idea how to do this. Documentation is not helping and there is no precise example of processor usage (or I could not find one).
Is my theoretical solution valid ?
How to extract json data, build a HQL query and send it to my local hive database ?
basicly you want to transform your record from kafka into HQL request then send the request to putHiveQl processor.
I am not sur that the transformation kafka record -> putHQL can be done with replacing text ( seam little bit hard/ tricky) . In general i use custom groovy script processor to do this.
Edit
Global overview :
EvaluateJsonPath
This extract the properties timestamp and uuid of my Json flowfile and put them as attribute of the flowfile.
ReplaceText
This set flowfile string to empty string and replaces it by the replacement value property, in which I build the query.
You can directly inject the streaming data using Puthivestreaming process.
create an ORC table with the strcuture matching to the flow and pass the flow to PUTHIVE3STreaming processor it works.
I'm creating a job that takes the inputs of a json input from a table and i'm trying to do a etl metadata injection, i tried with v5.4,v6.0 but non is working.
Is there any work around for my scenario?