I have a queue in JSON format in RabbitMQ and I would like to get some data that fix some conditions in StreamSets (using stream selector) and then save in a new database (JDBC Producer) a certain value. How do I write the specific value after the conditions and send to de database?
From your pipeline diagram, it looks like you're trying to set the field values you need for the database. You should be able to do this in the JDBC Producer itself - configure the Field to Column Mapping.
Actually you can use default streamsets component named Expression evaluator for it.
You add or move any fields with it.
Link to the documentation:
streamsets expression evaluator
Related
I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.
I am building a Nifi flow to get json elements from a kafka and write them into a Have table.
However, there is very little to none documentation about the processors and how to use them.
What I plan to do is the following:
kafka consume --> ReplaceText --> PutHiveQL
Consuming kafka topic is doing great. I receive a json string.
I would like to extract the json data (with replaceText) and put them into the hive table (PutHiveQL).
However, I have absolutely no idea how to do this. Documentation is not helping and there is no precise example of processor usage (or I could not find one).
Is my theoretical solution valid ?
How to extract json data, build a HQL query and send it to my local hive database ?
basicly you want to transform your record from kafka into HQL request then send the request to putHiveQl processor.
I am not sur that the transformation kafka record -> putHQL can be done with replacing text ( seam little bit hard/ tricky) . In general i use custom groovy script processor to do this.
Edit
Global overview :
EvaluateJsonPath
This extract the properties timestamp and uuid of my Json flowfile and put them as attribute of the flowfile.
ReplaceText
This set flowfile string to empty string and replaces it by the replacement value property, in which I build the query.
You can directly inject the streaming data using Puthivestreaming process.
create an ORC table with the strcuture matching to the flow and pass the flow to PUTHIVE3STreaming processor it works.
I am using DataFlow's WriteToBigQuery with CREATE_IF_NEEDED, and thus have to specify the schema.
I define the schema in the beginning of my code (outside the actual pipeline), but since I need the flag --save_main_session, I get the same error as here, which explains that the schema cannot be passed along with the pipeline since a BigQuery schema definition is not pickleable.
The solution mentioned on the page is not an option for me (disable the --save_main_session flag), and thus the other option to specify the schema is through a string.
However, I need to set some fields to REQUIRED. Is there a way to do this with the string schema definition?
As you can see from bigquery.py the conversion from a string schema to a TableSchema is quite straightforward and does indeed set the mode to NULLABLE. Perhaps you can create the TableSchema with REQUIRED fields based on this code snippet.
I need, at runtime, to change which connection is used by a table input step.
I have 3 connections defined: STG, DWH, DM.
I want to choose at runtime between them.
I can't create a new connection with parameters for server name, database name, etc. I must use the existing connections.
I wish I can write down a variable ${my_connection} in the box below, but the field cannot be edited.
Any suggestion?
Instead of using the variable in the connection selector of the Step, use the Host and Database name in the connection configuration.
EDIT:
You can pass a variable for the KTR to capture and test it using a Switch/Case step that calls a Transformation Executor, in this KTR you'll have your Table input and a copy rows to result step, results which will be captured after the Transformation Executor. You'll need 3 different KTR's, each with the Table input step that's going to execute the row passed by the Switch / Case step.
If i'm not clear or you need further explanation i can perhaps produce an example.
Is it possible to write an output parameter to a dataset?
I have a meta data activity that stores the file name of an azure blob dataset and I would like write that value into another azure blob dataset as an additional column via a copy activity.
Thanks
If you are looking to get the output of the previous operation as an input to the next operation, you could probably go ahead in the following manner,
I am hoping that the attribute you are getting is the child Items, the values for this can be obtained in the next step using the following expression.
#activity('Name_of_activity').output.childItems.
This would return an Array of your subfolders.
The following link should help you with the expression in ADF