Executing "Select *" query for multiple tables using Apache NiFi - sql

Requirement: From an UI I am getting selected list of tables of a database. Data from these tables are to extracted and stored in file location. We are expected to use NiFi Rest APIs as there is a requirement for custom UI. So we are invoking NiFi processors using REST APIs.
Issue: With ExecuteSQL processor I can execute one SQL query at a time. Since the query is same for all of the tables (select * ...) I can pass in the table name as an attribute to the processor. But I am getting the table names from a REST API call so its delimited string.
the issue is how to invoke the ExecuteSQL in a loop for each table name received in a delimited String.
Kindly let me know if additional information is required.

ExecuteSQL can't loop on a single flow file. Instead you can use ReplaceText to put the delimited string into the body of the flow file, then use SplitText to split on the delimiter. You will get multiple flow files, each with the body containing one of the fields. Then you can use ReplaceText again to match the entire text, and replace with the SELECT statement around the group:
SELECT * from $1
Then you can send all flow files to ExecuteSQL, it will execute each SELECT one at a time (or concurrently if you set multiple concurrent tasks.

Related

Azure Data Factory 2 : How to split a file into multiple output files

I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.

SSIS: Excel data source - if column not exists use other column

I am using select statement in excel source to select just specific columns data from excel for import.
But I am wondering, is it possible to select data such way when I select for example column with name: Column_1, but if this column is not exists in excel then it will try to select column with name Column_2? Currently if Column_1 is missing, then data flow task fails.
Use a Script task and write .net code to read the excel file and then perform the check for the Column_1 availability in the file. If the column does not present then use Column_2 as input. Script Task in SSIS can act as a source.
SSIS is metadata based and will not support dynamic metadata, however you can use Script Component as #nitin-raj suggested to handle all known source columns. There is a good post below on how it can be done.
Dynamic File Connections
If you have many such files that can have varying columns then it is better to create a custom component.However, you cannot have dynamic metadata even with custom component, the set of columns should be known upfront to SSIS.
If the list of columns keep changing and you cannot know in advance what are expected columns then you are better off handling the entire thing in C#/VB.Net using Script Task of control flow
As a best practice, because SSIS meta data is static, any data quality and formatting issues in source files should be corrected before ssis data flow task runs.
I have seen this situation before and there is a very simple fix. In the beginning of your ssis package, using a file task to create copy of the source excel file and then run a c# script or execute a powershell to rename the columns so that if column 1 does not exist, it is either added at the appropriate spot in excel file or in case the column name is wrong is it corrected.
As a result of this, you will not need to refresh your ssis meta data every time it fails. This is a standard data standardization practice.
The easiest way is to add two data flow tasks, one data flow for each Excel source select statement and use precedence constraints to execute the second data flow when the first one fails.
The disadvantage of this approach is that if the first data flow task fails for another reason, it will also try to execute the second one. You will need some advanced error handling to check if the error is thrown due to missing columns or not.
But if have a similar situation, I will use a Script Task to check if the column exists and build the SQL command dynamically. Note that this SQL command must always return the same metadata (you must use aliases).
Helpful links
Overview of SSIS Precedence Constraints
Working with Precedence Constraints in SQL Server Integration Services
Precedence Constraints

Variable values stored outside of SSIS

This is merely a SSIS question for advanced programmers. I have a sql table that holds clientid, clientname, Filename, Ftplocationfolderpath, filelocationfolderpath
This table holds a unique record for each of my clients. As my client list grows I add a new row in my sql table for that client.
My question is this: Can I use the values in my sql table and somehow reference each of them in my SSIS package variables based on client id?
The reason for the sql table is that sometimes we get request to change the delivery or file name of a file we send externally. We would like to be able to change those things dynamically on the fly within the sql table instead of having to export the package each time and manually change then re-import the package. Each client has it's own SSIS package
let me know if this is feasible..I'd appreciate any insight
Yes, it is possible. There are two ways to approach this and it depends on how the job runs. First is if you are running for a single client for a single job run or if you are running for multiple clients for a single job run.
Either way, you will use the Execute SQL Task to retrieve data from the database and assign it to your variables.
You are running for a single client. This is fairly straightforward. In the Result Set, select the option for Single Row and map the single row's result to the package variables and go about your processing.
You are running for multiple clients. In the Result Set, select Full Result Set and assign the result to a single package variable that is of type Object - give it a meaningful name like ObjectRs. You will then add a ForEachLoop Enumerator:
Type: Foreach ADO Enumerator
ADO object source variable: Select the ObjectRs.
Enumerator Mode: Rows in all the tables (ADO.NET dataset only)
In Variable mappings, map all of the columns in their sequential order to the package variables. This effectively transforms the package into a series of single transactions that are looped.
Yes.
I assume that you run your package once per client or use some loop.
At the beginning of the "per client" code read all required values from the database into SSIS varaibles and the use these variables to define what you need. You should not hardcode client specific information in the package.

"UNLOAD" data tables from AWS Redshift and make them readable as CSV

I am currently trying to move several data tables in my current AWS instance's redshift database to a new database in a different AWS instance (for background my company has acquired a new one and we need to consolidate to on instance of AWS).
I am using the UNLOAD command below on a table and I plan on making that table a csv then uploading that file to the destination AWS' S3 and using the COPY command to finish moving the table.
unload ('select * from table1')
to 's3://destination_folder'
CREDENTIALS 'aws_access_key_id=XXXXXXXXXXXXX;aws_secret_access_key=XXXXXXXXX'
ADDQUOTES
DELIMITER AS ','
PARALLEL OFF;
My issue is that when I change the file type to .csv and open the file I get inconsistencies with the data. there are areas where many rows are skipped and on some rows when the expected columns end I get additional columns with the value "f" for unknown reasons. Any help on how I could achieve this transfer would be greatly appreciated.
EDIT 1: It looks like fields with quotes are having the quotes removed. Additionally fields with commas are having the commas separated away. I've identified some fields with quotes and commas and they are throwing everything off. Would the addquotes clause I have apply to the entire field regardless of whether there are quotes and commas within the field?
Default document will have extension as txt and with quotes. Try to open it with Excel and then save as csv file.
refer https://help.xero.com/Q_ConvertTXT

SSIS Script Component - only to change variables

I have a series of task that are very similar:
SELECT a,b FROM c
Lookup in another table and change value in column b.
Save new value back to c and if not match, send the result on to an error table.
That part is pretty straight forward and illustrated here:
Source ==> Lookup =match=> SQL Update command
=No match=> SQL Save Error command
(Hope you understand what I mean - but it works!)
I now have to repeat this a number of times, where my source-sql changes. So what I want to do is to insert a Script Component in front of the Source and set my User::Sql variable like:
Variables.Sql = "SELECT d, e FROM f"
All of the above is contained in a Data Flow. When I have created one I can then copy that one and only change the Sql variable in the script and then it should all work.
My problem is: When I insert the Script Command it asks me if it is a Source, Destination or Transscript script. And by only setting the variable it does not produce any rows for output and cannot connect to my Source.
Anyone know how to make that work?
(I have simplified the above. I actually want to update multiple variables and use those in my Source, Lookup and Error update as well - therefore it is not more simple just to change the SQL script in the initial Source! But being able to do the above, I will be able to achieve what I want :-))
You should set your variable containing the SQL query in the control flow, before you execute the dataflow.
Then you need to use that variable as an expression in your Dataflow. You can parametrize the query used in the lookup or any other parameters of your dataflow.
If your dataflows really have always the same structure, you could even generate a list of queries and call your dataflow task in a loop, preventing the duplication of the same tasks.