Azure Data Factory CDS Connector Random Records - azure-data-factory-2

I have an online implementation of Dynamics 365 where I need to use Azure Data factory to do the following:
1 - pull some records based on a fetchxml criteria
2 - select random 100 records out of these
3 - insert them into a different entity in D365.
I am using the ADF CDS connector which only supports copy activity (does not support data flows as yet)
What I am hoping I can do is the following:
Task 1 - copy all records into a csv file and add an extra column that contains a random ID as an extra column
Issue here: When I do this, and use the rand() function, all the numbers returned are the same:
The same issue happens if I try to use #guid() all values come back the same.
Question 1 - Is there a reason why rand() and guid() are returning the same values for all records and is there a way to work around it.
Question 2 - Is there another way that I can't think of to achieve what I am trying to do: pick random x number of records from a dataset?

Is there a reason why rand() and guid() are returning the same values
for all records and is there a way to work around it?
This is because ADF will execute this expression first and just one time. Then use it's result as the column value. So you get the same data. As a workaround, you need to copy records into csv file first. Then use the csv file as source in Data Flow, then use Derived Column to add randID column. You can also create an Azure Function to add randID column and invoke it in ADF.

Related

Pentaho step - Use SQL functions to add a column in data before dumping it int DB

I am fairly new to Pentaho, and while working on it, I have stumbled across a problem. Below is how my flow is:
Read input from a file. Let's say this has 5 columns.
Make some modifications to existing columns. (Filter, modify and all).
Add a new column, which will be equal to an SQL function of the current row data. Example, it can be sum(id, id+1)
Dump to the database.
Step 1,2, 4 are already in place and are working fine. It's Step 3 where I am stuck. I've tried to execute SQL, but that is only for Modifying DDL and doesn't return data. Table input needs data to be in a table already, which isn't the case with me.
I have a workaround, that I can enter all rows in DB, and then fire an update query, but I was hoping if there is a better way to do this.
You can add formula step and in the formula column, you can specify what you want to achieve. For example, your other column+1 and save it in a new field or also replace the existing field value

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Pentaho compare values from table to a number from REST api

I need to make a dimension for a datawarehouse using pentaho.
I need to compare a number in a table with the number I get from a REST call.
If the number is not in the table, I need to set it to a default (999). I was thinking to use table input step with a select statement, and a javascript step that if the result is null to set it to 999. The problem is if there is no result, there is nothing passed through. How can this be done? Another idea was to get all values from that table and somehow convert it to something so I can read id as an array in javascript. I'm very new to pentaho DI but I've did some research but couldn't find what I was looking for. Anyone know how to solve this? If you need information, or want to see my transformation let me know!
Steps something like this:
Load number from api
Get Numbers from table
A) If number not in table -> set number to value 999
B) If number is in table -> do nothing
Continue with transformation with that number
I have this atm:
But the problem is if the number is not in the table, it returns nothing. I was trying to check in javascript if number = null or 0 then set it to 999.
Thanks in advance!
Replace the Input rain-type table by a lookup stream.
You read the main input with a rest, and the dimension table with an Input table, then make a Stream Lookup in which you specify that the lookup step is the dimension input table. In this step you can also specify a default value of 999.
The lookup stream works like this: for each row coming in from the main stream, the steps looks if it exists in the reference step and adds the reference fields to the row. So there is always one and exactly one passing by.

SSIS - Only Load Certain Records - Skip the remaining

I have a Flat File that I'm loading into SQL and that Flat file has 2 different RecordTypes and 2 Different File Layouts based on the RecordType.
So I may have
000010203201501011 (RecordType 1)
00002XXYYABCDEFGH2 (RecordType 2)
So I want to immediately check for Records of RecordType1 and then send those records thru [Derived Column] & [Data Conversion] & [Loading to SQL]
And I want to ignore all Records of RecordType2.
I tried a Conditional Split but it seems like the Records of RecordType2 are still trying to go thru the [Derived Column]&[DataConversion] Steps.
It gives me a DataConversion error on the RecordType2 Records.
I have the Conditional Split set up as RecordType == 1 to go thru the process i have set up.
I guess Conditional Split isn't set up to be used this way?
Where in my process can i tell it to check for RecordType1 and only send records past that point that are RecordType=1?
It makes perfect sense you are having data type errors for Record Type 2 rows since you probably have defined columns along with their data types based on Record Type 1 records. I see three options to achieve what you want to do:
Have a script task in the control flow to copy only Record Type 1
records to a fresh file that would be used by the data flow you
already have (Pro: you do not need to touch the data flow, Con:
reading file twice), OR
In the existing data flow: Instead of getting all the columns from
the data source, read every line coming from the file as one big-fat
column, then a Derived Column to get RecordType, then a Conditional
Split, then a Derived Column to re-create all the columns you had
defined in the data source, OR
Ideal if you have another package processing Record Type 2 rows:
Dump the file into a database table in the staging area, then
replace the Data Source in your Data Flow for an OLEDB Data Source
(or whatever you use) and obtain+filter the records with something
like: SELECT substring(rowdata,1,5) AS RecordType,
substring(rowdata,6,...) AS Column2, .... FROM STG.FileData WHERE
substring(rowdata,1,5) = '00001'. If using this approach it would
be better to have a dedicated column for RecordType

Splitting data by column value into an indefinite number of tables using an ETL tool

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png