Create a historical sequence in talend big data jobs - sequence

I have a requirement to create sequence in talend.
Basically records are coming from a source file.
for each source row i want to create a unique number.
here is where it gets complicated.
when a new file comes next day , the talend should pick the last generated number and then increment it with 1.
for EX:
today the last generated sequence number is 100.
tomorrow the talend should generate sequence number from 100 . i.e. 101,102,103,104.....
This means talend should keep the history of previously generated last sequence number.
Thanks

So, in such a case you have to persist this last sequence value somewhere, in the target database (if any) or in a dedicated file.
If the records are stored in a database, you can also get the max value from the corresponding field using the appropriate Select.
When you got the desired value, you need to store it in a global variable, then reuse this variable to initialize the sequence with something like:
Numeric.sequence("yourSequence", (Integer)globalMap.get("yourGlobal"), 1)Hope this helps.

Related

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Is it possible to store a variable (int) in SQL to then take effect once a certain date has been reached?

Is it possible to store a variable in a SQL table which will only take effect once a certain date is reached? The variable is the amount of days that would be added to a date to create a "TargetDate", this variable can be changed by user input but must have an "EffectiveDate" ?
You can certainly create a configuration table that you store config data in. In your case, one of the items would be 'EFCTV_BUFFER' as the value in the key column, and (for an example) '5' as the number of days in the value column. Then you can reference that key value to select the buffer days value, and add that to whatever date you want.
This allows you to modify it at any time as requested.
You would reference this table in an insert/update trigger on your table where you store your dates. I would suggest having two dates so that the calculation is only done once, unless you need the calculation to be dynamic based upon the current 'EFCTV_BUFFER' configuration value.

Pentaho compare values from table to a number from REST api

I need to make a dimension for a datawarehouse using pentaho.
I need to compare a number in a table with the number I get from a REST call.
If the number is not in the table, I need to set it to a default (999). I was thinking to use table input step with a select statement, and a javascript step that if the result is null to set it to 999. The problem is if there is no result, there is nothing passed through. How can this be done? Another idea was to get all values from that table and somehow convert it to something so I can read id as an array in javascript. I'm very new to pentaho DI but I've did some research but couldn't find what I was looking for. Anyone know how to solve this? If you need information, or want to see my transformation let me know!
Steps something like this:
Load number from api
Get Numbers from table
A) If number not in table -> set number to value 999
B) If number is in table -> do nothing
Continue with transformation with that number
I have this atm:
But the problem is if the number is not in the table, it returns nothing. I was trying to check in javascript if number = null or 0 then set it to 999.
Thanks in advance!
Replace the Input rain-type table by a lookup stream.
You read the main input with a rest, and the dimension table with an Input table, then make a Stream Lookup in which you specify that the lookup step is the dimension input table. In this step you can also specify a default value of 999.
The lookup stream works like this: for each row coming in from the main stream, the steps looks if it exists in the reference step and adds the reference fields to the row. So there is always one and exactly one passing by.

Issue with create dimension table with last update time column in Pentaho Data Integration

I am creating dimension table with last updated time(from GetSystemInfo)in Pentaho Data Integration(PDI).It works fine except it enters new rows even there is no changes in row and reason is there is lookup is also performing on last updated time field which should not perform. But when I removes this field from key field from attribute Dimenssion lookup/update it works as expected but values in lat time updated field goes empty.Thanx in advance for any solution/Suggestion.
I expect you are talking about SDC II. (Slowly changing dimension of type 2) here and you want to store a date of when a row is inserted to a SCD table.
Instead of obtaining data from GetSystemInfo step, you can use Date of last insert (without stream field as source) type of dimension update in the Fields tab of Dimension Lookup / Update step which stores a datetime automatically in defined table column.
Additional hint: If you need to store maximum value of some date from a source system table which is relevant for loading new / changed data, store its maximum right after Dimension Lookup / Update step into a separate table and use it as when loading updated data at the beginning of a ETL transformation.
I think it is better to use the below components
Step 1: Using table input step you can get a max value from target system and pass the value to next step
Step 2: Take one more table input step and write a source query and assign the previous value in where clause(like ?)
Step 3: Then perform the as usual operation on target level
I think you are getting the above steps.

Derived date calculation

I am currently entering data into a SQL Server database using SSIS. The plan is for it to do this each week but the day that it happens may differ depending on when the data will be pushed through.
I use SSIS to grab data from an Excel worksheet and enter each row into the database (about 150 rows per week). The only common denominator is the date between all the rows. I want to add a date to each of the rows on the day that it gets pushed through. Because the push date may differ I can't use the current date I want to use a week from the previous date entered for that row.
But because there are about 150 rows I don't know how to achieve this. It would be nice if I could set this up in SQL Server where every time a new set of rows are entered it adds 7 days from the previous set of rows. But I would also be happy to do this in SSIS.
Does anyone have any clue how to achieve this? Alternatively, I don't mind doing this in C# either.
Here's one way to do what you want:
Create a column for tracking the data entry date in your target table.
Add an Execute SQL Task before the Data Flow Task. This task will retrieve the latest data entry date + 7 days. The query should be something like:
select dateadd(day,7,max(trackdate)) from targettable
Assign the SQL result to a package variable.
Add a Derived Column Transformation between your Source and Destination components in the Data Flow Task. Create a dummy column to hold the tracking date and assign the variable to it.
When you map the Excel to table in a Data Flow task, map the dummy column created earlier to the tracking date column. Now when you write the data to DB, your tracking column will have the desired date.
Derived Column Transformation