I am trying to measure duration of Dataflow pipeline which pulls messages from Pub/Sub and loads them to a BigQuery table. I cannot find how to get the last modified time of a row in BigQuery table though there is a last modified datetime of table.
Does anyone know how to set last modified datetime to row of BigQuery table?
You should include the current timestamp in the application that creates the output data structure. That would be the event time in some sense (you can add more granularity by adding event times on the client or on the server depending on how your events originate).
Then you possibly want to record the time before processing (right after the message is read from Pub/Sub). Then you want to record the time right before you write into BigQuery.
You can do both of these with a DoFn as an extra step or include it as the first action in the first transformation and the last action in the last transformation that you have in your pipeline.
Include these new columns respectively to the table schema of the output BigQuery table.
Related
I am trying to copy past 6 months data from ADLS to My Azure SQL DB using data factory pipeline.
When I entered start time and end time here in the source filter like below, it is copying data from current date and going back to old date (descending order) but my requirement is to copy data from old date to current date(ascending order). Please suggest how to get the data in ascending date order
There is no option to sort the source files in copy data activity.
You can store the file names with dates in a SQL table. And pull the list by sorting to process files in order.
Get the list of item names using the Get Metadata activity.
Pass the child items to ForEach activity.
Inside ForEach, get the last modified date of each item using the Get Metadata activity.
Pass the item name and Get Metadata output date to a Store procedure activity to insert data into a SQL table.
Add lookup activity to the ForEach activity, and get the files in order by date.
Pass the lookup activity output to another ForEach activity and copy that file data to sink.
I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.
I need to sychronize some data from a database to another using kettle/spoon transformation. The logic is i need to select latest date data that has existed in destination db. Then select from source db from the last date. What transformation element do i need to do this?
Thank you.
There can be many solutions:
If you have timestamp columns in both the source and destination tables, then you can take two table input steps. In the first one, just select the max last updated timestamp, use it as a variable in the next table input, taking it as a filter for the source data. You can do something like this:
If you just want the new data to be updated in the destination table and you don't care much about timestamp, I would suggest you to use insert/update step for output. It will bring all the data to the stream and if it finds a match, it won't insert anything. If it doesn't find a match, it will insert the new row. If it finds any modifications to the existing row in the destination table, it will update it accordingly.
I am creating dimension table with last updated time(from GetSystemInfo)in Pentaho Data Integration(PDI).It works fine except it enters new rows even there is no changes in row and reason is there is lookup is also performing on last updated time field which should not perform. But when I removes this field from key field from attribute Dimenssion lookup/update it works as expected but values in lat time updated field goes empty.Thanx in advance for any solution/Suggestion.
I expect you are talking about SDC II. (Slowly changing dimension of type 2) here and you want to store a date of when a row is inserted to a SCD table.
Instead of obtaining data from GetSystemInfo step, you can use Date of last insert (without stream field as source) type of dimension update in the Fields tab of Dimension Lookup / Update step which stores a datetime automatically in defined table column.
Additional hint: If you need to store maximum value of some date from a source system table which is relevant for loading new / changed data, store its maximum right after Dimension Lookup / Update step into a separate table and use it as when loading updated data at the beginning of a ETL transformation.
I think it is better to use the below components
Step 1: Using table input step you can get a max value from target system and pass the value to next step
Step 2: Take one more table input step and write a source query and assign the previous value in where clause(like ?)
Step 3: Then perform the as usual operation on target level
I think you are getting the above steps.
I am currently entering data into a SQL Server database using SSIS. The plan is for it to do this each week but the day that it happens may differ depending on when the data will be pushed through.
I use SSIS to grab data from an Excel worksheet and enter each row into the database (about 150 rows per week). The only common denominator is the date between all the rows. I want to add a date to each of the rows on the day that it gets pushed through. Because the push date may differ I can't use the current date I want to use a week from the previous date entered for that row.
But because there are about 150 rows I don't know how to achieve this. It would be nice if I could set this up in SQL Server where every time a new set of rows are entered it adds 7 days from the previous set of rows. But I would also be happy to do this in SSIS.
Does anyone have any clue how to achieve this? Alternatively, I don't mind doing this in C# either.
Here's one way to do what you want:
Create a column for tracking the data entry date in your target table.
Add an Execute SQL Task before the Data Flow Task. This task will retrieve the latest data entry date + 7 days. The query should be something like:
select dateadd(day,7,max(trackdate)) from targettable
Assign the SQL result to a package variable.
Add a Derived Column Transformation between your Source and Destination components in the Data Flow Task. Create a dummy column to hold the tracking date and assign the variable to it.
When you map the Excel to table in a Data Flow task, map the dummy column created earlier to the tracking date column. Now when you write the data to DB, your tracking column will have the desired date.
Derived Column Transformation