kettle etl passing variable - pentaho

In my transformation, I created a var (current time formatted to yyyy-mm-dd HH24mmss ) in Modified java script. I then use a set variable step to set the field to a variable and the scope is valid in root job.
The question is how to use that variable in another transformation (in the same job)? I tried get variable, but there seems to be only system variables. What I want to do is output the date to a file in the second transformation. There are more transformations in between, that's why I can't do the output in the first transformation.
Or is it possible to create a variable in the job, and set its value (current date in yyyy-mm-dd HH24mmss) then use it in transformations?
EDIT:
The answer works, but the date is not in my expected format (yyyy-mm-dd HH24mmss), and it's not clear what format the date is. E.g if I try to format it in a modified java script and use getFullYear function on that I get TypeError: Cannot find function getFullYear in object Wed May 25 17:44:04 BST 2016. But if I just output it to a file, the date is in yyyy/mm/dd hh:mm:ss.
So I found another way to do it is use a table input and generate a date to the format desired and set variable, the rest is the same.

In your first transformation use the Get System Info step to inject the current date/time into your data flow and run it into a Set Variables step that sets the variable defined in your Job.
The variable you're using may not appear in the drop down list when you do CTRL-Space. This is because the variable is allocated by the Job at run time and isn't available at design time. Just type '${VariableName}' into the field at design time. When you run from a job that contains a variable of that name, it should work.

Related

Data Factory expression substring? Is there a function similar like right?

Please help,
How could I extract 2019-04-02 out of the following string with Azure data flow expression?
ABC_DATASET-2019-04-02T02:10:03.5249248Z.parquet
The first part of the string received as a ChildItem from a GetMetaData activity is dynamically. So in this case it is ABC_DATASET that is dynamic.
Kind regards,
D
There are several ways to approach this problem, and they are really dependent on the format of the string value. Each of these approaches uses Derived Column to either create a new column or replace the existing column's value in the Data Flow.
Static format
If the format is always the same, meaning the length of the sections is always the same, then substring is simplest:
This will parse the string like so:
Useful reminder: substring and array indexes in Data Flow are 1-based.
Dynamic format
If the format of the base string is dynamic, things get a tad trickier. For this answer, I will assume that the basic format of {variabledata}-{timestamp}.parquet is consistent, so we can use the hyphen as a base delineator.
Derived Column has support for local variables, which is really useful when solving problems like this one. Let's start by creating a local variable to convert the string into an array based on the hyphen. This will lead to some other problems later since the string includes multiple hyphens thanks to the timestamp data, but we'll deal with that later. Inside the Derived Column Expression Builder, select "Locals":
On the right side, click "New" to create a local variable. We'll name it and define it using a split expression:
Press "OK" to save the local and go back to the Derived Column. Next, create another local variable for the yyyy portion of the date:
The cool part of this is I am now referencing the local variable array that I created in the previous step. I'll follow this pattern to create a local variable for MM too:
I'll do this one more time for the dd portion, but this time I have to do a bit more to get rid of all the extraneous data at the end of the string. Substring again turns out to be a good solution:
Now that I have the components I need isolated as variables, we just reconstruct them using string interpolation in the Derived Column:
Back in our data preview, we can see the results:
Where else to go from here
If these solutions don't address your problem, then you have to get creative. Here are some other functions that may help:
regexSplit
left
right
dropLeft
dropRight

Bad date format change from string to date in Bigquery

Been struggling with some datasets I want to use which have a problem with the date format.
Bigquery could not load the files and returned the following error:
Could not parse '4/12/2016 2:47:30 AM' as TIMESTAMP for field date (position 1) starting at location 21 with message 'Invalid time zone:
AM'
I have been able to upload the file manually but as strings, and now would like to set the fields back to the proper format, However, I just could not find a way to change the format of the date column from string to proper DateTime format.
Would love to know if this is possible as the file is just too long to be formatted in excel or sheets (as I have done with the smaller files from this dataset).
now would like to set the fields back to the proper format ... from string to proper DateTime format
Use parse_datetime('%m/%d/%Y %r', string_col) to parse datetime out of string
If applied to sample string in your question - you got
As #Mikhail Berlyant rightly said, using the parse_datetime('%m/%d/%Y %r', string_col)
function will convert your badly formatted dates to a standard format as per ISO 8601 accepted by Google Bigquery . the best option will then be to save these query results to a new table on the database in your Bigquery Project.
I had a similar issue.
Below is an image of my table which i uploaded with all columns in String format .
Next up was that i applied the following settings to the query below
The Settings below stored the query output to a new table called heartrateSeconds_clean on the same dataset
The Write if empty option is a good option to avoid overwriting the existing raw data or just arbitrarily writing output to a temporary table, except if you are sure you want to do so. Save the settings and proceed to Run your Query.
As seen below, the output schema of the new table is automatically updated
Below is the new preview of the resulting table
NB: I did not apply an ORDER BY clause to the Results hence the data is not ordered by any specific column in both versions of the same table.
This dataset has over 2M rows.

MemSQL pipeline from S3 inserting NULLs into DATE type columns

The Memsql pipeline is supposed to dump data from S3 into a columnstore table. The source files are in ORC format. they are then converted to Parquet.
The files have certain columns with DATE datatype (yyyy-mm-dd).
The pipeline runs fine but inserts NULL into all the Date type columns.
The DATE values may be getting written to Parquet as int64 with a timestamp logical type annotation (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp). MemSQL doesn't currently automatically convert these to a format compatible with e.g. DATETIME or TIMESTAMP, but rather attempts to assign to the destination column as if by assigning an integer literal with the raw underlying value. This gives NULL rather than an error for MySQL compatibility reasons, though set global data_conversion_compatibility_level="7.0" will make it an error.
You can investigate by temporarily giving the problem column TEXT type and looking at the resulting value. If it's an integer string, the issue is as described above and you can use the SET clause of CREATE PIPELINE to transform the value to a compatible format via something like CREATE PIPELINE P AS LOAD DATA .... INTO TABLE T(#col_tmp <- parquet_field_name) SET col = timestampadd(microsecond, #col_tmp, from_unixtime(0));.
The value will be a count of some time unit since the the unix epoch in some time zone. The unit and time zone depends on the writer, but should become clear if you know which time it's supposed to represent. Once you know that, modify the expression above to correct for units and perhaps call convert_tz as necessary.
Yes, it's a pain. We'll be making it automatic.

Why does selecting PostgreSQL interval using Knex.js returns a JSON or JavaScript object rather than a string?

I have a PostgreSQL table which has a column of the type interval which is storing a time duration in the ISO 8601 format i.e. P1D equals "1 day".
The problem I am having is that when selecting this data from the database using Knex.js the data is converted from the string P1D into a JSON object {"days":1}, if I execute the same basic select query in the command line interface I get the string P1D back, and have the option to set the style of output SET intervalStyle = iso_8601.
As best I can tell this is being doing by a dependency of Knex.js called "node-pg-types" which in turn uses "postgres-interval". In Bookshelf.js you can set a data processor, and in using the "pg" module directly you can set different type behaviours, however it's not clear at all how to modify the behaviour of Knex.js in this regard, and yet Bookshelf.js can do this and is built on Knex.js.
In short my question is how do I make Knex.js output ISO 8601 style intervals on interval columns rather than a JSON object?
It turns out that through my research jumping from one module to another, that indeed Knex.js does use "node-pg-types" to format the interval columns, and that in turn is using "postgres-interval", neither module document this well at all.
In looking into "postgres-interval" it was evident that the data returned was a JavaScript object which was being encoded into what looked like JSON, however reading the documentation on this module it actually has functions you can call to get the data in any format:
https://github.com/bendrucker/postgres-interval
interval.toPostgres() -> string
Returns an interval string. This allows the interval object to be passed into prepared statements.
interval.toISO() -> string
Returns an ISO 8601 compliant string.
So the answer is to append .toISO() to your code.
I will notify the developer that this particular behaviour is not well documented so they can look to improve awareness of how Knex.js passes off some of the work to other modules which also pass work off, however I wrote this self answered question so no one else has to spend countless hours trying to figure this out.

Pentaho Data Integration - Pass dynamic value for 'Add sequence' as Start

Can we pass any dynamic value (which is the max value of another table column) in "Start at Value" in ADD Sequence step.
Please guide me.
Yes, but as the step is written you'll have to be sneaky about it.
Create two transforms and wrap them up in a job. In the first transform, query that database to get the value you want, then store it in a variable. Then in the second transform, which you should execute in the job after the first, in the Add Sequence step use variable substitution on the Start at Value field to sub in the value you previously extracted from the earlier transform.
Note that you can't do this all in one transform because there is no way to ensure that the variable will be set before the Add Sequence step (although it might seem like Wait steps would make this possible, I've tried it in the past and was unsuccessful and so had to go with the methods described above).