Kettle: Calculate length of column - pentaho

I have following Kettle transformation:
The output is:
2017/06/02 14:51:10 - Write to log.0 - ------------> Linenr 1------------------------------
2017/06/02 14:51:10 - Write to log.0 - Text = This is line 1
2017/06/02 14:51:10 - Write to log.0 - Length = 11
2017/06/02 14:51:10 - Write to log.0 - Copy = [B#709B5D90
2017/06/02 14:51:10 - Write to log.0 -
2017/06/02 14:51:10 - Write to log.0 - ====================
2017/06/02 14:51:10 - Write to log.0 -
2017/06/02 14:51:10 - Write to log.0 - ------------> Linenr 2------------------------------
2017/06/02 14:51:10 - Write to log.0 - Text = This is line 2 and is longer
2017/06/02 14:51:10 - Write to log.0 - Length = 11
2017/06/02 14:51:10 - Write to log.0 - Copy = [B#7E5CADF3
2017/06/02 14:51:10 - Write to log.0 -
2017/06/02 14:51:10 - Write to log.0 - ====================
2017/06/02 14:51:10 - Write to log.0 -
2017/06/02 14:51:10 - Write to log.0 - ------------> Linenr 3------------------------------
2017/06/02 14:51:10 - Write to log.0 - Text = This is line 3 and is much longer
2017/06/02 14:51:10 - Write to log.0 - Length = 11
2017/06/02 14:51:10 - Write to log.0 - Copy = [B#7A6336E0
It seems kettle refers to the column "Text" by its hashcode instead of its value. What am I doing wrong?

The values you see are not hashcodes but references to raw data. This happens when the input step has Lazy conversion enabled. The Calculator step should be triggering the conversion to strings but for some reason it is skipped in this case.
Uncheck Lazy conversion in the CSV input step to fix it.

Related

How do I add a line break to an external text log file from a pentaho transform?

I'm using pentaho pdi (spoon) I have a transform to compare 2 database tables (from a query selecting year and quarters within those tables), i'm then hoping to a merge rows (diff) to a filter rows if flagfield is not identical, which if success logs the matches, and if doesn't match logs the output, both with text file output steps...
my issue is my external log file gets appended and looks like this:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failuresDOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
I want this output:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failures
DOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
notice the line breaks between the successes and failures
using add a data grid w/ a "line_break" string that's simply a new line, then passing that to each "text file output" that logs as this "line_break" data column string value quickly which helps, but I can't seem to sequence the transform steps because they're parallel...

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID?

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? I have a problem with annotation due to inadequate gene ID's.

Delimiter to separate long string

In Ax Dynamics 365, GL financial dimensions are getting stored in the following manner.
41110-GTC-R-West-JD-WJED014-R0101-1410-WJED014-SAL--
11410------R0102-----
I want to use delimiter function with ‘-‘ to separate dimensions. Can we create function to separate all the dimensions.
I need the output as follows:
11411 GTC R West JD WJED014 R0101 1410 WJED014 SAL
11410 - - - - - R0102 - - -
Thanks in advance

How to convert fields into bags and tuples in PIG?

I have a dataset which has comma seperated values as:
10,4,21,9,50,9,4,50
50,78,47,7,4,7,4,50
68,25,43,13,11,68,10,9
I want to convert this into Bags and tuples as shown below:
({(10),(4),(21),(9),(50)},{(9),(4),(50)})
({(50),(78),(45),(7),(4)},{(7),(4),(50)})
({(68),(25),(43),(13),(11)},{(68),(10),(9)})
I have tried the below command but it does not show any data.
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (bag1:bag{t1:tuple(p1:int, p2:int, p3:int, p4:int, p5:int)}, bag2:bag{t2:tuple(p6:int, p7:int, p8:int)});
grunt> dump dataset;
Below is the output of dump:
2015-09-11 05:26:31,057 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 8 time(s).
2015-09-11 05:26:31,057 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-09-11 05:26:31,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-09-11 05:26:31,058 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-09-11 05:26:31,063 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-09-11 05:26:31,063 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(,)
(,)
(,)
(,)
Please help. How can I convert the dataset into bags and tuples.
Got the solution.
I have used the below command:
grunt> dataset = load '/user/dataset' Using PigStorage(',') As (p1:int, p2:int, p3:int, p4:int, p5:int, p6:int, p7:int, p8:int);
grunt> dataset2 = Foreach dataset Generate TOBAG(p1, p2, p3, p4, p5) as bag1, TOBAG(p6, p7, p8) as bag2;
grunt> dump dataset2;

Printing particularities of pgf77 (FORTRAN 77 behaviour?)

I compile and run this simple FORTRAN 77 program:
program test
write(6,*) '- - - - - - - - - - - - - - - - - - - - - - - - - - ',
& '- - - - - - - - - - - - - - - - - - - - - - - - - -'
write(6,'(2G15.5)') 0.1,0.0
end
with gfortran or f95 the output is:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.0000
with pgf77 it is:
- - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.00000E+00
and with g77 or ifort:
- - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.0000
A couple of questions arise:
Why is 0.0 printed with four decimal places instead of five, as
requested in the format G15.5? Is this spec-compliant? And why
does pgf77 write it differently?
I guess the line break in the - - - - - - line with the last three
compilers is due to some limitation in the output line length... Is
there any way of increasing this limit, or otherwise force
single-line writes, at compile time?
By the way, the desired output is
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.00000
which matches none of the above.
Exactly what the G edit descriptor causes to be printed is a little complicated but for the value 0.0 the standard (10.7.5.2.2 in the 2008 edition) states that the compiler should print a representation with d-1 (ie 4 in your example) digits in the fractional part of the number. So most of your compilers are behaving correctly; I think that pgf77 is incorrect but it may be working to an earlier standard with a different requirement.
The simplest fix for this is probably to use an f edit descriptor instead, (2F15.5).
As for the printing of the lines of hyphens, your use of the * edit descriptor, which causes list-directed output, surrenders precise control of the output to the compiler. My opinion is that it is a little perverse of a compiler to print the two parts of the expression on two lines but it is not non-standard behaviour.
If you want the hyphens printed all on one line take control of the output format, write(6,'(2A24)') or something similar ought to do it (I didn't count the hyphens for you, just guessed that there are 24 in each part of the output.) If that doesn't appeal to you simply write one string with all the hyphens in; that will probably get written on one line even using list-directed output.