I am using PDI/Kettle. I know it is possible to add new columns by specifying them in fields. Is it possible to remove deprecated input columns coming from the previous step in Modified Javascript Step with Spoon?
You can use Select / Rename values step to remove any field from record stream.
Do it in a 2nd tab Remove where you define Fields to remove
#Hello-lad
Reading your question looks like you wanted to know specifically if you can discard an input column coming from a previous step inside of a Modified Javascript Step, but the real use of this transformation is to create columns derived from values coming in the stream of Pentaho and not really to eliminate unwanted items in that stream, for that you particularly use the Select / Rename values (as indicated by mzy)
Related
Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean
For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.
I am working on IBM I series VR7, and running SQL(DB2) using CLLE.
I have a SQL procedure in a TXT file, having below command to create a table in QTEMP.
create table qtemp.FILE1 as (
select
Field1,Field2,Field3,.....Field10 from FILE2 ) with data;
I am calling the above procedure from CLLE using below command.
RUNSQLSTM SRCFILE(MyLib/MySrc) SRCMBR(Proc_txt) COMMIT(*NONE)
And then running below command to generate the spool.
RUNQRY QRYFILE((FILE1)) OUTTYPE(*PRINTER) OUTFORM(*DETAIL) FORMSIZE(60 132)
FORMTYPE(*STD) COPIES(1) LINESPACE(1)
The issue I am facing is that I am getting 2 white spaces between columns while creating the table using the create table command. When that table is converted into a spool file using above RUNQRY command, the fields on the right side truncates as my report width is 132 by default and I can not change it.
If the white spaces in the table created can be reduced to 1, my issue will be resolved.
The SQL I am using IBM i Series' default and DB2 as database. I don't have much idea about their version.
Edit2: Another issue I had was of report having a field in second line. Actually as per requirement a field had to be in the second row under another field. For example I needed field10 under field5. I have fixed it too, read my answer below.
Hope it helps people in need but I really doubt.
Edit1: I have updated the question as requested. Any help would be much appreciated. Thanks.
The short answer is that yes you can define the report to have 1 space between columns, but you have to define the Query400 object to do that. Unfortunately this is not a good place to write a tutorial for Query400. I can get you started though.
Type wrkqry, press enter.
Then put the cursor on the query name field, and press F4. You are now in the tool. You need to create a new query, and define everything about it in this tool. Play around with it, and see if that helps you.
I was able to get what I needed. As others have suggested, I have finally used WRKQRY to control the column spacing. Reduced the column spacing to 1 and was able to get the columns needed in the 132 width.
Another issue I had was of report having a field in second line. Actually as per requirement a field had to be in the second row under another field. For example I needed field10 under field5. So what I did was, I used the Line wrapping feature available in WRKQRY.
How I did:
Create a WRKQRY object and select the file needed.
Sequenced the field I needed in second line, to the bottom.
Go to Select Output Type and Output Form and take Y on Line Wrapping field. Put the
wrapping width equal to your report width. Leave other fields as required.
This way each record will have 10th field in next row, if it has data. You can add as
many as fields.
You may have to add some white spaces to the field for proper alignment. I would
suggest to create a new field and use concat(||) operator available in WRKQRY.
Thanks everyone for helping.
I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.
I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.
I am working on a transformation step for Pentaho Kettle. It selects several input columns and based on that adds two new columns during transformation. I am unable to understand (based on code from other plugins), how I can add the two new columns so that 1) steps downstream are aware of these columns and 2) i can push the transformed data into these columns.
Thanks in advance.
You might need to override meta.getStepFields() to add new ValueMetaInterface objects to the RowMetaInterface passed in. This is the standard way to add columns at runtime; however, the row's metadata (i.e. list of ValueMetaInterface objects) must be the same from row to row or else the next step in your transformation will complain.
Often when doing data-driven custom plugins, you consume as many rows as you need (using getRow()) in order to figure out what the outgoing row format/metadata will be, then you can construct a RowMetaInterface (usually using meta.getStepFields()) that will be passed into the putRow() call. If you intend to pass through the incoming fields, do something like:
RowMetaInterface outputRowMeta = getInputRowMeta().clone();
If you're creating new rows use this:
RowMetaInterface outputRowMeta = new RowMeta();
Either way when you call meta.getStepFields(outputRowMeta, ...) it should populate outputRowMeta with the appropriate fields, by adding/changing/removing ValueMetaInterface objects from outputRowMeta.
I've got a blog post using Groovy to add/replace fields in the incoming rows here:
http://funpdi.blogspot.com/2014/10/flatten-json-to-key-value-pairs-in-pdi.html
Not sure if that is similar to your use case or not. If you have more questions, feel free to find me on IRC at ##pentaho (my nick is usually mburgess_pdi)
IF i have understood your question correctly, i think you are trying to create an output file with dynamic column. So you can do this by checking on the "fast dumping" option in Text File Output Step. While doing so , donot define any column names in the "Fields" tab
Check my image below:
Hope it helps :)