map columns in Pentaho spoon - pentaho

I have input.csv file with 3 columns viz, name,age,address. And, I have an output.csv file with 5 columns viz, Person name,Person age,Person address,Person salary,Person pass criteria.
I need to map my input.csv to output.csv. Please help me out with this. I tried Select values step, but it does not work.

You can do this in 4 steps.
1) Using CSV file input step you can get the name,age,address fields
2) Using Select values step you can rename name,age,address fields to Person name,Person age,Person address.
3) Using Add constants step you can add the additional fields, Person salary,Person pass criteria.
4) Using Text file output step you can output to a csv file. Here as the extension, type csv and separator as ,.

Related

How to replace strings in text with id from second text?

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.
You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

Python - Compare two csv file - based on Column

I am trying to compare two CSV files, most of the time it will have same data but order of data will not be the same. Eg
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
CSV File2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
so I want to use third column as Primary key to compare other values. Report the difference. Is this possible to do it in Robotframework or Panda?
If you are making use of robotframework you need to do the following,
install robotframework-csvlib
Use Built-in Collections
Input from your question
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
csv file2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
My Solution
In the below approach, we are first reading csv into list of lists for both csv files and then comparing all the list of list items by making use of Collections KW List Should Contain Sub List, here, notice that we are passing an argument "values=True" which compares the value as well.
Code that compares 2 csv files
*** Settings ***
Library CSVLib
Library Collections
*** Test Cases ***
Test CSV
${list1}= read csv as list csv1.csv
log to console ${list1}
${list2}= read csv as list csv2.csv
log to console ${list2}
List Should Contain Sub List ${list1} ${list2} values=True
OUTPUT
(rf1) C:\Users\kgurupra>robot s1.robot
==============================================================================
S1
==============================================================================
Test CSV .[['C1,C2,C3'], ['AAA,111,A1A1'], ['BBB,222,B2B2'], ['CCC,333,C3C3']]
..[['C1,C2,C3'], ['CCC,333,C3C3'], ['BBB,212,B2B2'], ['AAA,111,A1A1']]
Test CSV | FAIL |
Following values were not found from first list: ['BBB,212,B2B2']
------------------------------------------------------------------------------
S1 | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output: C:\Users\kgurupra\output.xml
Log: C:\Users\kgurupra\log.html
Report: C:\Users\kgurupra\report.html
Assuming you've imported your CSV files as pandas DataFrames you can do the following to merge the two while retaining fundamental differences:
df = csv1.merge(csv2, on='<insert name primary key column here>',how='outer')
Adding the suffixes option allows you to more clearly differentiate between identically named columns from each file:
df = csv1.merge(csv2, on='<insert name>',how='outer',suffixes=['_csv1','_csv2'])
After that it depends on what kind of differences you are looking to spot but perhaps a starting point is:
df['difference_1'] = df['column1_csv1'] == df['column1_csv2']
this will create a boolean column which indicates True if observations are the same and False otherwise.
But there are nearly endless options for comparison.

Load CSV file in PIG

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

Using a Database column Value in output file name in Pentaho kettle

I wish to use a database column value in the output file name.
example:
select max(id) from process;
suppose the result of above query is 111
-- wish to use this value in the output file name as shown below.
output file name: file_111
how can i achieve this in pentaho kettle?
Please advice.
Depending on the type of file you want to create, you can simply create a column in your stream that contains the file name and then use the Accept file name from field-function that some output steps provide. The text file output for example does have this function, the XML output unfortunately doesn't.
to create the file name itself you can e.g. use the javascript step, or use the concat fields step together with the Add constants step.
Please follow the below steps:
Step 1: Table input :- select max(id) as max_id from process;
step 2: Modified Java Script Value:- put bellow code in this step.
eg:- var dummy= 'C:/Users/Venkatesh/Desktop/file_'+ max_id ;
in same step in the bottom ADD Field Name is dummy, Type is string and
Replace value 'Fieldname' or 'Rename to' is N
step 3: Text file output:-
select the **Add filenames to result**
**file name field** => dummy
Finally execute and see the result..

Split JSON file using apache PIG

I have a JSON input file that needs to be split into multiple files based on a keyword and the output should also retain the same JSON format.
Example:
The keyword here is the value of the object EVT.NAME. Depeneding on the value it should route it to the output.
Input has three different values (KEYPRESS,TUNE,TRICK), so 3 different output files should be created.
Input:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
Output1:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 2:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 3:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
You can use JsonLoader and JsonStorage. See this article - http://joshualande.com/read-write-json-apache-pig.
table = LOAD 'file.json'
USING JsonLoader('KEYPRESS:chararray, TUNE:chararray, TRICK:chararray');