How to take in a sample DAT file and break each record by its header in Mule? - mule

I am trying to taken in a sample DAT file into a Mule application.
SEC.DAT
D1030325 ADFSA 12321.00 XXXX
A1354610 AEWTF 94332.00 AAAA
V1030325 ADFSA 12321.00 XXXX
I am fairly new to the platform and having been somewhat lost on how to structure the flow in that context, but my goal is to break each record by its beginning value.
Example:
Where examples D, A, & V are the conditions.
Expected outputs:
SEC1.DAT
D1030325 ADFSA 12321.00 XXXX
SEC2.DAT
A1354610 AEWTF 94332.00 AAAA
SEC3.DAT
V1039325 AOFSA 12321.00 XXYF

This is about a Mule application to process a file. The rest of the platform has no impact for this particular question.
Assuming it is a fixed length file but not using DataWeave Fixed Length feature you can treat the input file as a single string (ie format text/xml).
First you can separate into records using a DataWeave transform to split by the end of line:
%dw 2.0
output application/java
---
payload splitBy "\n"
Following that have a <foreach> to loop over each record. Inside the body of the <foreach> have a <choice> router to select to which output file write based on the first character of the record. Example of a condition: $[0] == "A". Then inside that branch of the <choice> just append to a list of records or write directly to the file (with append).
Note that writing directly will not work as expected if you add any kind of concurrency because overlapping writes may corrupt the output.

Related

How to solve parse issues when a CSV has a field content escaped with double quotes

The input is received from a Salesforce Bulk API query.
INPUT
"RecordTypeId","Name","Description"
"AAA","Talent 2022 - Skills Renewal - ABC","DF - 14/03 - Monty affirmed that the ""mastercard approach"" would best fit in this situation. I will connect (abc, def, ghi) and the confirm booking tomorrow (15/03)"
SCRIPT:
%dw 2.0
output application/csv separator=",", ignoreEmptyLine=false, quoteValues=true, quoteHeader=true, lineSeparator="\r\n"
---
payload
OUTPUT:
"RecordTypeId","Name","Description"
"AAA","Talent 2022 - Skills Renewal - ABC","DF - 14/03 - Monty affirmed that the , def, ghi) and the confirm booking tomorrow (15/03)"
Expected OUTPUT:
The column description has " and , in it and therefore some description content is getting lost and some is getting shifted to different columns. I need entire description value in one column
The escape character has to be set to a double quote (") for DataWeave to recognize that "" is an escaped quote and not the end of a string. You can not use replace or any string operation because they are executed after the input is parsed.
You need to configure the reader properties in the source of that payload. For example in the SFTP or HTTP listeners, or whatever connector or operation reads the CSV. There you can add the outputMimeType attribute and set the input type and its properties. Note that because the flow is in an XML file you need to be mindful of XML escaping also to use double quotes, and also need to escape the double quotes as DataWeave expects it, with a backslash (\).
Example:
outputMimeType="application/csv; escape="\"""
It looks like your payload is using " as escape character. By default DataWeave expects \ as the escape character for CSV, so you will need to specify the escape character explicitly while reading your input, after which DataWeave should be able to read the completely description as a single value.
For example the below DataWeave shows how you can use the input derivative to read your csv correctly. I do not know what exactly is your expected output so I am just giving an example that writes the value of description as text
%dw 2.0
input payload application/csv escape='"'
output text
---
payload[0].Description
The output of this will is
DF - 14/03 - Monty affirmed that the "mastercard approach" would best fit in this situation. I will connect (abc, def, ghi) and the confirm booking tomorrow (15/03)

replacing fasta headers gives mismatch

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

How to split a CSV file into groups using Pentaho?

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.
Eg
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W
I need to split and group this as such:
(each block starts when the first column is equal to '1')
Block a)
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
Block b)
1|N|O|P
4|Q|R|S|T
5|U|V|W
Eg
a |1|A|B|C
a |2|D|E|F
a |8|G|H|I|J|K
a |4|L|M
b |1|N|O|P
b |4|Q|R|S|T
b |5|U|V|W
How can this be achieved using Penatho? Thanks.
I found a similar question but answers don't really help my case
Pentaho Kettle split CSV into multiple records
I think I got the answer.
I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)
I'll explain what I did:
1) First, we grab the row full text with a Text input step
When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.
2) Next we apply a regex eval to separate the ID from the rest of our string.
^(\d+)\|(.*)
Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.
That gives you this output: (blue is the first capture group, red is the second)
3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:
var sequence
//if it's the first row, set sequence to 1
if(sequence == null){
sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
if(row_id == '1'){
// increment the sequence
sequence++;
}else{
//nothing
}
}
And that will give you this output that seem to be what you expected: (green, the group/sequence done)
Hope it helps =)

How to get output headers on dynamic table input in pentaho kettle

I've got a simple kettle transformation which just does Table Input -> Text File Output
The table input however is SELECT * FROM ${tableName}
(with the table coming from a job parameter)
The Text file output just has the filename options and separator set.
The output data rows are written OK, but the header checkbox does nothing and I cannot work out how to generate a header.
I guess it is because I am not explicitly mapping fields in the output stage.
How can I introduce a header to my output?
Thx
It turns out that enable "append" disables "header"
See the comment here: http://wiki.pentaho.com/display/EAI/Text+File+Output?focusedCommentId=21104316#comment-21104316

inserting character in file , jython

I have written a simple program where to read the first 4 characters and get the integer of it and read those many character and write xxxx after it . Although the program is working the only issues instead of inserting the character , its replacing.
file = open('C:/40_60.txt','r+')
i=0
while 1:
char = int(file.read(4))
if not char: break
print file.read(char)
file.write('xxxx')
print 'done'
file.close()
I am having issue with writing data .
considering this is my sample data
00146456135451354500107589030015001555854640020
and expected output is
001464561354513545xxxx00107589030015001555854640020
but actually my above program is giving me this output
001464561354513545xxxx7589030015001555854640020
ie. xxxx overwrites 0010.
Please suggest.
Files do not support an "insert"-operation. To get the effect you want, you need to rewrite the whole file. In your case, open a new file for writing; output everything you read and in addition, output your 'xxxx'.