how do I filter out errant integer data in pentaho data integration - pentaho

I have a fixed position input.txt file like this:
4033667 70040118401401
4033671 70040/8401901 < not int because of "/"
4033669 70040118401301
4033673 70060118401101
I'm using a text file input step to pull the data in, and I'd like to load the data into a database as int's and have errant data go to a log file.
I've tried to using the filter step and the data validator step, but I can't seem to get either to work. I've even tried using the text input field to bring it in as a string and then converting it to an int w/ the Select/Rename values Step, and changing the data-type in meta-data section.
a typical error I keep running into is "String : couldn't convert String to Integer"
Any suggestions?
Thanks!

So I ended up using...
Text file input > Filter Rows (regex \d+) > select values (to cast string to int) > table output
...and the error log comes off of the false result of the regex filter.

I understand you problem.
Let do it simple.

Related

Octave dlmread won't read date format

I have a csv file, the one from https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba/downloads/waves-measuring-buoys-data-mooloolaba.zip/1. The first entries look like this:
The first column has dates which I'm trying to read with this command:
matrix = dlmread ('waves-measuring-buoys-data/WavesMooloolabaJan2017toJun2019.csv',',',1,0);
(If referring to file on Kaggle, note that I slightly modified the directory and file names for ease of reading)
Then when I check a date by printing matrix(2,1), I get 1 instead of 01/01/2017 00:00.
How do I get the correct format?
csvread is only for numeric inputs.
Use csv2cell from the io package instead, to obtain your data as a string, and then perform any necessary string operatios and conversions accordingly.

How to get output headers on dynamic table input in pentaho kettle

I've got a simple kettle transformation which just does Table Input -> Text File Output
The table input however is SELECT * FROM ${tableName}
(with the table coming from a job parameter)
The Text file output just has the filename options and separator set.
The output data rows are written OK, but the header checkbox does nothing and I cannot work out how to generate a header.
I guess it is because I am not explicitly mapping fields in the output stage.
How can I introduce a header to my output?
Thx
It turns out that enable "append" disables "header"
See the comment here: http://wiki.pentaho.com/display/EAI/Text+File+Output?focusedCommentId=21104316#comment-21104316

Pentaho Spoon - Validate Fixed Width Input File Format

I'm trying to process a fixed width input file in pentaho and validate the format. The file will be a mixture of strings, numbers and dates. However when attempting to process a number field that has an incorrect character present (which i had expected would throw an error) it just reads the first part of the number and ignores the bad char.
I can recreate this issue with a very simple input file containing a single field:
I specify the expected number format, along with start position and length:
On running the transformation i would have expected the 'Q' to cause an error instead the following result is displayed, just reading the first two digits "67" and padding the rest to match the specified format:
If the input file is formatted correctly it runs perfectly well, but need it to throw an error otherwise. Any suggestions would be awesome. Thanks!
Just an FYI in case someone stumbles accross this question after hitting the same issues as myself.
I was able to construct a workaround by reading all values in the "Text File Input" step as strings, and then using a "Data Validator" step equipped with regex evaluation to ensure numbers were correctly formatted before parsing to number type with a following "Select Values" step.
Takes a bit longer to do this for every field, but was the most robust solution i could come up with.
Thanks

inserting character in file , jython

I have written a simple program where to read the first 4 characters and get the integer of it and read those many character and write xxxx after it . Although the program is working the only issues instead of inserting the character , its replacing.
file = open('C:/40_60.txt','r+')
i=0
while 1:
char = int(file.read(4))
if not char: break
print file.read(char)
file.write('xxxx')
print 'done'
file.close()
I am having issue with writing data .
considering this is my sample data
00146456135451354500107589030015001555854640020
and expected output is
001464561354513545xxxx00107589030015001555854640020
but actually my above program is giving me this output
001464561354513545xxxx7589030015001555854640020
ie. xxxx overwrites 0010.
Please suggest.
Files do not support an "insert"-operation. To get the effect you want, you need to rewrite the whole file. In your case, open a new file for writing; output everything you read and in addition, output your 'xxxx'.

how to import flat file source to database using sql

im currently want to inport my data from flat file to the database.
the flat file is in a txt file. in that txt file, i save a list of URLs. example:
http://www.mimi.com/Hotels-g303188-Rurrenabaque-Hotels.html
im using the SQL Server Import and Export wizard to do it. but when the time of execution, it has error saying
Error 0xc02020a1:
Data Flow Task 1: Data conversion failed. The data conversion for column
"Column 0" returned status value 4 and status text "Text was truncated or one
or more characters had no match in the target code page.".
can anyone help?..
You get this error because the text is too long for the column youve chosen to put it in.
Text was truncated or
You might want to check the size of the database column vis-a-vis your input data. Does the longest URL less than the column width?
one or more characters had no match in the target code page.".
Check if your input file has any special characters. An easy way to check this would be to save your file in ANSI (Notepad > Save As > Encoding = ANSI). Note - you'd still have to select the right code page so that the import interprets your input text correctly.
Here's a very nice link that has some background on what code pages are - http://www.joelonsoftware.com/articles/Unicode.html
Note you can also change the target column data type (to text stream for example) in the Datasource->Advanced section