I am trying to read a text file and replace all occurrence of a "search term" with "replace term" with regular expression and write the new file.
I am relatively new to pentaho kettle and not sure which transform or set of steps will suite best for this use case? Most transforms read data by rows or columns so I am not sure how to read a text file and do a find replace to work? Most transforms either file line by line or by fields.
Thanks for time and attention.
Step 1 - Input-> Text file input : file type: Fixed, just one String field big enough
Step 2 - Transform -> Replace in String : Use regexp: Y, Find: "search term", Replace: "replace term"
Step 3 - Output -> Text file output
Not sure it's the best, but it works. Hope it helps.
Related
I have text file which is having ^(CAP) and ,(Comma) as a delimiter and after clearing i need to load to sql . I have tried my best to clear a source file
But still file is not cleaned as expectation .
Please find the below picture i have tried to correct the source file
But still file is not cleared as expected . Please find below uncleared file .
You have a variety of issues here.
You have identified the header row delimiter as a comma. A row delimiter is the, usually invisible, delimiter than indicates a row's worth of data has happened. Traditionally, this is an Operating System specific value but it's a Carriage Return (CR), Line Feed (LF) or Carriage Return/Line Feed.
Your source data is not a comma delimited file with caret/circumflex/cap text delimiters. You have a comma-space delimited file which SSIS doesn't support in the editor. However, if you hand edit the dtsx file as I outlined in How to read a flatfile with lowercase thorn as the delimiter to specify that it should use comma space ColumnDelimiter="_x002C__x0020_"
Given a truncated version of your source data
ListCode, CAS, Name
^216^, ^^, ^Coal Dust^
^216^, ^7782-24-5^, ^Graphite (Natural)^
^216^, ^^, ^Inert or Nuisance Dust^
and the comma (0x2C) space (0x20) edited into the raw dtsx connection manager, I was able to pull data as I believe you are expecting
You might also run into additional issues given your selection of code pages and not checking the Unicode button but that's beyond my ability to generate matching source data from an image.
Just replace the ^, ^ with ^,^
It looks like your source
CAS, SubName, ListCode, Type, CountryCode, ListName
^1000413-72-8^,^fasiglifam^,^447^,^Chemical Inventory^,^EU^,^ECICS Custom Tariff Codes^
^1000413-72-8^,^fasiglifam^,^0^,^^,^NN^,^SPHERA Global Substance List^
Then edit your connection manager with below details
[![enter image description here][2]][2]
It will work .
[2]: https://i.stack.imgur.com/0x89k.png
I'm using SQL file table and for instance I have a saved text file named "SOS.txt" which contains following text
For god's sake, save us right now please. We can't survive.
Now or never!
Now I want to find all files that contain the word save, so I execute following query
SELECT * FROM FileTableExample
WHERE CONTAINS(file_stream, 'save')
and here's the result:
stream file => 0x616C692053617665207573207269676874206E6F772E0D0A4E6F77206F72206E6576657221
As you can see I got the true result, the third column of the result indicates the file under name SOS.txt, I have the stream_id and stream_file but what I'm about to find is the way to show the the intended text in company with it's surrounding in human readable format.
Somethings like this:
Name | Excerpt
-------------+----------------------
SOS.txt |..sake, save us..
Is there any way?
Update:
After searching on the net I found this article which is useful but it didn't mention about full text search in filetable structure.
Based on this article, I converted file stream to string:
SELECT CONVERT(varchar(MAX), file_stream) AS Excerpt, *
from FileTableExample
where contains(file_stream, 'save')
It works if the file is a plain text like SOS.txt but if it's .docx or .pptx file, you are not going to gain a useful convention.
Use this, CAST(file_Stream as varchar(max))
I am doing i simple input text file into Kettle Pentaho PDI 8.1.0. The file has several acceding char like "á" and it is a .csv file.
In the settings of the input text file step i set the encoding to ISO-8859-1. So when i go to "Show file content" button everything are correct.
But when i press the Preview rows so i can see the data separated into columns then i get error on all acceding chars and are replaced with ? So Mária becomes M�ria.
By using the word error i do not mean that kettle does not run the transformation but that the data are not correct.
Any Idea?
Your file is obviously not encoded in ISO-8859-1.
The Encoding field in the Content tab of 'Text file input' is used by the "Preview rows" button but not by the "Show file content" button.
Try another encoding.
Try encoding cp866, hope it helps, or also you could try encoding with latin-1
Pentaho -
Design : Text file output
Requirement :
- Read values from DB and create a csv file.
- I want to remove the CR & LF from the last line in the generated file.
This empty last line is causing problem while file parsing so I want to get rid of it.
Sample example here :
Test.ktr :
https://ufile.io/ug06w
This produces output.csv in which last line contains CRLF (contains 3 lines - blank line at the end of file)
input.csv
https://ufile.io/lj0tj
(To simulate values coming from database, contains 2 lines)
Put some logic between the Table input and CSV output, for example the Filter step which can remove empty lines.
I cannot tell you more, unless you tell me more about your specific case.
I could solve this using Shell Script component. After generating file I added a post process step to remove the empty line at the end of the file.
There could be other solutions but this fulfilled my requirement.
Thank you.
I had lots of data in a .rtf file(having usernames and passwords).How can I fetch that data into a table. I'm using sqlite3.
I had created a "userDatabase.sql" in that I had created a table "usersList" having fields "username","password". I want to get the list of data in the "list.rtf" file in to my table "usersList". Please help me .
Thanks in advance.
Praveena.
I would write a little parser. Re-save the .rtf as a txt-file and assume it look like this:
user1:pass1
user2:pass2
user5:pass5
Now do this (in your code):
open the .txt file (NSString -stringWithContentsOfFile:usedEncoding:error:)
read line by line
for each line, fetch user and password (NSArray -componentsSeparatedByString)
store user/password into your DB
Best,
Christian
Edit: for parsing excel-sheets I recommend export as CSV file and then do the same
Parsing RTF files is mostly trivial. They're actually text, not binary (like doc pdf etc).
Last I used it, I remember the file format wasn't too difficult either.
Example:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 Username Password\par
Username2 Password2\par
UsernameN PasswordN\par
}
Do a regular expression match to get the last { ... } part. By sure to match { not \{.
Next, parse the text as you want, but keep in mind that:
everything starting with a \ is escaped, I would write a little function to unescape the text
the special identifier \par is for a new line
there are other special identifiers, such as \b which toggles bolding text
the color change identifier, \cfN changes the text color according to the color table defined in the file header. You would want to ignore this identifier since we're talking about plain text.