Regexpression for getting a file - pentaho

I have to get a file through PDI based on the filename and i want to select file with name matching pattern eligible_for_push which has to be at the end.The file can be .txt or .csv
Please Help
Thanks

There are two part to your query:
1. Finding all files ending with "eligible_for_push":
You cannot use regex to find this sort of pattern (at least i am not aware of). So as an alternate do the following:
Search all the files in the path using "Get Filename" steps. Use modified Javascript to find out the file ending with the above pattern. Check the JS file below.
2. Files can be ".txt" or ".csv":
You can use the below regex/wildcard to find choose between either .txt or .csv
.*\.txt|.*\.csv
Note : Use this code once you have filtered out the files ending with "eligible_for_push". The above JS ignore all the file patterns. After that use the second step to sort out all the .txt or .csv files.
Hope it helps :)

Related

Not able to filter files using pathGlobFilter

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib
Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

Input data from file from directory where filename updates each month

I'm having trouble finding the answer to this, but if it already exists, please share with me. I'm trying input a file that where the file ending is ambiguous within a directory. For example the directory would be 'ThisPC\Desktop\Folder' and I'm looking for file 'SomeData_April2021'. I need it to search for the term 'SomeData' within the folder, as the ending is updated for each month. There are also other files in the folder, so I can't just specify the directory and have it upload everything.
Hopefully this is clear! If not, I'm happy to clarify.
Figured it out. Had to type (?i).+SomeData.. into the wildcard expression field

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

Parse M3U file locations to fully qualified paths

I would like to parse the file location information in an M3U playlist into fully qualified paths. The possible formats in M3U files seem to be:
c:\mydir\songs\tune.mp3
\songs\tune.mp3
..\songs\tune.mp3
For the first example, just leave it alone. For the second add the directory that the playlist resides in so it would become c:\playlists\songs\tune.mp3 and the same for the third case so it would also become: c:\playlists\songs\tune.mp3.
I'm using vb under VS2008 and I can't find a way to recognise each of the potential location formats in the M3U file. System.IO.Path offers no solution that I can find. I've searched extensively for terms like "convert relative path to absolute" but no luck.
Any advice appreciated.
Thanks.
Write a batch script that just reads the m3u file line by line, and then just parse each line looking for ":" , and for "..", and edit the string as needed. You can then just write the "converted" strings to another file...

How to find any "txt" file at particular location in system?

I have a robot to find a file of the given name at a particular location in a system but now I want to find all the text files at that particular location. I have tried to use "*.txt", but it didn't worked out. Is there a way to do that?
file.exists ♥environment⟦USERPROFILE⟧\Documents\t.txt errormessage ‴Sorry, I could not find a file‴
dialog ‴File exists‴
You can use the directory command. The pattern arguments allows you to filter out files of a particular extension.
directory path ♥environment⟦USERPROFILE⟧\Desktop pattern *.txt result ♥files
dialog ♥files⟦count⟧
The above code should let you know how many files of the given extension exist in the given directory.
You could take values from the returned list and use it with file.exists command.