Not able to filter files using pathGlobFilter - dataframe

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib

Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

Related

Pandas, Glob, use wildcard to stand for end of filename

I am programming (Pandas) around a problem where certain generated files are saved with a date attached to the file. For example: file-name_20220814.csv.
However, these files change each time they are generated, creating a new ending to the file. What is the best way to use a wildcard to stand for these file date endings?
Glob? How would I do that in the following code:
df1 = pd.read_csv('files/file-name_20220816.csv')
Answer provided by #mitoRibo:
pd.read_csv(glob('files/file-name_*csv')[0])

Mule 4: SFTP List files that contain a variable

I have an SFTP directory that contains several files in this format
19328D_T001045863113302101909_20220721_103898.txt
1932A8_T001045863113302101909_20220721_103802.txt
The part starting with T i have saved as a dynamic variable vars.transaction (e.g. vars.transaction == "T001045863113302101909"). I want to do a check if I have any files in this directory that contain my vars.transaction in the filename.
So I think I need to use sftp list connector, edit inline and use filename pattern. But as there is numbers before and after the Transaction part I am not sure what to put in the filename pattern. Something like [#vars.transaction]
Thanks in advance
You can use the wildcard * along with your variable. Like *#[vars.transaction]* that will match all the files which has the vars.transaction in their name

How to fix 'File name too long' errors when using Snakemake

When using Snakemake, I store the values for my variables as part of the filenames (ex. "processed/count_{project}.tsv"). Recently, I've started using R formulas with many covariates as a variable. Now I get an error because the the filename is too long for the operating system. Has anyone else run into this issue and have any suggestions? Is there a canonical Snakemake approach for this problem?
Personally, I don't think it is a good idea to store information into the filename.
Rather, I would create a temp file in tabular or yaml format linking the file in question to covariates or other data. Then read this file in R or else to extract the relevant information.
One idea is to use paths instead since paths allowed to be longer.

Regexpression for getting a file

I have to get a file through PDI based on the filename and i want to select file with name matching pattern eligible_for_push which has to be at the end.The file can be .txt or .csv
Please Help
Thanks
There are two part to your query:
1. Finding all files ending with "eligible_for_push":
You cannot use regex to find this sort of pattern (at least i am not aware of). So as an alternate do the following:
Search all the files in the path using "Get Filename" steps. Use modified Javascript to find out the file ending with the above pattern. Check the JS file below.
2. Files can be ".txt" or ".csv":
You can use the below regex/wildcard to find choose between either .txt or .csv
.*\.txt|.*\.csv
Note : Use this code once you have filtered out the files ending with "eligible_for_push". The above JS ignore all the file patterns. After that use the second step to sort out all the .txt or .csv files.
Hope it helps :)

Parse M3U file locations to fully qualified paths

I would like to parse the file location information in an M3U playlist into fully qualified paths. The possible formats in M3U files seem to be:
c:\mydir\songs\tune.mp3
\songs\tune.mp3
..\songs\tune.mp3
For the first example, just leave it alone. For the second add the directory that the playlist resides in so it would become c:\playlists\songs\tune.mp3 and the same for the third case so it would also become: c:\playlists\songs\tune.mp3.
I'm using vb under VS2008 and I can't find a way to recognise each of the potential location formats in the M3U file. System.IO.Path offers no solution that I can find. I've searched extensively for terms like "convert relative path to absolute" but no luck.
Any advice appreciated.
Thanks.
Write a batch script that just reads the m3u file line by line, and then just parse each line looking for ":" , and for "..", and edit the string as needed. You can then just write the "converted" strings to another file...