Renaming files by comparing it to JSON list? - vb.net

I have a big folder with many files in it that need to rename, by comparing it to a list of a JSON output.
Here is a sample of the JSON output:
[
{"title":"The Little Hours","year":"2017","imdbid":"tt5666304","scid":"10080"},
{"title":"CarGo","year":"2017","imdbid":"tt6680792","scid":"10079"},
{"title":"My Little Pony: The Movie","year":"2017","imdbid":"tt4131800","scid":"10078"},
{"title":"Amityville: The Awakening","year":"2017","imdbid":"tt1935897","scid":"10077"}
]
The actual list has over 10,000 entries. In my folder I have the file names that matches with the scid. For example 10080.mp4. In the JSON list, scid 10080 is equals to the title The Little Hours
I know how to loop through the entire folder to read the names of the file, but I am stuck on reading this JSON file.
my pseudo code is something like:
For each file in (folder)
if file.name = Json.scid then
Rename(file, json.title(of that scid))
I'be Googled around but the other JSON examples does not look similar to the JSON output I have.

Related

Not able to filter files using pathGlobFilter

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib
Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.
Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

Appending csv files in directory into a pandas dataframe

I have written a scraper which downloads daily flight prices, stores them as pandas data frames and saves them off as csv files in a given folder. I am now trying to combine these csv files into pandas for data analysis using append, but end result is an empty data frame.
Specifically, individual csv files are loaded correctly into pandas, but the append seems to fail (and several methods found on stackoverflow posts don't seem to work). Code is below, any pointers? Thanks!
directory = os.path.join("C:\\Testfolder\\")
for root,dirs,files in os.walk(directory):
for file in files:
daily_flight_df = (pd.read_csv(directory+file,sep=";")) #loads csv into dataframe - works correctly
cons_flight_df.append(daily_flight_df) #appends daily flight prices into a pandas with consolidated flight prices - does not seem to work
print(cons_flight_df) #currently prints out an empty data frame
cons_flight_df.to_csv('C:\\Testfolder\\test.csv') #currently returns empty csv file
In pandas, the append method isn't in place. You need to assign it.
cons_flight_df = cons_flight_df.append(daily_flight_df)

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

Regexpression for getting a file

I have to get a file through PDI based on the filename and i want to select file with name matching pattern eligible_for_push which has to be at the end.The file can be .txt or .csv
Please Help
Thanks
There are two part to your query:
1. Finding all files ending with "eligible_for_push":
You cannot use regex to find this sort of pattern (at least i am not aware of). So as an alternate do the following:
Search all the files in the path using "Get Filename" steps. Use modified Javascript to find out the file ending with the above pattern. Check the JS file below.
2. Files can be ".txt" or ".csv":
You can use the below regex/wildcard to find choose between either .txt or .csv
.*\.txt|.*\.csv
Note : Use this code once you have filtered out the files ending with "eligible_for_push". The above JS ignore all the file patterns. After that use the second step to sort out all the .txt or .csv files.
Hope it helps :)