Parse CSV to Extract Filenames and Rename Files (Python) - pandas

I'm looking to try and extract filenames from a comma CSV, rename the files they refer to by sequential numbering, then going back to the CSV in the process.
I am able to extract all the first column:
import pandas as pd
my_data = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
And then the list of entries that I need:
values = list(x for x in my_data["full path"])
From there I want to use that path to rename each file sequentially as per its path(1.msg, 2.msg, 3.msg), then go back and update the CSV with the "new" path.
My CSV looks like:
full path, name, data1, data2
\path\to\a\file.msg,data,moredata,evenmoredata
Existing file path:
\path\to\a\file.msg
New file path:
\path\to\a\1.msg
Any help is appreciated.

You can directly modify the dataframe and the file by iterating trough the dataframe itself. Once you have edited the desired rows, you persist the dataframe by rewriting it to a csv file (the same if you want to overwrite it). I assume here that file_path is the name of the column containing the filepath: change it accordingly.
Explanations come with the code comments
import os
import pandas as pd
# I'm assuming everything is correct up to the data reading
df = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
# You can iterate trough the index of the dataframe itself. Were it inconsistent, you can use a custom one (here `k`)
k = 0
for index, row in df.iterrows():
# Extract the current file path, `/path/to/file.msg
fp = row['file_path']
# Extract the filename, e.g. `file.msg`
fn = os.path.basename(fp)
# Extract the dir path, e.g. `/path/to
dir_path = os.path.dirname(fp)
# split and separate from the extention
name, ext = os.path.splitext(fn)
# Reconstruct the new filepath
new_path = os.path.join(dir_path, str(k) + ext)
# Important to try in order to avoid any prohibited access to the file or its absence
try:
os.rename(fp,new_path)
except:
# Here you can enrich the code to handle different specific exceptions
print(f'Error: file {fp} cannot be renamed')
else:
# If the file was correctly replaced, you can modify the dataframe. Put this code outside the try-except-else block to modify the dataframe in any case. NOTE: the index here MUST be that of the dataframe! Not your custom one.
df.at[index, 'file_path'] = new_path
k = k + 1
# Overwrite the dataframe. Adjust accordingly!
df.to_csv('test.csv')
Disclaimer: I couldn't try the code above. In case of any slip, I'll correct it as soon as possible

Related

combine all csv in various subfolders in pandas or in powershell/terminal and create a pandas dataframe

I have individual csv files within each subfolders of subfolders. From year to months, and within each month folder are day folders, and within each day, is the individual csv. I would like to combine all of the individual csv into one and create a pandas df.
In the tree diagram, it looks like this:
I tried this approach below but nothing was created:
import pandas as pd
import glob
path = r'~/root/up/to/the/folder/2022'
alldata = glob.glob(path + "each*.csv")
alldata.head()
I initially had it just looking for "each*.csv" files but realized there is something missing in between in order to get individual csv within each folder. Then maybe, a for loop will work. like loop through each folder within each subfolder, but that is where I am stucked right now.
The answer to this: Combining separate daily CSVs in pandas shows files that are in the same folder.
I tried to make sense on this answer: batch file to concatenate all csv files in a subfolder for all subfolders, but it just won't click on me.
I also tried the following as suggested in Python importing csv files within subfolders
import os
import pandas as pd
path = '<Insert Path>'
file_extension = '.csv'
csv_file_list = []
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(file_extension):
file_path = os.path.join(root, name)
csv_file_list.append(file_path)
dfs = [pd.read_csv(f) for f in csv_file_list]
but nothing is showing, I think there is something wrong with the path to redirect as shown in the tree above.
Or maybe there is a following step I need to do because when I ran dfs.head() it says AttributeError: 'list' object has no attribute 'head'
The following should work:
from pathlib import Path
import pandas as pd
csv_folder = Path('.') # path to your folder, e.g. to `2022`
df = pd.concat(pd.read_csv(p) for p in csv_folder.glob('**/*.csv'))
Alternatively, if you prefer you can also use glob.glob('**/*.csv', recursive=True) instead of the Path.glob method.

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

Trying to combine all csv files in directory into one csv file

My directory structure is as follows:
>Pandas-Data-Science
>Demo
>SalesAnalysis
>Sales_Data
>Sales_April_2019.csv
>Sales_August_2019.csv
....
>Sales_December_2019.csv
So Demo is a new python file I made and I want to take all the csv files from Sales_Data and create one csv file in Demo.
I was able to make a csv file for any particular csv file from Sales_Data
df = pd.read_csv('./SalesAnalysis/Sales_Data/Sales_August_2019.csv')
So I figured if I just get the file name and iterate through it I can concatenate it all into an empty csv file:
import os
import pandas as pd
df = pd.DataFrame(list())
df.to_csv('one_file.csv')
files = [f for f in os.listdir('./SalesAnalysis/Sales_Data')]
for f in files:
current = pd.read_csv("./SalesAnalysis/Sales_Data/"+f)
So my thinking was that current will create a single csv file since f prints out the exact string required ie. Sales_August_2019.csv
However I get an error with current that says: No such file or directory: './SalesAnalysis/Sales_Data/Sales_April_2019.csv'
when clearly I was able to make a csv file with the exact same string. So why does my code not work?
Probably a problem with your current working directory not being what you expect. I prefer to do these operations with absolute path which makes it easier for debugging:
from pathlib import Path
path = Path('./SalesAnalysis/Sales_Data').resolve()
current = [pd.read_csv(file) for file in path.glob('*.csv')]
demo = pd.concat(current)
You can set a breakpoint to find out what path is exactly.
try this:
import os
import glob
import pandas as pd
os.path.expanduser("/mydir")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

How to iterate over files extract file name and pass to pandas logic

I have a folder called "before_manipulation ".
It contains 3 CSV files with names File_A.CSV, File_B.CSV ,File_C.CSV
Current_path : c:/users/before_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have a data manipulation that I need to do in each of the files and after manipulation ,I need to save with the same file names in another directory.
Targeted_path : C:/users/after_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have the logic to do the data manipulation when there is only a single file with Pandas dataframe. When I have multiple files, how to read each file and its name and pass it to my logic ?
Pseudo Code of how I am working if there was one file.
import pandas as pd
df = pd.read_csv('c:/users/before_manipulation/file_A.csv')
... do logic/manipulation
df.to_csv('c:/users/after_manipuplation/file_A.csv')
any help is appreciated.
You can use os.listdir(<path>) to return a list of the files contained within a directory. If you do not pass a variable to <path> it will return the working directory listing.
With the list from os.listdir you can iterate over it, passing the capture filename to the function you already have for data manipulation. Then on the save to you can use the captured filename to save in your desired directory.
In summary the code would look something like this.
import os
import pandas as pd
in_dir = r'c:/users/before_manipulation/'
out_dir = r'c:/users/after_manipulation/'
files_to_run = os.listdir(in_dir)
for file in files_to_run:
print('Running {}'.format(in_dir + file))
df = pd.read_csv(in_dir + file)
...do your logic here to return the changed df you want to save
...
df.to_csv(out_dir + file)
For this to work you would need to have the same shape files for each file you have in the directory, and also you would need to want to do the same logic for each file.
If that is not the case you will need something like a dictionary to save the different manipulations you need to do based on the file name and call those when appropriate.
Assuming you have some logic that works for one file, I'd just put that logic into a function and run it on a for loop.
You'd end up with something like this:
directory = r'c:/users/before_manipulation'
files = ['file_A.CSV', 'File_B.CSV','File_C.CSV']
for file in files:
somefunction(directory + '/' + file)
If you need more info on functions I'd check this out: https://www.w3schools.com/python/python_functions.asp
using pathlib
from pathlib import Path
new_dir = '\\your_path'
files = [file for file in Path(your_dir).glob('*.csv')]
for file in files:
df = pd.read_csv(file)
# .. your logic
df.to_csv(f'{new_dir}\\{file.name}',index=False)

open all csv files with different structure at the same time in pandas

I have a text file and I am using pandas to process it. I am trying to skip the 1st few rows while opening the file. I don't have a fixed number of lines to be removed in different files (I am making a script) to use skiprows argument. but I know the line that would be the 1st line in new dataframe. so all the files would have this line in common and I want to remove any line before this line in all of them.
small example:
ID,Sample01
Owner,Administrator
ID,identifier
IL21R,6
CD84,21
KLRC2,9
TNFRSF11A,18
and here is the expected output:
ID,identifier
IL21R,6
CD84,21
KLRC2,9
TNFRSF11A,18
the common line among all files is:
ID,identifier
the following code works for the small example since there are only 2 line above the common line. how can I change the code to make it useful also for the files with more than 2 lines above the common line.
df = pd.read_csv(filename, skiprows=2, sep=',')
logic - read the file in a list and find the index of ID,identifier.
Then use the value as skip rows.
with open(r'file_path') as f:
total = f.readlines()
skip_value = total.index('ID,identifier\n')
import pandas as pd
df = pd.read_csv(r'file_path',skiprows=skip_value, sep=',')
print(df)