how to read data from multiple folder from adls to databricks dataframe - dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file

The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

Related

combine all csv in various subfolders in pandas or in powershell/terminal and create a pandas dataframe

I have individual csv files within each subfolders of subfolders. From year to months, and within each month folder are day folders, and within each day, is the individual csv. I would like to combine all of the individual csv into one and create a pandas df.
In the tree diagram, it looks like this:
I tried this approach below but nothing was created:
import pandas as pd
import glob
path = r'~/root/up/to/the/folder/2022'
alldata = glob.glob(path + "each*.csv")
alldata.head()
I initially had it just looking for "each*.csv" files but realized there is something missing in between in order to get individual csv within each folder. Then maybe, a for loop will work. like loop through each folder within each subfolder, but that is where I am stucked right now.
The answer to this: Combining separate daily CSVs in pandas shows files that are in the same folder.
I tried to make sense on this answer: batch file to concatenate all csv files in a subfolder for all subfolders, but it just won't click on me.
I also tried the following as suggested in Python importing csv files within subfolders
import os
import pandas as pd
path = '<Insert Path>'
file_extension = '.csv'
csv_file_list = []
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(file_extension):
file_path = os.path.join(root, name)
csv_file_list.append(file_path)
dfs = [pd.read_csv(f) for f in csv_file_list]
but nothing is showing, I think there is something wrong with the path to redirect as shown in the tree above.
Or maybe there is a following step I need to do because when I ran dfs.head() it says AttributeError: 'list' object has no attribute 'head'
The following should work:
from pathlib import Path
import pandas as pd
csv_folder = Path('.') # path to your folder, e.g. to `2022`
df = pd.concat(pd.read_csv(p) for p in csv_folder.glob('**/*.csv'))
Alternatively, if you prefer you can also use glob.glob('**/*.csv', recursive=True) instead of the Path.glob method.

Parse CSV to Extract Filenames and Rename Files (Python)

I'm looking to try and extract filenames from a comma CSV, rename the files they refer to by sequential numbering, then going back to the CSV in the process.
I am able to extract all the first column:
import pandas as pd
my_data = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
And then the list of entries that I need:
values = list(x for x in my_data["full path"])
From there I want to use that path to rename each file sequentially as per its path(1.msg, 2.msg, 3.msg), then go back and update the CSV with the "new" path.
My CSV looks like:
full path, name, data1, data2
\path\to\a\file.msg,data,moredata,evenmoredata
Existing file path:
\path\to\a\file.msg
New file path:
\path\to\a\1.msg
Any help is appreciated.
You can directly modify the dataframe and the file by iterating trough the dataframe itself. Once you have edited the desired rows, you persist the dataframe by rewriting it to a csv file (the same if you want to overwrite it). I assume here that file_path is the name of the column containing the filepath: change it accordingly.
Explanations come with the code comments
import os
import pandas as pd
# I'm assuming everything is correct up to the data reading
df = pd.read_csv('test.csv', sep=',', header=0, usecols=[0])
# You can iterate trough the index of the dataframe itself. Were it inconsistent, you can use a custom one (here `k`)
k = 0
for index, row in df.iterrows():
# Extract the current file path, `/path/to/file.msg
fp = row['file_path']
# Extract the filename, e.g. `file.msg`
fn = os.path.basename(fp)
# Extract the dir path, e.g. `/path/to
dir_path = os.path.dirname(fp)
# split and separate from the extention
name, ext = os.path.splitext(fn)
# Reconstruct the new filepath
new_path = os.path.join(dir_path, str(k) + ext)
# Important to try in order to avoid any prohibited access to the file or its absence
try:
os.rename(fp,new_path)
except:
# Here you can enrich the code to handle different specific exceptions
print(f'Error: file {fp} cannot be renamed')
else:
# If the file was correctly replaced, you can modify the dataframe. Put this code outside the try-except-else block to modify the dataframe in any case. NOTE: the index here MUST be that of the dataframe! Not your custom one.
df.at[index, 'file_path'] = new_path
k = k + 1
# Overwrite the dataframe. Adjust accordingly!
df.to_csv('test.csv')
Disclaimer: I couldn't try the code above. In case of any slip, I'll correct it as soon as possible

How to export data frame back to the original csv file that I imported it from?

I have a list of DataFrames by the name of "dfs" which has 139 DataFrames. I originally imported csv files to python and have deleted first few rows from each data frame. Now I wish to save these new files back in their original positions. How can I do that. My new data is saved in another list named final. Also please tell me if I can make my code more efficient as I am a new to python.
dfs = [pd.read_csv(filename) for filename in filenames]
final=[]
for i in range(139):
a= dfs[i].iloc[604:,]
final.append(a)
Not sure if I've understood it correctly, if you want to write df to csv to the same as when you made df but this time opposite way.
for df, filename in zip(final, filenames):
path = f'{filename}.csv'
df.to_csv(path)

Save the file in a different folder using python

I have a pandas dataframe and I would like to save it as a text file to another folder. What I tried so far?
import pandas as pd
df.to_csv(path = './output/filename.txt')
This does not save the file and gives me an error. How do I save the dataframe (df) into the folder called output?
the first arguement name of to_csv() is path_or_buf either change it or just remove it
df.to_csv('./output/filename.txt')

How to iterate over files extract file name and pass to pandas logic

I have a folder called "before_manipulation ".
It contains 3 CSV files with names File_A.CSV, File_B.CSV ,File_C.CSV
Current_path : c:/users/before_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have a data manipulation that I need to do in each of the files and after manipulation ,I need to save with the same file names in another directory.
Targeted_path : C:/users/after_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have the logic to do the data manipulation when there is only a single file with Pandas dataframe. When I have multiple files, how to read each file and its name and pass it to my logic ?
Pseudo Code of how I am working if there was one file.
import pandas as pd
df = pd.read_csv('c:/users/before_manipulation/file_A.csv')
... do logic/manipulation
df.to_csv('c:/users/after_manipuplation/file_A.csv')
any help is appreciated.
You can use os.listdir(<path>) to return a list of the files contained within a directory. If you do not pass a variable to <path> it will return the working directory listing.
With the list from os.listdir you can iterate over it, passing the capture filename to the function you already have for data manipulation. Then on the save to you can use the captured filename to save in your desired directory.
In summary the code would look something like this.
import os
import pandas as pd
in_dir = r'c:/users/before_manipulation/'
out_dir = r'c:/users/after_manipulation/'
files_to_run = os.listdir(in_dir)
for file in files_to_run:
print('Running {}'.format(in_dir + file))
df = pd.read_csv(in_dir + file)
...do your logic here to return the changed df you want to save
...
df.to_csv(out_dir + file)
For this to work you would need to have the same shape files for each file you have in the directory, and also you would need to want to do the same logic for each file.
If that is not the case you will need something like a dictionary to save the different manipulations you need to do based on the file name and call those when appropriate.
Assuming you have some logic that works for one file, I'd just put that logic into a function and run it on a for loop.
You'd end up with something like this:
directory = r'c:/users/before_manipulation'
files = ['file_A.CSV', 'File_B.CSV','File_C.CSV']
for file in files:
somefunction(directory + '/' + file)
If you need more info on functions I'd check this out: https://www.w3schools.com/python/python_functions.asp
using pathlib
from pathlib import Path
new_dir = '\\your_path'
files = [file for file in Path(your_dir).glob('*.csv')]
for file in files:
df = pd.read_csv(file)
# .. your logic
df.to_csv(f'{new_dir}\\{file.name}',index=False)