I have a zip file that upon unlocking opens up a .xlsx file. When I use the Zipfile extract function in my lambda function, It stores it in a directory called 'tmp'. Below is the code I have written.
def extract_zip(input_zip):
input_zip=ZipFile(input_zip)
with input_zip as zf:
return zf.extractall(path='/tmp/',pwd=b'password')
My question is how can I access the .xlsx file in the 'tmp' directory? Essentially all I want to do is convert it into a data frame
df = pd.read_excel('tmp')
You can simply read the file using Pandas as you would normally do:
df = pd.read_excel('/tmp/filename.xlsx')
Related
file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.
My directory structure is as follows:
>Pandas-Data-Science
>Demo
>SalesAnalysis
>Sales_Data
>Sales_April_2019.csv
>Sales_August_2019.csv
....
>Sales_December_2019.csv
So Demo is a new python file I made and I want to take all the csv files from Sales_Data and create one csv file in Demo.
I was able to make a csv file for any particular csv file from Sales_Data
df = pd.read_csv('./SalesAnalysis/Sales_Data/Sales_August_2019.csv')
So I figured if I just get the file name and iterate through it I can concatenate it all into an empty csv file:
import os
import pandas as pd
df = pd.DataFrame(list())
df.to_csv('one_file.csv')
files = [f for f in os.listdir('./SalesAnalysis/Sales_Data')]
for f in files:
current = pd.read_csv("./SalesAnalysis/Sales_Data/"+f)
So my thinking was that current will create a single csv file since f prints out the exact string required ie. Sales_August_2019.csv
However I get an error with current that says: No such file or directory: './SalesAnalysis/Sales_Data/Sales_April_2019.csv'
when clearly I was able to make a csv file with the exact same string. So why does my code not work?
Probably a problem with your current working directory not being what you expect. I prefer to do these operations with absolute path which makes it easier for debugging:
from pathlib import Path
path = Path('./SalesAnalysis/Sales_Data').resolve()
current = [pd.read_csv(file) for file in path.glob('*.csv')]
demo = pd.concat(current)
You can set a breakpoint to find out what path is exactly.
try this:
import os
import glob
import pandas as pd
os.path.expanduser("/mydir")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
I have a folder called "before_manipulation ".
It contains 3 CSV files with names File_A.CSV, File_B.CSV ,File_C.CSV
Current_path : c:/users/before_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have a data manipulation that I need to do in each of the files and after manipulation ,I need to save with the same file names in another directory.
Targeted_path : C:/users/after_manipulation [file_A.CSV, File_B.CSV,File_C.CSV]
I have the logic to do the data manipulation when there is only a single file with Pandas dataframe. When I have multiple files, how to read each file and its name and pass it to my logic ?
Pseudo Code of how I am working if there was one file.
import pandas as pd
df = pd.read_csv('c:/users/before_manipulation/file_A.csv')
... do logic/manipulation
df.to_csv('c:/users/after_manipuplation/file_A.csv')
any help is appreciated.
You can use os.listdir(<path>) to return a list of the files contained within a directory. If you do not pass a variable to <path> it will return the working directory listing.
With the list from os.listdir you can iterate over it, passing the capture filename to the function you already have for data manipulation. Then on the save to you can use the captured filename to save in your desired directory.
In summary the code would look something like this.
import os
import pandas as pd
in_dir = r'c:/users/before_manipulation/'
out_dir = r'c:/users/after_manipulation/'
files_to_run = os.listdir(in_dir)
for file in files_to_run:
print('Running {}'.format(in_dir + file))
df = pd.read_csv(in_dir + file)
...do your logic here to return the changed df you want to save
...
df.to_csv(out_dir + file)
For this to work you would need to have the same shape files for each file you have in the directory, and also you would need to want to do the same logic for each file.
If that is not the case you will need something like a dictionary to save the different manipulations you need to do based on the file name and call those when appropriate.
Assuming you have some logic that works for one file, I'd just put that logic into a function and run it on a for loop.
You'd end up with something like this:
directory = r'c:/users/before_manipulation'
files = ['file_A.CSV', 'File_B.CSV','File_C.CSV']
for file in files:
somefunction(directory + '/' + file)
If you need more info on functions I'd check this out: https://www.w3schools.com/python/python_functions.asp
using pathlib
from pathlib import Path
new_dir = '\\your_path'
files = [file for file in Path(your_dir).glob('*.csv')]
for file in files:
df = pd.read_csv(file)
# .. your logic
df.to_csv(f'{new_dir}\\{file.name}',index=False)
I have a csv file on which i need to work in my jupyter notebook ,even though i am able to view the contents in the file using the code in the picture
When i am trying to convert the data into a data frame i get a "no columns to parse from file error"
i have no headers. My csv file looks like this and also i have saved it in the UTF-8 format
Try to use pandas to read the csv file:
df = pd.read_csv("BON3_NC_CUISINES.csv)
print(df)
I'm trying to use input to variable to create a file using the input the filename
The only examples I've seen are print(input)
I'm new to Python but trying to write a functional program
thanks
This is nice beginning for you
def create_file():
fn = input('Enter file name: ').strip()
try:
file = open(fn, 'r')
except IOError:
file = open(fn, 'w')
In Python you can use the open() function to create a file (assuming it will be a text file).
The documentation for it is located here
Using your input variable to create the file, you could do it like so:
file = open(input, 'w+')
This will give you a file object which you can write lines to using the write() function.