Import multiple files in pandas - pandas

I am trying to import multiple files in pandas. I have created 3 files in the folder
['File1.xlsx', 'File2.xlsx', 'File3.xlsx'] as read by files = os.listdir(cwd)
import os
import pandas as pd
cwd = os.path.abspath(r'C:\Users\abc\OneDrive\Import Multiple files')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file), ignore_index=True)
df.head()
# df.to_excel('total_sales.xlsx')
print (files)
Upon running the code, I am getting the error (even though the file does exist in the folder)
FileNotFoundError: [Errno 2] No such file or directory: 'File1.xlsx'
Ideally, I want a code where I define a list of files in a LIST and then read the files through the loop using the path and the file LIST.

I think the following should work
import os
import pandas as pd
cwd = os.path.abspath(r'C:\Users\abc\OneDrive\Import Multiple files')
paths = [os.path.join(cwd,path) for path in os.listdir(cwd) if path.endswith('.xlsx')]
df = pd.concat(pd.read_excel(path,ignore_index=True) for path in paths)
df.head()
The idea is to get a list of full paths and then read them all in and concatenate them into a single dataframe on the next line

Related

Pandas - xls to xlsx converter

I want python to take ANY .xls file from given location and save it as .xlsx with original file name? How I can do that so anytime I paste file to location it will be converted to xlsx with original file name?
import pandas as pd
import os
for filename in os.listdir('./'):
if filename.endswith('.xls'):
df = pd.read_excel(filename)
df.to_excel(??)
Your code seems to be perfectly fine. In case you are only missing the correct way to write it with the given name, here you go.
import pandas as pd
import os
for filename in os.listdir('./'):
if filename.endswith('.xls'):
df = pd.read_excel(filename)
df.to_excel(f"{os.path.splitext(filename)[0]}.xlsx")
A possible extension to convert any file that gets pasted inside the folder can be implemented with an infinite loop, for instance:
import pandas as pd
import os
import time
while True:
files = os.listdir('./')
for filename in files:
out_name = f"{os.path.splitext(filename)[0]}.xlsx"
if filename.endswith('.xls') and out_name not in files:
df = pd.read_excel(filename)
df.to_excel(out_name)
time.sleep(10)

Pandas file reader error FileNotFoundError: [WinError 3]

I have the following
import os
import pandas as pd
path = 'C:/PanelComplete/FileForPeter/'
for folder in os.listdir(path):
for file in os.listdir(folder):
df = pd.read_csv(path+folder+'/'+file,engine='python')
df1 = df.groupby('codprg').size().reset_index(name='counts')
df1.to_csv(spath1+folder+'.csv', index=False,encoding='utf-8')
it causes the following problem FileNotFoundError: [WinError 3] The system cannot find the path specified: '20180101'
even the path is right as in the following snipping
This case is asked rapidly but my case is different
The problem is in the second for you are placing the folder name only instead of the full path (path+folder), hence you are not correctly addressing your desired directory. This should work:
import os
import pandas as pd
path = 'C:/PanelComplete/FileForPeter/'
for folder in os.listdir(path):
for file in os.listdir(path+folder):
df = pd.read_csv(path+folder+'/'+file,engine='python')
df1 = df.groupby('codprg').size().reset_index(name='counts')
df1.to_csv(spath1+folder+'.csv', index=False,encoding='utf-8')

Read json files from tar.gz folders and convert to pandas dataframe [duplicate]

This question already has answers here:
JSON to pandas DataFrame
(14 answers)
Closed 3 years ago.
i have never worked with json files and my problem is I have several folders tar.gz containing different json files. From each zipped folders I need to read only files AAjson, append and convert to a pandas dataframe. I tried in this way
import os, re
import pandas as pd
import pandas as pd
import tarfile
import json
from pandas.io.json import json_normalize
cd = "my_path"
dfList = []
for root, dirs, files in os.walk(cd):
with tarfile.open("dirs", "r:*") as tar:
for fname in files:
if re.match("AA_*.json$", fname):
data = json.load(fname)
frame = pd.DataFrame.from_dict(json_normilized(data),
orient='columns')
dfList.append(frame)
df = pd.concat(dfList)
I found the error
FileNotFoundError: [Errno 2] No such file or directory: 'dirs'
import pandas as pd
data = pd.read_json('filepath/filename')
data

Reading CSV files from Google Cloud Storage using pandas

I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)

How Can I import a Data set in Jupiter notebook (AD_Data.xlsx) data got xlsx extention

Tried all the possible options
like
import pandas as pd
df = pd.read_csv('AD_Data')
data = pd.ExcelFile("AD_Data")
xl_file = pd.ExcelFile(AD_Data)
dfs = {sheet_name: xl_file.parse(AD_Data) for sheet_name in xl_file.AD_Data}
dfs = pd.read_excel(AD_Data, sheetname=None)
None of them are helping
The error I am getting that
FileNotFoundError: File b'adData' does not exist
notebook and Data is in the same Folder.
I tried keeping different folder too, did not help.
I can use / import any other file like text and convert to DataFrame and work on it in same note book and from same data folder.
pd.read_excel (Python 3.6.4) works fine with xlsx on Windows.
Add the fileending .xlsx or make sure the file is in the same folder as the script.
dfs = pd.read_excel(r'C:\users\ilja\Desktop\Mappe1.xlsx', sheet_name=None)
print(dfs)
# OrderedDict([('Tabelle1', 1 5
# 0 2 6
# 1 3 7)])