Writing pandas dataframe to excel in dbfs azure databricks: OSError: [Errno 95] Operation not supported - pandas

I am trying to write a pandas dataframe to the local file system in azure databricks:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
data.to_excel(writer)
Then I get the following error message:
OSError: [Errno 95] Operation not supported
--------------------------------------------------------------------------- OSError Traceback (most recent call
last) in
3 data = pd.read_csv(url)
4 with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
----> 5 data.to_excel(writer)
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in exit(self, exc_type, exc_value, traceback)
892
893 def exit(self, exc_type, exc_value, traceback):
--> 894 self.close()
895
896 def close(self):
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in close(self)
896 def close(self):
897 """synonym for save, to make it more file-like"""
--> 898 content = self.save()
899 self.handles.close()
900 return content
I read in this post some limitations for mounted file systems: Pandas: Write to Excel not working in Databricks
But if I got it right, the solution is to write to the local workspace file system, which is exactly what is not working for me.
My user is workspace admin and I am using a standard cluster with 10.4 Runtime.
I also verified I can write csv file to the same location using pd.to_csv
What could be missing.

Databricks has a drawback that does not allow random write operations into DBFS which is indicated in the SO thread you are referring to.
So, a workaround for this would be to write the file to local file system (file:/) and then move to the required location inside DBFS. You can use the following code:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'export.xlsx', engine="openpyxl") as writer:
#file will be written to /databricks/driver/ i.e., local file system
data.to_excel(writer)
dbutils.fs.ls("/databricks/driver/") indicates that the path you want to use to list the files is dbfs:/databricks/driver/ (absolute path) which does not exist.
/databricks/driver/ belongs to the local file system (DBFS is a part of this). The absolute path of /databricks/driver/ is file:/databricks/driver/. You can list the contents of this path by using either of the following:
import os
print(os.listdir("/databricks/driver/")
#OR
dbutils.fs.ls("file:/databricks/driver/")
So, use the file located in this path and move (or copy) it to your destination using shutil library as the following:
from shutil import move
move('/databricks/driver/export.xlsx','/dbfs/tmp/export.xlsx')

Related

Python Script that used to work, is now getting automatically killed in Ubuntu

I was once able to run the below python script on my Ubuntu machine without the memory errors I was getting on windows.
import pandas as pd
import numpy as np
#create a pandas dataframe for each input file
dfs1 = pd.read_csv('s1.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfs2 = pd.read_csv('s2.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfr = pd.read_csv('r.csv' , encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
#combine them into one dataframe
dfs12r = pd.concat([dfs1, dfs2, dfr],ignore_index=True)#withour ignore index the line numbers are not adjusted
# bow is "comming
wordlist=[]
for line in range(8052):
for row in range(106) :
#print(line,row,dfs12r[row][line])
if dfs12r[row][line] not in wordlist :
wordlist.append(dfs12r[row][line])
wordlist.sort()
#print(wordlist)
print(len(wordlist)) #12350
dfBOW = pd.DataFrame(np.zeros((len(dfs12r.index), len(wordlist))),dtype='int')
#create the dictionary
wordDict = dict.fromkeys(wordlist,'default')
counter=0
for word in wordlist :
wordDict[word] = counter
counter+=1
#print(wordDict)
#will start scanning every word from dfS12R and +1 the respective cell in dfBOW
for line in range(8052):
for row in range(107):
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
Unfortunately, probably after some automatic Ubuntu updates I am now getting the simple message "KIlled", after trying to run the process without any further explanation.
Through simple print statements I know that the script is interrupted inside the for loop in the end.
I understand that I should be able to make the script more memory efficient, but I am also hoping for guidance on how to get Ubuntu able to run again the same script like they used to. (Through the TOP command I can see the all of my memory including the swap is being used while inside this loop)
Could paging have been disabled somehow after the updates? Any advice is welcome.
I still have 16GB of RAM, and use Ubuntu 20.04 (Specs are the same before and after the script stopped working). I use dual boot on the same SSD.
Below is the error I am getting from teh same script on windows :
Traceback (most recent call last):
File "D:\sharedfiles\Organised\WorkSpace\ptixiaki\github\ptixiaki\code\makingthedata\2.1 Approach (Same as 2 but turning all words to lowercase)\2.1_CSVtoDataframe\CSVtoBOW.py", line 60, in <module>
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1143, in __setitem__
self._maybe_update_cacher()
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1279, in _maybe_update_cacher
ref._maybe_cache_changed(cacher[0], self, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\frame.py", line 3950, in _maybe_cache_changed
self._mgr.iset(loc, arraylike, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\managers.py", line 1141, in iset
blk.delete(blk_locs)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\blocks.py", line 388, in delete
self.values = np.delete(self.values, loc, 0) # type: ignore[arg-type]
File "<__array_function__ internals>", line 5, in delete
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\numpy\lib\function_base.py", line 4555, in delete
new = arr[tuple(slobj)]
MemoryError: Unable to allocate 501. MiB for an array with shape (12234, 10736) and data type int32

move files within Google Colab directory

what do I need to do or change?
I want to move files within Google Colab directory,
file label
0 img_44733.jpg 1
1 img_72999.jpg 1
2 img_25094.jpg 1
3 img_69092.jpg 1
4 img_92629.jpg 1
all files were stored in /content/train/ and separate folder 0 or 1 depend on their label (as seen on pic attached).
I pick some filename to use as validation dataset using sklearn and store it in x_test, and trying to move those file from their original directory of /content/train/ into /content/validation/ folder (again separated based on their label) using code below.
import os.path
from os import path
from os import mkdir
filter_list = x_test['file']
tmp = data[data.file.isin(filter_list)].index
for validator in tmp:
src_name = os.path.join('/content/train/'+str(data.loc[validator]['label'])+"/"+data.loc[validator]['file'])
trg_name = os.path.join('/content/validation/'+str(data.loc[validator]['label'])+"/"+data.loc[validator]['file'])
shutil.move(src_name, trg_name)
data is the original dataframe storing filename and label (shown above).
I get error code below:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/usr/lib/python3.7/shutil.py in move(src, dst, copy_function)
565 try:
--> 566 os.rename(src, real_dst)
567 except OSError:
FileNotFoundError: [Errno 2] No such file or directory: '/content/train/1/img_92629.jpg' -> '/content/validation/1/img_92629.jpg'
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
3 frames
/usr/lib/python3.7/shutil.py in copyfile(src, dst, follow_symlinks)
119 else:
120 with open(src, 'rb') as fsrc:
--> 121 with open(dst, 'wb') as fdst:
122 copyfileobj(fsrc, fdst)
123 return dst
FileNotFoundError: [Errno 2] No such file or directory: '/content/validation/1/img_92629.jpg'
I've tried using these for source directory input (shutil):
/content/train/
/train/
content/train/
not using the os.path
and using shell command:
!mv /content/train/
!mv content/train/
!mv /train/
!mv train/
what need to be changed?

Using pandas' read_hdf to load data on Google Drive fails with ValueError

I have uploaded a HDF file to Google Drive and wish to load it in Colab. The file was created from a dataframe with DataFrame.to_hdf() and it can be loaded successfully locally with pd.read_hdf(). However, when I try to mount my Google Drive and read the data in Colab, it fails with a ValueError.
Here is the code I am using to read the data:
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_hdf('/content/drive/My Drive/Ryhmäytyminen/data/data.h5', 'students')
And this is the full error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-cfe913c26e60> in <module>()
----> 1 data = pd.read_hdf('/content/drive/My Drive/Ryhmäytyminen/data/data.h5', 'students')
7 frames
/usr/local/lib/python3.6/dist-packages/tables/vlarray.py in read(self, start, stop, step)
819 listarr = []
820 else:
--> 821 listarr = self._read_array(start, stop, step)
822
823 atom = self.atom
tables/hdf5extension.pyx in tables.hdf5extension.VLArray._read_array()
ValueError: cannot set WRITEABLE flag to True of this array
Reading some JSON data was successful, so the problem probably is not with mounting. Any ideas what is wrong or how to debug this problem?
Thank you!
Try navigating to the directory that you store your HDF file first:
cd /content/drive/My Drive/Ryhmäytyminen/data
From here you should be able to load the HDF file directly:
data = pd.read_hdf('data.h5', 'students')

What is OSError: [Errno 95] Operation not supported for pandas to_csv on colab?

My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.