Using pandas' read_hdf to load data on Google Drive fails with ValueError - pandas

I have uploaded a HDF file to Google Drive and wish to load it in Colab. The file was created from a dataframe with DataFrame.to_hdf() and it can be loaded successfully locally with pd.read_hdf(). However, when I try to mount my Google Drive and read the data in Colab, it fails with a ValueError.
Here is the code I am using to read the data:
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_hdf('/content/drive/My Drive/Ryhmäytyminen/data/data.h5', 'students')
And this is the full error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-cfe913c26e60> in <module>()
----> 1 data = pd.read_hdf('/content/drive/My Drive/Ryhmäytyminen/data/data.h5', 'students')
7 frames
/usr/local/lib/python3.6/dist-packages/tables/vlarray.py in read(self, start, stop, step)
819 listarr = []
820 else:
--> 821 listarr = self._read_array(start, stop, step)
822
823 atom = self.atom
tables/hdf5extension.pyx in tables.hdf5extension.VLArray._read_array()
ValueError: cannot set WRITEABLE flag to True of this array
Reading some JSON data was successful, so the problem probably is not with mounting. Any ideas what is wrong or how to debug this problem?
Thank you!

Try navigating to the directory that you store your HDF file first:
cd /content/drive/My Drive/Ryhmäytyminen/data
From here you should be able to load the HDF file directly:
data = pd.read_hdf('data.h5', 'students')

Related

Writing pandas dataframe to excel in dbfs azure databricks: OSError: [Errno 95] Operation not supported

I am trying to write a pandas dataframe to the local file system in azure databricks:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
data.to_excel(writer)
Then I get the following error message:
OSError: [Errno 95] Operation not supported
--------------------------------------------------------------------------- OSError Traceback (most recent call
last) in
3 data = pd.read_csv(url)
4 with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
----> 5 data.to_excel(writer)
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in exit(self, exc_type, exc_value, traceback)
892
893 def exit(self, exc_type, exc_value, traceback):
--> 894 self.close()
895
896 def close(self):
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in close(self)
896 def close(self):
897 """synonym for save, to make it more file-like"""
--> 898 content = self.save()
899 self.handles.close()
900 return content
I read in this post some limitations for mounted file systems: Pandas: Write to Excel not working in Databricks
But if I got it right, the solution is to write to the local workspace file system, which is exactly what is not working for me.
My user is workspace admin and I am using a standard cluster with 10.4 Runtime.
I also verified I can write csv file to the same location using pd.to_csv
What could be missing.
Databricks has a drawback that does not allow random write operations into DBFS which is indicated in the SO thread you are referring to.
So, a workaround for this would be to write the file to local file system (file:/) and then move to the required location inside DBFS. You can use the following code:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'export.xlsx', engine="openpyxl") as writer:
#file will be written to /databricks/driver/ i.e., local file system
data.to_excel(writer)
dbutils.fs.ls("/databricks/driver/") indicates that the path you want to use to list the files is dbfs:/databricks/driver/ (absolute path) which does not exist.
/databricks/driver/ belongs to the local file system (DBFS is a part of this). The absolute path of /databricks/driver/ is file:/databricks/driver/. You can list the contents of this path by using either of the following:
import os
print(os.listdir("/databricks/driver/")
#OR
dbutils.fs.ls("file:/databricks/driver/")
So, use the file located in this path and move (or copy) it to your destination using shutil library as the following:
from shutil import move
move('/databricks/driver/export.xlsx','/dbfs/tmp/export.xlsx')

move files within Google Colab directory

what do I need to do or change?
I want to move files within Google Colab directory,
file label
0 img_44733.jpg 1
1 img_72999.jpg 1
2 img_25094.jpg 1
3 img_69092.jpg 1
4 img_92629.jpg 1
all files were stored in /content/train/ and separate folder 0 or 1 depend on their label (as seen on pic attached).
I pick some filename to use as validation dataset using sklearn and store it in x_test, and trying to move those file from their original directory of /content/train/ into /content/validation/ folder (again separated based on their label) using code below.
import os.path
from os import path
from os import mkdir
filter_list = x_test['file']
tmp = data[data.file.isin(filter_list)].index
for validator in tmp:
src_name = os.path.join('/content/train/'+str(data.loc[validator]['label'])+"/"+data.loc[validator]['file'])
trg_name = os.path.join('/content/validation/'+str(data.loc[validator]['label'])+"/"+data.loc[validator]['file'])
shutil.move(src_name, trg_name)
data is the original dataframe storing filename and label (shown above).
I get error code below:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/usr/lib/python3.7/shutil.py in move(src, dst, copy_function)
565 try:
--> 566 os.rename(src, real_dst)
567 except OSError:
FileNotFoundError: [Errno 2] No such file or directory: '/content/train/1/img_92629.jpg' -> '/content/validation/1/img_92629.jpg'
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
3 frames
/usr/lib/python3.7/shutil.py in copyfile(src, dst, follow_symlinks)
119 else:
120 with open(src, 'rb') as fsrc:
--> 121 with open(dst, 'wb') as fdst:
122 copyfileobj(fsrc, fdst)
123 return dst
FileNotFoundError: [Errno 2] No such file or directory: '/content/validation/1/img_92629.jpg'
I've tried using these for source directory input (shutil):
/content/train/
/train/
content/train/
not using the os.path
and using shell command:
!mv /content/train/
!mv content/train/
!mv /train/
!mv train/
what need to be changed?

Load dataset from Roboflow in colab

I'm trying to retreive a roboflow project dataset in google colab. It works for two of the dataset versions, but not the latest I have created (same project, version 5).
Anyone know what goes wrong?
Snippet:
from roboflow import Roboflow
rf = Roboflow(api_key="keyremoved")
project = rf.workspace().project("project name")
dataset = project.version(5).download("yolov5")
loading Roboflow workspace...
loading Roboflow project...
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-22-7f073ab2bc86> in <module>()
7 rf = Roboflow(api_key="keyremoved")
8 project = rf.workspace().project("projectname")
----> 9 dataset = project.version(5).download("yolov5")
10
11
/usr/local/lib/python3.7/dist-packages/roboflow/core/version.py in download(self, model_format, location)
76 link = resp.json()['export']['link']
77 else:
---> 78 raise RuntimeError(resp.json())
79
80 def bar_progress(current, total, width=80):
RuntimeError: {'error': {'message': 'Unsupported get request. Export with ID `idremoved` does not exist or cannot be loaded due to missing permissions.', 'type': 'GraphMethodException', 'hint': 'You can find the API docs at https://docs.roboflow.com'}}
There can be limits for the number of images+augmentations that you can export with roboflow according to the plans that you use. Please check your account details and limits. Contact roboflow support if you need more help.

ValueError: unknown url type with DeepLab demo.ipynb

I am running Demo DeepLab.ipnyb using Google Colab. Demo provides images work well. When I tried to add my own image, I receive an error "ValueError: unknown url type: '/content/harshu-06032019.png'. I see that the file is uploaded to Colab.
Any help on why I am getting this error is appreciated.
I tried to put this file into Google Drive and grant access to Colab by mounting the Google Drive. That doesn't work as well.
But if the file is uploaded to google drive, I am getting the error "Cannot retrieve image. Please check url"
This is the code provided by DeepLabv3+
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-ba1edc5ae51a> in <module>()
24
25 image_url = IMAGE_URL or _SAMPLE_URL % SAMPLE_IMAGE
---> 26 run_visualization(image_url)
5 frames
/usr/lib/python3.6/urllib/request.py in _parse(self)
382 self.type, rest = splittype(self._full_url)
383 if self.type is None:
--> 384 raise ValueError("unknown url type: %r" % self.full_url)
385 self.host, self.selector = splithost(rest)
386 if self.host:
ValueError: unknown url type: '/content/harshu-06032019.png'
To fix this "ValueError: unknown url type: '/content/harshu-06032019.png'", follow these steps:
Remove content from the URL path
The path to your image should only be harshu-06032019.png
run !ls to see if the image file is present

Getting a EOF error when calling pd.read_pickle

Had a quick question regarding a pandas DataFrame and the pd.read_pickle() function. Basically, I have a large but simple Dataframe (333 mb). When I run pd.read_pickle on the dataframe, I am getting and EOFError.
Is there any way around this issue? What might be causing this?
I saw the same EOFError when I created a pickle using:
pandas.DataFrame.to_pickle('path.pkl', compression='bz2')
and then tried to read with:
pandas.read_pickle('path.pkl')
I fixed the issue by supplying the compression on read:
pandas.read_pickle('path.pkl', compression='bz2')
According to the Pandas docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
string representing the compression to use in the output file. By default,
infers from the file extension in specified path.
Thus, simply changing the path from 'path.pkl' to 'path.bz2' also fixed the problem.
I can confirm the valuable comment of greg_data:
When I encountered this error I worked out that it was due to the
initial pickling not having completed correctly. The pickle file was
created, but not finished correctly. Seems to me this is the only
possible source of the EOFError in pickle, that the pickle is
malformed, i.e. not finished.
My error during pickling was:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-263240bbee7e> in <module>()
----> 1 main()
<ipython-input-38-9b3c6d782a2a> in main()
43 with open("/content/drive/MyDrive/{}.file".format(tm.id), "wb") as f:
---> 44 pickle.dump(tm, f, pickle.HIGHEST_PROTOCOL)
45
46 print('Coherence:', get_coherence(tm, token_lists, 'c_v'))
TypeError: can't pickle weakref objects
And when reading that pickle file that was obviously not finished during pickling, the reported error occured:
pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
Error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-41-460bdd0a2779> in <module>()
----> 1 object = pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
/usr/local/lib/python3.7/dist-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression)
180 # We want to silence any warnings about, e.g. moved modules.
181 warnings.simplefilter("ignore", Warning)
--> 182 return pickle.load(f)
183 except excs_to_catch:
184 # e.g.
EOFError: Ran out of input