numpy memmap runtime error.... 64bits system with 2Gigas limit? - numpy

I'm trying to create a large file with numpy memmap
big_file = np.memmap(fnamemm, dtype=np.float32, mode='w+', shape=(np.prod(dims[1:]), len_im), order='F')
The system is a Windows 10-64bits operating in a 64bits python
In [2]: sys.maxsize
Out[2]: 9223372036854775807
With enough virtual memory (maximum of 120000Megas)
However every time I try to create a file which resulting size should exceed 2Gigas I get a runtime error
In [29]: big_file = np.memmap(fnamemm, dtype=np.int16, mode='w+', shape=(np.prod(dims[1:]), len_im), order=order)
C:\Users\nuria\AppData\Local\Continuum\anaconda3\envs\caiman\lib\site-packages\numpy\core\memmap.py:247: RuntimeWarning: overflow encountered in long_scalars
bytes = long(offset + size*_dbytes)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-29-66578da2d3f6> in <module>()
----> 1 big_file = np.memmap(fnamemm, dtype=np.int16, mode='w+', shape=(np.prod(dims[1:]), len_im), order=order)
~\AppData\Local\Continuum\anaconda3\envs\caiman\lib\site-packages\numpy\core\memmap.py in __new__(subtype, filename, dtype, mode, offset, shape, order)
248
249 if mode == 'w+' or (mode == 'r+' and flen < bytes):
--> 250 fid.seek(bytes - 1, 0)
251 fid.write(b'\0')
252 fid.flush()
OSError: [Errno 22] Invalid argument
This error does not happen when the files sizes are under 2Gigas...
I have replicated the same problem with another windows 7 also 64bits
Have I forgotten something? Why is memmap acting as I have a 32bits system?
EDIT: The error is not exactly a runtime error. the variable "bytes" gets an runtime warning when trying to get the length of the file, resulting I guess in a bad argument that raises the Errno 22

I had a similar error and it turned out that it was because one of the shape=(A,B) argument was with int32 instead of int64. Try the following:
len_im64 = np.array(len_im,dtype='int64')
big_file = np.memmap(fnamemm, dtype=np.float32, mode='w+', shape=(np.prod(dims[1:]).astype('int64'), len_im), order='F')
It fixed it for me.

Even though the system is 64 bit, problem may be because the application is built with 32 bit target. Check your shell execution mode (32 bit or 64 bit).
For such applications you have make them large address aware. Then the 32 applications can get access to 4GB memory on 64 bit machines.
How to do that? Here is someone's article.
https://github.com/pyinstaller/pyinstaller/issues/1288
Note: If your application is already built with 64 bit target.. ignore this and put in comment, Will delete this answer.

Related

TypeError: can’t convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first (fastai)

I am following the code here:
https://www.kaggle.com/tanlikesmath/diabetic-retinopathy-with-resnet50-oversampling
However, during the metrics calculation, I am getting the following error:
File "main.py", line 50, in <module>
learn.fit_one_cycle(4,max_lr = 2e-3)
...
File "main.py", line 39, in quadratic_kappa
return torch.tensor(cohen_kappa_score(torch.argmax(y_hat,1), y, weights='quadratic'),device='cuda:0')
...
File "/pfs/work7/workspace/scratch/ul_dco32-conda-0/conda/envs/resnet50/lib/python3.8/site-packages/torch/tensor.py", line 486, in __array__
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Here are the metrics and the model:
def quadratic_kappa(y_hat, y):
return torch.tensor(cohen_kappa_score(torch.argmax(y_hat,1), y, weights='quadratic'),device='cuda:0')
learn = cnn_learner(data, models.resnet50, metrics = [accuracy,quadratic_kappa])
learn.fit_one_cycle(4,max_lr = 2e-3)
As it is being said in the discussion https://discuss.pytorch.org/t/typeerror-can-t-convert-cuda-tensor-to-numpy-use-tensor-cpu-to-copy-the-tensor-to-host-memory-first/32850/6, I have to bring the data back to cpu. But I am slightly lost how to do it.
I tried to add .cpu() all over the metrics but could not solve it so far.
I'm assuming that both y and y_hat are CUDA tensors, that means that you need to bring them both to the CPU for the cohen_kappa_score, not just one.
def quadratic_kappa(y_hat, y):
return torch.tensor(cohen_kappa_score(torch.argmax(y_hat.cpu(),1), y.cpu(), weights='quadratic'),device='cuda:0')
# ^^^ ^^^
Calling .cpu() on a tensor that is already on the CPU has no effect, so it's safe to use in any case.
I went from a CPU to a GPU version and received this error. It was due to passing metrics=[mean_absolute_error,mean_squared_error] to the Learner object (in my case tabular_learner).
Removing the metric parameter solved the issue temporarily for me.

What is OSError: [Errno 95] Operation not supported for pandas to_csv on colab?

My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root

geopandas cannot read a geojson properly

I am trying the following:
After downloading http://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json
In [2]: import geopandas
In [3]: geopandas.read_file('./gz_2010_us_050_00_20m.json')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-83a1d4a0fc1f> in <module>
----> 1 geopandas.read_file('./gz_2010_us_050_00_20m.json')
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/io/file.py in read_file(filename, **kwargs)
24 else:
25 f_filt = f
---> 26 gdf = GeoDataFrame.from_features(f_filt, crs=crs)
27
28 # re-order with column order from metadata, with geometry last
~/miniconda3/envs/ml3/lib/python3.6/site-packages/geopandas/geodataframe.py in from_features(cls, features, crs)
207
208 rows = []
--> 209 for f in features_lst:
210 if hasattr(f, "__geo_interface__"):
211 f = f.__geo_interface__
fiona/ogrext.pyx in fiona.ogrext.Iterator.__next__()
fiona/ogrext.pyx in fiona.ogrext.FeatureBuilder.build()
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
On the page http://eric.clst.org/tech/usgeojson/ with 4 geojson files under the 20m column, the above file corresponds to the US Counties row, and is the only one that cannot be read out of the 4. The error message isn't very informative, I wonder what's the reason, please?
If your error message looks anything like "Polygons and MultiPolygons should follow the right-hand rule", it means the order of the coordinates in those GeoObjects should be clock-wise.
Here's an online tool to "fix" your objects, with a short explanation:
https://mapster.me/right-hand-rule-geojson-fixer/
Possibly an answer for people arriving at this page, I received the same error and the error was thrown due to encoding issues.
Try encoding the initial file with utf-8 or try opening the file with an encoding which you think is applied to the file. This fixed my error.
More info here

pandas memory error on large RAM machine but not on smaller RAM machine: same code, same data

I run the following on two of my machines:
import os, sqlite3
import pandas as pd
from feat_transform import filter_anevexp
db_path = r'C:\Users\timregan\Desktop\anondb_280718.sqlite3'
db = sqlite3.connect(db_path)
anevexp_df = filter_anevexp(db, 0)
On my laptop (with 8GB of RAM) this runs without issue (although the call out to filter_anevexp takes a few minutes). On my desktop (with 128GB of RAM) it fails in pandas with a memory error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\timregan\source\MentalHealth\code\preprocessing\feat_transform.py", line 171, in filter_anevexp
anevexp_df = anevexp_df[anevexp_df["user_id"].isin(df)].copy()
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2724, in _getitem_array
return self._take(indexer, axis=0)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 2789, in _take
verify=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4539, in take
axis=axis, allow_dups=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in reindex_indexer
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in <listcomp>
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 1258, in take_nd
allow_fill=True, fill_value=fill_value)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\algorithms.py", line 1655, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
Is there anything special I need to do to prevent errors (e.g. addressing errors) on machines with lots of memory?
N.B. I have not included the code in the filter_anevexp function because I am not interested in advice on how to reduce its memory footprint. I am interested in understanding why the same code running on the same data fails with a memory error on a 128GB RAM machine while it succeeds on a 8GB RAM machine?
You are using a 32 bit version in your home pc, this means that your python executables can only access 4gb of ram. Try to reinstall python37 with the 64bits instead of the 32 you are currently using.

Getting a EOF error when calling pd.read_pickle

Had a quick question regarding a pandas DataFrame and the pd.read_pickle() function. Basically, I have a large but simple Dataframe (333 mb). When I run pd.read_pickle on the dataframe, I am getting and EOFError.
Is there any way around this issue? What might be causing this?
I saw the same EOFError when I created a pickle using:
pandas.DataFrame.to_pickle('path.pkl', compression='bz2')
and then tried to read with:
pandas.read_pickle('path.pkl')
I fixed the issue by supplying the compression on read:
pandas.read_pickle('path.pkl', compression='bz2')
According to the Pandas docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
string representing the compression to use in the output file. By default,
infers from the file extension in specified path.
Thus, simply changing the path from 'path.pkl' to 'path.bz2' also fixed the problem.
I can confirm the valuable comment of greg_data:
When I encountered this error I worked out that it was due to the
initial pickling not having completed correctly. The pickle file was
created, but not finished correctly. Seems to me this is the only
possible source of the EOFError in pickle, that the pickle is
malformed, i.e. not finished.
My error during pickling was:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-263240bbee7e> in <module>()
----> 1 main()
<ipython-input-38-9b3c6d782a2a> in main()
43 with open("/content/drive/MyDrive/{}.file".format(tm.id), "wb") as f:
---> 44 pickle.dump(tm, f, pickle.HIGHEST_PROTOCOL)
45
46 print('Coherence:', get_coherence(tm, token_lists, 'c_v'))
TypeError: can't pickle weakref objects
And when reading that pickle file that was obviously not finished during pickling, the reported error occured:
pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
Error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-41-460bdd0a2779> in <module>()
----> 1 object = pd.read_pickle(r'/content/drive/MyDrive/TEST_2021_06_01_10_23_02.file')
/usr/local/lib/python3.7/dist-packages/pandas/io/pickle.py in read_pickle(filepath_or_buffer, compression)
180 # We want to silence any warnings about, e.g. moved modules.
181 warnings.simplefilter("ignore", Warning)
--> 182 return pickle.load(f)
183 except excs_to_catch:
184 # e.g.
EOFError: Ran out of input