Numpy Load OverflowError: length too large - numpy

I have an algorithm that runs through a dataset and creates a scipy sparse matrix which in turn is saved using:
numpy.savez
and the file is open such as:
open(file, 'wb').
The matrix can get a considerable amount of disk space (it took about 20 GB running for 30 days)
After that, those matrices are loaded into other applications such as:
file = open(path_to_file, 'rb')
matrix = load(file)
data = matrix['arr_0']
ind = matrix['arr_1']
indptr = matrix['arr_2']
For 10 days it worked fine.
When running for a dataset of 30 days the matrix was also successfully created and saved.
But when trying to load it I got the error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/ubuntu/recsys/Scripts/Neighborhood/s3_CRM_neighborhood.py", line 76, in <module>
data = matrix['arr_0']
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 241, in __getitem__
return format.read_array(value)
File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 458, in read_array
data = fp.read(int(count * dtype.itemsize))
OverflowError: length too large
If I could successfully create and save the matrices shouldn't it be able to also load the result? Is there some overhead that is killing the loading? Is is possible to work around this issue?
Thanks in advance,

From the notes on the just published numpy version 1.8, release candidate 1:
IO compatibility with large files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Large NPZ files >2GB can be loaded on 64-bit systems.
So it seems you hit a known bug that has just been solved.

Related

Python matplotlib/pandas fails with "OverflowError: value too large to convert to npy_uint32"

I am trying to run this variant call pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32. This is a Python script, that plots from a gzipped tsv file. I guess, I just have to many rows in my calls.tsv.gzto be handled. The complete error log is:
Traceback (most recent call last):
File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
return unstack(self, level, fill_value)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
constructor=obj._constructor_expanddim)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
self.index = index.remove_unused_levels()
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
uniques = algos.unique(lab)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
table = htable(len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32
any ideas?

running temporal fusion transformer default dataset shape error

I ran default code of Temporal fusion transformer in google colab which downloaded at github.
After clone, when I ran the step 2, there's no way to test training.
python3 -m script_train_fixed_params volatility outputs yes
The problem is shape error in the below.
Computing best validation loss
Computing test loss
/usr/local/lib/python3.7/dist-packages/keras/engine/training_v1.py:2079: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
updates=self.state_updates,
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/tft_tf2/script_train_fixed_params.py", line 239, in <module>
use_testing_mode=True) # Change to false to use original default params
File "/content/drive/MyDrive/tft_tf2/script_train_fixed_params.py", line 156, in main
targets = data_formatter.format_predictions(output_map["targets"])
File "/content/drive/MyDrive/tft_tf2/data_formatters/volatility.py", line 183, in format_predictions
output[col] = self._target_scaler.inverse_transform(predictions[col])
File "/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_data.py", line 1022, in inverse_transform
force_all_finite="allow-nan",
File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 773, in check_array
"if it contains a single sample.".format(array)
ValueError: Expected 2D array, got 1D array instead:
array=[-1.43120418 1.58885804 0.28558148 ... -1.50945972 -0.16713021
-0.57365613].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've tried to modify code which is predict dataframe shpae of 'data_formatters/volatility.py", line 183, in format_predictions' because I guessed that's where the problem arises.), but I can't handle that.
You have to change line
183 in volatitlity.py
output[col] = self._target_scaler.inverse_transform(predictions[col].values.reshape(-1, 1))
and line 216 in electricity.py
sliced_copy[col] = target_scaler.inverse_transform(sliced_copy[col].values.reshape(-1, 1))
Afterwards the example electricity works fine. And I guess this should be the same with volatility.

nibabel.filebasedimages.ImageFileError for large label files

I am getting the following error when loading a large labeled NIFTI file of size 14.4 MB.
Traceback (most recent call last):
File "/home/miran045/reine097/projects2/lab2im/lab2im/dcan/reproduce_load_error.py", line 7, in <module>
img = nib.load(file_path)
File "/home/miran045/reine097/.local/lib/python3.7/site-packages/nibabel/loadsave.py", line 55, in load
raise ImageFileError(f'Cannot work out file type of "{filename}"')
nibabel.filebasedimages.ImageFileError: Cannot work out file type of "/home/feczk001/shared/data/nnUNet/nnUNet_raw_data_base/nnUNet_raw_data/Task509_Paper/labelsTr1/1mo_sub-375518.nii.gz"
Here is the code:
import nibabel as nib
print(nib.__version__)
file_path = '/home/feczk001/shared/data/nnUNet/nnUNet_raw_data_base/nnUNet_raw_data/Task509_Paper/labelsTr1/1mo_sub' \
'-375518.nii.gz'
img = nib.load(file_path)
print(img.shape)
This does not happen when I try to open such files of smaller size (on the order of KB). I can open this file in FreeSurfer FreeView without error and it looks fine. This is happening with version 3.2.1 of NiBabel.
The file is probably not actually zipped, despite the GZ file extension. Try gzipping it.

pandas memory error on large RAM machine but not on smaller RAM machine: same code, same data

I run the following on two of my machines:
import os, sqlite3
import pandas as pd
from feat_transform import filter_anevexp
db_path = r'C:\Users\timregan\Desktop\anondb_280718.sqlite3'
db = sqlite3.connect(db_path)
anevexp_df = filter_anevexp(db, 0)
On my laptop (with 8GB of RAM) this runs without issue (although the call out to filter_anevexp takes a few minutes). On my desktop (with 128GB of RAM) it fails in pandas with a memory error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\timregan\source\MentalHealth\code\preprocessing\feat_transform.py", line 171, in filter_anevexp
anevexp_df = anevexp_df[anevexp_df["user_id"].isin(df)].copy()
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 2724, in _getitem_array
return self._take(indexer, axis=0)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 2789, in _take
verify=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4539, in take
axis=axis, allow_dups=True)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in reindex_indexer
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4425, in <listcomp>
for blk in self.blocks]
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 1258, in take_nd
allow_fill=True, fill_value=fill_value)
File "C:\Users\timregan\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\algorithms.py", line 1655, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
Is there anything special I need to do to prevent errors (e.g. addressing errors) on machines with lots of memory?
N.B. I have not included the code in the filter_anevexp function because I am not interested in advice on how to reduce its memory footprint. I am interested in understanding why the same code running on the same data fails with a memory error on a 128GB RAM machine while it succeeds on a 8GB RAM machine?
You are using a 32 bit version in your home pc, this means that your python executables can only access 4gb of ram. Try to reinstall python37 with the 64bits instead of the 32 you are currently using.

matplotlib savefig IO error

I am trying to use the matplotlib.pyplot.savefig() function to save some figures.
I am saving them to a directory, however I keep getting the error:
matplotlib.pyplot.savefig(savepath,dpi=dpi,size=size)
File "C:\Anaconda\lib\site-packages\matplotlib\pyplot.py", line 577, in savefig
res = fig.savefig(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\matplotlib\figure.py", line 1476, in savefig
self.canvas.print_figure(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\matplotlib\backends\backend_qt5agg.py", line 161, in print_figure
FigureCanvasAgg.print_figure(self, *args, **kwargs)
File "C:\Anaconda\lib\site-packages\matplotlib\backend_bases.py", line 2211, in print_figure
**kwargs)
File "C:\Anaconda\lib\site-packages\matplotlib\backends\backend_agg.py", line 526, in print_png
filename_or_obj = open(filename_or_obj, 'wb')
IOError: [Errno 2] No such file or directory:
Of course the file doesn't exist, as I am trying to save it now.
The directory does exist, I have checked repeatedly.
I am completely baffled, as this worked perfectly fine 2 days ago, but with no changes to the code it does not now. EDIT: I updated the version of the anaconda python distribution that I was using before, From 32 bit anaconda 2.0 to 64 bit 2.3, both for python 2.7.
Does anyone have any clue?
Thank you for reading my desperate plea for assistance!
EDIT:
I am also getting what I believe to be the same error from saving txt files in python now.
f = open(fname, 'w')
IOError: [Errno 2] No such file or directory: 'D:\\DropBox\\Dropbox\\abc\\Time resolved spectroscopy data\\LiHoF4\\High resolution 1cm\\power spectra\\Si\\RT\\25ns\\CUT POWER SPECTRUM LiHoF pumping 5G5 449.8nm DC si detector 1cm resolution 25ns data aquisition 2000 points 5AVG RT 450nmlongpassfilter.0.dpt_fitting_output.txt'
Could this be to do with the long filename?
I don't understand why there would be a problem with a file not existing when opening it to write to.