I have a large collection of NumPy arrays saved on disk. I would like to read them efficiently and concurrently with the training. I can't load them all into memory at once - the data set is too large.
Additionally, it would be nice to apply some user defined transforms on the fly. Also it would be nice to be able to read them from C++, not just Python.
I believe CNTK does not have this capability now, am I correct?
Currently, we don't have build-in numpy reader. However, you have multiple options:
Read the numpy data in batches and feed them to the trainer, here an example that read images into numpy array and feed it to the trainer:
https://github.com/Microsoft/FERPlus
What the data inside your numpy array? Can you convert it to a format readable by one of the CNTK readers?
Related
I've tried looking for this and haven't had any meaningful results.
I have a keras model that has multi input and my data was getting too large for my pandas approach so I preprocessed it and saved it parquet file. I'm not sure how to open it with keras.
I looked up tf.datasets but I still cannot figure out how to read a parquet file that I can pass to my model.
Does anyone know how to use open parquet files? I can't seem to figure out how to do this in tensorflow and can't find anything related to it in keras.
You can probably keep your pandas approach, but you would have to breakdown your data into chunks.
If you have already broken it down to create your parquet file, you should be able to use the same method to have only a subset of your data opened in pandas at a time.
If you need to extract the data from your parquet file here's a link on how to create chunks of data for a pandas dataframe:
How to read a CSV file subset by subset with Pandas?
Once you have a chunk of data you can call model.fit on that chunk of data and then go on to the next chunk and call model.fit
You can look into TensorFlow I/O which is a collection of file systems and file formats that are not available in TensorFlow's built-in support. Here you can find functionalities such tfio.IODataset.from_parquet, and also tfio.IOTensor.from_parquet to work with the parquet file formats.
!pip install tensorflow_io -U -q
import tensorflow_io as tfio
df = pd.DataFrame({"data": tf.random.normal([20], 0, 1, tf.float32),
"label": np.random.randint(2, size=(20))})
df.to_parquet("df.parquet")
pd.read_parquet('/content/df.parquet')[:2]
data label
0 0.721347 1
1 -1.215225 1
ds = tfio.IODataset.from_parquet('/content/df.parquet')
ds
FYI, I think you should also consider using the feather format rather than the parquet file format, AFAIK, the parquet file can be really heavy to load and can slow down your training pipelines, whereas feather is comparatively fast (very fast).
I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/
I have a data in mat file (observations and features) and i want to load it into numpy 2D array. I dont want to convert it into csv first and then load csv into numpy.
Use scipy's loadmat (API-docs).
The docs should be sufficient to get you going, but make sure to read the notes.
There is also the io-tutorial with some examples.
Is it possible to convert a breeze dense matrix to numpy array using spark?
I have here a breeze dense matrix I want to convert to numpy array.
Here is a way that works correctly but is slow / inefficient (creates multiple copies). i used zeppelin spark and pyspark interpreters (i guess toree should also be possible):
in spark:
%spark
import breeze.linalg._
import breeze.numerics._
z.put("matrix", DenseMatrix.eye[Double](4));
z.get("matrix")
then in python:
%pyspark
import numpy as np
def breeze2numpy(breeze_matrix):
data = list(breeze_matrix.copy().data())
return np.array(data).reshape(breeze_matrix.rows(), breeze_matrix.cols(), order='F')
breeze2numpy(z.z.get("matrix"))
this works but will be impractical for big datasets (because of the multiple copies involved via a python list). it would be nice to have a zero-copy method using python's buffer protocol like there is for C++ Eigen matrix --> numpy array.
I am trying to create a 78TB HDF5 dataset by filling it in a 2d block-partition manner. This is very slow when the block I'm writing spans rows that haven't ever been written to, because HDF5 is going in and allocating the diskspace and filling in the missing entries with zero.
Instead, I would like h5py to allocate the disk space for my dataset as soon as its created, and never fill it. This is possible with the C api according to Table 16 in the HDF5 Dataset documentation, but how can I do this with h5py, preferably with the high level interface?
I believe that you want to set the fill time to "never", with the H5Pset_fill_time() routine, but I don't know the h5py way to do that.
As Quincey suggested. You can use the low-level H5py API to create the dataset with the FILL_TIME_NEVER property then convert it back to a high-level Dataset object:
# create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
spaceid = h5py.h5s.create_simple((numRows, numCols))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
plist.set_chunk((rowchunk, colchunk))
datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
rows = h5py.Dataset(datasetid)
Try specifying a chunk shape that matches your write pattern. For example if you are writing in blocks of 1024x1024, it would look like this:
import h5py
import numpy as np
f = h5py.File('mybigdset.h5', 'w')
dset = f.create_dataset('dset', (78*1024*1024, 1024*1024), dtype='f4', chunks=(1024,1024))
arr = np.random.rand(1024,1024)
dset[0:1024, 0:1024] = arr
f.close()
Thankfully, this didn't use 78TB of disk, the file size was just 4MB.