Faster numpy.random.shuffle with a length limit? - numpy

I am using numpy.random.shuffle to shuffle a list of data. The length of the list is large so I want to randomly sample some of data to do my work.
I implement this using the following code:
# data_list is a numpy array of shape (num_data,)
index = np.arange(data_list.size)
np.random.shuffle(index)
index = index[:len_limit]
data = data_list[index]
But since index is big, the shuffle is slow.
Any advice to improve the performance?

This is a common problem. I use the following:
Drawing with replacement
idxs = np.random.randint(0, high=len(data), size=(N,))
result = data[idxs]
Drawing without replacement
import random
idxs = random.sample(xrange(len(data)), N)
result = data[idxs]
where data is your original dataset and N is the number of desired samples. Either should be faster than shuffling, as long as N << len(data).

Try np.random.choice, with replace=False.
Example (using the same variables as in the question):
data = np.random.choice(data_list, len_limit, replace=False)
You'll need numpy version 1.7.0 or later.

Related

A more efficient way of creating an NxM array in Python

In Python, I need to create an NxM matrix in which the ij entry has value of i^2 + j^2.
I'm currently constructing it using two for loops, but the array is quite big and the computation time is long and I need to perform it several times. Is there a more efficient way of constructing such matrix using maybe Numpy ?
You can use broadcasting in numpy. You may refer to the official documentation. For example,
import numpy as np
N = 3; M = 4 #whatever values you'd like
a = (np.arange(N)**2).reshape((-1,1)) #make it to column vector
b = np.arange(M)**2
print(a+b) #broadcasting applied
Instead of using np.arange(), you can use np.array([...some array...]) for customizing it.

Numpy iterating over rows

I kind of have the misconception that for loops should be avoided in Numpy for speed reasons, for example
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.array([[1,1,1,1,1,1,1]])
output = numpy.zeros((2,1))
for i in range(2):
output[i] = numpy.mean(targets[a[i]])
Is this a good way to get the mean on selected positions of each row? Feels like there might be ways to slice the array first then apply mean directly.
I think you are looking for this:
targets[a].mean(1)
Note that in your example, targets need to be 1-D and not 2-D. Otherwise, your loop throws out of bound index as it interprets the index for row index and not the column index.
numpy actually interprets this for you: targets[a] works "row-wise" and subsequently using np.mean(targets[a], axis=1) as suggested by #hpaulj in the comments does exactly what you want:
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.arange(1,6) # To make the results differ
output = numpy.mean(targets[a], axis=1) # the i-th row of targets[a] is targets[a[i]]

How to check the presence of a given numpy array in a larger-shape numpy array?

I guess the title of my question might not be very clear..
I have a small array, say a = ([[0,0,0],[0,0,1],[0,1,1]]). Then I have a bigger array of a higher dimension, say b = ([[[2,2,2],[2,0,1],[2,1,1]],[[0,0,0],[3,3,1],[3,1,1]],[...]]).
I'd like to check if one of the elements of a can be found in b. In this case, I'd find that the first element of a [0,0,0] is indeed in b, and then I'd like to retrieve the corresponding index in b.
I'd like to do that avoiding looping, since from the very little I understood from numpy arrays, they are not meant to be iterated over in a classic way. In other words, I need it to be very fast, because my actual arrays are quite big.
Any idea?
Thanks a lot!
Arnaud.
I don't know of a direct way, but I here's a function that works around the problem:
import numpy as np
def find_indices(val, arr):
# first take a mean at the lowest level of each array,
# then compare these to eliminate the majority of entries
mb = np.mean(arr, axis=2); ma = np.mean(val)
Y = np.argwhere(mb==ma)
indices = []
# Then run a quick loop on the remaining elements to
# eliminate arrays that don't match the order
for i in range(len(Y)):
idx = (Y[i,0],Y[i,1])
if np.array_equal(val, arr[idx]):
indices.append(idx)
return indices
# Sample arrays
a = np.array([[0,0,0],[0,0,1],[0,1,1]])
b = np.array([ [[6,5,4],[0,0,1],[2,3,3]], \
[[2,5,4],[6,5,4],[0,0,0]], \
[[2,0,2],[3,5,4],[5,4,6]], \
[[6,5,4],[0,0,0],[2,5,3]] ])
print(find_indices(a[0], b))
# [(1, 2), (3, 1)]
print(find_indices(a[1], b))
# [(0, 1)]
The idea is to use the mean of each array and compare this with the mean of the input. np.argwhere() is the key here. That way you remove most of the unwanted matches, but I did need to use a loop on the remainder to avoid the unsorted matches (this shouldn't be too memory-consuming). You'll probably want to customise it further, but I hope this helps.

How to merge very large numpy arrays?

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.
I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.
I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).
Arrays won't fit into RAM+swap memory.
How to merge them into an unique array and save it to a disk?
I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.
I have read this post but I still cannot realize how to do it.
EDIT
Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.
This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.
The way in which this particular array (the one I am trying to create) is built is not important while it works.
You should be able to load chunk by chunk on a np.memap array:
import numpy as np
data_files = ['file1.npz', 'file2.npz2', ...]
# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
rows += chunk.shape[0]
cols = chunk.shape[1]
dtype = chunk.dtype
# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
merged[idx:idx + len(chunk)] = chunk
idx += len(chunk)
However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.
This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774
The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.
Example
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
def read_the_arrays():
#Easily compressable data
#A lot smaller than your actual array, I do not have that much RAM
return np.arange(10*int(15E3)).reshape(10,int(15E3))
def writing(hdf5_path):
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Lets write to the dataset
for i in range(0,int(15E5),10):
dset[i:i+10,:]=read_the_arrays()
f.close()
def reading(hdf5_path):
f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f["your_data"]
#Read chunks
for i in range(0,int(15E3),10):
data=np.copy(dset[:,i:i+10])
f.close()
hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)

How to create sparse boolean mask in Pandas?

I have the following code for mask filtering of df :
for i, y in enumerate(cols) :
dfm = df[y].str.contains(s)
mask= dfm if i==0 else np.column_stack((mask, dfm))
df is not sparse, but the filtering results mask is sparse.
Storing the mask in full boolean consumes a lot of memory when having a large dataframe ( 50mio rows * 100columns).
So, as mask result is very sparse (0.1% is TRUE), wondering if there is a way to use sparse boolean mask instead of array mask in order to reduce memory load...
Could not find any solution even there is already sparse array in Pandas.
Since this is not clear how to use it for the mask storage and usage.
ie
mask_sparse = pd.SparseArray(mask)
EDIT 2: Clarification of the question :
Can we get directly the filter result mask into a sparse array
without manipulating the full array ?
You can create sparse dataframes easily. But there is one major gotcha!
Consider the following dataframe df and its memory footprint
# 10,000 cells with 1% ones and 99% zeros
df = pd.DataFrame(np.random.choice((0, 1), size=(10000, 1000), p=(.99, .01)))
df.memory_usage().sum()
80000080
Let's try to sparsify
df_sparse = df.to_sparse()
df_sparse.memory_usage().sum()
80000080
Hmmm, that didn't do anything. That's because, we need to specify the object that is the majority place holder. Let's see
df_sparse_2 = df.to_sparse(1)
df_sparse_2.memory_usage().sum()
79196744
And
df_sparse_3 = df.to_sparse(0)
df_sparse_3.memory_usage().sum()
803416
That's better. Make sure to specify the place holder value.