Weird copy-on-write behavior and rss memory usage of numpy.zeros() - numpy

I know np.zeros() does not take up RAM at allocation time. But even if I tried to modify some entries of np.zeros(), I found that the memory usage, as measured by psutil.Process().memory_info().rss, was still showing as zero., why? Shouldn't the memory usage increase at once like a page size when I accessed the array data? What happened inside?
Here is how I measure the memory usage of np.zeros():
>>> N = 10**8
>>> process = psutil.Process()
>>>
>>> start = process.memory_info()
>>> a = np.zeros(N, dtype=np.float64) # should take up around 8e8 bytes
>>> end = process.memory_info()
>>> print(f"""rss {end.rss-start.rss},
... vms {end.vms-start.vms:.2e},
... data {end.data-start.data:.2e}""")
Output
rss 0,
vms 8.00e+08,
data 8.00e+08
Initially, this numpy array a took up 0 rss because it called calloc() (See this SO post for more detail about calloc()) and did not touch the allocated memory. No problem.
However, when I wrote to the array, I expected the process triggered a page fault and began to take up some amount of rss memory. I wrote the first n=10**3 entries, rss did not change, still 0:
>>> n = 10**3
>>> a = np.zeros(N, dtype=np.float64)
>>> start = process.memory_info()
>>> for i in range(33790):
... a[i] = 1.
>>> end = process.memory_info()
>>> print(f"rss {end.rss-start.rss}")
rss 0
Until some point, the rss began to change:
>>> n = 10**4 # use a bigger n now
>>> a = np.zeros(N, dtype=np.float64)
>>> start = process.memory_info()
>>> for i in range(33790):
... a[i] = 1.
>>> end = process.memory_info()
>>> print(f"rss {end.rss-start.rss}")
rss 536576
(I also noticed that the threshold of n making rss be non-zero changed when new variables being added to the current environment. Don't know why.)
Finally, when the whole np.zeros() was modified, the rss got close to 8e8 (but not exactly):
>>> n = N # the whole array size
>>> a = np.zeros(N, dtype=np.float64)
>>> start = process.memory_info()
>>> for i in range(33790):
... a[i] = 1.
>>> end = process.memory_info()
>>> print(f"rss {end.rss-start.rss}")
rss 7.9992e+08
Shouldn't the memory usage rss increase like a page size as soon as I accessed the array data?
Shouldn't the memory usage rss be equal to or above 8e8 bytes when I modified the whole array?
Could anyone clarify on these things?
(Optional) I am confused about how memory usage measurements like rss work, the right time to use it. Could anyone provide some good practices of measuring memory usage or references on it?
Thank you for sharing your knowledge.

Related

Show class probabilities from Numpy array

I've had a look through and I don't think stack has an answer for this, I am fairly new at this though any help is appreciated.
I'm using an AWS Sagemaker endpoint to return a png mask and I'm trying to display the probability as a whole of each class.
So first stab does this:
np.set_printoptions(threshold=np.inf)
pred_map = np.argmax(mask, axis=0)
non_zero_mask = pred_map[pred_map != 0]) # get everything but background
# print(np.bincount(pred_map[pred_map != 0]).argmax()) # Ignore this line as it just shows the most probable
num_classes = 6
plt.imshow(pred_map, vmin=0, vmax=num_classes-1, cmap='jet')
plt.show()
As you can see I'm removing the background pixels, now I need to show class 1,2,3,4,5 have X probability based on the number of pixels they occupy - I'm unsure if I'll reinvent the wheel by simply taking the total number of elements from the original mask then looping and counting each pixel/class number etc - are there inbuilt methods for this please?
Update:
So after typing this out had a little think and reworded some of searches and came across this.
unique_elements, counts_elements = np.unique(pred_map[pred_map != 0], return_counts=True)
print(np.asarray((unique_elements, counts_elements)))
#[[ 2 3]
#[87430 2131]]
So then I'd just calculate the % based on this or is there a better way? For example I'd do
87430 / 89561(total number of pixels in the mask) * 100
Giving 2 in this case a 97% probability.
Update for Joe's comment below:
rec = Record()
recordio = mx.recordio.MXRecordIO(results_file, 'r')
protobuf = rec.ParseFromString(recordio.read())
values = list(rec.features["target"].float32_tensor.values)
shape = list(rec.features["shape"].int32_tensor.values)
shape = np.squeeze(shape)
mask = np.reshape(np.array(values), shape)
mask = np.squeeze(mask, axis=0)
My first thought was to use np.digitize and write a nice solution.
But then I realized how you can hack it in 10 lines:
import numpy as np
import matplotlib.pyplot as plt
size = (10, 10)
x = np.random.randint(0, 7, size) # your classes, seven excluded.
# empty array, filled with mask and number of occurrences.
x_filled = np.zeros_like(x)
for i in range(1, 7):
mask = x == i
count_mask = np.count_nonzero(mask)
x_filled[mask] = count_mask
print(x_filled)
plt.imshow(x_filled)
plt.colorbar()
plt.show()
I am not sure about the axis convention with imshow
at the moment, you might have to flip the y axis so up is up.
SageMaker does not provide in-built methods for this.

Efficient way of storing 1TB of random data with Zarr

I'd like to store 1TB of random data backed by a zarr on disk array. Currently, I am doing something like the following:
import numpy as np
import zarr
from numcodecs import Blosc
compressor = Blosc(cname='lz4', clevel=5, shuffle=Blosc.BITSHUFFLE)
store = zarr.DirectoryStore('TB1.zarr')
root = zarr.group(store)
TB1 = root.zeros('data',
shape=(1_000_000, 1_000_000),
chunks=(20_000, 5_000),
compressor=compressor,
dtype='|i2')
for i in range(1_000_000):
TB1[i, :1_000_000] = np.random.randint(0, 3, size=1_000_000, dtype='|i2')
This is going to take some time -- I know things could probably be improved if I wasn't always generating 1_000_000 random numbers and instead reusing the array but I'd like some more randomness for now. Is there a better way to go about building this random dataset ?
Update 1
Using bigger numpy blocks speeds things up a bit:
for i in range(0, 1_000_000, 100_000):
TB1[i:i+100_000, :1_000_000] = np.random.randint(0, 3, size=(100_000, 1_000_000), dtype='|i2')
I'd recommend using Dask Array which will enable parallel computation of random numbers and storage, e.g.:
import zarr
from numcodecs import Blosc
import dask.array as da
shape = 1_000_000, 1_000_000
dtype = 'i2'
chunks = 20_000, 5_000
compressor = Blosc(cname='lz4', clevel=5, shuffle=Blosc.BITSHUFFLE)
# set up zarr array to store data
store = zarr.DirectoryStore('TB1.zarr')
root = zarr.group(store)
TB1 = root.zeros('data',
shape=shape,
chunks=chunks,
compressor=compressor,
dtype=dtype)
# set up a dask array with random numbers
d = da.random.randint(0, 3, size=shape, dtype=dtype, chunks=chunks)
# compute and store the random numbers
d.store(TB1, lock=False)
By default Dask will compute using all available local cores, but can also be configured to run on a cluster via the Distributed package.

Slicing a 300MB CuPy array is ~5x slower than NumPy

My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.
Is this expected behaviour due to CuPy overheads or am I missing something?
Code to reproduce:
import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)
# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'
timeit.timeit(cp_code, number=8192*4, globals=globals()) # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals()) # prints 0.027
Setup:
GPU: NVIDIA Quadro P4000
CuPy Version: 7.3.0
OS: CentOS Linux 7
CUDA Version: 10.1
cuDNN Version: 7.6.5
I also confirmed that the slicing is about 5x times slower in cupy, while there's a more precise way to measure the time (see e.g. https://github.com/cupy/cupy/pull/2740).
The size of the array does not matter because slice operations do not copy the data but create views. The result with the following is similar.
cp_arr = cp.zeros((4, 4, 4), dtype=cp.float32)
cp_code = 'arr2 = cp_arr[1:3, 1:3, 1:3]'
It is natural that "take slice then send it to GPU" is faster because it reduces the bytes to be transferred. Consider doing so if the first preprocess is the slicing.
Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:
In [1]: import cupy as cp
In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)
In [3]: b = a[100:120, 100:120, 100:120]
In [4]: a.strides
Out[4]: (691200, 1600, 4)
In [5]: b.strides
Out[5]: (691200, 1600, 4)
The same above could be verified by replacing CuPy with NumPy.
If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()' # 0.771 seconds
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()' # 0.154 seconds
Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won't be able to saturate memory channels, thus it's still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:
cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()' # 0.786 seconds
np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()' # 2.911 seconds

random.multivariate_normal on a dask array?

I've been struggling to find a way to get this calc that works for a dask workflow.
I have code that uses np.random.mulivariate_normal function and while many of these types are available to us on dask array it seems this one it not. Sooo.... I attempted to create my own based on an example provided in the dask documentation.
Here is my attempt which is giving errors that I am having difficulty understanding. I also provided random input variables to make it easy to replicate:
import numpy as np
from dask.distributed import Client
import dask.array as da
def mvn(mu, sigma, n, blocksize):
chunks = ((blocksize,) * (n // blocksize),
(blocksize,) * (n // blocksize))
name = 'mvn' # unique identifier
dsk = {(name, i, j): (np.random.multivariate_normal(mu,sigma, blocksize))
if i == j else
(np.zeros, (blocksize, blocksize))
for i in range(n // blocksize)
for j in range(n // blocksize)}
dtype = np.random.multivariate_normal(0).dtype # take dtype default from numpy
return da.Array(dsk, name, chunks, dtype)
n = 10000
A = da.random.normal(0, 1, size=(n,n), chunks=(1000, 1000))
sigma = da.dot(A,A.transpose())
mu = 4.0*da.ones(n, chunks = 1000)
R = da.numpy.random.mvn(mu, sigma, n, chunks=(100))
Any suggestions or am I so far off the mark here that I should abandon all hope? Thanks!
If you have a cluster to run this on, you can use my answer from this post, copied here for refrence:
An work arround for now, is to use a cholesky decomposition. Note that any covariance matrix C can be expressed as C=G*G'. It then follows that x = G'*y is correlated as specified in C if y is standard normal (see this excellent post on StackExchange Mathematic). In code:
Numpy
n_dim =4
size = 100000
A = np.random.randn(n_dim, n_dim)
covm = A.dot(A.T)
x= np.random.multivariate_normal(size=size, mean=np.zeros(len(covm)),cov=covm)
## verify numpys covariance is correct
np.cov(x, rowvar=False)
covm
Dask
## create covariance matrix
A = da.random.standard_normal(size=(n_dim, n_dim),chunks=(2,2))
covm = A.dot(A.T)
## get cholesky decomp
L = da.linalg.cholesky(covm, lower=True)
## drawn standard normal
sn= da.random.standard_normal(size=(size, n_dim),chunks=(100,100))
## correct for correlation
x =L.dot(sn.T)
x.shape
## verify
covm.compute()
da.cov(x, rowvar=True).compute()
This answer can be fleshed out, but I imagine you would have an easier time using dask's delayed, da.from_delayed and da.*stack.
One immediate problem I see with what you have: with np.random.multivariate_normal(mu,sigma, blocksize) you are directly calling the function, instead of making the spec. You probably wanted (np.random.multivariate_normal, mu,sigma, blocksize). This shows that working with raw dask dictionaries can be tricky!

How to merge very large numpy arrays?

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.
I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.
I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).
Arrays won't fit into RAM+swap memory.
How to merge them into an unique array and save it to a disk?
I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.
I have read this post but I still cannot realize how to do it.
EDIT
Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.
This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.
The way in which this particular array (the one I am trying to create) is built is not important while it works.
You should be able to load chunk by chunk on a np.memap array:
import numpy as np
data_files = ['file1.npz', 'file2.npz2', ...]
# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
rows += chunk.shape[0]
cols = chunk.shape[1]
dtype = chunk.dtype
# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
merged[idx:idx + len(chunk)] = chunk
idx += len(chunk)
However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.
This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774
The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.
Example
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
def read_the_arrays():
#Easily compressable data
#A lot smaller than your actual array, I do not have that much RAM
return np.arange(10*int(15E3)).reshape(10,int(15E3))
def writing(hdf5_path):
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Lets write to the dataset
for i in range(0,int(15E5),10):
dset[i:i+10,:]=read_the_arrays()
f.close()
def reading(hdf5_path):
f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f["your_data"]
#Read chunks
for i in range(0,int(15E3),10):
data=np.copy(dset[:,i:i+10])
f.close()
hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)