Numpy.dot hang my program, i assume that is memory problem - numpy

I have two numpy array: A shape(512,) and B (3000,512). One function call
C = np.dot(B,A) and my program hang without any error out.
My python 3.7.3 and numpy 1.16.2.
But that code run good if i call c = np.dot(B,A) manual with suitable input or the length of B around 50
I don't know what difference between 2 ways of call.

I found the answer. This because of proceed memory limit. My program when running taking 20GB of RAM and when numpy need more memory for it's working the system hanged without a any error or warning, but when i call my that function manually, it called another process and got more RAM for it's work.

When A comes first, you need to use the transpose of B. Interestingly, I did not have to change the shape of A. This does not seem to be consistent to me, but it works.
import numpy as np
A = np.array([i for i in range(512)])
B = np.random.rand(3000, 512)
C1 = B.dot(A) # 3000 rows, 512
B = B.transpose() # 512 rows, 3,000 columns
C2 = A.dot(B)
C2 = C2.transpose() # 3,000 rows, 512 columns
print(np.all(np.equal(C1, C2))) # Verify that the result is the same

Related

How to setup a batched matrix multiplication in Numba with np.dot() using contiguous arrays

I am trying to speed up a batched matrix multiplication problem with numba, but it keeps telling me that it's faster with contiguous code.
Note: I'm using numba version 0.55.1, and numpy version 1.21.5
Here's the problem:
import numpy as np
import numba as nb
def numbaFastMatMult(mat,vec):
result = np.zeros_like(vec)
for n in nb.prange(vec.shape[0]):
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
return result
D,N = 10,1000
mat = np.random.normal(0,1,(N,D,D))
vec = np.random.normal(0,1,(N,D))
result = numbaFastMatMult(mat,vec)
print(mat.data.contiguous)
print(vec.data.contiguous)
print(mat[n,:,:].data.contiguous)
print(vec[n,:].data.contiguous)
clearly all the relevant data is contiguous (run the above code snippet and see the results of print()...
But, when I run this code, I get the following warning:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 1d, C), array(float64, 2d, A))
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
2 Extra comments:
This is just a toy problem for replication. I'm actually using something with many more data points, so hoping this will speed up.
I think the "right" way to solve this is with np.tensordot. However, I want to understand what's going on for future reference. For example, this discussion addresses a similar issue, but as far as I can tell, doesn't address why the warning shows up directly.
I've tried adding a decorator:
nb.float64[:,::1](nb.float64[:,:,::1],nb.float64[:,::1]),
I've tried reordering the arrays so the batch index is first (n in the above code)
I've tried printing whether the "mat" variable is contiguous from inside the function
I'll leave this up, but I figured it out:
Outside of a numba function:
mat[n,:,:].data.contiguous==True
but inside numba, mat[n,:,:] is no longer continous.
Changing my code above to np.dot(vec[n], mat[n]) removed the warning.
I'm making this the "correct" answer since it solved my problem. However, according to max9111's response, this behavior may be a bug!

Why does Pandas read_CSV return TextFileReader even if iterator = False?

Environment:
python : 3.7.10.final.0,
python-bits : 64,
OS : Windows, 64GB mem,
OS-release : 10,
Version : 10.0.19041,
pandas : 1.2.4
I have a very simple read statement after a simple print statement and conditional to check that I do want to read the whole file...
%timeit csvDataFrame = pd.read_csv(fc.selected, sep=",", header='infer', na_values ='?', skiprows=csvSkipRows, dtype=desiDtypesDict, comment = '#', iterator=False)
where fc.selected is the full path to a CSV of ~2 million rows (524MB on disk); csvSkipRows = [] and desiDtypesDict is a dict of types for the columns in the dataset.
CSV reading works works fine if I also add e.g. chunksize = 10, and iterate on the result so I am confident about the arguments, but when I try to read the whole file in at once
%timeit tells me that it took ~6s/loop and that there were 7 loops
csvDataFrame is not a dataframe but a TextFileReader
even if iterator = False
In this environment, Python generally has no problem accessing huge amounts of memory so:
Why is this happening, and how to I read the CSV in all at once?
Why is it happening? %timeit has interacted with pandas oddly*
How do you read the CSV all at once? Omit the %timeit
A fresh kernel and a minor modification later and we identify that the problem is due to putting %timeit before the read_csv.
Remove it and the total time is the same but a single dataframe is now returned, 2684511 rows × 31 columns, as expected and desired.
Recommendation: always test with minimal code before inferring that there really is a problem.
* exactly what the interaction is I don't know... maybe somebody else does.
The full cell code was:
# Read all at once.
print(readCSVAllAtOnce)
if readCSVAllAtOnce:
%timeit csvDataFrame = pd.read_csv(fc.selected, sep=",", header='infer', na_values ='?', skiprows=csvSkipRows, dtype=desiDtypesDict, comment = '#', iterator=False, chunksize=None)

pandas apply for performance

I have a pandas apply function that runs inference over a 10k csv of strings
account messages
0 th_account Forgot to tell you Evan went to sleep a little...
1 th_account Hey I heard your buying a house I m getting ri...
2 th_account They re releasing a 16 MacBook
3 th_account 5 cups of coffee today I may break the record
4 th_account Apple Store Items in order W544414717 were del...
The function takes about 17 seconds to run.
I'm working on a text classifier and was wondering if there is a quicker way to write it
def _predict(messages):
results = []
for message in messages:
message = vectorizer.transform([message])
message = message.toarray()
results.append(model.predict(message))
return results
df["pred"] = _predict(df.messages.values)
the vectorizer is a TfidfVectorizer and model is a GaussianNB model from sklearn.
I need to loop through every messsage in the csv and perform a prediction to be shown in a new column
You can try built-in function apply in pandas. Its underlying uses C language passby GIL. But still slow.
def _predict(message):
"""message is each row in dataframe
Each row of dataframe return a result
"""
message = vectorizer.transform([message])
message = message.toarray()
return model.predict(message)
df["pred"] = df.apply(_predict, axis=1)
You can run the following code to evaluate the time.
df.head().apply(_predict, axis=1)

How to merge very large numpy arrays?

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.
I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.
I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).
Arrays won't fit into RAM+swap memory.
How to merge them into an unique array and save it to a disk?
I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.
I have read this post but I still cannot realize how to do it.
EDIT
Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.
This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.
The way in which this particular array (the one I am trying to create) is built is not important while it works.
You should be able to load chunk by chunk on a np.memap array:
import numpy as np
data_files = ['file1.npz', 'file2.npz2', ...]
# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
rows += chunk.shape[0]
cols = chunk.shape[1]
dtype = chunk.dtype
# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
with np.load(data_file) as data:
chunk = data['array']
merged[idx:idx + len(chunk)] = chunk
idx += len(chunk)
However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.
This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774
The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.
Example
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
def read_the_arrays():
#Easily compressable data
#A lot smaller than your actual array, I do not have that much RAM
return np.arange(10*int(15E3)).reshape(10,int(15E3))
def writing(hdf5_path):
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Lets write to the dataset
for i in range(0,int(15E5),10):
dset[i:i+10,:]=read_the_arrays()
f.close()
def reading(hdf5_path):
f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
dset = f["your_data"]
#Read chunks
for i in range(0,int(15E3),10):
data=np.copy(dset[:,i:i+10])
f.close()
hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)

Tensorflow on shared GPUs: how to automatically select the one that is unused

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.