MaybeEncodingError when returning large arrays with pool.starmap - python-multiprocessing

I'm experiencing a MaybeEncodingError when using multiprocess when the return list becomes too large.
I have a function that takes image as an input and produces a list of arrays as a result. Naturally data parallelization works magic in this case. I made a wrapper for running this in parallel, however I start receiving the MaybeEncodingError after reaching some threshold in memory when the results need to be merged with pool.join(). I decided to go with pool since the order of processing is very important.
from multiprocessing import Pool
def pool_starmap(function, input_list_tuple, processes = 5):
with Pool(processes=processes) as pool:
results = pool.starmap(function, input_list_tuple)
pool.close()
pool.join()
return results
results=pool_starmap(function, input_list_tuple, processes=2)
I run the code on the server with 80 cores and 512GB of RAM. The whole code rarely exceeds 30GB and yet it breaks only for pool. How do I allow for returning the list of large arrays with multiprocessing but also preserving the order of execution?

Related

Unmanaged memory jamming cluster during dask's merge_asof method

I am trying to merge large dataframes using dask.dataframe.multi.merge_asof, but I am running into issues with accumulating unmanaged memory on the cluster.
I have boiled down the problem to the following. It essentially just creates some sample data with timestamps using dask, converts to pandas dataframes, and then runs dask's mergea_asof implementation. It quickly exceed the memory of the workers with unmanaged memory and eventually the cluster just gets stuck (it does not crash, just stops doing anything).
It looks like this:
from distributed import Client, LocalCluster
import pandas as pd
import dask
import dask.dataframe
client = Client(LocalCluster(n_workers=40))
# make sample datasets, and convert them to pandas dataframes
left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index().compute()
right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index().compute()
left = dask.dataframe.from_pandas(left, npartitions=250)
right = dask.dataframe.from_pandas(right, npartitions=250)
# Alternative to above block (no detour via pandas)
#left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index()
#right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index()
# (dask crashes on datetime, so convert to int first)
left['timestamp'] = left['timestamp'].values.astype('int64')
right['timestamp'] = right['timestamp'].values.astype('int64')
dask.dataframe.multi.merge_asof(
left=left.sort_values(by='timestamp'),
right=right.sort_values(by='timestamp'),
tolerance=pd.to_timedelta(10000000, unit='seconds').delta,
direction='forward',
left_on='timestamp',
right_on='timestamp',
).compute()
Note that I deliberately call compute() on the sample data to get pandas dataframes, since in the final use case I also need to start from padnas dataframes, so I need to "model" that step as well.
Interestingly, not making this detour works much better. So if I outcomment the first data creation code block and use the one labelled as an alternative instead, then the merge is successful (still some unmanaged memory, but not enough to stop the cluster)
While I suspect that occupying machine memory by the local pandas dataframe may increase the unmanaged memory (?) the data size is far from taking all of the machine's RAM (200 GiB). So my questions are
Why does the cluster hang despite memory being available on the machine?
Why does using pandas dataframes intermittently affect memory management during merge_asof?
I have tried various things with regard to garbage collection, thinking there might be some pandas data sticking around in memory while merge_asof is executed, but to no avail. I have also tried garbage collection on the workers, as well as automatic and manual memory trimming, as described in this blog post and this video, neither of which showed any effect worth mentioning.

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Partial calls to MPI_file_read_at_all in Fortran

I need to read a very big binary Fortran file in parallel with MPI.
The problem is that the file is so big that each cores cannot even store #total_file_size/ncpu on memory.
Therefore each cpu read sequentially part of the file.
I have the following pseudo code:
ir_start = start reading position for that cpu
ir_stop = end reading position for that cpu
CALL MPI_FILE_OPEN(world_comm,filint,MPI_MODE_RDONLY,MPI_INFO_NULL,fh,ierr)
size = size of the chuck to read
DO ir=ir_start, ir_stop
offset = data_size * (ir -1)
CALL MPI_FILE_READ_AT(fh, offset, data, size, MPI_DOUBLE_PRECISION, MPI_STATUS_IGNORE, ierr)
ENDDO
This works well.
However I was told that MPI_FILE_READ_AT_all (collective) would be much faster.
The problem is that if I use the collective version, then it only works if all the cores have the same number of elements (i.e. ir_stop-ir_start is the same).
This does not always happens.
Therefore for some number of cores, the code just hang as it is waiting for the last cpu that does not enter the final loop.
Is there a clever way to make this work with the MPI_FILE_READ_AT_ALL for an arbitrary number of cores ?

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?
Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

Parallelism in (I)Python with large blocks of data

I've been toiling with threads and processes for a while now to try to speed up my very parallel job in IPython. I'm not sure how much detail about the function I'm calling is useful, so here's a bash but ask if you need more.
My function's call signature looks like
def intersplit_array(ob,er,nl,m,mi,t,ti,dmax,n0=6,steps=50):
Basically, ob, er and nl are parameters for observed values and m,mi,t,ti and dmax are parameters that represent models against which the observations will be compared. (n0 and steps are fixed numerical parameters for the function.) The function loops through all the models in m and, using associated information in mi, t, ti and dmax, calculates a probability that this model matches. Note that m is quite big: it's a list of about 700 000 22x3 NumPy arrays. mi and dmax are of similar sizes. If releant, my normal IPython instance uses about 25% of system memory in top: 4GB of my 16GB of RAM.
I've tried to parallelize this in two ways. First, I tried to use the parallel_map function given over at the SciPy Cookbook. I made the call
P = parallel_map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
which runs, and provides the correct answer. Without the parallel_ part, this is just the result of applying the function one by one to each element. But this is slower than using a single core. I guess this is related to the Global Interpreter Lock?
Second, I tried to use a Pool from multiprocessing. I initialized a pool with
p = multiprocessing.Pool(6)
and then tried to call my function with
P = p.map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
First, I get an error.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Having a look in top, I then see all the extra ipython processes, each of which is apparently taking up 25% of RAM (which can't be so, because I've still got 4GB free) and using 0% CPU. I presume it isn't doing anything. I can't use IPython, either. I tried Ctrl-C for a while, but gave up once I got passed the 300th pool worker.
Does it work not interactively?
multiprocessing doesn't play well interactively, because of the way it splits processes. This is also why you had trouble killing it because it spawned so many processes. You would have to keep track of the master process to cancel it.
From the documentation:
Note
Functionality within this package requires that the __main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter.
...
If you try this it will actually output full tracebacks interleaved in a semi-random fashion, and then you may have to stop the master process somehow.
The best solution is probably to just run it as a script from the command line. Alternatively, IPython has its own system for parallel computing, but I've never used it.