Unaccountable Dask memory usage - pandas

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))

I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Related

MaybeEncodingError when returning large arrays with pool.starmap

I'm experiencing a MaybeEncodingError when using multiprocess when the return list becomes too large.
I have a function that takes image as an input and produces a list of arrays as a result. Naturally data parallelization works magic in this case. I made a wrapper for running this in parallel, however I start receiving the MaybeEncodingError after reaching some threshold in memory when the results need to be merged with pool.join(). I decided to go with pool since the order of processing is very important.
from multiprocessing import Pool
def pool_starmap(function, input_list_tuple, processes = 5):
with Pool(processes=processes) as pool:
results = pool.starmap(function, input_list_tuple)
pool.close()
pool.join()
return results
results=pool_starmap(function, input_list_tuple, processes=2)
I run the code on the server with 80 cores and 512GB of RAM. The whole code rarely exceeds 30GB and yet it breaks only for pool. How do I allow for returning the list of large arrays with multiprocessing but also preserving the order of execution?

Unmanaged memory jamming cluster during dask's merge_asof method

I am trying to merge large dataframes using dask.dataframe.multi.merge_asof, but I am running into issues with accumulating unmanaged memory on the cluster.
I have boiled down the problem to the following. It essentially just creates some sample data with timestamps using dask, converts to pandas dataframes, and then runs dask's mergea_asof implementation. It quickly exceed the memory of the workers with unmanaged memory and eventually the cluster just gets stuck (it does not crash, just stops doing anything).
It looks like this:
from distributed import Client, LocalCluster
import pandas as pd
import dask
import dask.dataframe
client = Client(LocalCluster(n_workers=40))
# make sample datasets, and convert them to pandas dataframes
left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index().compute()
right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index().compute()
left = dask.dataframe.from_pandas(left, npartitions=250)
right = dask.dataframe.from_pandas(right, npartitions=250)
# Alternative to above block (no detour via pandas)
#left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index()
#right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index()
# (dask crashes on datetime, so convert to int first)
left['timestamp'] = left['timestamp'].values.astype('int64')
right['timestamp'] = right['timestamp'].values.astype('int64')
dask.dataframe.multi.merge_asof(
left=left.sort_values(by='timestamp'),
right=right.sort_values(by='timestamp'),
tolerance=pd.to_timedelta(10000000, unit='seconds').delta,
direction='forward',
left_on='timestamp',
right_on='timestamp',
).compute()
Note that I deliberately call compute() on the sample data to get pandas dataframes, since in the final use case I also need to start from padnas dataframes, so I need to "model" that step as well.
Interestingly, not making this detour works much better. So if I outcomment the first data creation code block and use the one labelled as an alternative instead, then the merge is successful (still some unmanaged memory, but not enough to stop the cluster)
While I suspect that occupying machine memory by the local pandas dataframe may increase the unmanaged memory (?) the data size is far from taking all of the machine's RAM (200 GiB). So my questions are
Why does the cluster hang despite memory being available on the machine?
Why does using pandas dataframes intermittently affect memory management during merge_asof?
I have tried various things with regard to garbage collection, thinking there might be some pandas data sticking around in memory while merge_asof is executed, but to no avail. I have also tried garbage collection on the workers, as well as automatic and manual memory trimming, as described in this blog post and this video, neither of which showed any effect worth mentioning.

Pandas v 0.25 groupby with many columns gives memory error

After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error
MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64
Doing a bit of research I find issue (#14942) reported on Git for an earlier version
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cat': np.random.randint(0, 255, size=3000000),
'int_id': np.random.randint(0, 255, size=3000000),
'other_id': np.random.randint(0, 10000, size=3000000),
'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')
# killed after 6 minutes of 100% cpu and 90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()
Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?
Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:
df.groupby(index, observed=True)
There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup.
While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.
My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.
db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.
class DuckDB:
def __init__(self, db=None):
self.db_loc = db or ':memory:'
self.db = duckdb.connect(self.db_loc)
def sql(self, sql=""):
return self.db.execute(sql).fetchdf()
def __del__():
self.db.close()
Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.
I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.

How to use CUDA pinned "zero-copy" memory for a memory mapped file?

Objective/Problem
In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.
In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]
Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [
cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.
My previous attempts have been with Cupy, but I am open to any cuda methods.
What I have tried so far
I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.
import os
import numpy as np
import cupy
#Create .npy files.
for i in range(4):
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# Eventually results in memory error.
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
Result of what I have tried
My attempt resulting in OutOfMemoryError:
It was mentioned that
it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.
And it was also mentioned that
CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default.
https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc
You can change default memory allocator if you want to use Unified Memory.
I tried using
cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)
But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).
So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.
If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.
It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.
However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.
The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.
A few other things to be aware of:
after doing what you need with pinned memory (allocations) you should probably revert the cupy allocator to its default value. Unfortunately, unlike cupy.cuda.set_allocator, I did not find a corresponding cupy.cuda.get_allocator, which strikes me as a deficiency in cupy, something that also seems worthy of filing a cupy issue to me. However for this demonstration we will just revert to the None choice, which uses one of the default device memory allocators (not the pool allocator, however).
by providing this minimalistic pinned memory allocator, we are still suggesting to cupy that this is ordinary device memory. That means it's not directly accessible from the host code (it is, actually, but cupy doesn't know that). Therefore, various operations (such as cupy.load) will create unneeded host allocations, and unneeded copy operations. I think to address this would require much more than just this small change I am suggesting. But at least for your test case, this additional overhead may be manageable. It appears that you want to load data from disk once, and then leave it there. For that type of activity, this should be manageable, especially since you are breaking it up into chunks. As we will see, handling four 5GB chunks will be too much for 25GB of host memory. We will need host memory allocation for the four 5GB chunks (which are actually pinned) and we will also need additional space for one additional 5GB "overhead" buffer. So 25GB is not enough for that. But for demonstration purposes, if we reduce your buffer sizes to 4GB (5x4GB = 20GB) I think it may fit within your 25GB host RAM size.
Ordinary device memory associated with cupy's default device memory allocator, has an association with a particular device. pinned memory need not have such an association, however our trivial replacement of BaseMemory with a lookalike class means that we are suggesting to cupy that this "device" memory, like all other ordinary device memory, has a specific device association. In a single device setting such as yours, this distinction is meaningless. However, this isn't suitable for robust multi-device use of pinned memory. For that, again the suggestion would be a more robust change to cupy, perhaps by filing an issue.
Here's an example:
import os
import numpy as np
import cupy
class PMemory(cupy.cuda.memory.BaseMemory):
def __init__(self, size):
self.size = size
self.device_id = cupy.cuda.device.get_device_id()
self.ptr = 0
if size > 0:
self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
def __del__(self):
if self.ptr:
cupy.cuda.runtime.freeHost(self.ptr)
def my_pinned_allocator(bsize):
return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)
cupy.cuda.set_allocator(my_pinned_allocator)
#Create 4 .npy files, ~4GB each
for i in range(4):
print(i)
numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
np.save( 'reg.memmap'+str(i) , numpyMemmap )
del numpyMemmap
os.remove( 'reg.memmap'+str(i) )
# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
print(i)
NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
del NPYmemmap
# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
print(i)
CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' ) )
cupy.cuda.set_allocator(None)
I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.
Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.
Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.
UPDATE: cupy provides support for pinned allocators now, see here. This answer should only be used for historical reference.

Memory is not released when taking a small slice of a DataFrame

Summary
adataframe is a DataFrame with 800k rows. Naturally, it consumes a bit of memory. When I do this:
adataframe = adataframe.tail(144)
memory is not released.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.
I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.
Demonstration with minimal program
With the following program I create a dictionary
data = {
0: a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
1: same_thing,
# ...
9: same_thing,
}
and I monitor memory usage while I'm creating the dictionary. Here it is:
#!/usr/bin/env python3
from resource import getrusage, RUSAGE_SELF
import pandas as pd
def print_memory_usage():
print(getrusage(RUSAGE_SELF).ru_maxrss)
def read_dataframe_from_csv(f):
result = pd.read_csv(f, parse_dates=[0],
names=('date', 'value', 'flags'),
usecols=('date', 'value', 'flags'),
index_col=0, header=None,
converters={'flags': lambda x: x})
result = result.tail(144)
return result
print_memory_usage()
data = {}
for i in range(10):
with open('data.csv') as f:
data[i] = read_dataframe_from_csv(f)
print_memory_usage()
Results
If data.csv only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv contains 800k rows, the results are similar to these:
52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660
(Adding gc.collect() before print_memory_usage() does not make any significant difference.)
What can I do about it?
As #Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.
I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case.
Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.
Correct that is how maxrss works (it measures peak memory usage). See here.
So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.
I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.