keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time..." - dask-distributed

I keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" warning message after I finished DASK code. I am using DASK doing a large seismic data computing. After the computing, I will write the computed data into disk. The writing to disk part takes much longer than computing. Before I wrote the data to the disk, I call client.close(), which I assume that I am done with DASK. But "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" keep coming. When I doing the computing, I got this warning message 3-4 times. But when I write the data to the disk, I got the warning every 1 sec. How can I get ride of this annoying warning? Thanks.

same was happening with me in the Colab where we start the session
client = Client(n_workers = 40, threads_per_worker = 2 )
I terminate all my Colab sessions and installed and imported all the Dask libs
!pip install dask
!pip install cloudpickle
!pip install 'dask[dataframe]'
!pip install 'dask[complete]'
from dask.distributed import Client
import dask.dataframe as dd
import dask.multiprocessing
Now everything is working fine and not facing any issues.
Don't know how this solved my issue :D

I had been struggling with this warning too. I would get many of these warnings and then the workers would die. I was getting them because I had some custom python functions for aggregating my data together that was handling large python objects (dict). It makes sense so much time was being spent of garbage collection if I was creating these large objects.
I refactored my code so more computation was being done in parallel before they were aggregated together and the warnings went.
I looked at the progress chart on the status page of dask dashboard to see which tasks were taking along time to process (Dask tries to name the tasks after the function in your code which called them so that can help, but they're not always that descriptive). From there I could figure out which part of my code I needed to optimise.

You can disable garbage collection in Python.
gc.disable()
I found that it was easier to manage Dask worker memory through periodic usages of the Dask client restart: Client.restart()

Just create a process to run Dask cluster and return the ip address. Create the client using that ip address.

Related

Your notebook tried to allocate more memory than is available. It has restarted

I was getting started with TalkingData AdTracking, my first entry to Kaggle competitions. The first line was pd.read_csv() and I got this error
Your notebook tried to allocate more memory than is available. It has restarted
I thought my code was run in the cloud and that I don't have to worry about the memory requirements. Can someone help me with that?
Yes, kaggle has a memory limit: Its 8GB per kernel, or 20 Min running. It takes a lot of server juice to host such a thing.
There are various solutions for this problem, notably loading and processing the dataset in chunks.
There is this as well. But I have no experience with it.
And of course you can use another cloud platform such as AWS, GCP, Oracle, etc..

Dask distributed apparently not releasing memory on task completion

I'm trying to execute a custom dask graph on a distributed system, the thing is that it seems to be not releasing memory of finished tasks. Am I doing something wrong?
I've tried changing the number of processes and using a local cluster but it doesn't seem to make a difference.
from dask.distributed import Client
from dask import get
import pandas as pd
client = Client()
def get_head(df):
return df.head()
process_big_file_tasks = {f'process-big-file-{i}': (pd.read_csv, '/home/ubuntu/huge_file.csv') for i in range(50)}
return_fragment_tasks = {f'return-fragment-{i}': (get_head, previous_task) for i, previous_task in enumerate(process_big_file_tasks)}
dsk = {
**process_big_file_tasks,
**return_fragment_tasks,
'concat': (pd.concat, list(return_fragment_tasks))
}
client.get(dsk, 'concat')
Since the tasks are independent of each other (or at least the ones that consume the most memory), when every one finishes its memory should be released.
How do you determine that it isn't releasing memory? I recommend looking at Dask's dashboard to see the structure of the computation, including what has been released and not. This youtube video may be helpful
https://www.youtube.com/watch?v=N_GqzcuGLCY
In particular, I encourage you to watch the Graph tab while running your computation.

Is it possible to use Datalab with multiprocessing as a way to scale Pandas transformations?

I try to use Google Cloud Datalab to scale up data transformations in Pandas.
On my machine, everything works fine with small files (keeping the first 100000 rows of my file), but working with the full 8G input csv file led to a Memoryerror.
I though that a Datalab VM would help me. I first tried to use a VM with Highmem, going to up to 120 G or memory.
There, I keep getting an error : The kernel appears to have died. It will restart automatically.
I found something here :
https://serverfault.com/questions/900052/datalab-crashing-despite-high-memory-and-cpu
But I am not using TensorFlow, so it didn't help much.
So I tried a different approach, chunk processing and parallelize on more cores. It works well on my machine (4-cores, 12 G ram), but still requires hours of computation.
So I wanted to use a Datalab VM with 32 cores to speed things up, but here after 5 hours, the first threads still didn't finish, when on my local machine already 10 are completed.
So very simply:
Is it possible to use Datalab as a way to scale Pandas transformations ?
Why do I get worst results with a theoretically much better VM than my local machine ?
Some code:
import pandas as pd
import numpy as np
from OOS_Case.create_features_v2 import process
from multiprocessing.dummy import Pool as ThreadPool
df_pb = pd.read_csv('---.csv')
list_df = []
for i in range(-) :
df = df_pb.loc[---]
list_df.append(df)
pool = ThreadPool(4)
pool.map(process, list_df)
All the operations in my process function are pure Pandas and Numpy operations
Thanks for any tip, alternative or best practice advice you could give me !
One year later, I learned some useful best practices:
Use Google AI Platform, select to build VM with required number of CPUs
Use Dask for multithreading
With Pandas, there is a possibility to parallelise the .apply(), with pandarallel
It seems that a GCP Datalab has not supported multithreading:
Each kernel is single threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefit.
More information you can find here

Slow pywinauto Import

When I import all most any module, it loads almost/seemingly instantly or at least fast enough to be unnoticeable.
However, there is an issue with PyWinAuto. When i try to import it, it takes a huge amount of time (~1min) which is highly annoying for the users.
I am wondering if there is anyway to be able to speed up the time it takes to load up the module.
If you use backend='uia' it looks impossible, because import comtypes and loading UIAutomationCore.dll should take the most time and it's functionally required.
But if you need only default backend='win32' (i.e. if you create Application() object without backend parameter), you can run pip uninstall -y comtypes. Only Win32 backend will be available after that. But the import should work much faster.
More details about these 2 backends could be found in the Getting Started Guide.

numpy: How do you "break" a numpy operation?

I accidentally tried to make a 200,000 x 200,000 array in numpy. Control-C does not seem to break the operation. Is there any way to stop the array creation without simply killing the python session?
Unfortunately no. Python (and MatLab, the only other analysis package I use) do not detect the user interrupt until the current operation (NOT run) is finished.
The reason this doesn't work is that numpy has large portions written in C.
When Python begins executing a compiled function, the Python signal handling is effectively paused until the execution of the compiled code is finished.
This is bad news for your interactive Python session, but there isn't much you can do aside from waiting for the inevitable OutOfMemoryError, or killing the session.