Dask distributed apparently not releasing memory on task completion - pandas

I'm trying to execute a custom dask graph on a distributed system, the thing is that it seems to be not releasing memory of finished tasks. Am I doing something wrong?
I've tried changing the number of processes and using a local cluster but it doesn't seem to make a difference.
from dask.distributed import Client
from dask import get
import pandas as pd
client = Client()
def get_head(df):
return df.head()
process_big_file_tasks = {f'process-big-file-{i}': (pd.read_csv, '/home/ubuntu/huge_file.csv') for i in range(50)}
return_fragment_tasks = {f'return-fragment-{i}': (get_head, previous_task) for i, previous_task in enumerate(process_big_file_tasks)}
dsk = {
**process_big_file_tasks,
**return_fragment_tasks,
'concat': (pd.concat, list(return_fragment_tasks))
}
client.get(dsk, 'concat')
Since the tasks are independent of each other (or at least the ones that consume the most memory), when every one finishes its memory should be released.

How do you determine that it isn't releasing memory? I recommend looking at Dask's dashboard to see the structure of the computation, including what has been released and not. This youtube video may be helpful
https://www.youtube.com/watch?v=N_GqzcuGLCY
In particular, I encourage you to watch the Graph tab while running your computation.

Related

keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time..."

I keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" warning message after I finished DASK code. I am using DASK doing a large seismic data computing. After the computing, I will write the computed data into disk. The writing to disk part takes much longer than computing. Before I wrote the data to the disk, I call client.close(), which I assume that I am done with DASK. But "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" keep coming. When I doing the computing, I got this warning message 3-4 times. But when I write the data to the disk, I got the warning every 1 sec. How can I get ride of this annoying warning? Thanks.
same was happening with me in the Colab where we start the session
client = Client(n_workers = 40, threads_per_worker = 2 )
I terminate all my Colab sessions and installed and imported all the Dask libs
!pip install dask
!pip install cloudpickle
!pip install 'dask[dataframe]'
!pip install 'dask[complete]'
from dask.distributed import Client
import dask.dataframe as dd
import dask.multiprocessing
Now everything is working fine and not facing any issues.
Don't know how this solved my issue :D
I had been struggling with this warning too. I would get many of these warnings and then the workers would die. I was getting them because I had some custom python functions for aggregating my data together that was handling large python objects (dict). It makes sense so much time was being spent of garbage collection if I was creating these large objects.
I refactored my code so more computation was being done in parallel before they were aggregated together and the warnings went.
I looked at the progress chart on the status page of dask dashboard to see which tasks were taking along time to process (Dask tries to name the tasks after the function in your code which called them so that can help, but they're not always that descriptive). From there I could figure out which part of my code I needed to optimise.
You can disable garbage collection in Python.
gc.disable()
I found that it was easier to manage Dask worker memory through periodic usages of the Dask client restart: Client.restart()
Just create a process to run Dask cluster and return the ip address. Create the client using that ip address.

Is it possible to use Datalab with multiprocessing as a way to scale Pandas transformations?

I try to use Google Cloud Datalab to scale up data transformations in Pandas.
On my machine, everything works fine with small files (keeping the first 100000 rows of my file), but working with the full 8G input csv file led to a Memoryerror.
I though that a Datalab VM would help me. I first tried to use a VM with Highmem, going to up to 120 G or memory.
There, I keep getting an error : The kernel appears to have died. It will restart automatically.
I found something here :
https://serverfault.com/questions/900052/datalab-crashing-despite-high-memory-and-cpu
But I am not using TensorFlow, so it didn't help much.
So I tried a different approach, chunk processing and parallelize on more cores. It works well on my machine (4-cores, 12 G ram), but still requires hours of computation.
So I wanted to use a Datalab VM with 32 cores to speed things up, but here after 5 hours, the first threads still didn't finish, when on my local machine already 10 are completed.
So very simply:
Is it possible to use Datalab as a way to scale Pandas transformations ?
Why do I get worst results with a theoretically much better VM than my local machine ?
Some code:
import pandas as pd
import numpy as np
from OOS_Case.create_features_v2 import process
from multiprocessing.dummy import Pool as ThreadPool
df_pb = pd.read_csv('---.csv')
list_df = []
for i in range(-) :
df = df_pb.loc[---]
list_df.append(df)
pool = ThreadPool(4)
pool.map(process, list_df)
All the operations in my process function are pure Pandas and Numpy operations
Thanks for any tip, alternative or best practice advice you could give me !
One year later, I learned some useful best practices:
Use Google AI Platform, select to build VM with required number of CPUs
Use Dask for multithreading
With Pandas, there is a possibility to parallelise the .apply(), with pandarallel
It seems that a GCP Datalab has not supported multithreading:
Each kernel is single threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefit.
More information you can find here

Real-time camera input to Julia-lang

TLDR: How can I achieve low-latency, low-cpu impact webcam aquistition in Julia?
edit: I also posted this on the julia devs forum
I am new to Julia. I am interested in processing the video feed from a connected webcam, and see what kind of performance I can get out of Julia.
I am working on Linux Ubuntu, 16.04.
The only way I have found to get webcam input through video4linux, is through VideoIO, which is working on my system. The video has an unacceptable lag however, of up to 4 seconds. I assume this is given by the buffering of frames by the driver and/or libav (or is it ffmpeg, I dunno). With any camera api worth its name, I should be able to access the latest camera frame acquired... or at least set the size of the queue that Im popping frames from. Seems there is no such option in VideoIO, or maybe I am missing it.
It really is important for me to be able show-case Julia as a high performance language to non-techies... so this lag will ruin the demo I am hoping to put together.
edit: here is some of the code I have:
module myViewCam
export myView
import VideoIO, ImageView;
function myView()
camera = VideoIO.opencamera();
buf = VideoIO.read(camera);
guidict = ImageView.imshow(buf);
while !eof(camera)
VideoIO.read!(camera, buf);
ImageView.imshow(guidict["gui"]["canvas"], buf);
sleep(0.00001);
end
end
end
Assuming above is content of myViewCam.jl at the Julia prompt (the "REPL"), I type:
include("myViewCam.jl");
myViewCam.myView();
Note that this is a fix for the function "VideoIO.viewcam()" which does not work out of the box it seems.
On my system, this brings the Julia thread up to about 100% cpu usage, at the beginning of video-stream there is about 4 seconds lag, but this evens out over time, until it lands on about 0.5 seconds lag. There obviously is some queue where frames are popped from.
Also see Video4Linux wrapper in Julia which works well with Images.jl:
https://github.com/Affie/Video4Linux.jl
It's not registered yet, but has been around for a while. It is possible to make this process multithreaded in Julia using SharedArrays.jl, or likely the new Composible Threading model since Julia 1.3.
PS, this vendor specific camera interface package exists too: https://github.com/JuliaCameras/RealSense.jl

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!
Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.
You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.

numpy: How do you "break" a numpy operation?

I accidentally tried to make a 200,000 x 200,000 array in numpy. Control-C does not seem to break the operation. Is there any way to stop the array creation without simply killing the python session?
Unfortunately no. Python (and MatLab, the only other analysis package I use) do not detect the user interrupt until the current operation (NOT run) is finished.
The reason this doesn't work is that numpy has large portions written in C.
When Python begins executing a compiled function, the Python signal handling is effectively paused until the execution of the compiled code is finished.
This is bad news for your interactive Python session, but there isn't much you can do aside from waiting for the inevitable OutOfMemoryError, or killing the session.