Is it possible to use Datalab with multiprocessing as a way to scale Pandas transformations? - pandas

I try to use Google Cloud Datalab to scale up data transformations in Pandas.
On my machine, everything works fine with small files (keeping the first 100000 rows of my file), but working with the full 8G input csv file led to a Memoryerror.
I though that a Datalab VM would help me. I first tried to use a VM with Highmem, going to up to 120 G or memory.
There, I keep getting an error : The kernel appears to have died. It will restart automatically.
I found something here :
https://serverfault.com/questions/900052/datalab-crashing-despite-high-memory-and-cpu
But I am not using TensorFlow, so it didn't help much.
So I tried a different approach, chunk processing and parallelize on more cores. It works well on my machine (4-cores, 12 G ram), but still requires hours of computation.
So I wanted to use a Datalab VM with 32 cores to speed things up, but here after 5 hours, the first threads still didn't finish, when on my local machine already 10 are completed.
So very simply:
Is it possible to use Datalab as a way to scale Pandas transformations ?
Why do I get worst results with a theoretically much better VM than my local machine ?
Some code:
import pandas as pd
import numpy as np
from OOS_Case.create_features_v2 import process
from multiprocessing.dummy import Pool as ThreadPool
df_pb = pd.read_csv('---.csv')
list_df = []
for i in range(-) :
df = df_pb.loc[---]
list_df.append(df)
pool = ThreadPool(4)
pool.map(process, list_df)
All the operations in my process function are pure Pandas and Numpy operations
Thanks for any tip, alternative or best practice advice you could give me !

One year later, I learned some useful best practices:
Use Google AI Platform, select to build VM with required number of CPUs
Use Dask for multithreading
With Pandas, there is a possibility to parallelise the .apply(), with pandarallel

It seems that a GCP Datalab has not supported multithreading:
Each kernel is single threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefit.
More information you can find here

Related

how many 'CPU' cores does google colab assigns when I keep n_jobs=8, is there any way to check that?

I am running Regression tasks in Google colab with GridSearhCV. In parameters I keep n_jobs=8, when I keep it to -1 (to use all possible cores) it uses only 2 cores, so I am assuming that there is a limit there on server end if n_jobs=-1, so i would like to know that how to check how many cores are actually getting used.
If you use the code below, you will see that the multiprocessor in Google colab has 2 cores:
import multiprocessing
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores
That is a question that I had too. I've put n_job = 100 in Colab and I've got:
[Parallel(n_jobs=100)]: Using backend LokyBackend with 100 concurrent workers.
This is a surprising because google colab only gives you 2 processors. However, you can always use your own CPU/GPU on colab.

Your notebook tried to allocate more memory than is available. It has restarted

I was getting started with TalkingData AdTracking, my first entry to Kaggle competitions. The first line was pd.read_csv() and I got this error
Your notebook tried to allocate more memory than is available. It has restarted
I thought my code was run in the cloud and that I don't have to worry about the memory requirements. Can someone help me with that?
Yes, kaggle has a memory limit: Its 8GB per kernel, or 20 Min running. It takes a lot of server juice to host such a thing.
There are various solutions for this problem, notably loading and processing the dataset in chunks.
There is this as well. But I have no experience with it.
And of course you can use another cloud platform such as AWS, GCP, Oracle, etc..

keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time..."

I keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" warning message after I finished DASK code. I am using DASK doing a large seismic data computing. After the computing, I will write the computed data into disk. The writing to disk part takes much longer than computing. Before I wrote the data to the disk, I call client.close(), which I assume that I am done with DASK. But "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" keep coming. When I doing the computing, I got this warning message 3-4 times. But when I write the data to the disk, I got the warning every 1 sec. How can I get ride of this annoying warning? Thanks.
same was happening with me in the Colab where we start the session
client = Client(n_workers = 40, threads_per_worker = 2 )
I terminate all my Colab sessions and installed and imported all the Dask libs
!pip install dask
!pip install cloudpickle
!pip install 'dask[dataframe]'
!pip install 'dask[complete]'
from dask.distributed import Client
import dask.dataframe as dd
import dask.multiprocessing
Now everything is working fine and not facing any issues.
Don't know how this solved my issue :D
I had been struggling with this warning too. I would get many of these warnings and then the workers would die. I was getting them because I had some custom python functions for aggregating my data together that was handling large python objects (dict). It makes sense so much time was being spent of garbage collection if I was creating these large objects.
I refactored my code so more computation was being done in parallel before they were aggregated together and the warnings went.
I looked at the progress chart on the status page of dask dashboard to see which tasks were taking along time to process (Dask tries to name the tasks after the function in your code which called them so that can help, but they're not always that descriptive). From there I could figure out which part of my code I needed to optimise.
You can disable garbage collection in Python.
gc.disable()
I found that it was easier to manage Dask worker memory through periodic usages of the Dask client restart: Client.restart()
Just create a process to run Dask cluster and return the ip address. Create the client using that ip address.

Dask distributed apparently not releasing memory on task completion

I'm trying to execute a custom dask graph on a distributed system, the thing is that it seems to be not releasing memory of finished tasks. Am I doing something wrong?
I've tried changing the number of processes and using a local cluster but it doesn't seem to make a difference.
from dask.distributed import Client
from dask import get
import pandas as pd
client = Client()
def get_head(df):
return df.head()
process_big_file_tasks = {f'process-big-file-{i}': (pd.read_csv, '/home/ubuntu/huge_file.csv') for i in range(50)}
return_fragment_tasks = {f'return-fragment-{i}': (get_head, previous_task) for i, previous_task in enumerate(process_big_file_tasks)}
dsk = {
**process_big_file_tasks,
**return_fragment_tasks,
'concat': (pd.concat, list(return_fragment_tasks))
}
client.get(dsk, 'concat')
Since the tasks are independent of each other (or at least the ones that consume the most memory), when every one finishes its memory should be released.
How do you determine that it isn't releasing memory? I recommend looking at Dask's dashboard to see the structure of the computation, including what has been released and not. This youtube video may be helpful
https://www.youtube.com/watch?v=N_GqzcuGLCY
In particular, I encourage you to watch the Graph tab while running your computation.

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!
Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.
You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.