I'm hitting what appears to be a deadlock when trying to make use of multiprocessing with pytorch. The equivalent numpy code works like I expect it to.
I've made a simplified version of my code: a pool of 4 workers executing an array-wide broadcast operation 1000 times (so ~250 each worker). The array in question is 100,000 x 3 and the broadcast operation is subtraction of all rows by a single 1 x 3 row array. The large array is a shared/global array, and the row array is different at each iteration.
The code works exactly as expected using numpy, with the pooled workers showing a 4x speedup over the equivalent for loop.
The code in pytorch, however, hits a deadlock (I assume): none of the workers complete the array broadcast operation even once.
The numpy code below prints the following:
Finished for loop over my_subtractor: took 8.1504 seconds.
Finished pool over my_subtractor: took 2.2247 seconds.
The pytorch code, on the other hand, prints this then stalls:
Finished for loop over my_subtractor: took 3.1082 seconds.
BLA
BLA
BLA
BLA
"BLA" print statements are just to show that each worker is stuck in -- apparently -- a deadlock state. There are exactly 4 of these: one per worker entering -- and getting stuck in -- an iteration.
If you feel ambitious enough to reproduce, note that it doesn't work on Windows because it's not wrapped around if __name__ == '__main__': (I read somewhere that you need this because of the way Windows handles launching processes). Also you will need to create an empty file called my_globals.py.
Here is the numpy code
from time import time
import numpy as np
import my_globals
from multiprocessing import Pool as ThreadPool
# shared memory by virtue of being global
my_globals.minuend = np.random.rand(100000,3)
# array to be iterated over in for loop / pool of workers
subtrahends = np.random.rand(10000,3)
# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend):
my_globals.minuend - subtrahend
return 0
# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
my_subtractor(subtrahend)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))
# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))
Here is the equivalent pytorch code:
from time import time
import torch
import my_globals
from torch.multiprocessing import Pool as ThreadPool
# necessary on my system because it has low limits for number of file descriptors; not recommended for most systems,
# see: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor
torch.multiprocessing.set_sharing_strategy('file_system')
# shared memory by virtue of being global
my_globals.minuend = torch.rand(100000,3)
# array to be iterated over in for loop / pool of workers
subtrahends = torch.rand(10000,3)
# function called at each iteration (broadcast operation)
def my_subtractor(subtrahend, verbose=True):
if verbose:
print("BLA") # -- prints for every worker in the pool (so 4 times total)
my_globals.minuend - subtrahend
if verbose:
print("ALB") # -- doesn't print for any worker
return 0
# launch for loop
ts = time()
for idx, subtrahend in enumerate(subtrahends):
my_subtractor(subtrahend, verbose=False)
te = time()
print('Finished for loop over my_subtractor: took %2.4f seconds.' % (te - ts))
# launch equivalent pool of workers
ts = time()
pool = ThreadPool(4)
pool.map(my_subtractor, subtrahends)
pool.close()
pool.join()
te = time()
print('Finished pool over my_subtractor: took %2.4f seconds.' % (te - ts))
You can try to set OMP_NUM_THREADS=1 environment variable as an attempt to crunch-fix this. It helped me with DataLoader+OpenCV deadlock.
Related
These days I've been stucked in problem of speeding up groupby&apply,Here is code:
dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()
which cost large time when data size grows.
I've try to change the groupby&apply to for type which didn't work;
then I tried to use unique() but still fail to speed up the running time.
I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem
I think you can consider to use multiprocessing
Check the following example
import multiprocessing
import numpy as np
# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
return grouped_by_KEY_SEQ_and_count
# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)
# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)
# Close multiprocessing pool
pool.close()
pool.join()
concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()
import time
from multiprocessing import Pool, RawArray, sharedctypes
from ctypes import c_int
def init_worker(X):
print(f"{X}")
def worker_func(i):
print(f"{X}")
time.sleep(i) # Some heavy computations
return
# We need this check for Windows to prevent infinitely spawning new child
# processes.
if __name__ == '__main__':
X =sharedctypes.RawValue(c_int)
X=3
with Pool(processes=4, initializer=init_worker, initargs=(X)) as pool:
pool.map(worker_func, [1,2,3,4])
print(X)
--- I am simply trying to print the value of X in each subprocess. This is a toy program in order to check whether I can share a value
and update it by using multiple processes.
This program spawns infinite number of processes because it there is no ,(comma) after the X in initargs=(X) ; it should be initargs=(X,) . That is because, if the comma is left-out that causes an error.
I need to process a huge file of around 30GB containing hundreds of millions of rows. More precisely, I want to perform the three following steps:
Reading the file by chunks: given the size of the file, I don't have the memory to read the file in one go;
Computing stuff on the chunks before aggregating each of them to a more manageable size;
Concatenating the aggregated chunks into a final dataset containing the results of my analyses.
So far, I have coded two threads :
One thread in charge of reading the file by chunks and storing the chunks in a Queue (step 1);
One thread in charge of performing the analyses (step 2) on the chunks;
Here is the spirit of my code so far with dummy data:
import queue
import threading
import concurrent.futures
import os
import random
import pandas as pd
import time
def process_chunk(df):
return df.groupby(["Category"])["Value"].sum().reset_index(drop=False)
def producer(queue, event):
print("Producer: Reading the file by chunks")
reader = pd.read_table(full_path, sep=";", chunksize=10000, names=["Row","Category","Value"])
for index, chunk in enumerate(reader):
print(f"Producer: Adding chunk #{index} to the queue")
queue.put((index, chunk))
time.sleep(0.2)
print("Producer: Finished putting chunks")
event.set()
print("Producer: Event set")
def consumer(queue, event, result_list):
# The consumer stops iff queue is empty AND event is set
# <=> The consumer keeps going iff queue is not empty OR event is not set
while not queue.empty() or not event.is_set():
try:
index, chunk = queue.get(timeout=1)
except queue.Empty:
continue
print(f"Consumer: Retrieved chunk #{index}")
print(f"Consumer: Queue size {queue.qsize()}")
result_list.append(process_chunk(chunk))
time.sleep(0.1)
print("Consumer: Finished retrieving chunks")
if __name__=="__main__":
# Record the execution time
start = time.perf_counter()
# Generate a fake file in the current directory if necessary
path = os.path.dirname(os.path.realpath(__file__))
filename = "fake_file.txt"
full_path = os.path.join(path, filename)
if not os.path.exists(full_path):
print("Main: Generate a dummy dataset")
with open(full_path, "w", encoding="utf-8") as f:
for i in range(100000):
value = random.randint(1,101)
category = i%2
f.write(f"{i+1};{value};{category}\n")
# Defining a queue that will store the chunks of the file read by the Producer
queue = queue.Queue(maxsize=5)
# Defining an event that will be set by the Producer when he is done
event = threading.Event()
# Defining a list storing the chunks processed by the Consumer
result_list = list()
# Launch the threads Producer and Consumer
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.submit(producer, queue, event)
executor.submit(consumer, queue, event, result_list)
# Display that the program is finished
print("Main: Consumer & Producer have finished!")
print(f"Main: Number of processed chunks = {len(result_list)}")
print(f"Main: Execution time = {time.perf_counter()-start} seconds")
I know that each iteration of step 1 takes more time than each iteration of step 2 i.e. that the Consumer will always be waiting for the Producer.
How can I speed up the process of reading my file by chunks (step 1) ?
I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.
I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?
It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.