Why is my multiprocessing program spawning processes infinitely? - python-multiprocessing

import time
from multiprocessing import Pool, RawArray, sharedctypes
from ctypes import c_int
def init_worker(X):
print(f"{X}")
def worker_func(i):
print(f"{X}")
time.sleep(i) # Some heavy computations
return
# We need this check for Windows to prevent infinitely spawning new child
# processes.
if __name__ == '__main__':
X =sharedctypes.RawValue(c_int)
X=3
with Pool(processes=4, initializer=init_worker, initargs=(X)) as pool:
pool.map(worker_func, [1,2,3,4])
print(X)
--- I am simply trying to print the value of X in each subprocess. This is a toy program in order to check whether I can share a value
and update it by using multiple processes.

This program spawns infinite number of processes because it there is no ,(comma) after the X in initargs=(X) ; it should be initargs=(X,) . That is because, if the comma is left-out that causes an error.

Related

Enhancing performance of pandas groupby&apply

These days I've been stucked in problem of speeding up groupby&apply,Here is code:
dat = dat.groupby(['glass_id','label','step'])['equip'].apply(lambda x:'_'.join(sorted(list(x)))).reset_index()
which cost large time when data size grows.
I've try to change the groupby&apply to for type which didn't work;
then I tried to use unique() but still fail to speed up the running time.
I wanna a update code for less run-time,and gonna be very appreciate if there is a solvement to this problem
I think you can consider to use multiprocessing
Check the following example
import multiprocessing
import numpy as np
# The function which you use in conjunction with multiprocessing
def loop_many(sub_df):
grouped_by_KEY_SEQ_and_count=sub_df.groupby(['KEY_SEQ']).agg('count')
return grouped_by_KEY_SEQ_and_count
# You will use 6 processes (which is configurable) to process dataframe in parallel
NUMBER_OF_PROCESSES=6
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
# Split dataframe into 6 sub-dataframes
df_split = np.array_split(pre_sale, NUMBER_OF_PROCESSES)
# Process split sub-dataframes by loop_many() on multiple processes
processed_sub_dataframes=pool.map(loop_many,df_split)
# Close multiprocessing pool
pool.close()
pool.join()
concatenated_sub_dataframes=pd.concat(processed_sub_dataframes).reset_index()

Pyinstaller, Multiprocessing, and Pandas - No such file/directory [duplicate]

Python v3.5, Windows 10
I'm using multiple processes and trying to captures user input. Searching everything I see there are odd things that happen when using input() with multiple processes. After 8 hours+ of trying, nothing I implement worked, I'm positive I am doing it wrong but I can't for the life of me figure it out.
The following is a very stripped down program that demonstrates the issue. Now it works fine when I run this program within PyCharm, but when I use pyinstaller to create a single executable it fails. The program constantly is stuck in a loop asking the user to enter something as shown below:.
I am pretty sure it has to do with how Windows takes in standard input from things I've read. I've also tried passing the user input variables as Queue() items to the functions but the same issue. I read you should put input() in the main python process so I did that under if __name__ = '__main__':
from multiprocessing import Process
import time
def func_1(duration_1):
while duration_1 >= 0:
time.sleep(1)
print('Duration_1: %d %s' % (duration_1, 's'))
duration_1 -= 1
def func_2(duration_2):
while duration_2 >= 0:
time.sleep(1)
print('Duration_2: %d %s' % (duration_2, 's'))
duration_2 -= 1
if __name__ == '__main__':
# func_1 user input
while True:
duration_1 = input('Enter a positive integer.')
if duration_1.isdigit():
duration_1 = int(duration_1)
break
else:
print('**Only positive integers accepted**')
continue
# func_2 user input
while True:
duration_2 = input('Enter a positive integer.')
if duration_2.isdigit():
duration_2 = int(duration_2)
break
else:
print('**Only positive integers accepted**')
continue
p1 = Process(target=func_1, args=(duration_1,))
p2 = Process(target=func_2, args=(duration_2,))
p1.start()
p2.start()
p1.join()
p2.join()
You need to use multiprocessing.freeze_support() when you produce a Windows executable with PyInstaller.
Straight out from the docs:
multiprocessing.freeze_support()
Add support for when a program which uses multiprocessing has been frozen to produce a Windows executable. (Has been tested with py2exe, PyInstaller and cx_Freeze.)
One needs to call this function straight after the if name == 'main' line of the main module. For example:
from multiprocessing import Process, freeze_support
def f():
print('hello world!')
if __name__ == '__main__':
freeze_support()
Process(target=f).start()
If the freeze_support() line is omitted then trying to run the frozen executable will raise RuntimeError.
Calling freeze_support() has no effect when invoked on any operating system other than Windows. In addition, if the module is being run normally by the Python interpreter on Windows (the program has not been frozen), then freeze_support() has no effect.
In your example you also have unnecessary code duplication you should tackle.

Passing a Queue with concurrent.futures regardless of executor type

Working up from threads to processes, I have switched to concurrent.futures, and would like to gain/retain flexibility in switching between a ThreadPoolExecutor and a ProcessPoolExecutor for various scenarios. However, despite the promise of a unified facade, I am having a hard time passing multiprocessing Queue objects as arguments on the futures.submit() when I switch to using a ProcessPoolExecutor:
import multiprocessing as mp
import concurrent.futures
def foo(q):
q.put('hello')
if __name__ == '__main__':
executor = concurrent.futures.ProcessPoolExecutor()
q = mp.Queue()
p = executor.submit(foo, q)
p.result()
print(q.get())
bumps into the following exception coming from multiprocessing's code:
RuntimeError: Queue objects should only be shared between processes through inheritance
which I believe means it doesn't like receiving the queue as an argument, but rather expects to (not in any OOP sense) "inherit it" on the multiprocessing fork rather than getting it as an argument.
The twist is that with bare-bones multiprocessing, meaning when not using it through the facade which concurrent.futures is ― there seems to be no such limitation, as the following code seamlessly works:
import multiprocessing as mp
def foo(q):
q.put('hello')
if __name__ == '__main__':
q = mp.Queue()
p = mp.Process(target=foo, args=(q,))
p.start()
p.join()
print(q.get())
I wonder what am I missing about this ― how can I make the ProcessPoolExecutor accept the queue as an argument when using concurrent.futures the same as it does when using the ThreadPoolExecutor or multiprocessing very directly like shown right above?

multiprocessing code gets stuck

I am using python 2.7 on windows 7 and I am currently trying to learn parallel processing.
I downloaded the multiprocessing 2.6.2.1 python package and installed it using pip.
When I try to run the foolowing very simple code, the program seems to get stuck, even after one hour it doesn't exit the execution despite the code to be super simple.
What am I missing?? thank you very much
from multiprocessing import Pool
def f(x):
return x*x
array =[1,2,3,4,5]
p=Pool()
result = p.map(f, array)
p.close()
p.join()
print result
The issue here is the way multiprocessing works. Think of it as python opening a new instance and importing all the modules all over again. You'll want to use the if __name__ == '__main__' convention. The following works fine:
import multiprocessing
def f(x):
return x * x
def main():
p = multiprocessing.Pool(multiprocessing.cpu_count())
result = p.imap(f, xrange(1, 6))
print list(result)
if __name__ == '__main__':
main()
I have changed a few other parts of the code too so you can see other ways to achieve the same thing, but ultimately you only need to stop the code executing over and over as python re-imports the code you are running.

Multiprocessing with class functions and class attributes

I have a pandas Dataframe, that has millions of rows and I have to do row-wise operations. Since I have a Multicore CPU, I would like to speed up that process using Multiprocessing. The way I would like to do this is to just split up the dataframe in equally sized dataframes and process each of them within a separate process. So far so good...
The problem is, that my code is written in OOP style and I get Pickle errors using a Multiprocess Pool. What I do is, I pass a reference to a class function self.X to the pool. I further use class attributes within X (only read access). I really don't want to switch back to functional programming style... Hence, is it possible to do Multiprocessing in an OOP envirnoment?
It should be possible as long as all elements in your class (that you pass to the sub-processes) is picklable. That is the only thing you have to make sure. If there are any elements in your class that are not, then you cannot pass it to a Pool. Even if you only pass self.x, everything else like self.y has to be picklable.
I do my pandas Dataframe processing like that:
import pandas as pd
import multiprocessing as mp
import numpy as np
import time
def worker(in_queue, out_queue):
for row in iter(in_queue.get, 'STOP'):
value = (row[1] * row[2] / row[3]) + row[4]
time.sleep(0.1)
out_queue.put((row[0], value))
if __name__ == "__main__":
# fill a DataFrame
df = pd.DataFrame(np.random.randn(1e5, 4), columns=list('ABCD'))
in_queue = mp.Queue()
out_queue = mp.Queue()
# setup workers
numProc = 2
process = [mp.Process(target=worker,
args=(in_queue, out_queue)) for x in range(numProc)]
# run processes
for p in process:
p.start()
# iterator over rows
it = df.itertuples()
# fill queue and get data
# code fills the queue until a new element is available in the output
# fill blocks if no slot is available in the in_queue
for i in range(len(df)):
while out_queue.empty():
# fill the queue
try:
row = next(it)
in_queue.put((row[0], row[1], row[2], row[3], row[4]), block=True) # row = (index, A, B, C, D) tuple
except StopIteration:
break
row_data = out_queue.get()
df.loc[row_data[0], "Result"] = row_data[1]
# signals for processes stop
for p in process:
in_queue.put('STOP')
# wait for processes to finish
for p in process:
p.join()
This way I do not have to pass big chunks of DataFrames and I do not have to think about picklable elements in my class.