multiprocessing code gets stuck - python-multiprocessing

I am using python 2.7 on windows 7 and I am currently trying to learn parallel processing.
I downloaded the multiprocessing 2.6.2.1 python package and installed it using pip.
When I try to run the foolowing very simple code, the program seems to get stuck, even after one hour it doesn't exit the execution despite the code to be super simple.
What am I missing?? thank you very much
from multiprocessing import Pool
def f(x):
return x*x
array =[1,2,3,4,5]
p=Pool()
result = p.map(f, array)
p.close()
p.join()
print result

The issue here is the way multiprocessing works. Think of it as python opening a new instance and importing all the modules all over again. You'll want to use the if __name__ == '__main__' convention. The following works fine:
import multiprocessing
def f(x):
return x * x
def main():
p = multiprocessing.Pool(multiprocessing.cpu_count())
result = p.imap(f, xrange(1, 6))
print list(result)
if __name__ == '__main__':
main()
I have changed a few other parts of the code too so you can see other ways to achieve the same thing, but ultimately you only need to stop the code executing over and over as python re-imports the code you are running.

Related

How to run multiple functions synchronous in Jupyter Notebook?

I try to run multiple functions at the same time in Jupiter Notebook.
I have two web scraping functions that use Selenium and run for an infinite amount of time, both always creating an updated DataFrame. Another function merges the two DataFrames and does some calculations.
As the data always changes and the calculations from the different DataFrames need to be calculated within the same second (The two DataFrames update every 5 seconds), I wonder how I can run all functions at the same time.
As my code is mainly WebScraping I used this more to describe my goal and hopefully make it more readable. I already tried using 'multiprocessing' but it just does not do anything in the notebook.
def FirstWebScraping():
while True:
time.sleep(5).
#getting all data for DataFrame
def SecondtWebScraping():
while True:
time.sleep(5).
#getting all data for DataFrame
def Calculations():
while True:
#merging DataFrame from First- and SecondWebScraping
#doing calculations
#running this function infinite and looking for specific values
#Goal
def run_all_at_the_same_time()
FirstWebScraping()
SecondWebScraping()
Calculations()
Even though threading does not show the same benefits as multiprocessing it worked for me and with selenium. I put a waiting time at the beginning for the Calculations function and from there they were all looped infinitely.
from threading import Thread
if __name__ == '__main__':
Thread(target = FirstWebScraping).start()
Thread(target = SecondWebscraping).start()
Thread(target = Calculations).start()
You can run multiprocessing in Jupyter, if you follow two rules:
Put the worker functions in a separate module.
Protect the main process-only code with if __name__ == '__main__':
Assuming your three functions are moved to worker.py:
import multiprocessing as mp
import worker
if __name__ == '__main__':
mp.Process(target=worker.FirstWebScraping).start()
mp.Process(target=worker.SecondWebscraping).start()
mp.Process(target=worker.Calculations).start()

Pyinstaller, Multiprocessing, and Pandas - No such file/directory [duplicate]

Python v3.5, Windows 10
I'm using multiple processes and trying to captures user input. Searching everything I see there are odd things that happen when using input() with multiple processes. After 8 hours+ of trying, nothing I implement worked, I'm positive I am doing it wrong but I can't for the life of me figure it out.
The following is a very stripped down program that demonstrates the issue. Now it works fine when I run this program within PyCharm, but when I use pyinstaller to create a single executable it fails. The program constantly is stuck in a loop asking the user to enter something as shown below:.
I am pretty sure it has to do with how Windows takes in standard input from things I've read. I've also tried passing the user input variables as Queue() items to the functions but the same issue. I read you should put input() in the main python process so I did that under if __name__ = '__main__':
from multiprocessing import Process
import time
def func_1(duration_1):
while duration_1 >= 0:
time.sleep(1)
print('Duration_1: %d %s' % (duration_1, 's'))
duration_1 -= 1
def func_2(duration_2):
while duration_2 >= 0:
time.sleep(1)
print('Duration_2: %d %s' % (duration_2, 's'))
duration_2 -= 1
if __name__ == '__main__':
# func_1 user input
while True:
duration_1 = input('Enter a positive integer.')
if duration_1.isdigit():
duration_1 = int(duration_1)
break
else:
print('**Only positive integers accepted**')
continue
# func_2 user input
while True:
duration_2 = input('Enter a positive integer.')
if duration_2.isdigit():
duration_2 = int(duration_2)
break
else:
print('**Only positive integers accepted**')
continue
p1 = Process(target=func_1, args=(duration_1,))
p2 = Process(target=func_2, args=(duration_2,))
p1.start()
p2.start()
p1.join()
p2.join()
You need to use multiprocessing.freeze_support() when you produce a Windows executable with PyInstaller.
Straight out from the docs:
multiprocessing.freeze_support()
Add support for when a program which uses multiprocessing has been frozen to produce a Windows executable. (Has been tested with py2exe, PyInstaller and cx_Freeze.)
One needs to call this function straight after the if name == 'main' line of the main module. For example:
from multiprocessing import Process, freeze_support
def f():
print('hello world!')
if __name__ == '__main__':
freeze_support()
Process(target=f).start()
If the freeze_support() line is omitted then trying to run the frozen executable will raise RuntimeError.
Calling freeze_support() has no effect when invoked on any operating system other than Windows. In addition, if the module is being run normally by the Python interpreter on Windows (the program has not been frozen), then freeze_support() has no effect.
In your example you also have unnecessary code duplication you should tackle.

Passing a Queue with concurrent.futures regardless of executor type

Working up from threads to processes, I have switched to concurrent.futures, and would like to gain/retain flexibility in switching between a ThreadPoolExecutor and a ProcessPoolExecutor for various scenarios. However, despite the promise of a unified facade, I am having a hard time passing multiprocessing Queue objects as arguments on the futures.submit() when I switch to using a ProcessPoolExecutor:
import multiprocessing as mp
import concurrent.futures
def foo(q):
q.put('hello')
if __name__ == '__main__':
executor = concurrent.futures.ProcessPoolExecutor()
q = mp.Queue()
p = executor.submit(foo, q)
p.result()
print(q.get())
bumps into the following exception coming from multiprocessing's code:
RuntimeError: Queue objects should only be shared between processes through inheritance
which I believe means it doesn't like receiving the queue as an argument, but rather expects to (not in any OOP sense) "inherit it" on the multiprocessing fork rather than getting it as an argument.
The twist is that with bare-bones multiprocessing, meaning when not using it through the facade which concurrent.futures is ― there seems to be no such limitation, as the following code seamlessly works:
import multiprocessing as mp
def foo(q):
q.put('hello')
if __name__ == '__main__':
q = mp.Queue()
p = mp.Process(target=foo, args=(q,))
p.start()
p.join()
print(q.get())
I wonder what am I missing about this ― how can I make the ProcessPoolExecutor accept the queue as an argument when using concurrent.futures the same as it does when using the ThreadPoolExecutor or multiprocessing very directly like shown right above?

Parallelizing apply function in pandas taking longer than expected

I have a simple cleaner function which removes special characters from a dataframe (and other preprocessing stuff). My dataset is huge and I want to make use of multiprocessing to improve performance. My idea was to break the dataset into chunks and run this cleaner function in parallel on each of them.
I used dask library and also the multiprocessing module from python. However, it seems like the application is stuck and is taking longer than running with a single core.
This is my code:
from multiprocessing import Pool
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
def process_columns(data):
for i in data.columns:
data[i] = data[i].apply(cleaner_func)
return data
mydf2 = parallelize_dataframe(mydf, process_columns)
I can see from the resource monitor that all cores are being used, but as I said before, the application is stuck.
P.S.
I ran this on windows server 2012 (where the issue happens). Running this code on unix env, I was actually able to see some benefit from the multiprocessing library.
Thanks in advance.

How to Reload a Python3 C extension module?

I wrote a C extension (mycext.c) for Python 3.2. The extension relies on constant data stored in a C header (myconst.h). The header file is generated by a Python script. In the same script, I make use of the recently compiled module. The workflow in the Python3 myscript (not shown completely) is as follows:
configure_C_header_constants()
write_constants_to_C_header() # write myconst.h
os.system('python3 setup.py install --user') # compile mycext
import mycext
mycext.do_stuff()
This works perfectly fine the in a Python session for the first time. If I repeat the procedure in the same session (for example, in two different testcases of a unittest), the first compiled version of mycext is always (re)loaded.
How do I effectively reload a extension module with the latest compiled version?
You can reload modules in Python 3.x by using the imp.reload() function. (This function used to be a built-in in Python 2.x. Be sure to read the documentation -- there are a few caveats!)
Python's import mechanism will never dlclose() a shared library. Once loaded, the library will stay until the process terminates.
Your options (sorted by decreasing usefulness):
Move the module import to a subprocess, and call the subprocess again after recompiling, i.e. you have a Python script do_stuff.py that simply does
import mycext
mycext.do_stuff()
and you call this script using
subprocess.call([sys.executable, "do_stuff.py"])
Turn the compile-time constants in your header into variables that can be changed from Python, eliminating the need to reload the module.
Manually dlclose() the library after deleting all references to the module (a bit fragile since you don't hold all the references yourself).
Roll your own import mechanism.
Here is an example how this can be done. I wrote a minimal Python C extension mini.so, only exporting an integer called version.
>>> import ctypes
>>> libdl = ctypes.CDLL("libdl.so")
>>> libdl.dlclose.argtypes = [ctypes.c_void_p]
>>> so = ctypes.PyDLL("./mini.so")
>>> so.PyInit_mini.argtypes = []
>>> so.PyInit_mini.restype = ctypes.py_object
>>> mini = so.PyInit_mini()
>>> mini.version
1
>>> del mini
>>> libdl.dlclose(so._handle)
0
>>> del so
At this point, I incremented the version number in mini.c and recompiled.
>>> so = ctypes.PyDLL("./mini.so")
>>> so.PyInit_mini.argtypes = []
>>> so.PyInit_mini.restype = ctypes.py_object
>>> mini = so.PyInit_mini()
>>> mini.version
2
You can see that the new version of the module is used.
For reference and experimenting, here's mini.c:
#include <Python.h>
static struct PyModuleDef minimodule = {
PyModuleDef_HEAD_INIT, "mini", NULL, -1, NULL
};
PyMODINIT_FUNC
PyInit_mini()
{
PyObject *m = PyModule_Create(&minimodule);
PyModule_AddObject(m, "version", PyLong_FromLong(1));
return m;
}
there is another way, set a new module name, import it, and change reference to it.
Update: I have now created a Python library around this approach:
https://github.com/bergkvist/creload
https://pypi.org/project/creload/
Rather than using the subprocess module in Python, you can use multiprocessing. This allows the child process to inherit all of the memory from the parent (on UNIX-systems).
For this reason, you also need to be careful not to import the C extension module into the parent.
If you return a value that depends on the C extension, it might also force the C extension to become imported in the parent as it receives the return-value of the function.
import multiprocessing as mp
import sys
def subprocess_call(fn, *args, **kwargs):
"""Executes a function in a forked subprocess"""
ctx = mp.get_context('fork')
q = ctx.Queue(1)
is_error = ctx.Value('b', False)
def target():
try:
q.put(fn(*args, **kwargs))
except BaseException as e:
is_error.value = True
q.put(e)
ctx.Process(target=target).start()
result = q.get()
if is_error.value:
raise result
return result
def my_c_extension_add(x, y):
assert 'my_c_extension' not in sys.modules.keys()
# ^ Sanity check, to make sure you didn't import it in the parent process
import my_c_extension
return my_c_extension.add(x, y)
print(subprocess_call(my_c_extension_add, 3, 4))
If you want to extract this into a decorator - for a more natural feel, you can do:
class subprocess:
"""Decorate a function to hint that it should be run in a forked subprocess"""
def __init__(self, fn):
self.fn = fn
def __call__(self, *args, **kwargs):
return subprocess_call(self.fn, *args, **kwargs)
#subprocess
def my_c_extension_add(x, y):
assert 'my_c_extension' not in sys.modules.keys()
# ^ Sanity check, to make sure you didn't import it in the parent process
import my_c_extension
return my_c_extension.add(x, y)
print(my_c_extension_add(3, 4))
This can be useful if you are working in a Jupyter notebook, and you want to rerun some function without rerunning all your existing cells.
Notes
This answer might only be relevant on Linux/macOS where you have a fork() system call:
Python multiprocessing linux windows difference
https://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/