numpy: How do you "break" a numpy operation? - numpy

I accidentally tried to make a 200,000 x 200,000 array in numpy. Control-C does not seem to break the operation. Is there any way to stop the array creation without simply killing the python session?

Unfortunately no. Python (and MatLab, the only other analysis package I use) do not detect the user interrupt until the current operation (NOT run) is finished.

The reason this doesn't work is that numpy has large portions written in C.
When Python begins executing a compiled function, the Python signal handling is effectively paused until the execution of the compiled code is finished.
This is bad news for your interactive Python session, but there isn't much you can do aside from waiting for the inevitable OutOfMemoryError, or killing the session.

Related

keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time..."

I keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" warning message after I finished DASK code. I am using DASK doing a large seismic data computing. After the computing, I will write the computed data into disk. The writing to disk part takes much longer than computing. Before I wrote the data to the disk, I call client.close(), which I assume that I am done with DASK. But "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" keep coming. When I doing the computing, I got this warning message 3-4 times. But when I write the data to the disk, I got the warning every 1 sec. How can I get ride of this annoying warning? Thanks.
same was happening with me in the Colab where we start the session
client = Client(n_workers = 40, threads_per_worker = 2 )
I terminate all my Colab sessions and installed and imported all the Dask libs
!pip install dask
!pip install cloudpickle
!pip install 'dask[dataframe]'
!pip install 'dask[complete]'
from dask.distributed import Client
import dask.dataframe as dd
import dask.multiprocessing
Now everything is working fine and not facing any issues.
Don't know how this solved my issue :D
I had been struggling with this warning too. I would get many of these warnings and then the workers would die. I was getting them because I had some custom python functions for aggregating my data together that was handling large python objects (dict). It makes sense so much time was being spent of garbage collection if I was creating these large objects.
I refactored my code so more computation was being done in parallel before they were aggregated together and the warnings went.
I looked at the progress chart on the status page of dask dashboard to see which tasks were taking along time to process (Dask tries to name the tasks after the function in your code which called them so that can help, but they're not always that descriptive). From there I could figure out which part of my code I needed to optimise.
You can disable garbage collection in Python.
gc.disable()
I found that it was easier to manage Dask worker memory through periodic usages of the Dask client restart: Client.restart()
Just create a process to run Dask cluster and return the ip address. Create the client using that ip address.

Does OpenCL guarantee buffer memory integrity with multiple command queues?

Simplified problem
I have two host threads, each with its own command queue to the same GPU device. Both queues are out-of-order with the execution order explicitly managed using wait events (simplified example doesn't need this, but actual application does).
ThreadA is a lightweight processing pipeline that runs in real-time as new data is acquired. ThreadB is a heavyweight slower processing pipeline that uses the same input data but processes it asynchronously at a slower rate. I'm using a double buffer to keep the pipelines separate but allow ThreadB to work on the same input data written to device by ThreadA.
ThreadA's loop:
Pulling image from network as data is available
Write image to device cl_mem BufferA using clEnqueueWriteBuffer(CommandQueueA)
Invoke image processing KernelA using clEnqueueNDRangeKernel(CommandQueueA) once write is complete (kernel outputs results to cl_mem OutputA)
Read processed result from OutputA using clEnqueueReadBuffer(CommandQueueA)
ThreadB's loop
Wait until enough time has elapsed (does work at slower rate)
Copy BufferA to BufferB using clEnqueueCopyBuffer(CommandQueueB) (double buffer swap)
Invoke slower image processing KernelB using clEnqueueNDRangeKernel(CommandQueueB) once copy is complete (kernel outputs results to cl_mem OutputB)
Read processed result from OutputB using clEnqueueReadBuffer(CommandQueueB)
My Questions
There's a potential race condition between ThreadA's step 2 and ThreadB's step 2. I don't care which is executed first, I just want to make sure I don't copy BufferA to BufferB while BufferA is being written to.
Does OpenCL provide any implicit guarantees that this won't happen?
If not, if I instead on ThreadB step 2 use clEnqueueCopyBuffer(CommandQueueA) so that both the write and copy operations are in the same command queue, does that guarantee that they can't run simultaneously even though the queue allows out-of-order execution?
If not, is there a better solution than adding the WriteBuffer's event in ThreadA to the waitlist of the CopyBuffer command in ThreadB?
It seems like any of these should work, but I can't find where in the OpenCL spec it says this is fine. Please cite the OpenCL spec in your answers if possible.
Does OpenCL provide any implicit guarantees that this won't happen?
No, there is no implicit synchronization unless you use a single in-order command queue.
If not, if I instead on ThreadB step 2 use
clEnqueueCopyBuffer(CommandQueueA) so that both the write and
copy operations are in the same command queue, does that
guarantee that they can't run simultaneously even though the
queue allows out-of-order execution?
No, regardless of a queue's type (in-order vs out-of-order),
OpenCL runtime does not track memory dependencies of
commands. User is responsible to specify events in a wait list,
if any dependency between commands exists.
The following quote could serve as a proof of that:
s3.2.1 Execution Model: Context and Command Queues
Out-of-order Execution: Commands are issued in order, but do
not wait to complete before following commands execute. Any
order constraints are enforced by the programmer through
explicit synchronization commands.
It is not a direct answer to your question, but I assume that if
any guarantees were provided, they should be mentioned in this
section.
If not, is there a better solution than adding the
WriteBuffer's event in ThreadA to the waitlist of the
CopyBuffer command in ThreadB?
If you can use a single in-order queue, that would probably be
more efficient than a cross-queue event, at least for some
implementations.

TensorFlow operations with GPU support

Is there a way (or maybe a list?) to determine all the tensorflow operations which offer GPU support?
Right now, it is a trial and error process for me - I try to place a group of operations on GPU. If it works, sweet. If not, try somewhere else.
This is the only thing relevant (but not helpful) I have found so far: https://github.com/tensorflow/tensorflow/issues/2502
Unfortunately, it seems there's no built-in way to get an exact list or even check this for a specific op. As mentioned in the issue above, the dispatching is done in the native C++ code: a particular operation can be assigned to GPU if a corresponding kernel has been registered to DEVICE_GPU.
I think the easiest way for you is to grep "REGISTER_KERNEL_BUILDER" -r tensorflow the tensorflow source base to get a list of matched operations, which will look something like this.
But remember that even with REGISTER_KERNEL_BUILDER specification, there's no guarantee an op will be performed on a GPU. For example, 32-bit int Add is assigned on CPU regardless of the existing kernel.

Python's support for multi-threading

I heard that python still has this global interpreter lock issue. As a result, threads execution in python are not actually multi-threaded.
What are the possible solutions to overcome this problem?
I am using python 2.7.3
For understanding python's GIL, I would recommend using this link: http://www.dabeaz.com/python/UnderstandingGIL.pdf
From python wiki:
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
There are discussions on eliminating the GIL, but I guess its not achieved yet. If you really want to achieve multi-threading for your custom code, you can also switch to Java.
Do see if that helps.

How do I mitigate CUDA's very long initialization delay?

Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As #RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.
This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).
Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:
Passing the same state information / CUDA context between processes?
Telling CUDA to ignore most host memory altogether?
Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
Starting CUDA with Unified Memory disabled?
Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?
What you are asking about already exists. It is called MPS (MULTI-PROCESS SERVICE), and it basically keeps a single GPU context alive at all times with a daemon process that emulates the driver API. The initial target application is MPI, but it does bascially what you envisage.
Read more here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf