Profiling JAX code: What is redzone_checker and why does it take so much time? - gpu

I have found this post but am still unclear on what the redzone_checker kernel is doing and why. Specifically, should it be taking > 90% of my application's runtime? TensorBoard reports that it is taking the vast majority of the runtime of my JAX code, and I'd like to know
Is it actually the case that this kernel is taking way too much time, or is this a side effect of profiling JAX with TensorBoard (i.e., the output is misleading in some way)?
Is there a way to reduce the amount of time taken by the redzone_checker kernel? Is that even a good idea?
Thanks in advance for any insights.

make sure warmup before profiling.
it maybe jit compiling time.

Related

GUROBI only uses single core to setup problem with cvxpy (python)

I have a large MILP that I build with cvxpy and want to solve with GUROBI. When I give use the solve() function of cvxpy it take a really really really long time to setup and does not start solving for hours. Whilest doing that only 1 core of my cluster is being used. It is used for 100%. I would like to use multiple cores to build the model so that the process of building the model does not take so long. Running grbprobe also shows that gurobi knows about the other cores and for solving the problem it uses multiple cores.
I have tried to run with different flags i.e. turning presolve off and on or giving the number of Threads to be used (this seemed like i didn't even for the solving.
I also have reduce the number of constraints in the problem and it start solving much faster which means that this is definitively not a problem of the model itself.
The problem in it's normal state should have 2200 constraints i reduce it to 150 and it took a couple of seconds until it started to search for a solution.
The problem is that I don't see anything since it takes so long to get the ""set username parameters"" flag and I don't get any information on what the computer does in the mean time.
Is there a way to tell GUROBI or CVXPY that it can take more cpus for the build-up?
Is there another way to solve this problem?
Sorry. The first part of the solve (cvxpy model generation, setup, presolving, scaling, solving the root, preprocessing) is almost completely serial. The parallel part is when it really starts working on the branch-and-bound tree. For many problems, the parallel part is by far the most expensive, but not for all.
This is not only the case for Gurobi. Other high-end solvers have the same behavior.
There are options to do less presolving and preprocessing. That may get you earlier in the B&B. However, usually, it is better not to touch these options.
Running things with verbose=True may give you more information. If you have more detailed questions, you may want to share the log.

How to check the root cause of CUDA out of memory issue in the middle of training?

I'm running roberta on huggingface language_modeling.py. After doing 400 steps I suddenly get a CUDA out of memory issue. Don't know how to deal with it. Can you please help? Thanks
This can have multiple reasons. If you only get it after a few iterations, it might be that you don't free the computational graphs. Do you use loss.backward(retain_graph=True) or something similar?
Also, when you're running inference, be sure to use
with torch.no_grad():
model.forward(...)
Otherwise the computational graphs are saved there as well and potentially never freed since you never call backward() on them.
My problem was that I didn't check the size of my GPU memory with comparison to the sizes of samples. I had a lot of pretty small samples and after many iterations a large one. My bad.
Thank you and remember to check these things if it happens to you to.

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.
I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

Let the GPU handle recursive algorithm

I have a complex recursive algorithm that in it's php implementation runs about 15 minutes in the CLI to complete. I was thinking about porting it to objective-c and wanted to know who I can make use of the the GPU for the calculations. Is there a way to designate threads to be executed by the GPU?
Thanks
Yes, it's possible to use the GPU for calculations, although depending on the task it may not be advantageous. Without posting code it's anyone's guess what the most efficient means for your implementation might be. I would recommend reading the "Concurrency Programming Guide", for it's an excellent starting point in terms understanding the appropriate ways one might want to handle concurrent threading within Objective-C.

tips for efficient and optimized Cocoa applications

I am developing a cocoa application (Mac) and wanted to know what are your tips, best practices, ... for an efficient Cocoa application, which starts in less than 1 second and which is very responsive.
I've installed twitter for Mac and was amazed by its speed. Is it using special tricks?
Thanks in advance for your ideas :)
Three things that can help reduce startup time and improve overall performance are:
Defer loading resources until they're actually needed.
Profile your app to identify the parts that have the highest cost (whether you measure that in execution time, memory, or something else). Then work to reduce the cost of those operations or figure out a way to do them less or at a different time.
Take advantage of the hardware. Most machines these days have at least two processing cores and advanced graphics processors; use GCD, Quartz, Core Animation, and other technologies to take advantage of the available power.
I don't think there are really any "tricks" per se. You just profile your code with Instruments, and eliminate the slow areas. It's the same as optimising any code; don't block the main thread with disk reads/writes, use lazy loading where appropriate, etc.
A lot of it may be simply tightly written, good quality code. These sort of apps don't tend to rely on clunky frameworks etc.
Do only what you need to do and only when you need to do it.