I am using a Raspberry Pi 3 and basically stepped over a little tripwire.
I have a very big and complicated program that takes away a lot of memory and has a big CPU load. I thought it was normal that if I started the same process while the first one was still running, it would take the same amount of memory and especially double the CPU Load. I found out that it doesn't take away more memory and does not affect the CPU load.
To find out if this behavior came from my program, I wrote a tiny c++ program that has extremely high memory usage, here it is:
#include <iostream>
using namespace std;
int main()
{
for(int i = 0; i<100; i++) {
float a[100][100][100];
for (int i2 = 0; i2 < 99; ++i2) {
for (int i3 = 0; i3 < 99; ++i3){
for (int i4 = 0; i4 < 99; ++i4){
a[i2][i3][i4] = i2*i3*i4;
cout << a[i2][i3][i4] << endl;
}
}
}
}
return 0;
}
The CPU-load is about at 30 % of the max-level, I started the code in one terminal. Strangely, when I started it in another terminal at the same time, it didnt affect the CPU load. I concluded that this behaviour couldn't come from my program.
Now I want to know:
Is there a "lock" that ensures that a certain type of process does not grill your cores?
Why don't two identical processes double the CPU load?
Well, I found out that there is a "lock" that makes sure a process doesn't take away all memory and makes the CPU load go up to 100%. It seems that than more processes there are, than more CPU load is there, but not in a linear way.
Additionally, the code I wrote to look for the behaviour has only high memory usage, the 30% level came from the cout in the standard library . Multiple processes can use the command at the same time without increasing the CPU load, but it affects the speed of the printing.
When I found out that, I got suspicious about the programs speed. I used the analytics in my IDE for c++ to find out the duration of my original program, and indeed it was a bit more that two times slower.
That seems to be the solution I was looking for, but I think this is not really applicable to a large audience since the structure of the Raspberry Pi is very particular. I don't know how this works for other systems.
BTW: I could have guessed that there is a lock. I mean, if you start 10 processes that take away 15% of the CPU load max level, you would have 150% CPU usage. IMPOSSIBLE!
Related
Here's a low level question. How CPU intensive is getting system time?
What is the source of the time? I know there is a hardware clock on the bios chip but I'm thinking that getting data from outside the CPU and RAM will need some hardware synchronization which may delay the read so I'm guessing the CPU may have its own clock. Feel free to correct me if I'm wrong in any way.
Does getting time incur a heavy system function call or is it in any way dependent on the used programming language?
I have just tested it using a C++ program:
clock_t started = clock();
clock_t endClock = started + CLOCKS_PER_SEC;
long itera = 0;
for (; clock() < endClock; itera++)
{
}
I get about 23 million iterations per second (Windows 7, 32bit, Visual Studio 2015, 2.6 GHz CPU). In terms of your question, I would not call this intensive.
In debug mode, I measured 18 million iterations per second.
In case the time is transformed into a localized timestamp, complicated calendar calculations (timezone, daylight saving time, ...) might significantly slow down the loop.
It is not easy to tell what happens inside the clock() call. For my system, it calls QueryPerfomanceCounter, but this recurs to other system functions as explained here.
Tuning
To reduce the time measurement overhead even further, you can measure in every 10th, 100th ... iteration.
The following measures once in 1024 iterations:
for (; (itera & 0x03FF) || (clock() < endClock); itera++)
{
}
This brings up the loop per second count to some 500 million.
Tuning with Timer Thread
The following yields a further improvement of some 10% paid with additional complexity:
std::atomic<bool> processing = true;
// launch a timer thread to clear the processing flag after 1s
std::thread t([&processing]() {
std::this_thread::sleep_for(std::chrono::seconds(1));
processing = false;
});
for (; (itera & 0x03FF) || processing; itera++)
{
}
t.join();
An extra thread is started which sleeps for one second and then sets a control variable. The main thread executes the loop until the timer threads signals the end of processing.
I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".
have you ever calculated the mips of lpc1788 board? Recently I've calculated a result via following code running in rom:
volatile uint32_t tick;
void SysTick_Handler()
{
tick++;
}
unsigned long loops_per_ms;
extern void __delay(int n);
int calculate_mips()
{
int prec = 8;
unsigned long ji;
unsigned long loop;
loops_per_ms = 1 << 12;
while (loops_per_ms) {
ji = tick;
while (ji == tick) ;
ji = tick;
__delay(loops_per_ms);
if (ji != tick)
break;
loops_per_ms <<= 1;
}
loops_per_ms >>= 1;
loop = loops_per_ms >> 1;
while (prec--) {
loops_per_ms |= loop;
ji = tick;
while (ji == tick) ;
ji = tick;
__delay(loops_per_ms);
if (ji != tick)
loops_per_ms &= ~loop;
loop >>= 1;
}
return loops_per_ms / 500;
}
delay.s:
PUBLIC __delay
SECTION .text:CODE:REORDER(2)
THUMB
__delay
subs r0, r0, #1
bhi __delay
mov pc, lr
END
With IAR ide, I got loops_per_ms is 39936 and mips will be 79M, whil with Keil, I got a loops_per_ms is 29952 which means the mips is 59M.
The MCU speed is set to 120MHz, by datasheet the MIPS should be 1.25x120=150M, I think code running in ROM slow down the mips.
any body has some comments or other result?
You cannot measure MIPS in that way. You have no control over how many instructions the compiler will use to implement a particular high-level code source, and it will vary with optimisation level.
The core will achieve 1.25 MIPS per MHz, but that may be reduced depending on a number of factors. For example on Cortex-M on-chip Flash and on-chip RAM use separate buses, so that optimal performance is achieved when data is in RAM and code is in flash. If an instruction in flash needs to fetch data from flash the throughput will be reduced because the instruction fetch and the data fetch must be sequential, whereas a data fetch from RAM can occur in parallel. If you ran the code from RAM you would really notice a slow down since all data and instruction fetches would be sequential. Most Cortex-M parts employ a flash accelerator of some sort to compensate for slower flash memory to achieve zero-wait code execution in most cases, though it is possible to write code perversely to defeat such benefit. Other causes of reduced MIPS is bus latency caused by DMA operations and peripheral wait states.
The simplest and most accurate method of measuring MIPS for your particular application (which for the reasons mentioned above may vary from the optimal) is to use a trace capable debugger, which will capture every instruction executed over a period.
I have two ways to program the same functionality.
Method 1:
doTheWork(int action)
{
for(int i = 0 i < 1000000000; ++i)
{
doAction(action);
}
}
Method 2:
doTheWork(int action)
{
switch(action)
{
case 1:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1>();
}
break;
case 2:
for(int i = 0 i < 1000000000; ++i)
{
doAction<2>();
}
break;
//-----------------------------------------------
//... (there are 1000000 cases here)
//-----------------------------------------------
case 1000000:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1000000>();
}
break;
}
}
Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.
So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?
It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.
However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.
Scalability limit
Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.
As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:
DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS
gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.
Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.
How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.
In addition, your L2 cache will kick in completely only if
you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core
Given that, you can lose much earlier, especially on a CPU/board with smaller caches.
The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.
In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.
I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?
Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.
In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).
To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.
It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.
If you post some code, maybe you will get specific answers.
Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...
Think that the kernel calculates some values and these calculated values are used by all of the threads,
__global__ void kernel(...) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int id0 = blockDim.x * blockIdx.x;
int reg = id0 * ...;
int reg0 = reg * a / x + y;
...
int val = reg + reg0 + 2 * idx;
output[idx] = val > 10;
}
So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.
__global__ void kernel(...) {
__shared__ int cache[10];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (threadIdx.x == 0) {
int id0 = blockDim.x * blockIdx.x;
cache[0] = id0 * ...;
cache[1] = cache[0] * a / x + y;
}
__syncthreads();
...
int val = cache[0] + cache[1] + 2 * idx;
output[idx] = val > 10;
}
Take a look at this paper for further information..
It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.
How does it work when reducing registers caused slower speed
Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow
For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.
Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.
The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.