DynamicExpresso Eval memory leak - dynamic-expresso

Is there a solution to avoid memory leak in a simple expression evaluation like this ?
inter.SetVariable("tick", tick++);
if (inter.Eval<bool>("(tick%2)==1"))
{
odd++;
if ((odd % 100) == 0)
System.GC.Collect();
}
else
even++;
I'm running this code periodically in a WinForm application on a Linux machine with Mono (5.0.1.1) and the memory usage continuously increase.
Tested on Windows the Process.WorkingSet64 was increasing with a lower rate than Linux.
The GC.GetTotalMemory is always stable.

If possible, it is better to use the Parse method and then just Invoke the expression multiple times.
Something like:
// One time only
string expression = "(tick%2)==1";
Lambda parsedExpression = interpreter.Parse(expression, new Parameter("tick", typeof(int)));
// Call invoke for each cycle...
var result = parsedExpression.Invoke(tick++);
But from my previous tests I don't have see any Memory Leak, are you sure that is this the problem?

Related

Megamorphic virtual call - trying to optimize it

we know that there are some techniques that make virtual calls not so expensive in JVM like Inline Cache or Polymorphic Inline Cache.
Let's consider the following situation:
Base is an interface.
public void f(Base[] b) {
for(int i = 0; i < b.length; i++) {
b[i].m();
}
}
I see from my profiler that calling virtual (interface) method m is relatively expensive.
f is on the hot path and it was compiled to machine code (C2) but I see that call to m is a real virtual call. It means that it was not optimised by JVM.
The question is, how to deal with a such situation? Obviously, I cannot make the method m not virtual here because it requires a serious redesign.
Can I do anything or I have to accept it? I was thinking how to "force" or "convince" a JVM to
use polymorphic inline cache here - the number of different types in b` is quite low - between 4-5 types.
to unroll this loop - length of b is also relatively small. After an unroll it is possible that Inline Cache will be helpful here.
Thanks in advance for any advices.
Regards,
HotSpot JVM can inline up to two different targets of a virtual call, for more receivers there will be a call via vtable/itable [1].
To force inlining of more receivers, you may try to devirtualize the call manually, e.g.
if (b.getClass() == X.class) {
((X) b).m();
} else if (b.getClass() == Y.class) {
((Y) b).m();
} ...
During execution of profiled code (in the interpreter or C1), JVM collects receiver type statistics per call site. This statistics is then used in the optimizing compiler (C2). There is just one call site in your example, so the statistics will be aggregated throughout the entire execution.
However, for example, if b[0] always has just two receivers X or Y, and b[1] always has another two receivers Z or W, JIT compiler may benefit from splitting the code into multiple call sites, i.e. manual unrolling:
int len = b.length;
if (len > 0) b[0].m();
if (len > 1) b[1].m();
if (len > 2) b[2].m();
...
This will split the type profile, so that b[0].m() and b[1].m() can be optimized individually.
These are low level tricks relying on the particular JVM implementation. In general, I would not recommend them for production code, since these optimizations are fragile, but they definitely make the source code harder to read. After all, megamorphic calls are not that bad [2].
[1] https://shipilev.net/blog/2015/black-magic-method-dispatch/
[2] https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/

Make interpreter execute faster

I've created an interprter for a simple language. It is AST based (to be more exact, an irregular heterogeneous AST) with visitors executing and evaluating nodes. However I've noticed that it is extremely slow compared to "real" interpreters. For testing I've ran this code:
i = 3
j = 3
has = false
while i < 10000
j = 3
has = false
while j <= i / 2
if i % j == 0 then
has = true
end
j = j+2
end
if has == false then
puts i
end
i = i+2
end
In both ruby and my interpreter (just finding primes primitively). Ruby finished under 0.63 second, and my interpreter was over 15 seconds.
I develop the interpreter in C++ and in Visual Studio, so I've used the profiler to see what takes the most time: the evaluation methods.
50% of the execution time was to call the abstract evaluation method, which then casts the passed expression and calls the proper eval method. Something like this:
Value * eval (Exp * exp)
{
switch (exp->type)
{
case EXP_ADDITION:
eval ((AdditionExp*) exp);
break;
...
}
}
I could put the eval methods into the Exp nodes themselves, but I want to keep the nodes clean (Terence Parr saied something about reusability in his book).
Also at evaluation I always reconstruct the Value object, which stores the result of the evaluated expression. Actually Value is abstract, and it has derived value classes for different types (That's why I work with pointers, to avoid object slicing at returning). I think this could be another reason of slowness.
How could I make my interpreter as optimized as possible? Should I create bytecodes out of the AST and then interpret bytecodes instead? (As far as I know, they could be much faster)
Here is the source if it helps understanding my problem: src
Note: I haven't done any error handling yet, so an illegal statement or an error will simply freeze the program. (Also sorry for the stupid "error messages" :))
The syntax is pretty simple, the currently executed file is in OTZ1core/testfiles/test.txt (which is the prime finder).
I appreciate any help I can get, I'm really beginner at compilers and interpreters.
One possibility for a speed-up would be to use a function table instead of the switch with dynamic retyping. Your call to the typed-eval is going through at least one, and possibly several, levels of indirection. If you distinguish the typed functions instead by name and give them identical signatures, then pointers to the various functions can be packed into an array and indexed by the type member.
value (*evaltab[])(Exp *) = { // the order of functions must match
Exp_Add, // the order type values
//...
};
Then the whole switch becomes:
evaltab[exp->type](exp);
1 indirection, 1 function call. Fast.

cuda filter with ouput of this block is the input of the next block

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.
You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.

Does this code fill the CPU cache?

I have two ways to program the same functionality.
Method 1:
doTheWork(int action)
{
for(int i = 0 i < 1000000000; ++i)
{
doAction(action);
}
}
Method 2:
doTheWork(int action)
{
switch(action)
{
case 1:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1>();
}
break;
case 2:
for(int i = 0 i < 1000000000; ++i)
{
doAction<2>();
}
break;
//-----------------------------------------------
//... (there are 1000000 cases here)
//-----------------------------------------------
case 1000000:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1000000>();
}
break;
}
}
Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.
So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?
It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.
However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.
Scalability limit
Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.
As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:
DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS
gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.
Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.
How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.
In addition, your L2 cache will kick in completely only if
you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core
Given that, you can lose much earlier, especially on a CPU/board with smaller caches.
The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.
In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

Keeping time using timer interrupts an embedded microcontroller

This question is about programming small microcontrollers without an OS. In particular, I'm interested in PICs at the moment, but the question is general.
I've seen several times the following pattern for keeping time:
Timer interrupt code (say the timer fires every second):
...
if (sec_counter > 0)
sec_counter--;
...
Mainline code (non-interrupt):
sec_counter = 500; // 500 seconds
while (sec_counter)
{
// .. do stuff
}
The mainline code may repeat, set the counter to various values (not just seconds) and so on.
It seems to me there's a race condition here when the assignment to sec_counter in the mainline code isn't atomic. For example, in PIC18 the assignment is translated to 4 ASM statements (loading each byte at the time and selecting the right byte from the memory bank before that). If the interrupt code comes in the middle of this, the final value may be corrupted.
Curiously, if the value assigned is less than 256, the assignment is atomic, so there's no problem.
Am I right about this problem?
What patterns do you use to implement such behavior correctly? I see several options:
Disable interrupts before each assignment to sec_counter and enable after - this isn't pretty
Don't use an interrupt, but a separate timer which is started and then polled. This is clean, but uses up a whole timer (in the previous case the 1-sec firing timer can be used for other purposes as well).
Any other ideas?
The PIC architecture is as atomic as it gets. It ensures that all read-modify-write operations to a memory file are 'atomic'. Although it takes 4-clocks to perform the entire read-modify-write, all 4-clocks are consumed in a single instruction and the next instruction uses the next 4-clock cycle. It is the way that the pipeline works. In 8-clocks, two instructions are in the pipeline.
If the value is larger than 8-bit, it becomes an issue as the PIC is an 8-bit machine and larger operands are handled in multiple instructions. That will introduce atomic issues.
You definitely need to disable the interrupt before setting the counter. Ugly as it may be, it is necessary. It is a good practice to ALWAYS disable the interrupt before configuring hardware registers or software variables affecting the ISR method. If you are writing in C, you should consider all operations as non-atomic. If you find that you have to look at the generated assembly too many times, then it may be better to abandon C and program in assembly. In my experience, this is rarely the case.
Regarding the issue discussed, this is what I suggest:
ISR:
if (countDownFlag)
{
sec_counter--;
}
and setting the counter:
// make sure the countdown isn't running
sec_counter = 500;
countDownFlag = true;
...
// Countdown finished
countDownFlag = false;
You need an extra variable and is better to wrap everything in a function:
void startCountDown(int startValue)
{
sec_counter = 500;
countDownFlag = true;
}
This way you abstract the starting method (and hide ugliness if needed). For example you can easily change it to start a hardware timer without affecting the callers of the method.
Write the value then check that it is the value required would seem to be the simplest alternative.
do {
sec_counter = value;
} while (sec_counter != value);
BTW you should make the variable volatile if using C.
If you need to read the value then you can read it twice.
do {
value = sec_counter;
} while (value != sec_counter);
Because accesses to the sec_counter variable are not atomic, there's really no way to avoid disabling interrupts before accessing this variable in your mainline code and restoring interrupt state after the access if you want deterministic behavior. This would probably be a better choice than dedicating a HW timer for this task (unless you have a surplus of timers, in which case you might as well use one).
If you download Microchip's free TCP/IP Stack there are routines in there that use a timer overflow to keep track of elapsed time. Specifically "tick.c" and "tick.h". Just copy those files over to your project.
Inside those files you can see how they do it.
It's not so curious about the less than 256 moves being atomic - moving an 8 bit value is one opcode so that's as atomic as you get.
The best solution on such a microcontroller as the PIC is to disable interrupts before you change the timer value. You can even check the value of the interrupt flag when you change the variable in the main loop and handle it if you want. Make it a function that changes the value of the variable and you could even call it from the ISR as well.
Well, what does the comparison assembly code look like?
Taken to account that it counts downwards, and that it's just a zero compare, it should be safe if it first checks the MSB, then the LSB. There could be corruption, but it doesn't really matter if it comes in the middle between 0x100 and 0xff and the corrupted compare value is 0x1ff.
The way you are using your timer now, it won't count whole seconds anyway, because you might change it in the middle of a cycle.
So, if you don't care about it. The best way, in my opinion, would be to read the value, and then just compare the difference. It takes a couple of OPs more, but has no multi-threading problems.(Since the timer has priority)
If you are more strict about the time value, I would automatically disable the timer once it counts down to 0, and clear the internal counter of the timer and activate once you need it.
Move the code portion that would be on the main() to a proper function, and have it conditionally called by the ISR.
Also, to avoid any sort of delaying or missing ticks, choose this timer ISR to be a high-prio interrupt (the PIC18 has two levels).
One approach is to have an interrupt keep a byte variable, and have something else which gets called at least once every 256 times the counter is hit; do something like:
// ub==unsigned char; ui==unsigned int; ul==unsigned long
ub now_ctr; // This one is hit by the interrupt
ub prev_ctr;
ul big_ctr;
void poll_counter(void)
{
ub delta_ctr;
delta_ctr = (ub)(now_ctr-prev_ctr);
big_ctr += delta_ctr;
prev_ctr += delta_ctr;
}
A slight variation, if you don't mind forcing the interrupt's counter to stay in sync with the LSB of your big counter:
ul big_ctr;
void poll_counter(void)
{
big_ctr += (ub)(now_ctr - big_ctr);
}
No one addressed the issue of reading multibyte hardware registers (for example a timer.
The timer could roll over and increment its second byte while you're reading it.
Say it's 0x0001ffff and you read it. You might get 0x0010ffff, or 0x00010000.
The 16 bit peripheral register is volatile to your code.
For any volatile "variables", I use the double read technique.
do {
t = timer;
} while (t != timer);