I am using AVR controller atmega328.
I need to check the status of two bits. I have two approaches, but not sure which one is most efficient. In first case in the code below, I am reading the port using PIND command twice (Two times PIND access is made). So is this an issue or I have to go with second if statement?
#define SW1 2
#define SW2 5
//check if both are high
if((PIND&(1<<SW1))&&(PIND&(1<<SW2)))
//Or
if((PIND&((1<<SW1)|(1<<SW2)))==((1<<SW1)|(1<<SW2)))
I suppose PIND is a peripheral register (what do you mean by "command"?).
The first check will read it once or twice, depending on the first comparison. This might generate a glitch, if SW1 changes between the first and second read.
Presuming SW? are switches: In general, it is better to read such a register once (possibly to a variable) and test for the bits. That ensures you get a snapshot of all input bits at the same time.
Also, the seconde version is not necessarily slower, as it safes a branch (IIRC AVR has no conditional instructions other than branch/jump).
The first reads PIND twice so is less efficient and a potential for error - the time between reads may be significant if interrupts or thread context switches intervene.
The second evaluates (1<<SW1)|(1<<SW2) twice, but in this case it is a compile time constant, so should have no impact on performance. But to be honest you are probably sweating the small stuff worrying too much about the code at this level. If you really need to know; take a look at the code the compiler generates (in the debugger disassembly or by getting the compiler to output an assembly listing), but it will take a long time to write code if you intend to second guess every line like this,
Its hard to say which of the above 2 will be more efficient without the output of assembler. However i would guess that the difference in performance would be very little.
In case 2 you are avoiding reading the registers twice but you are performing 2 additional operations.
If i were to make a guess i would say that option 1 would most likely to be faster since PIND is a AVR regsiter and the cost of reading would be less.
Having said that i would suggest using the first option as its more readable even its not found to be faster than option 2.
Related
I've recently picked up Michael Abrash's The Zen of Assembly Language (from 1990) and was reading a section on how instruction prefetching is not always advantageous such as the case when branching occurs (a jump). This is because all of the instructions that were prefetched are no longer the ones to be executed and so more instructions must be fetched.
This reminded me of an optimization from another old book, Tricks of the Game Programming Gurus by Andre LaMothe in which he suggests that when setting up your conditional statements, you put the most frequently (or expected) path first.
For example:
if (booleanThatIsMostLikelyToBeTrue)
{
// ...expected code
// also the code that would've been prefetched
}
else
{
// ...exceptional or less likely code
}
My questions are:
1) Was LaMothe's optimization suggested with this in mind? (I no longer have the book)
2) Is this type of optimization still a worthwhile programming habit on modern machine? (maybe prefetching is handled differently than it used to be?)
You want to set up your code to branch as little as possible, and branch backwards when it does. A more reliable way to do that IF is to always do the common thing then test for the exception:
Do A;
if( test ) Do B;
Of course, this has to be arranged so that anything A does is reversed by B if B occurs.
The point of Zen programming is to try to eliminate the If statements altogether. So for example, instead of looping 10 times (which requires an exit condition test), you just write the same statement 10 times, voila!, no if statement. Another example is if you are looping a list, you use a sentinel to exit the loop, instead of testing an index value.
If you are working in C, it can be difficult to gimick the compiler into doing what you want. Putting something first or second in an IF statement will have no effect on compiled result. Note that it is critical to use the right compiler options. For example, using the /O2 (optimize for speed) in Visual C++ makes a HUGE difference in the compiled efficiency.
These kinds of optimizations can often be useful. However, it's typically something that you do after you've written the program, profiled it, and determined that the program would benefit from doing these micro-optimizations.
Also, many modern compilers have profile-guided optimization, which can relieve you of having to contort your code for performance purposes.
when I get a kernel using too many registers there are basically 3 options I can do:
leave the kernel as it is, which results in low occupancy
set compiler to use lower number of registers, spilling them, causing worse performance
rewrite the kernel
For option 3, I'd like to know which part of the kernel needs the maximum number of registers. Is there any tool or technique allowing me to identify this part? Reading through the PTX code (I develop on NVidia) is not helpful, the registers have various high numbers and to be honest, the best I can do is to identify which part of the assembly code maps to which part of the C code.
Just commenting out some code is not much a way to go - for example, I noticed that if I just put the code into loop, the number of registers raises dramatically, not only by one for the loop control variable. I personally suspect the NVidia compiler from imperfect variable liveness analysis, but of course I cannot do much with that :-)
If you're running on NVidia hardware, you can pass -cl-nv-verbose compile option to clBuildProgram then clGetProgramInfo CL_PROGRAM_BINARIES to get human readable text about the compile. In there it will say the number of registers it uses. Note that NVidia caches compiles and it only produces that register info when the kernel source actually changes, so you may want to inject some superfluous change in the source code to force it to do the full analysis.
If you're running on AMD hardware, just set the environment variable GPU_DUMP_DEVICE_KERNEL=1. It will produce a text file of the IL during the compile. Not sure it explicitly says the number of registers used, but it's what is equivalent to the NVidia technique above.
When looking at that output (at least on nvidia), you'll find that it seems to use an infinite number of registers (if you go by the register numbers). In reality, it does a flow analysis and actually reuses registers in a way that is not at all obvious when looking at the IL.
This is a tough question in any language, and there's probably not one correct answer. But here are some things to think about:
Look for the code in the "deepest" scope you can find, keeping in mind that most functions are probably inlined by your OpenCL compiler. Count the variables used in this scope, and walk up the containing scopes. In each containing scope, count variables that are used both before and after the inner scope. These are potentially live while the inner scope executes. This process could help you account for the live registers at a particular part of the program.
Try to take variables that "span" deep scopes and make them not span the scope if possible. For example, if you have something like this:
int i = func1();
int j = func2(); // perhaps lots of live registers here
int k = func3(i,j);
you could try to reorder the first two lines if func2 has lots of live registers. That would remove i from the set of live registers while func2 is running. This is a trivial pattern, of course, but hopefully it's illustrative.
Think about getting rid of variables that just keep around the results of simple computations. You might be able to recompute these when you need them. For example, if you have something like int i = get_local_id(0) you might be able to just use get_local_id(0) wherever you would have used i.
Think about getting rid of variables that are keeping around values stored in memory.
Without good tools for this kind of thing, it ends up being more art than science. But hopefully some of this is helpful.
I have implemented FIFO semaphores but now I need a way to test/prove that they are working properly. A simple test would be to create some threads that try to wait on a semaphore and then print a message with a number and if the numbers are in order it should be FIFO, but this is not good enough to prove it because that order could have occurred by chance. Thus, I need a better way of testing it.
If necessary locks or condition variables can be used too.
Thanks
What you describe with your sentence "but this is not good enough to prove it because that order could have occurred by chance" is somehow a known dilema.
1) Even if you have a specification, you can not ensure that the specification match your intention. To illustrate this I will take an example from "the limit of correctness". Let's consider a specification for a factorization function that is:
Compute A and B such as A * B = C
But it's not enough as you could have an implementation that returns A=1 and B=C. Adding A,B != 1 can still lead to A=-1 and B=-C, so the only correct specification must state A,B>1. That's just to illustrate how complicated it can be to write a specification that match the real intention.
2) Even having proved an algorithm, still doesn't mean the implementation is correct in practice. This is best illustrated with this quote from Donald Knuth:
Beware of bugs in the above code; I
have only proved it correct, not tried
it.
3) Testing can only reveal the presence of bug, not their absence. This quote goes back to Dijkstra:
Testing can be used to show the
presence of bugs but never to show
their absence.
Conclusion: you are doomed and you will never be 100% sure that your code is correct according to its intent! But stuff aren't that bad. Having a high confidence about the code is usually enough. For instance, if using multiple threads is still not enough for you, you can decide to use fuzzing as well so as to randomize the test execution even more. If your tests always pass, well, you can be pretty confident that your code is good.
because that order could have occurred by chance.
You can run the test a few times, e.g. 10, and test that each time the order was correct. This will ensure that it happened not by chance.
P.S. Multiple threads in a unit test is usually avoided
I need to optimize code to get room for some new code. I do not have the space for all the changes. I can not use code bank switching (80c31 with 64k).
You haven't really given a lot to go on here, but there are two main levels of optimizations you can consider:
Micro-Optimizations:
eg. XOR A instead of MOV A,0
Adam has covered some of these nicely earlier.
Macro-Optimizations:
Look at the structure of your program, the data structures and algorithms used, the tasks performed, and think VERY hard about how these could be rearranged or even removed. Are there whole chunks of code that actually aren't used? Is your code full of debug output statements that the user never sees? Are there functions specific to a single customer that you could leave out of a general release?
To get a good handle on that, you'll need to work out WHERE your memory is being used up. The Linker map is a good place to start with this. Macro-optimizations are where the BIG wins can be made.
As an aside, you could - seriously- try rewriting parts of your code with a good optimizing C compiler. You may be amazed at how tight the code can be. A true assembler hotshot may be able to improve on it, but it can easily be better than most coders. I used the IAR one about 20 years ago, and it blew my socks off.
With assembly language, you'll have to optimize by hand. Here are a few techniques:
Note: IANA8051P (I am not an 8501 programmer but I have done lots of assembly on other 8 bit chips).
Go through the code looking for any duplicated bits, no matter how small and make them functions.
Learn some of the more unusual instructions and see if you can use them to optimize, eg. A nice trick is to use XOR A to clear the accumulator instead of MOV A,0 - it saves a byte.
Another neat trick is if you call a function before returning, just jump to it eg, instead of:
CALL otherfunc
RET
Just do:
JMP otherfunc
Always make sure you are doing relative jumps and branches wherever possible, they use less memory than absolute jumps.
That's all I can think of off the top of my head for the moment.
Sorry I am coming to this late, but I once had exactly the same problem, and it became a repeated problem that kept coming back to me. In my case the project was a telephone, on an 8051 family processor, and I had totally maxed out the ROM (code) memory. It kept coming back to me because management kept requesting new features, so each new feature became a two step process. 1) Optimize old stuff to make room 2) Implement the new feature, using up the room I just made.
There are two approaches to optimization. Tactical and Strategical. Tactical optimizations save a few bytes at a time with a micro optimization idea. I think you need strategic optimizations which involve a more radical rethinking about how you are doing things.
Something I remember worked for me and could work for you;
Look at the essence of what your code has to do and try to distill out some really strong flexible primitive operations. Then rebuild your top level code so that it does nothing low level at all except call on the primitives. Ideally use a table based approach, your table contains stuff like; Input state, event, output state, primitives.... In other words when an event happens, look up a cell in the table for that event in the current state. That cell tells you what new state to change to (optionally) and what primitive(s) (if any) to execute. You might need multiple sets of states/events/tables/primitives for different layers/subsystems.
One of the many benefits of this approach is that you can think of it as building a custom language for your particular problem, in which you can very efficiently (i.e. with minimal extra code) create new functionality simply by modifying the table.
Sorry I am months late and you probably didn't have time to do something this radical anyway. For all I know you were already using a similar approach! But my answer might help someone else someday who knows.
In the whacked-out department, you could also consider compressing part of your code and only keeping some part that is actively used decompressed at any particular point in time. I have a hard time believing that the code required for the compress/decompress system would be small enough a portion of the tiny memory of the 8051 to make this worthwhile, but has worked wonders on slightly larger systems.
Yet another approach is to turn to a byte-code format or the kind of table-driven code that some state machine tools output -- having a machine understand what your app is doing and generating a completely incomprehensible implementation can be a great way to save room :)
Finally, if the code is indeed compiled in C, I would suggest compiling with a range of different options to see what happens. Also, I wrote a piece on compact C coding for the ESC back in 2001 that is still pretty current. See that text for other tricks for small machines.
1) Where possible save your variables in Idata not in xdata
2) Look at your Jmp statements – make use of SJmp and AJmp
I assume you know it won't fit because you wrote/complied and got the "out of memory" error. :) It appears the answers address your question pretty accurately; short of getting code examples.
I would, however, recommend a few additional thoughts;
Make sure all the code is really
being used -- code coverage test? An
unused sub is a big win -- this is a
tough step -- if you're the original
author, it may be easier -- (well, maybe) :)
Ensure the level of "verification"
and initialization -- sometimes we
have a tendency to be over zealous
in insuring we have initialized
variables/memory and sure enough
rightly so, how many times have we
been bitten by it. Not saying don't
initialize (duh), but if we're doing
a memory move, the destination
doesn't need to be zero'd first --
this dovetails with
1 --
Eval the new features -- can an
existing sub be be enhanced to cover
both functions or perhaps an
existing feature replaced?
Break up big code if a piece of the
big code can save creating a new
little code.
or perhaps there's an argument for hardware version 2.0 on the table now ... :)
regards
Besides the already mentioned (more or less) obvious optimizations, here is a really weird (and almost impossible to achieve) one: Code reuse. And with Code reuse I dont mean the normal reuse, but to a) reuse your code as data or b) to reuse your code as other code. Maybe you can create a lut (or whatever static data) that it can represented by the asm hex opcodes (here you have to look harvard vs von neumann architecture).
The other would reuse code by giving code a different meaning when you address it different. Here an example to make clear what I mean. If the bytes for your code look like this: AABCCCDDEEFFGGHH at address X where each letter stands for one opcode, imagine you would now jump to X+1. Maybe you get a complete different functionality where the now by space seperated bytes form the new opcodes: ABC CCD DE EF GH.
But beware: This is not only tricky to achieve (maybe its impossible), but its a horror to maintain. So if you are not a demo code (or something similiar exotic), I would recommend to use the already other mentioned ways to save mem.
I'd like to gather metrics on specific routines of my code to see where I can best optimize. Let's take a simple example and say that I have a "Class" database with multiple "Students." Let's say the current code calls the database for every student instead of grabbing them all at once in a batch. I'd like to see how long each trip to the database takes for every student row.
This is in C#, but I think it applies everywhere. Usually when I get curious as to a specific routine's performance, I'll create a DateTime object before it runs, run the routine, and then create another DateTime object after the call and take the milliseconds difference between the two to see how long it runs. Usually I just output this in the page's trace...so it's a bit lo-fi. Any best practices for this? I thought about being able to put the web app into some "diagnostic" mode and doing verbose logging/event log writing with whatever I'm after, but I wanted to see if the stackoverflow hive mind has a better idea.
For database queries, you have a two small problems. Cache: data cache and statement cache.
If you run the query once, the statement is parsed, prepared, bound and executed. Data is fetched from files into cache.
When you execute the query a second time, the cache is used, and performance is often much, much better.
Which is the "real" performance number? First one or second one? Some folks say "worst case" is the real number, and we have to optimize that. Others say "typical case" and run the query twice, ignoring the first one. Others says "average" and run in 30 times, averaging them all. Other say "typical average", run the 31 times and average the last 30.
I suggest that the "last 30 of 31" is the most meaningful DB performance number. Don't sweat the things you can't control (parse, prepare, bind) times. Sweat the stuff you can control -- data structures, I/O loading, indexes, etc.
I use this method on occasion and find it to be fairly accurate. The problem is that in large applications with a fairly hefty amount of debugging logs, it can be a pain to search through the logs for this information. So I use external tools (I program in Java primarily, and use JProbe) which allow me to see average and total times for my methods, how much time is spent exclusively by a particular method (as opposed to the cumulative time spent by the method and any method it calls), as well as memory and resource allocations.
These tools can assist you in measuring the performance of entire applications, and if you are doing a significant amount of development in an area where performance is important, you may want to research the tools available and learn how to use one.
Some times approach you take will give you a best look at you application performance.
One things I can recommend is to use System.Diagnostics.Stopwatch instead of DateTime ,
DateTime is accurate only up to 16 ms where Stopwatch is accurate up to the cpu tick.
But I recommend to complement it with custom performance counters for production and running the app under profiler during development.
There are some Profilers available but, frankly, I think your approach is better. The profiler approach is overkill. Maybe the use of profilers is worth the trouble if you absolutely have no clue where the bottleneck is. I would rather spend a little time analyzing the problem up front and putting a few strategic print statements than figure out how to instrument your app for profiling then pour over gargantuan reports where every executable line of code is timed.
If you're working with .NET, then I'd recommend checking out the Stopwatch class. The times you get back from that are going to be much more accurate than an equivalent sample using DateTime.
I'd also recommend checking out ANTS Profiler for scenarios in which performance is exceptionally important.
It is worth considering investing in a good commercial profiler, particularly if you ever expect to have to do this a second time.
The one I use, JProfiler, works in the Java world and can attach to an already-running application, so no special instrumentation is required (at least with the more recent JVMs).
It very rapidly builds a sorted list of hotspots in your code, showing which methods your code is spending most of its time inside. It filters pretty intelligently by default, and allows you to tune the filtering further if required, meaning that you can ignore the detail of third party libraries, while picking out those of your methods which are taking all the time.
In addition, you get lots of other useful reports on what your code is doing. It paid for the cost of the licence in the time I saved the first time I used it; I didn't have to add in lots of logging statements and construct a mechanism to anayse the output: the developers of the profiler had already done all of that for me.
I'm not associated with ej-technologies in any way other than being a very happy customer.
I use this method and I think it's very accurate.
I think you have a good approach. I recommend that you produce "machine friendly" records in the log file(s) so that you can parse them more easily. Something like CSV or other-delimited records that are consistently structured.