Effect of more number of functions in an application - oop

If I have a large number of functions in my application, Do they effect the execution speed of the application?
For example: I have 10000 functions in my application but each time that I run my application only 1 or 2 functions will work. It is not known beforehand which function(s) will be called, it depends on user given input.
Does it changes the execution speed it I have many number of functions?

The speed shouldn't be significantly affected in your case. The number of procedures defined is much less important than the computational complexity of each procedure called.
Think about it. A 2.5GHz processor can theoretically perform more than 10 billion floating point operations per second (FLOPS). The time required to load a fixed number of procedures into memory, even a million lines of code, will remain constant and fairly trivial, but if one of your procedures is complex enough, the number of operations can increase massively over a comparatively few iterations.

9,998 functions not used, but still in since they are referenced, does not affect performance unless you need to parse all code at each run.
I'm thinking the case analysis size might affect the performance. If you have 10,000 fucntions and only use about 2 each time, then you'll have about 5,000 outcomes and that means a lot of tests if it's a linear analysis or about 13 if it's binary.
I'd start with profiling the code to find the bottlenecks.

Related

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.
Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).
Is the first approach using atomic function faster than the second?
In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.
You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:
Optimizing Parallel Reduction in CUDA
to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.
Also relevant: CUDA: how to sum all elements of an array into one number within the GPU?
If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

Is faster code also more power efficient?

Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?
Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.
Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.
I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.

Does optimizing code in TI-BASIC actually make a difference?

I know in TI-BASIC, the convention is to optimize obsessively and to save as many bits as possible (which is pretty fun, I admit).
For example,
DelVar Z
Prompt X
If X=0
Then
Disp "X is zero"
End //28 bytes
would be cleaned up as
DelVar ZPrompt X
If not(X
"X is zero //20 bytes
But does optimizing code this way actually make a difference? Does it noticeably run faster or save memory?
Yes. Optimizing your TI-Basic code makes a difference, and that difference is much larger than you would find for most programming languages.
In my opinion, the most important optimization to TI-Basic programs is size (making them as small as possible). This is important to me since I have dozens of programs on my calculator, which only has 24 kB of user-accessible RAM. In this case, it isn't really necessary to spend lots of time trying to save a few bytes of space; instead, I simply advise learning the shortest and most efficient ways to do things, so that when you write programs, they will naturally tend to be small.
Additionally, TI-Basic programs should be optimized for speed. Examples off of the top of my head include the quirk with the unclosed For( loop, calculating a value once instead of calculating it in every iteration of a loop (if possible), and using quickly-accessed variables such as Ans and the finance variables whenever the variable must be accessed a large number of times (e.g. 1000+).
A third possible optimization is for run-time memory usage. Every loop, function call, etc. has an overhead that must be stored in the memory stack in order to return to the original location, calculate values, etc. during the program's execution. It is important to avoid memory leaks (such as breaking out of a loop with Goto).
It is up to you to decide how you balance these optimizations. I prefer to:
First and foremost, guarantee that there are no memory leaks or incorrectly nested loops in my program.
Take advantage of any size optimizations that have little or no impact on the program's speed.
Consider speed optimizations, and decide if the added speed is worth the increase in program size.
TI-BASIC is an interpreted language, which usually means there is a huge overhead on every single operation.
The way an interpreted language works is that instead of actually compiling the program into code that runs on the CPU directly, each operation is a function call to the interpreter that look at what needs to be done and then calls functions to complete those sub tasks. In most cases, the overhead is a factor or two in speed, and often also in stack memory usage. However, the memory for non-stack is usually the same.
In your above example you are doing the exact same number of operations, which should mean that they run exactly as fast. What you should optimize are things like i = i + 1, which is 4 operations into i++ which is 2 operations. (as an example, TI-BASIC doesn't support ++ operator).
This does not mean that all operations take the exact same time, internally a operation may be calling hundreds of other functions or it may be as simple as updating a single variable. The programmers of the interpreter may also have implemented various peephole optimizations that optimizes very specific language constructs, e.g. for(int i = 0; i < count; i++) could both be implemented as a collection of expensive interpreter functions that behave as if i is generic, or it could be optimized to a compiled loop where it just had to update the variable i and reevaluate the count.
Now, not all interpreted languages are doomed to this pale existence. For example, JavaScript used to be one, but these days all major js engines JIT compile the code to run directly on the CPU.
UPDATE: Clarified that not all operations are created equal.
Absolutely, it makes a difference. I wrote a full-scale color RPG for the TI-84+CSE, and let me tell you, without optimizing any of my code, the game would flat out not run. At present, on the CSE, Sorcery of Uvutu can only run if every other program is archived and all other memory is out of RAM. The programs and data storage alone takes up 20k bytes in RAM, or just 1kb under all of available user memory. With all the variables in use, the memory approaches dangerously low points. I had points in my development where due to poor optimizations, I couldn't even start the game without getting a "memory all gone" error. I had plans to implement various extra things, but due to space and speed concerns, it was impossible to do so. That's only the consideration to space.
In the speed department, the game became, and still is, slow in the overworld. Walking around in the overworld is painfully slow compared to other games, and that's because of what I have to do in that code; I have to check for collisions, check if the user is moving to a new map, check if they pressed a key that should illicit a response, check if a battle should go on, and more. I was able to make slight optimizations to the walking speed, but even then, I could blatantly tell I had made improvements. It still was pretty awfully slow (at least compared to every other port I've made), but I made it a little more tolerable.
In summary, through my own experiences crafting a large project, I can say that in TI-Basic, optimizing code does make a difference. Other answers mentioned this, but TI-Basic is an interpreted language. This means the code isn't compiled into faster, lower level code, but the stuff that you put in the program is read straight out as it executes, is interpreted by the interpreter, calls the subroutines and other stuff it needs to to execute the commands, and then returns back to read the next line. As a result of that, and the fact that the TI-84+ series CPU, the Zilog Z80, was designed in 1976, you get a rather slow interpreter, especially for this day and age. As such, the fewer the commands you run, and the more you take advantage of system weirdness such as Ans being the fastest variable that can also hold the most types of data (integers/floats, strings, lists, matrices, etc), the better the performance you're gonna get.
Sources: My own experiences, documented here: https://codewalr.us/index.php?topic=778.msg27190#msg27190
TI-84+CSE RAM numbers came from here: https://education.ti.com/en/products/calculators/graphing-calculators/ti-84-plus-c-se?category=specifications
Information about the Z80 came from here: http://segaretro.org/Zilog_Z80
Depends, if it's just a basic math program then no. For big games then YES. The TI-84 has only 3.5MB of space available and has the combo of an ancient Z80 processor and a whopping 128KB of RAM. TI-BASIC is also quite slow as it's interpreted (look it up for further information) so if you to make fast-running games then YES. Optimization is very important.

Control of parallelization

I am running a custom processor on a rowset that does not seem to run in parallel. The underlying ~1GB text file is first read into a table that is partitioned via round robin. The 'Extract' runs on 200 vertices but then (under 'Aggregate' node) the processing [that does various complex computations] happens on only 2 vertices even though the parallelism parameter is much higher than that. Is there a special hint that needs to be used to dictate the compiler to use more vertex? Is there a function or property that needs to be overridden to set the parallelism at this phase as well?
Sorry for the late reply. But it is vacation time :).
It is good to see that the extract phase is fully scaled out.
Without seeing the script or the generated plan it is a bit difficult to say why you only see 2 vertices in some places. There are a couple of reasons why that may be the case:
you don't have enough data to scale out to more.
your aggregation needs more data and thus the plan has less parallelism.
your operation is intrinsically less parallel.
The optimizer's data cardinality estimation is off and chooses not enough parallelism. We have some ability to hint, but I rather first see the job.
Note that custom processors often block the optimizer from pushing optimizations through in the script (using the READ ONLY option for example helps) and can throw off the cardinality estimations.
If you send me the script, the job graph and the link to the job to mrys at Microsoft, I and the team will look into it next week after the holidays are over.

basic operations cpu time cost

I was wondering, how to optimize loops for systems with very limited resources. Let's say, if we have a basic for loop, like ( written in javascript):
for(var i = someArr.length - 1; i > -1; i--)
{
someArr[i]
}
I honestly don't know, isn't != cheaper than > ?
I would be grateful for any resources covering computing cost in context of basic operators, like the aforementioned, >>, ~, !, and so on.
Performance on a modern CPU is far from trivial. Here are a couple of things that complicate it:
Computers are fast. Your CPU can execute upwards of 6 billion instructions per second. So even the slowest instruction can be executed millions of times per second, meaning that it only really matters if you use it very often
Modern CPU's have hundreds of instructions in flight simultaneously. They are pipelined, meaning that while one instruction is being read, another is reading from registers, a third one is executing, and a fourth one is writing back to a register. Modern CPU's have 15-20 of such stages. On top of this, they can execute 3-4 instructions at the same time on each of these stages. And they can reorder these instructions. If the multiplication unit is being used by another instruction, perhaps we can find an addition instruction to execute instead, for example. So even if you have some slow instructions mixed in, their cost can be hidden very well most of the time, by executing other instructions while waiting for the slow one to finish.
Memory is hundreds of times slower than the CPU. The instructions being executed don't really matter if their cost is dwarfed by retrieval of data from memory. And even this isn't reliable, because the CPU has its own onboard caches to attempt to hide this cost.
So the short answer is "don't try to outsmart the compiler". If you are able to choose between two equivalent expressions, the compiler is probably able to do the same, and will pick the most efficient one. The cost of an instruction varies, depending on all the above factors. Which other instructions are executing, what data is in the CPU's cache, which precise CPU model is the code running on, and so on. Code that is super efficient in one case may be very inefficient in other cases. The compiler will try to pick the most generally efficient instructions, and schedule them as well as possible. Unless you know more than the compiler about this, you're unlikely to be able to do a better job of it.
Don't try such microoptimizations unless you really know what you're doing. As the above shows, low-level performance is a ridiculously complex subject, and it's very easy to write "optimizations" that result in far slower code. Or which just sacrifice readability on something that makes no difference at all.
Further, most of your code simply doesn't have a measurable impact on performance.
People generally love quoting (or misquoting) Knuth on this subject:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil
People often interpret this as "don't bother trying to optimize your code". If you actually read the full quote, some much more interesting consequences should become clear:
Most of the time, we should forget about microoptimizations. Most code is executed so rarely that optimizations won't matter. Keeping in mind the number of instructions a CPU can execute per second, it is obvious that a block of code has to be executed very often for optimizations in it to have any effect. So about 97% of the time, your optimizations will be a waste of time. But he also says that sometimes (3% of the time), your optimizations will matter. And obviously, looking for those 3% is a bit like looking for a needle in a haystack. If you just decide to "optimize your code" in general, you're going to waste your time on the first 97%. Instead, you need to first locate the 3% that actually need optimizing. In other words, run your code through a profiler, and let it tell you which code takes up the most CPU time. Then you know where to optimize. And then your optimizations are no longer premature.
It is extraordinarily unlikely that such micro-optimizations will make a noticeable difference to your code in any but the most extreme (real time embedded systems?) circumstances. Your time would probably be better served worrying about making your code readable and maintainable.
When in doubt, always begin by asking Donald Knuth:
http://shreevatsa.wordpress.com/2008/05/16/premature-optimization-is-the-root-of-all-evil/
Or, for a slightly less high-brow take on micro-optimization:
http://www.codinghorror.com/blog/archives/000185.html
Most comparations have same coast, because the processor simply compares it in all aspects, then after that it takes a decision based on flags generated by this previous comparation so the comparation signal doesn't matter at all. But some architectures try to accelerate this process based on the value you are comparing with, like comparations against 0.
As far as I know, bitwise operations are the cheapest operations, slightly faster than addition and subtraction. Multiplication and division operations are a little more expensive, and comparation is the highest coast operation.
That's like asking for a fish, when I would rather teach you to fish.
There are simple ways to see for yourself how long things take. My favorite is to just copy the code 10 times, and then wrap it in a loop of 10^8 times. If I run it and look at my watch, the number of seconds it takes translates to nanoseconds.
Saying don't do premature optimization is a "don't be". If you want a "do be" you could try a proactive performance tuning technique like this.
BTW my favorite way of coding your loop is:
for (i = N; --i >= 0;){...}
Premature Optimization can be dangerous the best approach would be to write your application without worrying about that and then find the slow points and optimize those. If you are really worried about this use a lower level language. An interpreted language like javascript will cost you some processing power when compared to a lower level language like C.
In this particular case, > vs = is probably not a perfomance issue. HOWEVER > is generally safer choice because prevents cases where you modified the code from running off into the weeds and stuck in a infinite loop.