Index register in cpu (Computer org. and arc.) - indexing

Can index register have negative value?
For example: at start Xr is 0, and then we need to decrement it? What will be the value of Xr?

A register can hold any value that fits in the number of bits it has. What makes the value negative or not is the way you treat it. The question you should be asking yourself is - does your basic CPU support signed arithmetic operations, and how does it encode them.
The common method you must have learned by now is 2's complement, so if a register holds a value of 0xffff (let's say it's 16 bit wide), then it can be interpreted as (2^16 minus one), or as (-1) - depending on who's reading it and how.
The term indexing is therefore a little ambiguous, as the value used will be determined by the type of operation - if you mean to use it within some computation - it depends if the operations you use are signed or not. If you mean to use it directly for memory addressing (if your CPU supports it) - again, it depends on what the CPU supports for this form on indexing.

Related

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

Least memory intensive way of storing small integers in VBA

Occasionally I will store the state of some system as an integer. I often find myself using small values for these states (say 1-10), since the system is relatively simple.
In general, what's the best declaration for a variable which stores small positive integers - where best is defined as fastest read/write time & smallest memory consumption? Small is here defined as 1-10, although a complete list of integer storing methods and their ranges would be useful.
Originally I used Integer as on the face of it, it uses less memory. But I have since learned that that is not the case, as it is silently converted to Long
I then used Long for the above reason, and in the knowledge that it uses less memory than Double
I have since discovered Byte and switched to that, since it stores smaller integers (0-255 or 256, I never remember which), and I guess uses less memory from it's minute name. But I don't really trust VBA and wonder if there's any internal type conversions done here too.
Boolean I thought was only 0 or 1, but I've read that any non-zero number is converted to True, does this mean it can also store numbers?
Originally I used Integer as on the face of it, it uses less memory. But I have since learned that that is not the case, as it is silently converted to Long
That's right there is no advantage in using Integer over Long because of that conversion, but Integer might be necessary when communicating with old 16 bit APIs.
Also read "Why Use Integer Instead of Long?"
I then used Long for the above reason, and in the knowledge that it uses less memory than Double
You would not decide between Long or Double because one uses less memory. You decide between them because …
you need floating point numbers (Double)
or you don't accept floating point numbers. (Long)
Deciding on memory usage in this specific case is just a very bad idea because these types are fundamentally different.
I have since discovered Byte and switched to that, since it stores smaller integers (0-255 or 256, I never remember which), and I guess uses less memory from it's minute name. But I don't really trust VBA and wonder if there's any internal type conversions done here too.
I don't see any case where you use Office/Excel and run into any memory issues by using Long instead of Byte to iterate from 1 to 10. If you need to limit it to 255 (some old APIs, whatever) then you might use Byte. If there is no need for that I would use Long just to be flexible and not run into any coding issues because you need to remember which counters are only Byte and which are Long.
E.g. If I use i for iterating I would expect Long. I see no advantage in using Byte for that case.
Stay as simple as possible. Don't do strange things one would not expect only because you can. Avoiding future coding issues is worth more than one (or three) byte of memory usage. Sometimes it is worthier to write good human readable and maintainable code than faster code especially if you can't notice the differences (which you really can't in this case). Bad readable code always results in errors or vulnerabilities sooner or later.
Boolean I thought was only 0 or 1, but I've read that any non-zero number is converted to True, does this mean it can also store numbers?
No that's wrong. Boolean is -1 for True and 0 for False. But note that if you cast e.g. a Long into Boolean which is not 0 then it will automatically cast and result in True.
But Boolean in VBA is clearly defined as:
0 = False
-1 = True
The smallest chunk of memory that can be addressed is a byte (8 bits).
I cannot guarantee that VBA Bytes are stored as bytes in all cases, but using this type you are on the safest side.
By the way, the largest byte value is 11111111b, i.e 255d. The value 256d is 100000000b which requires 9 bits.
Also note that using Bytes every possible time might be unproductive as it can have a cost in terms of running time, if numerical conversions are required, while the spared memory space may be insignificant.
Except for very special applications, this kind of micro-optimization is of no use.

Why does the Java API use int instead of short or byte?

Why does the Java API use int, when short or even byte would be sufficient?
Example: The DAY_OF_WEEK field in class Calendar uses int.
If the difference is too minimal, then why do those datatypes (short, int) exist at all?
Some of the reasons have already been pointed out. For example, the fact that "...(Almost) All operations on byte, short will promote these primitives to int". However, the obvious next question would be: WHY are these types promoted to int?
So to go one level deeper: The answer may simply be related to the Java Virtual Machine Instruction Set. As summarized in the Table in the Java Virtual Machine Specification, all integral arithmetic operations, like adding, dividing and others, are only available for the type int and the type long, and not for the smaller types.
(An aside: The smaller types (byte and short) are basically only intended for arrays. An array like new byte[1000] will take 1000 bytes, and an array like new int[1000] will take 4000 bytes)
Now, of course, one could say that "...the obvious next question would be: WHY are these instructions only offered for int (and long)?".
One reason is mentioned in the JVM Spec mentioned above:
If each typed instruction supported all of the Java Virtual Machine's run-time data types, there would be more instructions than could be represented in a byte
Additionally, the Java Virtual Machine can be considered as an abstraction of a real processor. And introducing dedicated Arithmetic Logic Unit for smaller types would not be worth the effort: It would need additional transistors, but it still could only execute one addition in one clock cycle. The dominant architecture when the JVM was designed was 32bits, just right for a 32bit int. (The operations that involve a 64bit long value are implemented as a special case).
(Note: The last paragraph is a bit oversimplified, considering possible vectorization etc., but should give the basic idea without diving too deep into processor design topics)
EDIT: A short addendum, focussing on the example from the question, but in an more general sense: One could also ask whether it would not be beneficial to store fields using the smaller types. For example, one might think that memory could be saved by storing Calendar.DAY_OF_WEEK as a byte. But here, the Java Class File Format comes into play: All the Fields in a Class File occupy at least one "slot", which has the size of one int (32 bits). (The "wide" fields, double and long, occupy two slots). So explicitly declaring a field as short or byte would not save any memory either.
(Almost) All operations on byte, short will promote them to int, for example, you cannot write:
short x = 1;
short y = 2;
short z = x + y; //error
Arithmetics are easier and straightforward when using int, no need to cast.
In terms of space, it makes a very little difference. byte and short would complicate things, I don't think this micro optimization worth it since we are talking about a fixed amount of variables.
byte is relevant and useful when you program for embedded devices or dealing with files/networks. Also these primitives are limited, what if the calculations might exceed their limits in the future? Try to think about an extension for Calendar class that might evolve bigger numbers.
Also note that in a 64-bit processors, locals will be saved in registers and won't use any resources, so using int, short and other primitives won't make any difference at all. Moreover, many Java implementations align variables* (and objects).
* byte and short occupy the same space as int if they are local variables, class variables or even instance variables. Why? Because in (most) computer systems, variables addresses are aligned, so for example if you use a single byte, you'll actually end up with two bytes - one for the variable itself and another for the padding.
On the other hand, in arrays, byte take 1 byte, short take 2 bytes and int take four bytes, because in arrays only the start and maybe the end of it has to be aligned. This will make a difference in case you want to use, for example, System.arraycopy(), then you'll really note a performance difference.
Because arithmetic operations are easier when using integers compared to shorts. Assume that the constants were indeed modeled by short values. Then you would have to use the API in this manner:
short month = Calendar.JUNE;
month = month + (short) 1; // is july
Notice the explicit casting. Short values are implicitly promoted to int values when they are used in arithmetic operations. (On the operand stack, shorts are even expressed as ints.) This would be quite cumbersome to use which is why int values are often preferred for constants.
Compared to that, the gain in storage efficiency is minimal because there only exists a fixed number of such constants. We are talking about 40 constants. Changing their storage from int to short would safe you 40 * 16 bit = 80 byte. See this answer for further reference.
The design complexity of a virtual machine is a function of how many kinds of operations it can perform. It's easier to having four implementations of an instruction like "multiply"--one each for 32-bit integer, 64-bit integer, 32-bit floating-point, and 64-bit floating-point--than to have, in addition to the above, versions for the smaller numerical types as well. A more interesting design question is why there should be four types, rather than fewer (performing all integer computations with 64-bit integers and/or doing all floating-point computations with 64-bit floating-point values). The reason for using 32-bit integers is that Java was expected to run on many platforms where 32-bit types could be acted upon just as quickly as 16-bit or 8-bit types, but operations on 64-bit types would be noticeably slower. Even on platforms where 16-bit types would be faster to work with, the extra cost of working with 32-bit quantities would be offset by the simplicity afforded by only having 32-bit types.
As for performing floating-point computations on 32-bit values, the advantages are a bit less clear. There are some platforms where a computation like float a=b+c+d; could be performed most quickly by converting all operands to a higher-precision type, adding them, and then converting the result back to a 32-bit floating-point number for storage. There are other platforms where it would be more efficient to perform all computations using 32-bit floating-point values. The creators of Java decided that all platforms should be required to do things the same way, and that they should favor the hardware platforms for which 32-bit floating-point computations are faster than longer ones, even though this severely degraded PC both the speed and precision of floating-point math on a typical PC, as well as on many machines without floating-point units. Note, btw, that depending upon the values of b, c, and d, using higher-precision intermediate computations when computing expressions like the aforementioned float a=b+c+d; will sometimes yield results which are significantly more accurate than would be achieved of all intermediate operands were computed at float precision, but will sometimes yield a value which is a tiny bit less accurate. In any case, Sun decided everything should be done the same way, and they opted for using minimal-precision float values.
Note that the primary advantages of smaller data types become apparent when large numbers of them are stored together in an array; even if there were no advantage to having individual variables of types smaller than 64-bits, it's worthwhile to have arrays which can store smaller values more compactly; having a local variable be a byte rather than an long saves seven bytes; having an array of 1,000,000 numbers hold each number as a byte rather than a long waves 7,000,000 bytes. Since each array type only needs to support a few operations (most notably read one item, store one item, copy a range of items within an array, or copy a range of items from one array to another), the added complexity of having more array types is not as severe as the complexity of having more types of directly-usable discrete numerical values.
If you used the philosophy where integral constants are stored in the smallest type that they fit in, then Java would have a serious problem: whenever programmers write code using integral constants, they have to pay careful attention to their code to check if the type of the constants matter, and if so look up the type in the documentation and/or do whatever type conversions are needed.
So now that we've outlined a serious problem, what benefits could you hope to achieve with that philosophy? I would be unsurprised if the only runtime-observable effect of that change would be what type you get when you look the constant up via reflection. (and, of course, whatever errors are introduced by lazy/unwitting programmers not correctly accounting for the types of the constants)
Weighing the pros and the cons is very easy: it's a bad philosophy.
Actually, there'd be a small advantage. If you have a
class MyTimeAndDayOfWeek {
byte dayOfWeek;
byte hour;
byte minute;
byte second;
}
then on a typical JVM it needs as much space as a class containing a single int. The memory consumption gets rounded to a next multiple of 8 or 16 bytes (IIRC, that's configurable), so the cases when there are real saving are rather rare.
This class would be slightly easier to use if the corresponding Calendar methods returned a byte. But there are no such Calendar methods, only get(int) which must returns an int because of other fields. Each operation on smaller types promotes to int, so you need a lot of casting.
Most probably, you'll either give up and switch to an int or write setters like
void setDayOfWeek(int dayOfWeek) {
this.dayOfWeek = checkedCastToByte(dayOfWeek);
}
Then the type of DAY_OF_WEEK doesn't matter, anyway.
Using variables smaller than the bus size of the CPU means more cycles are necessary. For example when updating a single byte in memory, a 64-bit CPU needs to read a whole 64-bit word, modify only the changed part, then write back the result.
Also, using a smaller data type requires overhead when the variable is stored in a register, since the behavior of the smaller data type to be accounted for explicitly. Since the whole register is used anyways, there is nothing to be gained by using a smaller data type for method parameters and local variables.
Nevertheless, these data types might be useful for representing data structures that require specific widths, such as network packets, or for saving space in large arrays, sacrificing speed.

Testing whether some code works for ALL input numbers?

I've got an algorithm using a single (positive integer) number as an input to produce an output. And I've got the reverse function which should do the exact opposite, going back from the output to the same integer number. This should be a unique one-to-one reversible mapping.
I've tested this for some integers, but I want to be 100% sure that it works for all of them, up to a known limit.
The problem is that if I just test every integer, it takes an unreasonably long time to run. If I use 64-bit integers, that's a lot of numbers to check if I want to check them all. On the other hand, if I only test every 10th or 100th number, I'm not going to be 100% sure at the end. There might be some awkward weird constellation in one of the 90% or 99% which I didn't test.
Are there any general ways to identify edge cases so that just those "interesting" or "risky" numbers are checked? Or should I just pick numbers at random? Or test in increasing increments?
Or to put the question another way, how can I approach this so that I gain 100% confidence that every case will be properly handled?
The approach for this is generally checking every step of the computation for potential flaws. Concerning integer math, that is overflows, underflows and rounding errors from division, basically that the mathematical result can't be represented accurately. In addition, all operations derived from this suffer similar problems.
The process of auditing then looks at single steps in turn. For example, if you want to allocate memory for N integers, you need N times the size of an integer in bytes and this multiplication can overflow. You now determine those values where the multiplication overflows and create according tests that exercise these. Note that for the example of allocating memory, proper handling typically means that the function does not allocate memory but fail.
The principle behind this is that you determine the ranges for every operation where the outcome is somehow different (like e.g. where it overflows) and then make sure via tests that both variants work. This reduces the number of tests from all possible input values to just those where you expect a significant difference.

Exponents in Genetic Programming

I want to have real-valued exponents (not just integers) for the terminal variables.
For example, lets say I want to evolve a function y = x^3.5 + x^2.2 + 6. How should I proceed? I haven't seen any GP implementations which can do this.
I tried using the power function, but sometimes the initial solutions have so many exponents that the evaluated value exceeds 'double' bounds!
Any suggestion would be appreciated. Thanks in advance.
DEAP (in Python) implements it. In fact there is an example for that. By adding the math.pow from Python in the primitive set you can acheive what you want.
pset.addPrimitive(math.pow, 2)
But using the pow operator you risk getting something like x^(x^(x^(x))), which is probably not desired. You shall add a restriction (by a mean that I not sure) on where in your tree the pow is allowed (just before a leaf or something like that).
OpenBeagle (in C++) also allows it but you will need to develop your own primitive using the pow from <math.h>, you can use as an example the Sin or Cos primitive.
If only some of the initial population are suffering from the overflow problem then just penalise them with a poor fitness score and they will probably be removed from the population within a few generations.
But, if the problem is that virtually all individuals suffer from this problem, then you will have to add some constraints. The simplest thing to do would be to constrain the exponent child of the power function to be a real literal - which would mean powers would not be allowed to be nested. It depends on whether this is sufficient for your needs though. There are a few ways to add constraints like these (or more complex ones) - try looking in to Constrained Syntactic Structures and grammar guided GP.
A few other simple thoughts: can you use a data-type with a larger range? Also, you could reduce the maximum depth parameter, so that there will be less room for nested exponents. Of course that's only possible to an extent, and it depends on the complexity of the function.
Integers have a different binary representation than reals, so you have to use a slightly different bitstring representation and recombination/mutation operator.
For an excellent demonstration, see slide 24 of www.cs.vu.nl/~gusz/ecbook/slides/Genetic_Algorithms.ppt or check out the Eiben/Smith book "Introduction to Evolutionary Computing Genetic Algorithms." This describes how to map a bit string to a real number. You can then create a representation where x only lies within an interval [y,z]. In this case, choose y and z to be the of less magnitude than the capacity of the data type you are using (e.g. 10^308 for a double) so you don't run into the overflow issue you describe.
You have to consider that with real-valued exponents and a negative base you will not obtain a real, but a complex number. For example, the Math.Pow implementation in .NET says that you get NaN if you attempt to calculate the power of a negative base to a non-integer exponent. You have to make sure all your x values are positive. I think that's the problem that you're seeing when you "exceed double bounds".
Btw, you can try the HeuristicLab GP implementation. It is very flexible with a configurable grammar.