How to use coalesced memory access - gpu

I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?

Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:
arr[tid] --- gives perfect coalescence
arr[tid+5] --- is almost perfect, probably misaligned
arr[tid*4] --- is not that good anymore, because of the gaps
arr[random(0..N)] --- horrible!
I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.
"But I have so many arrays everyone has about 2 or 3 times longer than the number of my threads and using the pattern like "arr[tid*4]" is inevitable. What may be a cure for this?"
If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:
for (size_t base=0; i<arraySize; i+=numberOfThreads)
process(arr[base+threadIndex])
(the above asumes that array size is a multiple of the number of threads)
So, if the number of threads is a multiple of 32, the memory access will be good.
Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.
Is "32" related to the warp size which access parallel to the global memory?
Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".
In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties?
Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).
Again, you can read more in the CUDA Programming Guide.

Related

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

Why does the Java API use int instead of short or byte?

Why does the Java API use int, when short or even byte would be sufficient?
Example: The DAY_OF_WEEK field in class Calendar uses int.
If the difference is too minimal, then why do those datatypes (short, int) exist at all?
Some of the reasons have already been pointed out. For example, the fact that "...(Almost) All operations on byte, short will promote these primitives to int". However, the obvious next question would be: WHY are these types promoted to int?
So to go one level deeper: The answer may simply be related to the Java Virtual Machine Instruction Set. As summarized in the Table in the Java Virtual Machine Specification, all integral arithmetic operations, like adding, dividing and others, are only available for the type int and the type long, and not for the smaller types.
(An aside: The smaller types (byte and short) are basically only intended for arrays. An array like new byte[1000] will take 1000 bytes, and an array like new int[1000] will take 4000 bytes)
Now, of course, one could say that "...the obvious next question would be: WHY are these instructions only offered for int (and long)?".
One reason is mentioned in the JVM Spec mentioned above:
If each typed instruction supported all of the Java Virtual Machine's run-time data types, there would be more instructions than could be represented in a byte
Additionally, the Java Virtual Machine can be considered as an abstraction of a real processor. And introducing dedicated Arithmetic Logic Unit for smaller types would not be worth the effort: It would need additional transistors, but it still could only execute one addition in one clock cycle. The dominant architecture when the JVM was designed was 32bits, just right for a 32bit int. (The operations that involve a 64bit long value are implemented as a special case).
(Note: The last paragraph is a bit oversimplified, considering possible vectorization etc., but should give the basic idea without diving too deep into processor design topics)
EDIT: A short addendum, focussing on the example from the question, but in an more general sense: One could also ask whether it would not be beneficial to store fields using the smaller types. For example, one might think that memory could be saved by storing Calendar.DAY_OF_WEEK as a byte. But here, the Java Class File Format comes into play: All the Fields in a Class File occupy at least one "slot", which has the size of one int (32 bits). (The "wide" fields, double and long, occupy two slots). So explicitly declaring a field as short or byte would not save any memory either.
(Almost) All operations on byte, short will promote them to int, for example, you cannot write:
short x = 1;
short y = 2;
short z = x + y; //error
Arithmetics are easier and straightforward when using int, no need to cast.
In terms of space, it makes a very little difference. byte and short would complicate things, I don't think this micro optimization worth it since we are talking about a fixed amount of variables.
byte is relevant and useful when you program for embedded devices or dealing with files/networks. Also these primitives are limited, what if the calculations might exceed their limits in the future? Try to think about an extension for Calendar class that might evolve bigger numbers.
Also note that in a 64-bit processors, locals will be saved in registers and won't use any resources, so using int, short and other primitives won't make any difference at all. Moreover, many Java implementations align variables* (and objects).
* byte and short occupy the same space as int if they are local variables, class variables or even instance variables. Why? Because in (most) computer systems, variables addresses are aligned, so for example if you use a single byte, you'll actually end up with two bytes - one for the variable itself and another for the padding.
On the other hand, in arrays, byte take 1 byte, short take 2 bytes and int take four bytes, because in arrays only the start and maybe the end of it has to be aligned. This will make a difference in case you want to use, for example, System.arraycopy(), then you'll really note a performance difference.
Because arithmetic operations are easier when using integers compared to shorts. Assume that the constants were indeed modeled by short values. Then you would have to use the API in this manner:
short month = Calendar.JUNE;
month = month + (short) 1; // is july
Notice the explicit casting. Short values are implicitly promoted to int values when they are used in arithmetic operations. (On the operand stack, shorts are even expressed as ints.) This would be quite cumbersome to use which is why int values are often preferred for constants.
Compared to that, the gain in storage efficiency is minimal because there only exists a fixed number of such constants. We are talking about 40 constants. Changing their storage from int to short would safe you 40 * 16 bit = 80 byte. See this answer for further reference.
The design complexity of a virtual machine is a function of how many kinds of operations it can perform. It's easier to having four implementations of an instruction like "multiply"--one each for 32-bit integer, 64-bit integer, 32-bit floating-point, and 64-bit floating-point--than to have, in addition to the above, versions for the smaller numerical types as well. A more interesting design question is why there should be four types, rather than fewer (performing all integer computations with 64-bit integers and/or doing all floating-point computations with 64-bit floating-point values). The reason for using 32-bit integers is that Java was expected to run on many platforms where 32-bit types could be acted upon just as quickly as 16-bit or 8-bit types, but operations on 64-bit types would be noticeably slower. Even on platforms where 16-bit types would be faster to work with, the extra cost of working with 32-bit quantities would be offset by the simplicity afforded by only having 32-bit types.
As for performing floating-point computations on 32-bit values, the advantages are a bit less clear. There are some platforms where a computation like float a=b+c+d; could be performed most quickly by converting all operands to a higher-precision type, adding them, and then converting the result back to a 32-bit floating-point number for storage. There are other platforms where it would be more efficient to perform all computations using 32-bit floating-point values. The creators of Java decided that all platforms should be required to do things the same way, and that they should favor the hardware platforms for which 32-bit floating-point computations are faster than longer ones, even though this severely degraded PC both the speed and precision of floating-point math on a typical PC, as well as on many machines without floating-point units. Note, btw, that depending upon the values of b, c, and d, using higher-precision intermediate computations when computing expressions like the aforementioned float a=b+c+d; will sometimes yield results which are significantly more accurate than would be achieved of all intermediate operands were computed at float precision, but will sometimes yield a value which is a tiny bit less accurate. In any case, Sun decided everything should be done the same way, and they opted for using minimal-precision float values.
Note that the primary advantages of smaller data types become apparent when large numbers of them are stored together in an array; even if there were no advantage to having individual variables of types smaller than 64-bits, it's worthwhile to have arrays which can store smaller values more compactly; having a local variable be a byte rather than an long saves seven bytes; having an array of 1,000,000 numbers hold each number as a byte rather than a long waves 7,000,000 bytes. Since each array type only needs to support a few operations (most notably read one item, store one item, copy a range of items within an array, or copy a range of items from one array to another), the added complexity of having more array types is not as severe as the complexity of having more types of directly-usable discrete numerical values.
If you used the philosophy where integral constants are stored in the smallest type that they fit in, then Java would have a serious problem: whenever programmers write code using integral constants, they have to pay careful attention to their code to check if the type of the constants matter, and if so look up the type in the documentation and/or do whatever type conversions are needed.
So now that we've outlined a serious problem, what benefits could you hope to achieve with that philosophy? I would be unsurprised if the only runtime-observable effect of that change would be what type you get when you look the constant up via reflection. (and, of course, whatever errors are introduced by lazy/unwitting programmers not correctly accounting for the types of the constants)
Weighing the pros and the cons is very easy: it's a bad philosophy.
Actually, there'd be a small advantage. If you have a
class MyTimeAndDayOfWeek {
byte dayOfWeek;
byte hour;
byte minute;
byte second;
}
then on a typical JVM it needs as much space as a class containing a single int. The memory consumption gets rounded to a next multiple of 8 or 16 bytes (IIRC, that's configurable), so the cases when there are real saving are rather rare.
This class would be slightly easier to use if the corresponding Calendar methods returned a byte. But there are no such Calendar methods, only get(int) which must returns an int because of other fields. Each operation on smaller types promotes to int, so you need a lot of casting.
Most probably, you'll either give up and switch to an int or write setters like
void setDayOfWeek(int dayOfWeek) {
this.dayOfWeek = checkedCastToByte(dayOfWeek);
}
Then the type of DAY_OF_WEEK doesn't matter, anyway.
Using variables smaller than the bus size of the CPU means more cycles are necessary. For example when updating a single byte in memory, a 64-bit CPU needs to read a whole 64-bit word, modify only the changed part, then write back the result.
Also, using a smaller data type requires overhead when the variable is stored in a register, since the behavior of the smaller data type to be accounted for explicitly. Since the whole register is used anyways, there is nothing to be gained by using a smaller data type for method parameters and local variables.
Nevertheless, these data types might be useful for representing data structures that require specific widths, such as network packets, or for saving space in large arrays, sacrificing speed.

How to calculate capacity of a memory by given a range of address?

I have another exercice that I couldn't resolve,
A central memory composed by two memory module(RAM).the total address range attributed to the central memory is:
FROM 0000 0000H TO 3FFF FFFFH
1/Give the total capacity of the central memory (Megabyte and Gegabyte)
2/Give the capacity of each memory module(RAM)
3/Give the first and last address of each memory module(RAM)
Sorry for the bad translation the exercice is an french.
Well, 1 is easy. The range from 0000 0000H to 3FFF FFFFH contains 4000 0000H addresses. (Just like 0 to 3 is four addresses, 0, 1, 2, and 3.) 4000 0000H is 1,073,741,824 decimal, or 1GB. 1,024 MB.
2 is no problem. If two memory modules give 1GB, then each module must be 512MB.
3 is impossible. We don't know if the memory modules are consecutive or interleaved. But if we assume they're consecutive, which I imagine is what the exercise wants us to do, then the first one must be 0000 0000H to 1FFF FFFFH and the second one must be 2000 0000H to 3FFF FFFFH.
Note that mapping memory modules consecutively is generally considered dumb. It means that in the typical case where the memory module bandwidth is the limiting factor, if an application is only using the first half of memory, it's only using one of the two modules, wasting half the available memory bandwidth. (Though, in the less common case where the memory is as fast or faster than the CPU or its memory bus, it doesn't matter.)

Difference between blocks and sectors

With reference to this article, there is a line that reads:
Because there are limits to the number of blocks, or drive addresses, that an operating system can address. By defining a block as several sectors, an OS can work with bigger hard drives without increasing the number of block addresses.
What does it mean? What is meant by "operating system can address"? And the subsequent maths isn't clear either. How can 64*512 be less than 64*4?
Look at it this way. Every block that's used in your operating system's file system to store data requires a certain amount of metadata to be stored along with the actual file data you're writing. e.g: timestamps (created, modified), filename, ownership/permission bits. For files that span multiple blocks, you also have to store the IDs of each of those blocks and the order they're chained together, etc.
Determining block size in an OS is a case of tradeoffs. Every file must occupy at least one block, even if the file is 0 bytes long, so there's something for the file's metadata to be attached to. Unless you can guarantee that your files will ALWAYS be some multiple of the block size in size (e.g. in a 4k block OS, all files are 4k), there will be a certain amount of wastage for the files that don't exactly fit within that block.
Small block sizes are good when you need to store many small files. On the other hand, more blocks = more metadata, so you end up wasting a chunk of your storage system on overhead, tracking the location of all the files.
On the flip side, large blocks mean less metadata, but also mean greater wastage when you're storing small files. e.g. a 1 byte file stored in a 4k block wastes 3.99k of that block.
Each of those blocks must be given an ID number by the OS, so it can be uniquely identified. An OS which uses an 8 bit ID field can track only 256 blocks, and therefore, by extension, only 256 files. But if each of those blocks is actually 1 megabyte in size, then you can store up to 256 megabytes of data.
The article you link to has a typo/logical flaw: they meant 512 BYTES, not 512k, so 64*512 bytes is smaller than 64*4k, aka 64*4096 bytes. Most hard drives shipped with 512 byte sector/block sizes.
However, as discussed earlier, small blocks mean more metadata. With drive sizes now in the 3+ terabyte range, with 512 byte blocks, you had to have metadata storage for 3TB/512 bytes = 6.44 billion blocks. That's one major waste of space. So now they ship drives with 4k blocks, 8 times larger, so you only need metadata storage for 805 million blocks. The total number of possible files has been cut by a factor of 8, but the reduced amount of metadata means you can actually store a larger amount of useable data.
Incidentally, 6.4 billion blocks is larger than what can be addressed directly by a 32bit system. 2^32 has an upper limit of ~4.2 billion, so older 32bit machines could not use the entirety of a 3TB drive. Hence switching to larger block sizes. 32bit boxes can easily handle 805 million blocks.

What makes Apple's PowerPC memcpy so fast?

I've written several copy functions in search of a good memory strategy on PowerPC. Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. Initially pleased with that, I threw in a regular memcpy to see how it compared... 10x faster than my best! I have no intention of rewriting memcpy, but I do hope to learn from it and accelerate several simple image filters that spend most of their time moving pixels to and from memory.
Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration, the performance advantage of memcpy is still embarrassing. I'm using dcbz to free up bandwidth, Apple uses nothing, but both codes tend to hesitate on stores.
prefetch
dcbt future
dcbt distant future
load stuff
lvx image
lvx image + 16
lvx image + 32
lvx image + 48
image += 64
prepare to store
dcbz filtered
dcbz filtered + 32
store stuff
stvxl filtered
stvxl filtered + 16
stvxl filtered + 32
stvxl filtered + 48
filtered += 64
repeat
Does anyone have some ideas on why very similar code has such a dramatic performance gap? I'd love to marinate the real image filters in whatever secret sauce memcpy is using!
Additional info: All data is vector aligned. I'm making filtered copies of the image, not replacing the original. The code runs on PowerPC G4, G5, and Cell PPU. The Cell SPU version is already insanely fast.
Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. After tweaking my best function to also haul 64 bytes per iteration
I may be stating the obvious, but since you don't mention the following at all in your question, it may be worth pointing it out:
I would bet that Apple's choice of 4 vectors reads followed by 4 vector writes has as much to do with the G5's pipeline and its management of out-of-order instruction execution in "dispatch groups" as it has with a magical 64-byte perfect line size. Did you notice the line skips in Nick Bastin's linked bcopy.s? These mean that the developer thought about how the instruction stream would be consumed by the G5. If you want to reproduce the same performance, it's not enough to read data 64 bytes at a time, you must make sure your instruction groups are well filled (basically, I remember that instructions can be grouped by up to five independent ones, with the first four being non-jump instructions and the fifth only being allowed to be a jump. The details are more complicated).
EDIT: you may also be interested by the following paragraph on the same page:
The dcbz instruction still zeros aligned 32 byte segments of memory as per the G4 and G3. However, since that is not a full cacheline on a G5 it will not have the performance benefits that you were likely hoping for. There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline.
I don't know exactly what you're doing, since I can't see your code, but Apple's secret sauce is here.
Maybe it's because of CPU caching. Try to run CacheGrind:
Cachegrind is a cache profiler. It
performs detailed simulation of the
I1, D1 and L2 caches in your CPU and
so can accurately pinpoint the sources
of cache misses in your code. It
identifies the number of cache misses,
memory references and instructions
executed for each line of source code,
with per-function, per-module and
whole-program summaries. It is useful
with programs written in any language.
Cachegrind runs programs about
20--100x slower than normal.
Still not an answer, but did you verify that memcpy is actually moving the data? Maybe it was just remapped copy-on-write. You would still see the inner memcpy loop in Shark as part of the first and last pages are truly copied.
As mentioned in another answer, "dcbz", as defined by Apple on the G5, only operates on 32-bytes, so you will lose performance with this instruction on a G5 which has 128 byte cachelines. You need to use "dcbzl" to prevent the destination cacheline from being fetched from memory (and effectively reducing your useful read memory bandwidth by half).