2D Chunked Grid Fast Access - optimization

I am working on a program that stores some data in cells (small structs) and processes each one individually. The processing step accesses the 4 neighbors of the cell (2D). I also need them partitioned in chunks because the cells might be distributed randomly trough a very large surface, and having a large grid with mostly empty cells would be a waste. I also use the chunks for some other optimizations (skipping processing of chunks based on some conditions).
I currently have a hashmap of "chunk positions" to chunks (which are the actual fixed size grids). The position is calculated based on the chunk size (like Minecraft). The issue is that, when processing the cells in every chunk, I lose a lot of time doing a lookup to get the chunk of the neighbor. Most of the time, the neighbor is in the same chunk we are processing, so I did a check to prevent looking up a chunk if the neighbor is in the same chunk we are processing.
Is there a better solution to this?

This lacks some details, but hopefully you can employ a solution such as this:
Process the interior of a chunk (ie excluding the edges) separately. During this phase, the neighbours are for sure in the same chunk, so you can do this with zero chunk-lookups. The difference between this and doing a check to see whether a chunk-lookup is necessary, is that there is not even a check. The check is implicit in the loop bounds.
For edges, you can do a few chunk lookups and reuse the result across the edge.
This approach gets worse with smaller chunk sizes, or if you need access to neighbours further than 1 step away. It breaks down entirely in case of random access to cells. If you need to maintain a strict ordering for the processing of cells, this approach can still be used with minor modifications by rearranging it (there wouldn't be strict "process the interior" phase, but you would still have a nice inner loop with zero chunk-lookups).
Such techniques are common in general in cases where the boundary has different behaviour than the interior.

Related

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why).
The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.
First, I consider some theory basics.
Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:
Alignment (bits) of base address: 2048.
Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)
Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.
But still, I can place my data with two different access patterns:
1) contiguous
a[i] = v[get_global_id(0) * get_global_size(0) + i];
(wich actually perfomed as)
*(int8*)a = *(int8*)v;
2) interleaved
a[i] = v[get_global_id(0) + i * get_global_size(0)];
I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently.
While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.
So, my second question: is my assumption right?
Now, the practice.
Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.
And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)
Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):
«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»
So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.
Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).
In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.
Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.
Best Regards,
Michael

Is there a more efficient search factor than midpoint for binary search?

The naive binary search is a very efficient algorithm: you take the midpoint of your high and low points in a sorted array and adjust your high or low point accordingly. Then you recalculate your endpoint and iterate until you find your target value (or you don't, of course.)
Now, quite clearly, if you don't use the midpoint, you introduce some risk to the system. Let's say you shift your search target away from the midpoint and you create two sides - I'll call them a big side and small side. (It doesn't matter whether the shift is toward high or low, because it would be symmetrical.) The risk is that if you miss, your search space is bigger than it would be: you've got to search the big side which is bigger. But the reward is that if you hit your search space is smaller.
It occurs to me that the number of spaces being risked vs rewarded is the same, and (without patterns, which I'm assuming there are none) the likelihood of an element being higher and lower than the midpoint is equal. So the risk is that it falls between the new target and the midpoint.
Now because the number of spaces affects the search space, and the search space is measured logrithmically, it seems to me if I used, let's say 1/4 and 3/4 for our search spaces, I've cut the log of the small space in half, where the large space has only gone up in by about .6 or .7.
So with all this in mind: is there a more efficient way of performing a binary search than just using the midpoint?
Let's agree that the search key is equally likely to be at position in the array—otherwise, we'd want to design an algorithm based on our special knowledge of the location. So all we can choose is where to split the array each time. If we choose a number 0 < x < 1 and split the array there, the chance that it's on the left is x and the chance that it's on the right is 1-x. In the first case we shorten the array by a factor of x and in the second by a factor of 1-x. If we did this many times we'd have a product of many of these factors, and so the 'right' average to use here is the geometric mean. In that sense, the average decrease per step is x with weight x and 1-x with weight 1-x, for a total of x^x * (1-x)^(1-x).
So when is this minimized? If this were the math stackexchange, we'd take derivatives (with the product rule, chain rule, and exponent rule), set them to zero, and solve. But this is stackoverflow, so instead we graph it:
You can see that the further you get from 1/2, the worse you get. For a better understanding I recommend information theory or calculus which have interesting and complementary perspectives on this.

Using multiple threads for faster execution

Approximate program behavior:
I have a map image with data associated with the map indicated by RGB index. The data has been populated into an MS Access database. I imported the information in the database into my program as an array and sorted them to go in the order I want the program to run.
I want the program to find the nearest pixel that has a different color from the incumbent pixel being compared. (Colors are stored as string attributes of object Pixel)
First question: Should I use integers to represent my colors instead of string? Would this make the comparison function run significantly faster?
In order to find the nearest pixel of different color, the program begins with all 8 adjacent pixels around the incumbent. If a nonMatch is not found, it then continues onto the next "degree", and in this fashion, it spirals out from the incumbent pixel until it hits a nonMatch. When found, the color of the nonMatch is saved as an attribute of incumbent. After I find the nonMatch for each of the Pixels, the data is re-inserted into the database
The program accomplishes what I want in the manner I've written it, but it is very very slow. After 24 hours, I am only about 3% through with execution.
Question Two: Does my program behavior sound about right? Is this algorithm you would use if you had to accomplish this task?
Question Three: Would it be appropriate for me to use threads in order to finish execution of the program faster? How exactly does that work? (I am brand new to threads, but know a little of the syntax)
Question Four: Would it be more "intelligent" for my program to find the nonMatch for each pixel and insert it into the database immediately after finding it? (I'm making a guess that this would be good in multi-threading, because while one record is accessing the database (to insert), another record is accessing the array of pixels (shared global variable in program).
Question Five: If threading is a good idea, I'm guessing I would split the records up into more manageable chunks (i.e. quarters), and have each thread run the same functions for their specified number of records? Am I close at all?
Please let me know if I can clarify or provide code samples, I just figured that this is more of a conceptual topic so do not want to overburden the post.
1.) Yes, integers compare much faster than strings. Additionally the y use much less memory
2.) I would adapt the algorithm in this way:
E.g.: #1: Let's say, for pixel(87,23) you found the nearest nonMatch to be (88,24) at degree=1 - you can immediately invert the relation and record, that the nearest nonMatch to (88,24) is (87,23). On degree=1 you finished 2 pixels with 1 search.
E.g. #2: Let's say, for pixel (17,18) you found the nearest nonMatch to be (17,20) at degree=2. You can immediately record, that all pixels, that border on both (16,19), (17,19) and (18,19) have the nearest noMatch (17,20) at degree=1, and that one of them is the nearest noMatch to (17,20). On degree=2 (or higher), you finished 5 pixels with 1 search.
3.) Using threads is a two-sided sword: You can do searches in parallel, but you need locking if you write to your array. So this depends on how many CPU cores you can throw at the problem. If this is 3 or more, threads will surely speed up the search.
4.) The results from 2.) make it necessary to mark a pixel as "done" in your array, as you might have finished up to 5 pixels with 1 search. I recommend you put those into a queue and use a dedicated thread to write the queue back into the database: MS Access can't handle concurrent updates, so a single database writer thread looks like a good idea.
5.) I recommend you NOT chunk up the array: You will run into problems with pixels on the edges of a chunk having their nearest nonMatch in a different chunk. Instead if you use e.g. 4 Threads, let them run 1.) From NW corner E, then S 2.) From SE Corner W, then N 3.) From NE Corner S, then W 4. From SW Corner N, then E
Yes. Using a integer would make it much faster
You can reuse the work you have done for previous pixel. Eg. If (a,b) is the nearest non-equal pixel of (x,y), it is likely that points around (x,y) might also have (a,b) as the nearest non-equal pixel
You can use different threads to work on different pixels instead of dividing searching for one pixel
IMHO, steps 1&2 should make your program much faster and you might not need multi-threading.
Yes, I'd convert colour strings to Integers for speed, or even Color structures if you intend to display them on the screen.
Don't work directly with the database if you can avoid it. Copy the necessary data out of the database into an array before you start, and copy your results back when you're finished.

combining open(Filename, {delayed_write, Size, Delay}) with index

How can I combine a open(Filename, {delayed_write, Size, Delay}) with an index on where to write this data to?
I want to wait until I receive a certain amount of data and then write it to a position in the file.
Also is {read_ahead, Size} the opposite of {delayed_write, Size, Delay}? I would like to read a certain amount of data to send it.
Thanks
read_ahead is kind of the opposite of delayed_write in the sense that reading is the opposite of writing.
If you want to read and send bigger chunks of memory you don't need read_ahead, just read big chunks and send them (not many os calls to save here).
From the file:open/2 manpage on read_ahead:
If read/2 calls are for sizes not significantly less than, or even greater than Size
bytes, no performance gain can be expected.
You don't need to specify and index when opening. Just use pwrite/3 or a combination of position/2 and write/2.
But writing in different positions of the file might just reduce the gain of delayed_write since (also manpage of file:open/2):
The buffered
data is also flushed before some other file
operation than write/2 is executed.
If you have chunks of data for several positions collect them in a list of {Location, Bytes} and from time to time write them with file:pwrite/2 all in one go. This can map to very efficient writev(2) system call that writes several chunks in one go.

Circular Buffer in Flash

I need to store items of varying length in a circular queue in a flash chip. Each item will have its encapsulation so I can figure out how big it is and where the next item begins. When there are enough items in the buffer, it will wrap to the beginning.
What is a good way to store a circular queue in a flash chip?
There is a possibility of tens of thousands of items I would like to store. So starting at the beginning and reading to the end of the buffer is not ideal because it will take time to search to the end.
Also, because it is circular, I need to be able to distinguish the first item from the last.
The last problem is that this is stored in flash, so erasing each block is both time consuming and can only be done a set number of times for each block.
First, block management:
Put a smaller header at the start of each block. The main thing you need to keep track of the "oldest" and "newest" is a block number, which simply increments modulo k. k must be greater than your total number of blocks. Ideally, make k less than your MAX value (e.g. 0xFFFF) so you can easily tell what is an erased block.
At start-up, your code reads the headers of each block in turn, and locates the first and last blocks in the sequence that is ni+1 = (ni + 1) MODULO k. Take care not to get confused by erased blocks (block number is e.g. 0xFFFF) or data that is somehow corrupted (e.g. incomplete erase).
Within each block
Each block initially starts empty (each byte is 0xFF). Each record is simply written one after the other. If you have fixed-size records, then you can access it with a simple index. If you have variable-size records, then to read it you have to scan from the start of the block, linked-list style.
If you want to have variable-size records, but avoid linear scan, then you could have a well defined header on each record. E.g. use 0 as a record delimiter, and COBS-encode (or COBS/R-encode) each record. Or use a byte of your choice as a delimiter, and 'escape' that byte if it occurs in each record (similar to the PPP protocol).
At start-up, once you know your latest block, you can do a linear scan for the latest record. Or if you have fixed-size records or record delimiters, you could do a binary search.
Erase scheduling
For some Flash memory chips, erasing a block can take significant time--e.g. 5 seconds. Consider scheduling an erase as a background task a bit "ahead of time". E.g. when the current block is x% full, then start erasing the next block.
Record numbering
You may want to number records. The way I've done it in the past is to put, in the header of each block, the record number of the first record. Then the software has to keep count of the numbers of each record within the block.
Checksum or CRC
If you want to detect corrupted data (e.g. incomplete writes or erases due to unexpected power failure), then you can add a checksum or CRC to each record, and perhaps to the block header. Note the block header CRC would only cover the header itself, not the records, since it could not be re-written when each new record is written.
Keep a separate block that contains a pointer to the start of the first record and the end of the last record. You can also keep more information like the total number of records, etc.
Until you initially run out of space, adding records is as simple as writing them to the end of the buffer and updating the tail pointer.
As you need to reclaim space, delete enough records so that you can fit your current record. Update the head pointer as you delete records.
You'll need to keep track of how much extra space has been freed. If you keep a pointer to end of the last record, the next time you need to add a record, you can compare that with the pointer to the first record to determine if you need to delete any more records.
Also, if this is NAND, you or the flash controller will need to do deblocking and wear-leveling, but that should all be at a lower layer than allocating space for the circular buffer.
I think I get it now. It seems like your largest issue will be, having filled the available space for recording, what happens next? The new data should overwrite the oldest data, which is I believe what you mean by a circular buffer. But since the data is not fixed length you may overwrite more than one record.
I'm assuming that the amount of variability in length is high enough that padding everything out to a fixed length isn't an option.
Your write segment needs to keep track of the address that represents the start of the next record to write. If you know the size of a block to write ahead of time, you can tell if you are going to end up at the end of the logical buffer and start over at '0'. I wouldn't split a record up with some at the end and some at the beginning.
A separate register can track the beginning; this is the oldest data that hasn't been overwritten yet. If you went to read out the data this is where you would start.
The data writer then would check, given the write-start address and the length of data its about to commit, if it should bump the read register, which would examine the first block and see the length, and advance to the next record, until there is enough room to write whatever the data is. There will be a gap of junk data that lives between the end of the written data and the start of the oldest data, probably. But this way, you can just be writing an address or two as overhead, and not rearranging blocks.
At least, that's probably what I would do. HTH
The "circular" in a flash can be done on basis of block size, which means that you must declare how much blocks of the flash you allocate for this buffer.
The actual size of the buffer will be at each particular time between n-1 (n is the number of blocks) and n.
Each block should start with an header that contains sequential number or timestamp that could be used to determine which block is older than the other.
Each Item encapsulated with an header and a footer. the default header contains whatever you want but according to this header you must know the size of the item. The default footer is 0xFFFFFFFF. This value indicates a null termination.
In your RAM you must save a pointer to the oldest block and the latest block and pointer to the oldest item and latest item. On power up you go over all blocks find the relevant blocks and load this members.
When you want to store a new item, you check if the latest block contain enough space for this item. If it does you save the item at the end of the previous item and the change the previous footer to point to this item. If it does not contain enough space you need to erase the oldest block. Before you erase this block change the oldest block members (RAM) to point on the next block and the oldest item to point on the first item in this block.
Then you can save the new item in this block and change the footer of the latest item to point this item.
I know that the explanation may sounds complicated but the process is very simple and if you write it correct you can make it even power fail safe (always keep in you mind the order of the writes).
Pay attention that the circularity of the buffer is not saved in the flash but the flash only contains a blocks with items that you can decide according to the blocks headers and items headers what is the order of these items
I see three options:
option1: is to pad everything out to the same size, this is simple, store a pointer to the head and tail of the buffer so you know where to write and where to start reading from, use the size of each object to get an offset to the next, this means you need to transverse the buffer as you would a linked list, aka its slow if you need item 5000.
option2: is to store only pointers to the real data in the circular buffer, that way when you loop around you don't have to deal with size mis-matchs. if you store the real data in a circular buffer and don't pad it out you could run into a situations where your over witting multiple items with 1 new data object, i assume this is not ok.
store the actual data elsewhere in flash, most flash will have some sort of wear leveling built in, if so you don't need to worry about overwriting the same location multiple times, the IC will figure out where to actually store it on chip, just write to to the next available free space.
this means you need to pick a maximum size for the circular buffer how you do this depends on the data variability. If the size of the data just change much, say by only a few bytes, then you should just pad it out and use option 1. If the size changes wildly and unpredictably, choose the largest size it could be and figure out how many objects of that size would fit in your flash, use that as the max number of entries in the buffer. This means you waste a bunch of space.
option 3: if the object can really be any size, your at the point where you should just use a file system, name the files in order and loop back when your full keeping in mind if your new entry is large you may have to delete multiple old entries to fit it in. This is really just an extension of option 2 as option2 is in many ways a simple file system.