Fast memmove for x86 and +1 shift (for Move-to-front transform) - optimization

For fast MTF ( http://en.wikipedia.org/wiki/Move-to-front_transform ) i need faster version of moving a char from inside the array into the front of it:
char mtfSymbol[256], front;
char i;
for(;;) { \\ a very big loop
...
i=get_i(); \\ i is in 0..256 but more likely to be smaller.
front=mtfSymbol[i];
memmove(mtfSymbol+1, mtfSymbol, i);
mtfSymbol[0]=front;
}
cachegrind shows, that for memmove there are lot of branch mispredictions here.
For the other version of code (not a memmove in the first example, but this one)
do
{
mtfSymbol[i] = mtfSymbol[i-1];
} while ( --i );
there are lot of byte reads/writes, conditional branches and branch mispredictions
i is not very big, as it is MTF used for "good" input - a text file after a BWT ( Burrows–Wheeler transform )
Compiler is GCC.

If you pre-allocate your buffer bigger than you're going to need it, and put your initial array somewhere in the middle (or at the end, if you're never going to have to extend it that way) then you can append items (up to a limit) by changing the address of the start of the array rather than by moving all the elements.
You'll obviously need to keep track of how far back you've moved, so you can re-allocate if you do fall off the start of your existing allocation, but this should still be quicker than moving all of your array entries around.

You can also use a dedicated data structure rather than an array to speed up the forward transform.
A fast implementation can be built with a list of linked lists to avoid the array element moves altogether.
See http://code.google.com/p/kanzi/source/browse/java/src/kanzi/transform/MTFT.java
For inverse transform, it turns out that arrays are as fast as linked lists.

Related

Binary search start or end is target

Why is it that when I see example code for binary search there is never an if statement to check if the start of the array or end is the target?
import java.util.Arrays;
public class App {
public static int binary_search(int[] arr, int left, int right, int target) {
if (left > right) {
return -1;
}
int mid = (left + right) / 2;
if (target == arr[mid]) {
return mid;
}
if (target < arr[mid]) {
return binary_search(arr, left, mid - 1, target);
}
return binary_search(arr, mid + 1, right, target);
}
public static void main(String[] args) {
int[] arr = { 3, 2, 4, -1, 0, 1, 10, 20, 9, 7 };
Arrays.sort(arr);
for (int i = 0; i < arr.length; i++) {
System.out.println("Index: " + i + " value: " + arr[i]);
}
System.out.println(binary_search(arr, arr[0], arr.length - 1, -1));
}
}
in this example if the target was -1 or 20 the search would enter recursion. But it added an if statement to check if target is mid, so why not add two more statements also checking if its left or right?
EDIT:
As pointed out in the comments, I may have misinterpreted the initial question. The answer below assumes that OP meant having the start/end checks as part of each step of the recursion, as opposed to checking once before the recursion even starts.
Since I don't know for sure which interpretation was intended, I'm leaving this post here for now.
Original post:
You seem to be under the impression that "they added an extra check for mid, so surely they should also add an extra check for start and end".
The check "Is mid the target?" is in fact not a mere optimization they added. Recursively checking "mid" is the whole point of a binary search.
When you have a sorted array of elements, a binary search works like this:
Compare the middle element to the target
If the middle element is smaller, throw away the first half
If the middle element is larger, throw away the second half
Otherwise, we found it!
Repeat until we either find the target or there are no more elements.
The act of checking the middle is fundamental to determining which half of the array to continue searching through.
Now, let's say we also add a check for start and end. What does this gain us? Well, if at any point the target happens to be at the very start or end of a segment, we skip a few steps and end slightly sooner. Is this a likely event?
For small toy examples with a few elements, yeah, maybe.
For a massive real-world dataset with billions of entries? Hm, let's think about it. For the sake of simplicity, we assume that we know the target is in the array.
We start with the whole array. Is the first element the target? The odds of that is one in a billion. Pretty unlikely. Is the last element the target? The odds of that is also one in a billion. Pretty unlikely too. You've wasted two extra comparisons to speed up an extremely unlikely case.
We limit ourselves to, say, the first half. We do the same thing again. Is the first element the target? Probably not since the odds are one in half a billion.
...and so on.
The bigger the dataset, the more useless the start/end "optimization" becomes. In fact, in terms of (maximally optimized) comparisons, each step of the algorithm has three comparisons instead of the usual one. VERY roughly estimated, that suggests that the algorithm on average becomes three times slower.
Even for smaller datasets, it is of dubious use since it basically becomes a quasi-linear search instead of a binary search. Yes, the odds are higher, but on average, we can expect a larger amount of comparisons before we reach our target.
The whole point of a binary search is to reach the target with as few wasted comparisons as possible. Adding more unlikely-to-succeed comparisons is typically not the way to improve that.
Edit:
The implementation as posted by OP may also confuse the issue slightly. The implementation chooses to make two comparisons between target and mid. A more optimal implementation would instead make a single three-way comparison (i.e. determine ">", "=" or "<" as a single step instead of two separate ones). This is, for instance, how Java's compareTo or C++'s <=> normally works.
BambooleanLogic's answer is correct and comprehensive. I was curious about how much slower this 'optimization' made binary search, so I wrote a short script to test the change in how many comparisons are performed on average:
Given an array of integers 0, ... , N
do a binary search for every integer in the array,
and count the total number of array accesses made.
To be fair to the optimization, I made it so that after checking arr[left] against target, we increase left by 1, and similarly for right, so that every comparison is as useful as possible. You can try this yourself at Try it online
Results:
Binary search on size 10: Standard 29 Optimized 43 Ratio 1.4828
Binary search on size 100: Standard 580 Optimized 1180 Ratio 2.0345
Binary search on size 1000: Standard 8987 Optimized 21247 Ratio 2.3642
Binary search on size 10000: Standard 123631 Optimized 311205 Ratio 2.5172
Binary search on size 100000: Standard 1568946 Optimized 4108630 Ratio 2.6187
Binary search on size 1000000: Standard 18951445 Optimized 51068017 Ratio 2.6947
Binary search on size 10000000: Standard 223222809 Optimized 610154319 Ratio 2.7334
so the total comparisons does seem to tend to triple the standard number, implying the optimization becomes increasingly unhelpful for larger arrays. I'd be curious whether the limiting ratio is exactly 3.
To add some extra check for start and end along with the mid value is not impressive.
In any algorithm design the main concerned is moving around it's complexity either it is time complexity or space complexity. Most of the time the time complexity is taken as more important aspect.
To learn more about Binary Search Algorithm in different use case like -
If Array is not containing any repeated
If Array has repeated element in this case -
a) return leftmost index/value
b) return rightmost index/value
and many more point

Specifying push constant block offset in HLSL

I am trying to write a Vulkan renderer, I use glslangValidator with HLSL for shaders and am trying to implement push constants.
[[vk::push_constant]]
cbuffer cbFragment {
float4 imageColor;
float4 aaaa;
};
[[vk::push_constant]]
cbuffer cbMatrices {
float4 bbbb;
};
The annotation "[[vk::push_constant]]" works, I use spirv_reflect for reflection and both push constants show up and they work as intended.
The problem I'm having is that they seemingly overlap, if I assign "bbbb" a value, "imageColor" is affected in exactly the same way and vice versa. In the reflection data both push constant blocks have the offset 0, which explains the issue. However, I seem to be completely unable to change the offset of either of the push constants.
[[vk::offset(x)]] does not work at all, it neither affects the individual member offsets nor the offset of the push constants. The only offset that works at all is HLSL's built in "packoffset", which only applies to the buffer members. And although it might actually be a solution to just offset the members of one of the push constants to be outside the range of the other, I hardly believe that can be a sensible solution as it's also causing the validation layer to fail because offsetting the individual member simply increases the size of the push constant unnecessarily and the overlap itself is still present.
I would greatly appreciate any help on this matter and am willing to provide any necessary clarification, thank you very much!
Push constants live in a single chunk of contiguous memory. The compiler doesn't try to append multiple blocks into that memory; like with the GLSL syntax, it's intended to just have one block containing all the push constant data.
This is consistent with other places where the compiler has to pack variables in a block: it only packs within a block, not across multiple blocks. Two separate non-pushconstant cbuffers would refer to two distinct buffers in memory, with contents that begin at offset zero within their individual buffer. There's only one "push constant buffer", hence you should only decorate one cbuffer with vk::push_constant.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Trie Implementation Question

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.
If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.
If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};
A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.
I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.
I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.

How do you pre-size an array in Lua?

I've got a Lua program that seems to be slower than it ought to be. I suspect the issue is that I'm adding values to an associative array one at a time and the table has to allocate new memory each time.
There did seem to be a table.setn function, but it fails under Lua 5.1.3:
stdin:1: 'setn' is obsolete
stack traceback:
[C]: in function 'setn'
stdin:1: in main chunk
[C]: ?
I gather from the Google searching I've done that this function was depreciated in Lua 5.1, but I can't find what (if anything) replaced the functionality.
Do you know how to pre-size a table in Lua?
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
Let me focus more on your question:
adding values to an associative array
one at a time
Tables in Lua are associative, but using them in an array form (1..N) is optimized. They have double faces, internally.
So.. If you indeed are adding values associatively, follow the rules above.
If you are using indices 1..N, you can force a one-time size readjust by setting t[100000]= something. This should work until the limit of optimized array size, specified within Lua sources (2^26 = 67108864). After that, everything is associative.
p.s. The old 'setn' method handled the array part only, so it's no use for associative usage (ignore those answers).
p.p.s. Have you studied general tips for keeping Lua performance high? i.e. know table creation and rather reuse a table than create a new one, use of 'local print=print' and such to avoid global accesses.
static int new_sized_table( lua_State *L )
{
int asize = lua_tointeger( L, 1 );
int hsize = lua_tointeger( L, 2 );
lua_createtable( L, asize, hsize );
return( 1 );
}
...
lua_pushcfunction( L, new_sized_table );
lua_setglobal( L, "sized_table" );
Then, in Lua,
array = function(size) return sized_table(size,0) end
a = array(10)
As a quick hack to get this running you can add the C to lua.c.
I don't think you can - it's not an array, it's an associative array, like a perl hash or an awk array.
http://www.lua.org/manual/5.1/manual.html#2.5.5
I don't think you can preset its size meaningfully from the Lua side.
If you're allocating the array on the C side, though, the
void lua_createtable (lua_State *L, int narr, int nrec);
may be what you need.
Creates a new empty table and pushes
it onto the stack. The new table has
space pre-allocated for narr array
elements and nrec non-array elements.
This pre-allocation is useful when you
know exactly how many elements the
table will have. Otherwise you can use
the function lua_newtable.
There is still an internal luaL_setn and you can compile Lua so that
it is exposed as table.setn. But it looks like that it won't help
because the code doesn't seem to do any pre-extending.
(Also the setn as commented above the setn is related to the array part
of a Lua table, and you said that your are using the table as an associative
array)
The good part is that even if you add the elements one by one, Lua does not
increase the array that way. Instead it uses a more reasonable strategy. You still
get multiple allocations for a larger array but the performance is better than
getting a new allocation each time.
Although this doesn't answer your main question, it answers your second question:
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
If your running Lua in a custom application, as I can guess since your doing C coding, I suggest you replace the allocator with Loki's small value allocator, it reduced my memory allocations 100+ fold. This improved performance by avoiding round trips to the Kernel, and made me a much happier programmer :)
Anyways I tried other allocators, but they were more general, and provide guarantee's that don't benefit Lua applications (such as thread safety, and large object allocation, etc...), also writing your own small-object allocator can be a good week of programming and debugging to get just right, and after searching for an available solution Loki's allocator wasthe easiest and fastest I found for this problem.
If you declare your table in code with a specific amount of items, like so:
local tab = { 0, 1, 2, 3, 4, 5, ... , n }
then Lua will create the table with memory already allocated for at least n items.
However, Lua uses the 2x incremental memory allocation technique, so adding an item to a table should rarely force a reallocation.