Find first bit set and unset it atomically - optimization

I'm looking for instructions (x86 machine) or optimization for the following line of code:
lock()
int x = ffs(words); // find first bit that set
long words = unset(x, words); // unset the bit "x" in "words"
unlock()
I don't know how to do this without locking.

If words is an unsigned type, then the non-atomic computation can be done as follows:
words &= ~-words;
Or, equivalently
words &= words - 1;
That works because in -words (assuming 2's-complement, which unsigned types must use), the least-significant 1 and all the 0s to the right of it are unchanged, while all the bits to the left of the least significant 1 are inverted. So words & ~-words will be identical to words except for the least significant 1.
However, in order to do that atomically, you'll either need to use a lock as indicated in the question, or you'll need to do a spin-loop around an atomic compare-and-exchange.

Related

Binary search start or end is target

Why is it that when I see example code for binary search there is never an if statement to check if the start of the array or end is the target?
import java.util.Arrays;
public class App {
public static int binary_search(int[] arr, int left, int right, int target) {
if (left > right) {
return -1;
}
int mid = (left + right) / 2;
if (target == arr[mid]) {
return mid;
}
if (target < arr[mid]) {
return binary_search(arr, left, mid - 1, target);
}
return binary_search(arr, mid + 1, right, target);
}
public static void main(String[] args) {
int[] arr = { 3, 2, 4, -1, 0, 1, 10, 20, 9, 7 };
Arrays.sort(arr);
for (int i = 0; i < arr.length; i++) {
System.out.println("Index: " + i + " value: " + arr[i]);
}
System.out.println(binary_search(arr, arr[0], arr.length - 1, -1));
}
}
in this example if the target was -1 or 20 the search would enter recursion. But it added an if statement to check if target is mid, so why not add two more statements also checking if its left or right?
EDIT:
As pointed out in the comments, I may have misinterpreted the initial question. The answer below assumes that OP meant having the start/end checks as part of each step of the recursion, as opposed to checking once before the recursion even starts.
Since I don't know for sure which interpretation was intended, I'm leaving this post here for now.
Original post:
You seem to be under the impression that "they added an extra check for mid, so surely they should also add an extra check for start and end".
The check "Is mid the target?" is in fact not a mere optimization they added. Recursively checking "mid" is the whole point of a binary search.
When you have a sorted array of elements, a binary search works like this:
Compare the middle element to the target
If the middle element is smaller, throw away the first half
If the middle element is larger, throw away the second half
Otherwise, we found it!
Repeat until we either find the target or there are no more elements.
The act of checking the middle is fundamental to determining which half of the array to continue searching through.
Now, let's say we also add a check for start and end. What does this gain us? Well, if at any point the target happens to be at the very start or end of a segment, we skip a few steps and end slightly sooner. Is this a likely event?
For small toy examples with a few elements, yeah, maybe.
For a massive real-world dataset with billions of entries? Hm, let's think about it. For the sake of simplicity, we assume that we know the target is in the array.
We start with the whole array. Is the first element the target? The odds of that is one in a billion. Pretty unlikely. Is the last element the target? The odds of that is also one in a billion. Pretty unlikely too. You've wasted two extra comparisons to speed up an extremely unlikely case.
We limit ourselves to, say, the first half. We do the same thing again. Is the first element the target? Probably not since the odds are one in half a billion.
...and so on.
The bigger the dataset, the more useless the start/end "optimization" becomes. In fact, in terms of (maximally optimized) comparisons, each step of the algorithm has three comparisons instead of the usual one. VERY roughly estimated, that suggests that the algorithm on average becomes three times slower.
Even for smaller datasets, it is of dubious use since it basically becomes a quasi-linear search instead of a binary search. Yes, the odds are higher, but on average, we can expect a larger amount of comparisons before we reach our target.
The whole point of a binary search is to reach the target with as few wasted comparisons as possible. Adding more unlikely-to-succeed comparisons is typically not the way to improve that.
Edit:
The implementation as posted by OP may also confuse the issue slightly. The implementation chooses to make two comparisons between target and mid. A more optimal implementation would instead make a single three-way comparison (i.e. determine ">", "=" or "<" as a single step instead of two separate ones). This is, for instance, how Java's compareTo or C++'s <=> normally works.
BambooleanLogic's answer is correct and comprehensive. I was curious about how much slower this 'optimization' made binary search, so I wrote a short script to test the change in how many comparisons are performed on average:
Given an array of integers 0, ... , N
do a binary search for every integer in the array,
and count the total number of array accesses made.
To be fair to the optimization, I made it so that after checking arr[left] against target, we increase left by 1, and similarly for right, so that every comparison is as useful as possible. You can try this yourself at Try it online
Results:
Binary search on size 10: Standard 29 Optimized 43 Ratio 1.4828
Binary search on size 100: Standard 580 Optimized 1180 Ratio 2.0345
Binary search on size 1000: Standard 8987 Optimized 21247 Ratio 2.3642
Binary search on size 10000: Standard 123631 Optimized 311205 Ratio 2.5172
Binary search on size 100000: Standard 1568946 Optimized 4108630 Ratio 2.6187
Binary search on size 1000000: Standard 18951445 Optimized 51068017 Ratio 2.6947
Binary search on size 10000000: Standard 223222809 Optimized 610154319 Ratio 2.7334
so the total comparisons does seem to tend to triple the standard number, implying the optimization becomes increasingly unhelpful for larger arrays. I'd be curious whether the limiting ratio is exactly 3.
To add some extra check for start and end along with the mid value is not impressive.
In any algorithm design the main concerned is moving around it's complexity either it is time complexity or space complexity. Most of the time the time complexity is taken as more important aspect.
To learn more about Binary Search Algorithm in different use case like -
If Array is not containing any repeated
If Array has repeated element in this case -
a) return leftmost index/value
b) return rightmost index/value
and many more point

Chessprogramming: how to get a single move out of a bitboard attack-mask most efficently

How can I get efficiently a single move out of an attack mask, that looks like this:
....1...
1...1...
.1..1..1
..1.1.1.
...111..
11111111
..1.11..
.1..1.1.
for a queen.
What I've done in the past, is to get the square-indices of every single possible move from the queen by counting the trailing zeros (bitScanForward)
and after I generated the new move i removed this square from the attack mask and continued with the next attack-square. Is there any technic to get the single attack bits directly?
I think what you are describing is already the most efficient way. Looping over the bitboard until it is zero and pick one move at a time.
To sketch the idea with some code, it could look like this:
using Bitboard = uint64_t; // 64 bit unsigned integer
pMoves createAllMoves(Bitboard mask, int from_sq, Move* pMoves) {
while(moves != 0) {
int to_sq = findAndClearSetBit(mask);
*pMoves++ = createMove(from_sq, to_sq);
}
return pMoves;
}
The findAndClearSetBit function can choose any set bit, but commonly on today's hardware, finding the least significant bit is most efficient. If you are using GCC or Clang, you can use __builtin_ctzll which should be optimized to the specific hardware:
int findAndClearSetBit(Bitboard& mask) {
int sq = __builtin_ctzll(mask); // find least significant bit
mask &= mask - 1; // clear least significant bit
return sq;
}
If I am not mistaken, your existing function bitScanForward is already an implementation to find the least significant bit. So, you can use it to get a portable version.

Why do we do unsigned right shift or signed right shift? [duplicate]

I understand what the unsigned right shift operator ">>>" in Java does, but why do we need it, and why do we not need a corresponding unsigned left shift operator?
The >>> operator lets you treat int and long as 32- and 64-bit unsigned integral types, which are missing from the Java language.
This is useful when you shift something that does not represent a numeric value. For example, you could represent a black and white bit map image using 32-bit ints, where each int encodes 32 pixels on the screen. If you need to scroll the image to the right, you would prefer the bits on the left of an int to become zeros, so that you could easily put the bits from the adjacent ints:
int shiftBy = 3;
int[] imageRow = ...
int shiftCarry = 0;
// The last shiftBy bits are set to 1, the remaining ones are zero
int mask = (1 << shiftBy)-1;
for (int i = 0 ; i != imageRow.length ; i++) {
// Cut out the shiftBits bits on the right
int nextCarry = imageRow & mask;
// Do the shift, and move in the carry into the freed upper bits
imageRow[i] = (imageRow[i] >>> shiftBy) | (carry << (32-shiftBy));
// Prepare the carry for the next iteration of the loop
carry = nextCarry;
}
The code above does not pay attention to the content of the upper three bits, because >>> operator makes them
There is no corresponding << operator because left-shift operations on signed and unsigned data types are identical.
>>> is also the safe and efficient way of finding the rounded mean of two (large) integers:
int mid = (low + high) >>> 1;
If integers high and low are close to the the largest machine integer, the above will be correct but
int mid = (low + high) / 2;
can get a wrong result because of overflow.
Here's an example use, fixing a bug in a naive binary search.
Basically this has to do with sign (numberic shifts) or unsigned shifts (normally pixel related stuff).
Since the left shift, doesn't deal with the sign bit anyhow, it's the same thing (<<< and <<)...
Either way I have yet to meet anyone that needed to use the >>>, but I'm sure they are out there doing amazing things.
As you have just seen, the >> operator automatically fills the
high-order bit with its previous contents each time a shift occurs.
This preserves the sign of the value. However, sometimes this is
undesirable. For example, if you are shifting something that does not
represent a numeric value, you may not want sign extension to take
place. This situation is common when you are working with pixel-based
values and graphics. In these cases you will generally want to shift a
zero into the high-order bit no matter what its initial value was.
This is known as an unsigned shift. To accomplish this, you will use
java’s unsigned, shift-right operator,>>>, which always shifts zeros
into the high-order bit.
Further reading:
http://henkelmann.eu/2011/02/01/java_the_unsigned_right_shift_operator
http://www.java-samples.com/showtutorial.php?tutorialid=60
The signed right-shift operator is useful if one has an int that represents a number and one wishes to divide it by a power of two, rounding toward negative infinity. This can be nice when doing things like scaling coordinates for display; not only is it faster than division, but coordinates which differ by the scale factor before scaling will differ by one pixel afterward. If instead of using shifting one uses division, that won't work. When scaling by a factor of two, for example, -1 and +1 differ by two, and should thus differ by one afterward, but -1/2=0 and 1/2=0. If instead one uses signed right-shift, things work out nicely: -1>>1=-1 and 1>>1=0, properly yielding values one pixel apart.
The unsigned operator is useful either in cases where either the input is expected to have exactly one bit set and one will want the result to do so as well, or in cases where one will be using a loop to output all the bits in a word and wants it to terminate cleanly. For example:
void processBitsLsbFirst(int n, BitProcessor whatever)
{
while(n != 0)
{
whatever.processBit(n & 1);
n >>>= 1;
}
}
If the code were to use a signed right-shift operation and were passed a negative value, it would output 1's indefinitely. With the unsigned-right-shift operator, however, the most significant bit ends up being interpreted just like any other.
The unsigned right-shift operator may also be useful when a computation would, arithmetically, yield a positive number between 0 and 4,294,967,295 and one wishes to divide that number by a power of two. For example, when computing the sum of two int values which are known to be positive, one may use (n1+n2)>>>1 without having to promote the operands to long. Also, if one wishes to divide a positive int value by something like pi without using floating-point math, one may compute ((value*5468522205L) >>> 34) [(1L<<34)/pi is 5468522204.61, which rounded up yields 5468522205]. For dividends over 1686629712, the computation of value*5468522205L would yield a "negative" value, but since the arithmetically-correct value is known to be positive, using the unsigned right-shift would allow the correct positive number to be used.
A normal right shift >> of a negative number will keep it negative. I.e. the sign bit will be retained.
An unsigned right shift >>> will shift the sign bit too, replacing it with a zero bit.
There is no need to have the equivalent left shift because there is only one sign bit and it is the leftmost bit so it only interferes when shifting right.
Essentially, the difference is that one preserves the sign bit, the other shifts in zeros to replace the sign bit.
For positive numbers they act identically.
For an example of using both >> and >>> see BigInteger shiftRight.
In the Java domain most typical applications the way to avoid overflows is to use casting or Big Integer, such as int to long in the previous examples.
int hiint = 2147483647;
System.out.println("mean hiint+hiint/2 = " + ( (((long)hiint+(long)hiint)))/2);
System.out.println("mean hiint*2/2 = " + ( (((long)hiint*(long)2)))/2);
BigInteger bhiint = BigInteger.valueOf(2147483647);
System.out.println("mean bhiint+bhiint/2 = " + (bhiint.add(bhiint).divide(BigInteger.valueOf(2))));

Trie Implementation Question

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.
If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.
If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};
A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.
I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.
I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.