Trie Implementation Question - trie

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.

If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.

If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};

A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.

I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.

I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.

Related

Binary search start or end is target

Why is it that when I see example code for binary search there is never an if statement to check if the start of the array or end is the target?
import java.util.Arrays;
public class App {
public static int binary_search(int[] arr, int left, int right, int target) {
if (left > right) {
return -1;
}
int mid = (left + right) / 2;
if (target == arr[mid]) {
return mid;
}
if (target < arr[mid]) {
return binary_search(arr, left, mid - 1, target);
}
return binary_search(arr, mid + 1, right, target);
}
public static void main(String[] args) {
int[] arr = { 3, 2, 4, -1, 0, 1, 10, 20, 9, 7 };
Arrays.sort(arr);
for (int i = 0; i < arr.length; i++) {
System.out.println("Index: " + i + " value: " + arr[i]);
}
System.out.println(binary_search(arr, arr[0], arr.length - 1, -1));
}
}
in this example if the target was -1 or 20 the search would enter recursion. But it added an if statement to check if target is mid, so why not add two more statements also checking if its left or right?
EDIT:
As pointed out in the comments, I may have misinterpreted the initial question. The answer below assumes that OP meant having the start/end checks as part of each step of the recursion, as opposed to checking once before the recursion even starts.
Since I don't know for sure which interpretation was intended, I'm leaving this post here for now.
Original post:
You seem to be under the impression that "they added an extra check for mid, so surely they should also add an extra check for start and end".
The check "Is mid the target?" is in fact not a mere optimization they added. Recursively checking "mid" is the whole point of a binary search.
When you have a sorted array of elements, a binary search works like this:
Compare the middle element to the target
If the middle element is smaller, throw away the first half
If the middle element is larger, throw away the second half
Otherwise, we found it!
Repeat until we either find the target or there are no more elements.
The act of checking the middle is fundamental to determining which half of the array to continue searching through.
Now, let's say we also add a check for start and end. What does this gain us? Well, if at any point the target happens to be at the very start or end of a segment, we skip a few steps and end slightly sooner. Is this a likely event?
For small toy examples with a few elements, yeah, maybe.
For a massive real-world dataset with billions of entries? Hm, let's think about it. For the sake of simplicity, we assume that we know the target is in the array.
We start with the whole array. Is the first element the target? The odds of that is one in a billion. Pretty unlikely. Is the last element the target? The odds of that is also one in a billion. Pretty unlikely too. You've wasted two extra comparisons to speed up an extremely unlikely case.
We limit ourselves to, say, the first half. We do the same thing again. Is the first element the target? Probably not since the odds are one in half a billion.
...and so on.
The bigger the dataset, the more useless the start/end "optimization" becomes. In fact, in terms of (maximally optimized) comparisons, each step of the algorithm has three comparisons instead of the usual one. VERY roughly estimated, that suggests that the algorithm on average becomes three times slower.
Even for smaller datasets, it is of dubious use since it basically becomes a quasi-linear search instead of a binary search. Yes, the odds are higher, but on average, we can expect a larger amount of comparisons before we reach our target.
The whole point of a binary search is to reach the target with as few wasted comparisons as possible. Adding more unlikely-to-succeed comparisons is typically not the way to improve that.
Edit:
The implementation as posted by OP may also confuse the issue slightly. The implementation chooses to make two comparisons between target and mid. A more optimal implementation would instead make a single three-way comparison (i.e. determine ">", "=" or "<" as a single step instead of two separate ones). This is, for instance, how Java's compareTo or C++'s <=> normally works.
BambooleanLogic's answer is correct and comprehensive. I was curious about how much slower this 'optimization' made binary search, so I wrote a short script to test the change in how many comparisons are performed on average:
Given an array of integers 0, ... , N
do a binary search for every integer in the array,
and count the total number of array accesses made.
To be fair to the optimization, I made it so that after checking arr[left] against target, we increase left by 1, and similarly for right, so that every comparison is as useful as possible. You can try this yourself at Try it online
Results:
Binary search on size 10: Standard 29 Optimized 43 Ratio 1.4828
Binary search on size 100: Standard 580 Optimized 1180 Ratio 2.0345
Binary search on size 1000: Standard 8987 Optimized 21247 Ratio 2.3642
Binary search on size 10000: Standard 123631 Optimized 311205 Ratio 2.5172
Binary search on size 100000: Standard 1568946 Optimized 4108630 Ratio 2.6187
Binary search on size 1000000: Standard 18951445 Optimized 51068017 Ratio 2.6947
Binary search on size 10000000: Standard 223222809 Optimized 610154319 Ratio 2.7334
so the total comparisons does seem to tend to triple the standard number, implying the optimization becomes increasingly unhelpful for larger arrays. I'd be curious whether the limiting ratio is exactly 3.
To add some extra check for start and end along with the mid value is not impressive.
In any algorithm design the main concerned is moving around it's complexity either it is time complexity or space complexity. Most of the time the time complexity is taken as more important aspect.
To learn more about Binary Search Algorithm in different use case like -
If Array is not containing any repeated
If Array has repeated element in this case -
a) return leftmost index/value
b) return rightmost index/value
and many more point

how to optimize search difference between array / list of object

Premesis:
I am using ActionScript with two arraycollections containing objects with values to be matched...
I need a solution for this (if in the framework there is a library that does it better) otherwise any suggestions are appreciated...
Let's assume I have two lists of elements A and B (no duplicate values) and I need to compare them and remove all the elements present in both, so at the end I should have
in A all the elements that are in A but not in B
in B all the elements that are in B but not in A
now I do something like that:
for (var i:int = 0 ; i < a.length ;)
{
var isFound:Boolean = false;
for (var j:int = 0 ; j < b.length ;)
{
if (a.getItemAt(i).nome == b.getItemAt(j).nome)
{
isFound = true;
a.removeItemAt(i);
b.removeItemAt(j);
break;
}
j++;
}
if (!isFound)
i++;
}
I cycle both the arrays and if I found a match I remove the items from both of the arrays (and don't increase the loop value so the for cycle progress in a correct way)
I was wondering if (and I'm sure there is) there is a better (and less CPU consuming) way to do it...
If you must use a list, and you don't need the abilities of arraycollection, I suggest simply converting it to using AS3 Vectors. The performance increase according to this (http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/) are 60% compared to Arrays. I believe Arrays are already 3x faster than ArrayCollections from some article I once read. Unfortunately, this solution is still O(n^2) in time.
As an aside, the reason why Vectors are faster than ArrayCollections is because you provide type-hinting to the VM. The VM knows exactly how large each object is in the collection and performs optimizations based on that.
Another optimization on the vectors is to sort the data first by nome before doing the comparisons. You add another check to break out of the loop if the nome of list b simply wouldn't be found further down in list A due to the ordering.
If you want to do MUCH faster than that, use an associative array (object in as3). Of course, this may require more refactoring effort. I am assuming object.nome is a unique string/id for the objects. Simply assign that the value of nome as the key in objectA and objectB. By doing it this way, you might not need to loop through each element in each list to do the comparison.

Fast memmove for x86 and +1 shift (for Move-to-front transform)

For fast MTF ( http://en.wikipedia.org/wiki/Move-to-front_transform ) i need faster version of moving a char from inside the array into the front of it:
char mtfSymbol[256], front;
char i;
for(;;) { \\ a very big loop
...
i=get_i(); \\ i is in 0..256 but more likely to be smaller.
front=mtfSymbol[i];
memmove(mtfSymbol+1, mtfSymbol, i);
mtfSymbol[0]=front;
}
cachegrind shows, that for memmove there are lot of branch mispredictions here.
For the other version of code (not a memmove in the first example, but this one)
do
{
mtfSymbol[i] = mtfSymbol[i-1];
} while ( --i );
there are lot of byte reads/writes, conditional branches and branch mispredictions
i is not very big, as it is MTF used for "good" input - a text file after a BWT ( Burrows–Wheeler transform )
Compiler is GCC.
If you pre-allocate your buffer bigger than you're going to need it, and put your initial array somewhere in the middle (or at the end, if you're never going to have to extend it that way) then you can append items (up to a limit) by changing the address of the start of the array rather than by moving all the elements.
You'll obviously need to keep track of how far back you've moved, so you can re-allocate if you do fall off the start of your existing allocation, but this should still be quicker than moving all of your array entries around.
You can also use a dedicated data structure rather than an array to speed up the forward transform.
A fast implementation can be built with a list of linked lists to avoid the array element moves altogether.
See http://code.google.com/p/kanzi/source/browse/java/src/kanzi/transform/MTFT.java
For inverse transform, it turns out that arrays are as fast as linked lists.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.

How do you pre-size an array in Lua?

I've got a Lua program that seems to be slower than it ought to be. I suspect the issue is that I'm adding values to an associative array one at a time and the table has to allocate new memory each time.
There did seem to be a table.setn function, but it fails under Lua 5.1.3:
stdin:1: 'setn' is obsolete
stack traceback:
[C]: in function 'setn'
stdin:1: in main chunk
[C]: ?
I gather from the Google searching I've done that this function was depreciated in Lua 5.1, but I can't find what (if anything) replaced the functionality.
Do you know how to pre-size a table in Lua?
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
Let me focus more on your question:
adding values to an associative array
one at a time
Tables in Lua are associative, but using them in an array form (1..N) is optimized. They have double faces, internally.
So.. If you indeed are adding values associatively, follow the rules above.
If you are using indices 1..N, you can force a one-time size readjust by setting t[100000]= something. This should work until the limit of optimized array size, specified within Lua sources (2^26 = 67108864). After that, everything is associative.
p.s. The old 'setn' method handled the array part only, so it's no use for associative usage (ignore those answers).
p.p.s. Have you studied general tips for keeping Lua performance high? i.e. know table creation and rather reuse a table than create a new one, use of 'local print=print' and such to avoid global accesses.
static int new_sized_table( lua_State *L )
{
int asize = lua_tointeger( L, 1 );
int hsize = lua_tointeger( L, 2 );
lua_createtable( L, asize, hsize );
return( 1 );
}
...
lua_pushcfunction( L, new_sized_table );
lua_setglobal( L, "sized_table" );
Then, in Lua,
array = function(size) return sized_table(size,0) end
a = array(10)
As a quick hack to get this running you can add the C to lua.c.
I don't think you can - it's not an array, it's an associative array, like a perl hash or an awk array.
http://www.lua.org/manual/5.1/manual.html#2.5.5
I don't think you can preset its size meaningfully from the Lua side.
If you're allocating the array on the C side, though, the
void lua_createtable (lua_State *L, int narr, int nrec);
may be what you need.
Creates a new empty table and pushes
it onto the stack. The new table has
space pre-allocated for narr array
elements and nrec non-array elements.
This pre-allocation is useful when you
know exactly how many elements the
table will have. Otherwise you can use
the function lua_newtable.
There is still an internal luaL_setn and you can compile Lua so that
it is exposed as table.setn. But it looks like that it won't help
because the code doesn't seem to do any pre-extending.
(Also the setn as commented above the setn is related to the array part
of a Lua table, and you said that your are using the table as an associative
array)
The good part is that even if you add the elements one by one, Lua does not
increase the array that way. Instead it uses a more reasonable strategy. You still
get multiple allocations for a larger array but the performance is better than
getting a new allocation each time.
Although this doesn't answer your main question, it answers your second question:
Alternatively, is there some other way to avoid memory allocation when you add an object to a table?
If your running Lua in a custom application, as I can guess since your doing C coding, I suggest you replace the allocator with Loki's small value allocator, it reduced my memory allocations 100+ fold. This improved performance by avoiding round trips to the Kernel, and made me a much happier programmer :)
Anyways I tried other allocators, but they were more general, and provide guarantee's that don't benefit Lua applications (such as thread safety, and large object allocation, etc...), also writing your own small-object allocator can be a good week of programming and debugging to get just right, and after searching for an available solution Loki's allocator wasthe easiest and fastest I found for this problem.
If you declare your table in code with a specific amount of items, like so:
local tab = { 0, 1, 2, 3, 4, 5, ... , n }
then Lua will create the table with memory already allocated for at least n items.
However, Lua uses the 2x incremental memory allocation technique, so adding an item to a table should rarely force a reallocation.