Is it possible to reference a byte? - kotlin

In C, it is possible to create an array and have a pointer pointing to a specific byte of that array, like this:
char array[] = "This is not a question.";
char *ptr = strchr(array, ' '); // points to the first space
This is extremely useful both for performance and reduce memory usage when parsing, sometimes I create data structures that just points to different bytes of the same buffer. I wonder if it is convenient and possible to do the same in Kotlin.

The equivalent in Java and Kotlin is simply to store an index into the array (or String).
Remember that the JVM has very powerful dynamic compilation and optimisation, so while in C that would be less efficient, on the JVM it generally won't be. (The difference generally wouldn't be significant in most applications, anyway.)
Also note that Kotlin uses Unicode, so a character is not the same as a byte. A Character is an unsigned two-byte number. (Characters outside the Basic Multilingual Plane are stored as a surrogate pair.)
So the equivalent would be:
val string = "This is not a question."
val i = string.indexOf(' ') // = 4, index of the first space
or
val array = byteArrayOf(1, 2, 3, 4, 5)
val i2 = array.indexOf(3) // = 2, index of the first occurrence of 3

Related

V8 elements kinds optimization

After reading this article: https://v8.dev/blog/elements-kinds, I wondering if null and object are considered same type by V8 in terms of internal optimizations.
eg.
[{}, null, {}] vs [{}, {}, {}]
Yes. The only types considered for elements kinds are "small integer", "double", and "anything". null is not an integer or a double, so it's "anything".
Note that elements kinds are tracked per array, not per element. An array's elements kind is the most generic elements kind required for any of its elements:
[1, 2, 3] // "integer" elements (stored as integers internally)
[1, 2, 3.5] // "double" elements (stored as doubles: [1.0, 2.0, 3.5])
[1, 2, {}] // "anything" elements
[1, 2, null] // "anything" elements
[1, 2, "3"] // "anything" elements
The reason is that the benefit of tracking elements kinds in the first place is that some checks can be avoided. That has significant impact (in relative terms) for operations that are otherwise cheap. For example, if you wanted to sum up an array's elements, which are all integers:
for (let i = 0; i < array.length; i++) result += array[i];
adding integers is really fast (one instruction + overflow check), so checking for every element "is this element an integer (so I can do an integer addition)?" (another instruction + conditional jump) adds a relatively large overhead, so knowing up front that every element in this array is an integer lets the engine skip those checks inside the loop. Whereas if the array contained strings and you wanted to concatenate them all, string concatenation is a much slower operation (you have to allocate a new string object for the result, and then decide whether you want to copy the characters or just refer to the input strings), so the overhead added by an additional "is this element a string (so I can do a string concatenation)?" check is probably barely measurable. So tracking "strings" as an elements kind wouldn't provide much benefit, but would add complexity to the implementation and probably a small performance cost in some situations, so V8 doesn't do it. Similarly, if you knew up front "this array contains only null", there isn't anything obvious that you could speed up with that knowledge.
Also: as a JavaScript developer, don't worry about elements kinds. See that blog post as a (hopefully interesting) story about the lengths to which V8 goes to squeeze every last bit of performance out of your code; don't specifically contort your code to make better use of it (or spend time worrying about it). The difference is usually small, and in the cases where it does matter, it'll probably happen without you having to think about it.

Make Realloc behave like Calloc

How can I force Realloc to behave like calloc?
For instance:
I have the following structs:
typedef struct bucket0{
int hashID;
Registry registry;
}Bucket;
typedef struct table0{
int tSize;
int tElements;
Bucket** content;
}Table;
and I have the following code in order to grow the table:
int grow(Table* table){
Bucket** tempPtr;
//grow will add 1 to the number available buckets, and double it.
table->tSize++; //add 1
table->tSize *= 2; //double element
if(!table->content){
//table will be generated for the first time
table->content = (Bucket**)(calloc(sizeof(Bucket*), table->tSize));
} else {
//realloc content
tempPtr = (Bucket**)realloc(table->content, sizeof(Bucket)*table->tSize);
if(tempPtr){
table->content = tempPtr;
return 0;
}else{
return 1000;//table could not grow
}
}
}
When I execute it, the table grows properly, and MOST of the "Buckets" in it are initialized as a NULL ptr. However, not all of them are.
How can I make Realloc behave like calloc? in the sense that when it creates new "buckets" they initialize to NULL
Strictly speaking, you shouldn't be relying on calloc (or memset, for that matter) to set pointers to null. C doesn't guarantee that null pointers are represented by all-zero bytes in memory.
Quoting from the comp.lang.C FAQ question 7.31:
Don't rely on calloc's zero fill too much (see below); usually, it's best to initialize data structures yourself, on a field-by-field basis, especially if there are pointer fields.
calloc's zero fill is all-bits-zero, and is therefore guaranteed to yield the value 0 for all integral types (including '\0' for character types). But it does not guarantee useful null pointer values (see section 5 of this list) or floating-point zero values.
It's safer to initialize the individual structure fields yourself. You can create a static const one as a template, with its content initialized to NULL, and then memcpy it to each element of your dynamically-allocated array.

Dealing with Int64 value with Booksleeve

I have a question about Marc Gravell's Booksleeve library.
I tried to understand how booksleeve deal the Int64 value (i have billion long value in Redis actually)
I used reflection to undestand the Set long value overrides.
// BookSleeve.RedisMessage
protected static void WriteUnified(Stream stream, long value)
{
if (value >= 0L && value <= 99L)
{
int i = (int)value;
if (i <= 9)
{
stream.Write(RedisMessage.oneByteIntegerPrefix, 0, RedisMessage.oneByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i));
}
else
{
stream.Write(RedisMessage.twoByteIntegerPrefix, 0, RedisMessage.twoByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i / 10));
stream.WriteByte((byte)(48 + i % 10));
}
}
else
{
byte[] bytes = Encoding.ASCII.GetBytes(value.ToString());
stream.WriteByte(36);
RedisMessage.WriteRaw(stream, (long)bytes.Length);
stream.Write(bytes, 0, bytes.Length);
}
stream.Write(RedisMessage.Crlf, 0, 2);
}
I don't understand why, with more than two digits int64, the long is encoding in ascii?
Why don't use byte[] ? I know than i can use byte[] overrides to do this, but i just want to understand this implementation to optimize mine. There may be a relationship with the Redis storage.
By advance thank you Marc :)
P.S : i'm still very enthusiastic about your next major version, than i can use long value key instead of string.
It writes it in ASCII because that is what the redis protocol demands.
If you look carefully, it is always encoded as ASCII - but for the most common cases (0-9, 10-99) I've special-cased it, as these are very simple results:
x => $1\r\nX\r\n
xy => $2\r\nXY\r\n
where x and y are the first two digits of a number in the range 0-99, and X and Y are those digits (as numbers) offset by 48 ('0') - so decimal 17 becomes the byte sequence (in hex):
24-32-0D-0A-31-37-0D-0A
Of course, that can also be achieved simply via the writing each digit sequentially and offsetting the digit value by 48 ('0'), and handling the negative sign - I guess the answer there is simply "because I coded it the simple but obviously correct way". Consider the value -123 - which is encoded as $4\r\n-123\r\n (hey, don't look at me - I didn't design the protocol). It is slightly awkward because it needs to calculate the buffer length first, then write that buffer length, then write the value - remembering to write in the order 100s, 10s, 1s (which is much harder than writing the other way around).
Perfectly willing to revisit it - simply: it works.
Of course, it becomes trivial if you have a scratch buffer available - you just write it in the simple order, then reverse the portion of the scratch buffer. I'll check to see if one is available (and if not, it wouldn't be unreasonable to add one).
I should also clarify: there is also the integer type, which would encode -123 as :-123\r\n - however, from memory there are a lot of places this simply does not work.

Trie Implementation Question

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.
If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.
If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};
A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.
I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.
I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.