LevelDB: Implementing an iterator to enumerate by key prefix

LevelDB: Implementing an iterator to enumerate by key prefix - iterator

I am looking for an efficient way to implement a key enumerator using leveldb to iterate by a key-prefix. The key is byte array (and the db uses default byte array comparer, so all keys with specific prefix stored/retrieved sequentially), and I would like my iterator to be able to take a key prefix and return only the data with keys having that prefix.
Do I have to use or inherit the default db iterator, seek to the first key in the range (ofcourse I need to know what it is), and then verify and return every slice that starts with the prefix (by overriding the movenext or something)? or is there a more efficient way of implementing this?
Let me know if anybody has solved this already and can share the code or the general idea. I am trying this from C++/CLI, but implementation in any language would help.
Thanks.
-raj.

The comparator is used to determine if the keys are different, so overloading it wouldn't help because as you're scanning the database- you have to be able to compare the full keys (not just the prefixes). Overloading the iterator is not necessary: the keys are ordered in leveldb, you would know that if you encounter a key with a different prefix, it will already be out of the range. You can just use the iterator as you normally would and as long as your keys are evaluated properly, you should get the right results:
void ScanRecordRange(const leveldb::Slice& startSlice, const leveldb::Slice& endSlice)
{
// Get a database iterator
shared_ptr<leveldb::Iterator> dbIter(_database->NewIterator(leveldb::ReadOptions()));
// Possible optimization suggested by Google engineers
// for critical loops. Reduces memory thrash.
for(dbIter->Seek(startSlice); dbIter->Valid() && _options.comparator->Compare(dbIter->key(), endSlice)<=0 ;dbIter->Next())
{
// Read the record
if( !dbIter->value().empty() )
{
leveldb::Slice keySlice(dbIter->key());
leveldb::Slice dataSlice(dbIter->data());
// TODO: process the key/data
}
}
}

I have a prefix iterator in my LevelDB wrapper.
It is used in the range returned by startsWith method:
int count = 0;
for (auto& en: ldb.startsWith (std::string ("prefix"))) {++count;}

Related

Key-value store with only explicitly allowed keys

I need a key-value store (e.g. a Mapor a custom class) which only allows keys out of a previously defined set, e.g. only the keys ["apple", "orange"]. Is there anything like this built-in in Kotlin? Otherwise, how could one do this? Maybe like the following code?
class KeyValueStore(val allowedKeys: List<String>){
private val map = mutableMapOf<String,Any>()
fun add(key: String, value: Any) {
if(!allowedKeys.contains(key))
throw Exception("key $key not allowed")
map.put(key, value)
}
// code for reading keys, like get(key: String) and getKeys()
}

The best solution for your problem would be to use an enum, which provides exactly the functionality that you're looking for. According to the docs, you can declare an enum like so:
enum class AllowedKeys {
APPLE, ORANGE
}
then, you could declare the keys with your enum!

Since the keys are known at compile time, you could simply use an enum instead of String as the keys of a regular Map:
enum class Fruit {
APPLE, ORANGE
}
val fruitMap = mutableMapOf<Fruit, String>()
Instead of Any, use whatever type you need for your values, otherwise it's not convenient to use.
If the types of the values depend on the key (a heterogeneous map), then I would first seriously consider using a regular class with your "keys" as properties. You can access the list of properties via reflection if necessary.
Another option is to define a generic key class, so the get function returns a type that depends on the type parameter of the key (see how CoroutineContext works in Kotlin coroutines).

For reference, it's possible to do this if you don't know the set of keys until runtime. But it involves writing quite a bit of code; I don't think there's an easy way.
(I wrote my own Map class for this. We needed a massive number of these maps in memory, each with the same 2 or 3 keys, so I ended up writing a Map implementation pretty much from scratch: it used a passed-in array of keys — so all maps could share the same key array — and a private array of values, the same size. The code was quite long, but pretty simple. Most operations meant scanning the list of keys to find the right index, so the theoretic performance was dire; but since the list was always extremely short, it performed really well in practice. And it saved GBs of memory compared to using HashMap. I don't think I have the code any more, and it'd be far too long to post here, but I hope the idea is interesting.)

How to create a new list of Strings from a list of Longs in Kotlin? (inline if possible)

I have a list of Longs in Kotlin and I want to make them strings for UI purposes with maybe some prefix or altered in some way. For example, adding "$" in the front or the word "dollars" at the end.
I know I can simply iterate over them all like:
val myNewStrings = ArrayList<String>()
longValues.forEach { myNewStrings.add("$it dollars") }
I guess I'm just getting nitpicky, but I feel like there is a way to inline this or change the original long list without creating a new string list?
EDIT/UPDATE: Sorry for the initial confusion of my terms. I meant writing the code in one line and not inlining a function. I knew it was possible, but couldn't remember kotlin's map function feature at the time of writing. Thank you all for the useful information though. I learned a lot, thanks.

You are looking for a map, a map takes a lambda, and creates a list based on the result of the lambda
val myNewStrings = longValues.map { "$it dollars" }
map is an extension that has 2 generic types, the first is for knowing what type is iterating and the second what type is going to return. The lambda we pass as argument is actually transform: (T) -> R so you can see it has to be a function that receives a T which is the source type and then returns an R which is the lambda result. Lambdas doesn't need to specify return because the last line is the return by default.

You can use the map-function on List. It creates a new list where every element has been applied a function.
Like this:
val myNewStrings = longValues.map { "$it dollars" }

In Kotlin inline is a keyword that refers to the compiler substituting a function call with the contents of the function directly. I don't think that's what you're asking about here. Maybe you meant you want to write the code on one line.
You might want to read over the Collections documentation, specifically the Mapping section.
The mapping transformation creates a collection from the results of a
function on the elements of another collection. The basic mapping
function is
map().
It applies the given lambda function to each subsequent element and
returns the list of the lambda results. The order of results is the
same as the original order of elements.
val numbers = setOf(1, 2, 3)
println(numbers.map { it * 3 })
For your example, this would look as the others said:
val myNewStrings = longValues.map { "$it dollars" }
I feel like there is a way to inline this or change the original long list without creating a new string list?
No. You have Longs, and you want Strings. The only way is to create new Strings. You could avoid creating a new List by changing the type of the original list from List<Long> to List<Any> and editing it in place, but that would be overkill and make the code overly complex, harder to follow, and more error-prone.

Like people have said, unless there's a performance issue here (like a billion strings where you're only using a handful) just creating the list you want is probably the way to go. You have a few options though!
Sequences are lazily evaluated, when there's a long chain of operations they complete the chain on each item in turn, instead of creating an intermediate full list for every operation in the chain. So that can mean less memory use, and more efficiency if you only need certain items, or you want to stop early. They have overhead though so you need to be sure it's worth it, and for your use-case (turning a list into another list) there are no intermediate lists to avoid, and I'm guessing you're using the whole thing. Probably better to just make the String list, once, and then use it?
Your other option is to make a function that takes a Long and makes a String - whatever function you're passing to map, basically, except use it when you need it. If you have a very large number of Longs and you really don't want every possible String version in memory, just generate them whenever you display them. You could make it an extension function or property if you like, so you can just go
fun Long.display() = "$this dollars"
val Long.dollaridoos: String get() = "$this.dollars"
print(number.display())
print(number.dollaridoos)
or make a wrapper object holding your list and giving access to a stringified version of the values. Whatever's your jam
Also the map approach is more efficient than creating an ArrayList and adding to it, because it can allocate a list with the correct capacity from the get-go - arbitrarily adding to an unsized list will keep growing it when it gets too big, then it has to copy to another (larger) array... until that one fills up, then it happens again...

Efficiency: reuse Term Lucene 6

I want to reuse the Term object instead of creating a new one every time I call this method:
public long getDF(String term) throws Exception {
return indexReader.docFreq(new Term("content", term));
}
I read in the documentation that, I can use this constructor of Term to reuse it:
public Term(String fld)
Constructs a Term with the given field and empty text. This serves two purposes: 1) reuse of a Term with the same field. 2) pattern for a query.
However, I don't know what is the next step as there is no setters in the Term documentation nor reset() method.
Any hint on how to achieve this?

Constructing Terms is cheap. You probably shouldn't worry too much about trying to reuse them. If you are seeing real performance issues, you should run a profiler. I'd guess constructing terms is not the real problem, and attempting to reuse clauses like this will just complicate things for no discernible benefit.
That said, you could reuse it by getting the Term's BytesRef from Term.bytes(), and directly modifying the underlying byte array.
String text = "text";
Term term = new Term("field");
BytesRef bytes = term.bytes();
bytes.bytes = new byte[UnicodeUtil.maxUTF8Length(text.length())];
bytes.length = UnicodeUtil.UTF16toUTF8(text, 0, text.length(), bytes.bytes);
Be careful that you aren't changing the value of a Term that is still in use. For instance, attempting to add two clauses to a BooleanQuery like this will, of course, not work.

Partitioned key space for StackExchange Redis

When developing a component that use Redis, I've found it a good pattern to prefix all keys used by that component so that it does not interfere other components.
Examples:
A component managing users might use keys prefixed by user: and a component managing a log might use keys prefixed by log:.
In a multi-tenancy system I want each customer to use a separate key space in Redis to ensure that their data do not interfere. The prefix would then be something like customer:<id>: for all keys related to a specific customer.
Using Redis is still new stuff for me. My first idea for this partitioning pattern was to use separate database identifiers for each partition. However, that seems to be a bad idea because the number of databases is limited and it seems to be a feature that is about to be deprecated.
An alternative to this would be to let each component get an IDatabase instance and a RedisKey that it shall use to prefix all keys. (I'm using StackExchange.Redis)
I've been looking for an IDatabase wrapper that automatically prefix all keys so that components can use the IDatabase interface as-is without having to worry about its keyspace. I didn't find anything though.
So my question is: What is a recommended way to work with partitioned key spaces on top of StackExchange Redis?
I'm now thinking about implementing my own IDatabase wrapper that would prefix all keys. I think most methods would just forward their calls to the inner IDatabase instance. However, some methods would require a bit more work: For example SORT and RANDOMKEY.

I've created an IDatabase wrapper now that provides a key space partitioning.
The wrapper is created by using an extension method to IDatabase
ConnectionMultiplexer multiplexer = ConnectionMultiplexer.Connect("localhost");
IDatabase fullDatabase = multiplexer.GetDatabase();
IDatabase partitioned = fullDatabase.GetKeyspacePartition("my-partition");
Almost all of the methods in the partitioned wrapper have the same structure:
public bool SetAdd(RedisKey key, RedisValue value, CommandFlags flags = CommandFlags.None)
{
return this.Inner.SetAdd(this.ToInner(key), value, flags);
}
They simply forward the invocation to the inner database and prepend the key space prefix to any RedisKey arguments before passing them on.
The CreateBatch and CreateTransaction methods simply creates wrappers for those interfaces, but with the same base wrapper class (as most methods to wrap are defined by IDatabaseAsync).
The KeyRandomAsync and KeyRandom methods are not supported. Invocations will throw a NotSupportedException. This is not a concern for me, and to quote #Marc Gravell:
I can't think of any sane way of achieving that, but I suspect NotSupportedException("RANDOMKEY is not supported when a key-prefix is specified") is entirely reasonable (this isn't a commonly used command anyway)
I have not yet implemented ScriptEvaluate and ScriptEvaluateAsync because it is unclear to me how I should handle the RedisResult return value. The input parameters to these methods accept RedisKey which should be prefixed, but the script itself could return keys and in that case I think it would make (most) sense to unprefix those keys. For the time being, those methods will throw a NotImplementedException...
The sort methods (Sort, SortAsync, SortAndStore and SortAndStoreAsync) have special handling for the by and get parameters. These are prefixed as normal unless they have one of the special values: nosort for by and # for get.
Finally, to allow prefixing ITransaction.AddCondition I had to use a bit reflection:
internal static class ConditionHelper
{
public static Condition Rewrite(this Condition outer, Func<RedisKey, RedisKey> rewriteFunc)
{
ThrowIf.ArgNull(outer, "outer");
ThrowIf.ArgNull(rewriteFunc, "rewriteFunc");
Type conditionType = outer.GetType();
object inner = FormatterServices.GetUninitializedObject(conditionType);
foreach (FieldInfo field in conditionType.GetFields(BindingFlags.NonPublic | BindingFlags.Instance))
{
if (field.FieldType == typeof(RedisKey))
{
field.SetValue(inner, rewriteFunc((RedisKey)field.GetValue(outer)));
}
else
{
field.SetValue(inner, field.GetValue(outer));
}
}
return (Condition)inner;
}
}
This helper is used by the wrapper like this:
internal Condition ToInner(Condition outer)
{
if (outer == null)
{
return outer;
}
else
{
return outer.Rewrite(this.ToInner);
}
}
There are several other ToInner methods for different kind of parameters that contain RedisKey but they all more or less end up calling:
internal RedisKey ToInner(RedisKey outer)
{
return this.Prefix + outer;
}
I have now created a pull request for this:
https://github.com/StackExchange/StackExchange.Redis/pull/92
The extension method is now called WithKeyPrefix and the reflection hack for rewriting conditions is no longer needed as the new code have access to the internals of Condition classes.

Intriguing suggestion. Note that redis already offers a simple isolation mechanism by way of database numbers, for example:
// note: default database is 0
var logdb = muxer.GetDatabase(1);
var userdb = muxer.GetDatabase(2);
StackExchange.Redis will handle all the work to issue commands to the correct databases - i.e. commands issued via logdb will be issued against database 1.
Advantages:
inbuilt
works with all clients
provides complete keyspace isolation
doesn't require additional per-key space for the prefixes
works with KEYS, SCAN, FLUSHDB, RANDOMKEY, SORT, etc
you get high-level per-db keyspace metrics via INFO
Disadvantages:
not supported on redis-cluster
not supported via intermediaries like twemproxy
Note:
the number of databases is a configuration option; IIRC it defaults to 16 (numbers 0-15), but can be tweaked in your configuration file via:
databases 400 # moar databases!!!
This is actually how we (Stack Overflow) use redis with multi-tenancy; database 0 is "global", 1 is "stackoverflow", etc. It should also be clear that if required, it is then a fairly simple thing to migrate an entire database to a different node using SCAN and MIGRATE (or more likely: SCAN, DUMP, PTTL and RESTORE - to avoid blocking).
Since database partitioning is not supported in redis-cluster, there may be a valid scenario here, but it should also be noted that redis nodes are easy to spin up, so another valid option is simply: use different redis groups for each (different port numbers, etc) - which would also have the advantage of allowing genuine concurrency between nodes (CPU isolation).
However, what you propose is not unreasonable; there is actually "prior" here... again, largely linked to how we (Stack Overflow) use redis: while databases work fine for isolating keys, no isolation is currently provided by redis for channels (pub/sub). Because of this, StackExchange.Redis actually includes a ChannelPrefix option on ConfigurationOptions, that allows you to specify a prefix that is automatically added during PUBLISH and removed when receiving notifications. So if your ChannelPrefix is foo:, and you publish and event bar, the actual event is published to the channel foo:bar; likewise: any callback you have only sees bar. It could be that this is something that is viable for databases too, but to emphasize: at the moment this configuration option is at the multiplexer level - not the individual ISubscriber. To be comparable to the scenario you present, this would need to be at the IDatabase level.
Possible, but a decent amount of work. If possible, I would recommend investigating the option of simply using database numbers...

Reference Semantics in Google Protocol Buffers

I have slightly peculiar program which deals with cases very similar to this
(in C#-like pseudo code):
class CDataSet
{
int m_nID;
string m_sTag;
float m_fValue;
void PrintData()
{
//Blah Blah
}
};
class CDataItem
{
int m_nID;
string m_sTag;
CDataSet m_refData;
CDataSet m_refParent;
void Print()
{
if(null == m_refData)
{
m_refParent.PrintData();
}
else
{
m_refData.PrintData();
}
}
};
Members m_refData and m_refParent are initialized to null and used as follows:
m_refData -> Used when a new data set is added
m_refParent -> Used to point to an existing data set.
A new data set is added only if the field m_nID doesn't match an existing one.
Currently this code is managing around 500 objects with around 21 fields per object and the format of choice as of now is XML, which at 100k+ lines and 5MB+ is very unwieldy.
I am planning to modify the whole shebang to use ProtoBuf, but currently I'm not sure as to how I can handle the reference semantics. Any thoughts would be much appreciated

Out of the box, protocol buffers does not have any reference semantics. You would need to cross-reference them manually, typically using an artificial key. Essentially on the DTO layer you would a key to CDataSet (that you simply invent, perhaps just an increasing integer), storing the key instead of the item in m_refData/m_refParent, and running fixup manually during serialization/deserialization. You can also just store the index into the set of CDataSet, but that may make insertion etc more difficult. Up to you; since this is serialization you could argue that you won't insert (etc) outside of initial population and hence the raw index is fine and reliable.
This is, however, a very common scenario - so as an implementation-specific feature I've added optional (opt-in) reference tracking to my implementation (protobuf-net), which essentially automates the above under the covers (so you don't need to change your objects or expose the key outside of the binary stream).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas