Apache Ignite: How does the indexing work?

Apache Ignite: How does the indexing work? - ignite

How does Apache Ignite's indexing work? I haven't found those technical details in the documentation.
Is it using a B-tree?
Where is the index stored?
How is it stored?
What performance (in Big-O notation) does the index provide after build in usage?
How fast does it build, when does it build?
Ignite can store arbitrary serializable Java objects. How does it deal with composites when I want to index a field of a sub-sub-object?
Ignite Cache is a key-value store. Am I able to have different classes (=types as objects) as values? In other words, is Ignite Cache Schemaless? If yes, how does this fit with my SQL-queries?
Ignite Cache is a key-values store. How does do the keys come into play if I SQL-query for my values? What am I querying for?
The keys can be arbitrary, serializable Java objects - am I able to query for the keys or only the values?

This information is not covered really much in docs because it is mostly implementation detail and can change from version to version. After all the source code is available if you are interested in details.
To be specific I'm talking about Ignite 1.5 which is about to be released.
Before 1.5 the default data structure was a snap-tree (variant of avl-tree), since 1.5 skip-list option was added as well and it is a default now.
In java heap or in off-heap memory depending on config.
Reliably :) I don't understand this question.
log(N) on update and lookup.
Index is getting updated on each transaction commit (or just cache update in case of atomic cache), there is no separate build phase. You can expect you indexes to be in correct state after each update.
Ignite has two options (since 1.5): either to store objects in binary format which allows to get separate field values or keep the whole object deserialized and use reflection.
etc.
Have fun!

Related

cloudflare Durable Objects update object value

Halo! I'm recently diving into cloudflare Workers, especially Durable Objects. I could make a simple request which put a js object into the assigned key. Let's say the key is key0, and the put object value is {"fieldA": "val0", "fieldB": "val1"}. In this case, how can i update the field-value of fieldA without removing fieldB? I've tried simply executing put("key0", {"fieldA": "newVal0"}) and it has kept removing {"fieldB": "val1"}.
Of course it is a common behaviour in js operations, but i cannot find out anything like ~["key0"]["fieldA"] = "newVal0" in docs(maybe i'm missing sth). OTL
Hope this question reach to the gurus in the community! Thanks in advance [:
EDIT after the answers:
In theory, it would be wonderful if flare durable objects support and work just like a normal js object. Such possible worker feature feels like a killer app for the cloud db services, since the average cpu time is quite fast and flare also has super low pricing compared to other big bros. If it happens, i would eager to migrate everything into the flare platform [:

Durable Objects' KV storage only supports get and put operations -- it doesn't have any sort of "update". So, you have two options:
get() the key, modify it, and then write the modified version back. This may sound inefficient, but keep in mind that commonly-accessed keys will likely be in in-memory cache. In fact, this get/modify/put implemented in your JavaScript is probably about as fast as any modification operation that Durable Objects itself could possibly implement built-in. That said, you probably don't want to use this approach with large objects, since the whole object has to be written to disk again after every update.
Split your object across multiple keys. E.g. instead of having the key foo map to {"fieldA": "val0", "fieldB": "val1"}, you could have separate keys foo:fieldA and foo:fieldB. Note that you can fetch all the keys at once using storage.list({prefix: "foo:"}). This approach is not as convenient but allows each field to be written separately to disk.

get and put deal with whole JS objects, so if you want to change part of the object you should get it, update it using normal JS, and then put the entire object back.

does performance of an in-memory database depend on whether it is written in c++ or java

We are trying to find an in-memory database with index support that we can use for our application.
We are looking at Aerospike, Apache Ignite, Geode, Voltdb.
There is not much to distinguish and every one claims to be fast and have great community support.
Out of these, Aerospike and VoltDB are C/C++ based and Apache Ignite and Geode are java based.
Considering there is little to choose between the databases in terms of performance and further it is tough to test which db will work for us better in production, Was trying to find out if the performance of an in-memory database will also depend on whether it is java based or c/c++ based. Considering garbage collection issues are quite frequent and its a tough to properly tune it for your use case(which may change after some time), is it true that the java based dbs will be at a disadvantage.
Thanks

You can't really conclude that one db is faster than another just because it is written in X language vs Y language. Database is a very complex product with many features. Some queries may be faster in one db, other queries in another db.
The only way to find out is to test your specific use case.

For an in-memory DB that maintains consistency like Geode does (i.e. makes synchronous replication to other nodes before releasing the client thread), your network is going to be a bigger concern than will the hotspot compiler. Still, here are two points of input to get you to the point where language is irrelevant:
1) If you are doing lots of creates/ updates over reads: Use off-heap memory on the server. This minimizes GC's.
2) Use Geode's serialization mapping between C/C++ and Java objects to avoid JNI. Specifically, use the DataSerializer http://gemfire.docs.pivotal.io/geode/developing/data_serialization/gemfire_data_serialization.html
If you plan to use queries extensively rather than gets/ puts, use the PDXSerializer: http://gemfire.docs.pivotal.io/geode/developing/data_serialization/use_pdx_serializer.html

I guess I'm going to be the contrarian.
All else being equal, compiled code is faster than the JVM, and there's just no garbage collection to have to employ tactics to avoid.
Having been written in C/C++, eXtremeDB (my company's product) is able to avoid using the C run-time memory management altogether. Managing the memory area entirely within the database software enables the use of highly efficient & purpose-specific memory managers, and eliminates the potential for memory leaks (from the whole system point of view, e.g. if 200GB is set aside for the in-memory database, it will never exceed 200GB). eXtremeDB is not unique in this regard; other in-memory DBMS written in C/C++ are also able to avoid the C run-time malloc/free or C++ new/delete. So please don't ding me for making a product pitch, I'm not. I'm pointing out a capability that is possible with a C/C++ implementation that may not be available with a JVM.
The other answerers are correct: that a crappy implementation of a SQL execution plan for a given query can overwhelm any advantage of compiled code vs JVM, but at some point you've got to have confidence that your DBMS vendor knows what they are doing (and are interested in improving their product if a plan is demonstrably inefficient/wrong). If you're not using SQL, then the goodness/badness of a SQL optimizer is not part of the equation, and it's really down to how well the database system's index methods are written, and availability of different types of indexes for different search requirements (e.g. a hash index will generally be better than a b-tree for exact match lookup, but a hash index can't support partial key (wildcard) search or ordered retrieval).
There are some public (independent, audited) benchmarks you can look to. We have participated in a few STAC-M3, though only one other DBMS has also (the DBMS you listed specifically, have not).

What's the Point of Multiple Redis Databases?

So, I've come to a place where I wanted to segment the data I store in redis into separate databases as I sometimes need to make use of the keys command on one specific kind of data, and wanted to separate it to make that faster.
If I segment into multiple databases, everything is still single threaded, and I still only get to use one core. If I just launch another instance of Redis on the same box, I get to use an extra core. On top of that, I can't name Redis databases, or give them any sort of more logical identifier. So, with all of that said, why/when would I ever want to use multiple Redis databases instead of just spinning up an extra instance of Redis for each extra database I want? And relatedly, why doesn't Redis try to utilize an extra core for each extra database I add? What's the advantage of being single threaded across databases?

You don't want to use multiple databases in a single redis instance. As you noted, multiple instances lets you take advantage of multiple cores. If you use database selection you will have to refactor when upgrading. Monitoring and managing multiple instances is not difficult nor painful.
Indeed, you would get far better metrics on each db by segregation based on instance. Each instance would have stats reflecting that segment of data, which can allow for better tuning and more responsive and accurate monitoring. Use a recent version and separate your data by instance.
As Jonaton said, don't use the keys command. You'll find far better performance if you simply create a key index. Whenever adding a key, add the key name to a set. The keys command is not terribly useful once you scale up since it will take significant time to return.
Let the access pattern determine how to structure your data rather than store it the way you think works and then working around how to access and mince it later. You will see far better performance and find the data consuming code often is much cleaner and simpler.
Regarding single threaded, consider that redis is designed for speed and atomicity. Sure actions modifying data in one db need not wait on another db, but what if that action is saving to the dump file, or processing transactions on slaves? At that point you start getting into the weeds of concurrency programming.
By using multiple instances you turn multi threading complexity into a simpler message passing style system.

In principle, Redis databases on the same instance are no different than schemas in RDBMS database instances.
So, with all of that said, why/when would I ever want to use multiple
Redis databases instead of just spinning up an extra instance of Redis
for each extra database I want?
There's one clear advantage of using redis databases in the same redis instance, and that's management. If you spin up a separate instance for each application, and let's say you've got 3 apps, that's 3 separate redis instances, each of which will likely need a slave for HA in production, so that's 6 total instances. From a management standpoint, this gets messy real quick because you need to monitor all of them, do upgrades/patches, etc. If you don't plan on overloading redis with high I/O, a single instance with a slave is simpler and easier to manage provided it meets your SLA.

Even Salvatore Sanfilippo (creator of Redis) thinks it's a bad idea to use multiple DBs in Redis. See his comment here:
https://groups.google.com/d/topic/redis-db/vS5wX8X4Cjg/discussion
I understand how this can be useful, but unfortunately I consider
Redis multiple database errors my worst decision in Redis design at
all... without any kind of real gain, it makes the internals a lot
more complex. The reality is that databases don't scale well for a
number of reason, like active expire of keys and VM. If the DB
selection can be performed with a string I can see this feature being
used as a scalable O(1) dictionary layer, that instead it is not.
With DB numbers, with a default of a few DBs, we are communication
better what this feature is and how can be used I think. I hope that
at some point we can drop the multiple DBs support at all, but I think
it is probably too late as there is a number of people relying on this
feature for their work.

I don't really know any benefits of having multiple databases on a single instance. I guess it's useful if multiple services use the same database server(s), so you can avoid key collisions.
I would not recommend building around using the KEYS command, since it's O(n) and that doesn't scale well. What are you using it for that you can accomplish in another way? Maybe redis isn't the best match for you if functionality like KEYS is vital.
I think they mention the benefits of a single threaded server in their FAQ, but the main thing is simplicity - you don't have to bother with concurrency in any real way. Every action is blocking, so no two things can alter the database at the same time. Ideally you would have one (or more) instances per core of each server, and use a consistent hashing algorithm (or a proxy) to divide the keys among them. Of course, you'll loose some functionality - piping will only work for things on the same server, sorts become harder etc.

Redis databases can be used in the rare cases of deploying a new version of the application, where the new version requires working with different entities.

I know this question is years old, but there's another reason multiple databases may be useful.
If you use a "cloud Redis" from your favourite cloud provider, you probably have a minimum memory size and will pay for what you allocate. If however your dataset is smaller than that, then you'll be wasting a bit of the allocation, and so wasting a bit of money.
Using databases you could use the same Redis cloud-instance to provide service for (say) dev, UAT and production, or multiple instances of your application, or whatever else - thus using more of the allocated memory and so being a little more cost-effective.
A use-case I'm looking at has several instances of an application which use 200-300K each, yet the minimum allocation on my cloud provider is 1M. We can consolidate 10 instances onto a single Redis without really making a dent in any limits, and so save about 90% of the Redis hosting cost. I appreciate there are limitations and issues with this approach, but thought it worth mentioning.

I am using redis for implementing a blacklist of email addresses , and i have different TTL values for different levels of blacklisting , so having different DBs on same instance helps me a lot .

Using multiple databases in a single instance may be useful in the following scenario:
Different copies of the same database could be used for production, development or testing using real-time data. People may use replica to clone a redis instance to achieve the same purpose. However, the former approach is easier for existing running programs to just select the right database to switch to the intended mode.

Our motivation has not been mentioned above. We use multiple databases because we routinely need to delete a large set of a certain type of data, and FLUSHDB makes that easy. For example, we can clear all cached web pages, using FLUSHDB on database 0, without affecting all of our other use of Redis.
There is some discussion here but I have not found definitive information about the performance of this vs scan and delete:
https://github.com/StackExchange/StackExchange.Redis/issues/873

Infinispan keyset() not suitable for production

I decided to use infinispan distributed grid to extend my application to support cluster but I encountered a limitation when using this kind of shared resource.
How can I retrieve all the values or keys in the Distributed cache? I'm asking this because in their documentation all the collection methods are not recommended for running in production (meaning keySet()).
Right now I have a local bucket/cache with the pairs key/value but in order to process the values I need to retrieve the keys and iterate throught the set.
Set set = cache.keySet();
When having a large number of entries in the local cache, the keySet() returns a copy and this is a heavy load for the memory.
I tried to use the query feature but there are some network calls if I want to find the values and I don't need that. Also the query feature does not support complex filters.
Do you know which is the best approach when using infinispan in production?
As this is an experimental phase I'm using the last infinispan version.
Thanks a lot.

Map/Reduce functionality allows you to iterate over all the entries stored and also migrates the logic where the data is, so doesn't add a lot of burden.

We are using keySet() on production for informational purpose only. Performance do not seem to be a big issue under low data loads but of course you should use such methods with great care because they could have large performance impact depending by how you are using the cache. Remote cache queries seems a pretty handy feature to me.

JBoss TreeCache vs PojoCache when using invaludation rather than replication

We are setting up a Jboss cluster and we are building an own distributed cache solution built upon Jboss cache (Cant use it as 2nd level cache to ORM layer in our case). We want to use invalidation and not replication as cache mode. As far as i can see after (very) little testing both solutions seem to work, objects are put into the cache and objects seem to be evicted when they are updated on any of the servers.
This leads me to believe that PojoCache with AOP instrumentation is only needed when using replication so that you can replicate only updated field values and not whole objects. Am I correct here or are there any other advantages with using PojoCache over TreeCache in our scenario? And if PojoCache have advantages, do we still need AOP instrumentation and to annotate our entities with #PojoCacheable (yes, we are using JBCache 1.4.1) since we are not using relication?
Regards
Jonas Heineson

PoJoCache has the ability through AOP to:
only replicate changed fields and not whole objects. Makes a difference if e.g. your person object containes a huge image of the person and you only change the password
detect changes and thus can automatically put them on the list to be replicated.
TreeCache (plain) does not need AOP, but can thus not replicate individual fields or detect what has changed so that you need to trigger replication yourself.
If you don't replicate, those points are probably irrelevant.
IIrc, you don't need the #PojocaCacheable annotation for Pojo cache - without it, you need to specify the classes to be enhanced in a different way.
I have the feeling that if you are not replicating, the plain TreeCache will be enough.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas