Infinispan keyset() not suitable for production - infinispan

I decided to use infinispan distributed grid to extend my application to support cluster but I encountered a limitation when using this kind of shared resource.
How can I retrieve all the values or keys in the Distributed cache? I'm asking this because in their documentation all the collection methods are not recommended for running in production (meaning keySet()).
Right now I have a local bucket/cache with the pairs key/value but in order to process the values I need to retrieve the keys and iterate throught the set.
Set set = cache.keySet();
When having a large number of entries in the local cache, the keySet() returns a copy and this is a heavy load for the memory.
I tried to use the query feature but there are some network calls if I want to find the values and I don't need that. Also the query feature does not support complex filters.
Do you know which is the best approach when using infinispan in production?
As this is an experimental phase I'm using the last infinispan version.
Thanks a lot.

Map/Reduce functionality allows you to iterate over all the entries stored and also migrates the logic where the data is, so doesn't add a lot of burden.

We are using keySet() on production for informational purpose only. Performance do not seem to be a big issue under low data loads but of course you should use such methods with great care because they could have large performance impact depending by how you are using the cache. Remote cache queries seems a pretty handy feature to me.

Related

Apache Ignite: How does the indexing work?

How does Apache Ignite's indexing work? I haven't found those technical details in the documentation.
Is it using a B-tree?
Where is the index stored?
How is it stored?
What performance (in Big-O notation) does the index provide after build in usage?
How fast does it build, when does it build?
Ignite can store arbitrary serializable Java objects. How does it deal with composites when I want to index a field of a sub-sub-object?
Ignite Cache is a key-value store. Am I able to have different classes (=types as objects) as values? In other words, is Ignite Cache Schemaless? If yes, how does this fit with my SQL-queries?
Ignite Cache is a key-values store. How does do the keys come into play if I SQL-query for my values? What am I querying for?
The keys can be arbitrary, serializable Java objects - am I able to query for the keys or only the values?
This information is not covered really much in docs because it is mostly implementation detail and can change from version to version. After all the source code is available if you are interested in details.
To be specific I'm talking about Ignite 1.5 which is about to be released.
Before 1.5 the default data structure was a snap-tree (variant of avl-tree), since 1.5 skip-list option was added as well and it is a default now.
In java heap or in off-heap memory depending on config.
Reliably :) I don't understand this question.
log(N) on update and lookup.
Index is getting updated on each transaction commit (or just cache update in case of atomic cache), there is no separate build phase. You can expect you indexes to be in correct state after each update.
Ignite has two options (since 1.5): either to store objects in binary format which allows to get separate field values or keep the whole object deserialized and use reflection.
etc.
Have fun!

What's the Point of Multiple Redis Databases?

So, I've come to a place where I wanted to segment the data I store in redis into separate databases as I sometimes need to make use of the keys command on one specific kind of data, and wanted to separate it to make that faster.
If I segment into multiple databases, everything is still single threaded, and I still only get to use one core. If I just launch another instance of Redis on the same box, I get to use an extra core. On top of that, I can't name Redis databases, or give them any sort of more logical identifier. So, with all of that said, why/when would I ever want to use multiple Redis databases instead of just spinning up an extra instance of Redis for each extra database I want? And relatedly, why doesn't Redis try to utilize an extra core for each extra database I add? What's the advantage of being single threaded across databases?
You don't want to use multiple databases in a single redis instance. As you noted, multiple instances lets you take advantage of multiple cores. If you use database selection you will have to refactor when upgrading. Monitoring and managing multiple instances is not difficult nor painful.
Indeed, you would get far better metrics on each db by segregation based on instance. Each instance would have stats reflecting that segment of data, which can allow for better tuning and more responsive and accurate monitoring. Use a recent version and separate your data by instance.
As Jonaton said, don't use the keys command. You'll find far better performance if you simply create a key index. Whenever adding a key, add the key name to a set. The keys command is not terribly useful once you scale up since it will take significant time to return.
Let the access pattern determine how to structure your data rather than store it the way you think works and then working around how to access and mince it later. You will see far better performance and find the data consuming code often is much cleaner and simpler.
Regarding single threaded, consider that redis is designed for speed and atomicity. Sure actions modifying data in one db need not wait on another db, but what if that action is saving to the dump file, or processing transactions on slaves? At that point you start getting into the weeds of concurrency programming.
By using multiple instances you turn multi threading complexity into a simpler message passing style system.
In principle, Redis databases on the same instance are no different than schemas in RDBMS database instances.
So, with all of that said, why/when would I ever want to use multiple
Redis databases instead of just spinning up an extra instance of Redis
for each extra database I want?
There's one clear advantage of using redis databases in the same redis instance, and that's management. If you spin up a separate instance for each application, and let's say you've got 3 apps, that's 3 separate redis instances, each of which will likely need a slave for HA in production, so that's 6 total instances. From a management standpoint, this gets messy real quick because you need to monitor all of them, do upgrades/patches, etc. If you don't plan on overloading redis with high I/O, a single instance with a slave is simpler and easier to manage provided it meets your SLA.
Even Salvatore Sanfilippo (creator of Redis) thinks it's a bad idea to use multiple DBs in Redis. See his comment here:
https://groups.google.com/d/topic/redis-db/vS5wX8X4Cjg/discussion
I understand how this can be useful, but unfortunately I consider
Redis multiple database errors my worst decision in Redis design at
all... without any kind of real gain, it makes the internals a lot
more complex. The reality is that databases don't scale well for a
number of reason, like active expire of keys and VM. If the DB
selection can be performed with a string I can see this feature being
used as a scalable O(1) dictionary layer, that instead it is not.
With DB numbers, with a default of a few DBs, we are communication
better what this feature is and how can be used I think. I hope that
at some point we can drop the multiple DBs support at all, but I think
it is probably too late as there is a number of people relying on this
feature for their work.
I don't really know any benefits of having multiple databases on a single instance. I guess it's useful if multiple services use the same database server(s), so you can avoid key collisions.
I would not recommend building around using the KEYS command, since it's O(n) and that doesn't scale well. What are you using it for that you can accomplish in another way? Maybe redis isn't the best match for you if functionality like KEYS is vital.
I think they mention the benefits of a single threaded server in their FAQ, but the main thing is simplicity - you don't have to bother with concurrency in any real way. Every action is blocking, so no two things can alter the database at the same time. Ideally you would have one (or more) instances per core of each server, and use a consistent hashing algorithm (or a proxy) to divide the keys among them. Of course, you'll loose some functionality - piping will only work for things on the same server, sorts become harder etc.
Redis databases can be used in the rare cases of deploying a new version of the application, where the new version requires working with different entities.
I know this question is years old, but there's another reason multiple databases may be useful.
If you use a "cloud Redis" from your favourite cloud provider, you probably have a minimum memory size and will pay for what you allocate. If however your dataset is smaller than that, then you'll be wasting a bit of the allocation, and so wasting a bit of money.
Using databases you could use the same Redis cloud-instance to provide service for (say) dev, UAT and production, or multiple instances of your application, or whatever else - thus using more of the allocated memory and so being a little more cost-effective.
A use-case I'm looking at has several instances of an application which use 200-300K each, yet the minimum allocation on my cloud provider is 1M. We can consolidate 10 instances onto a single Redis without really making a dent in any limits, and so save about 90% of the Redis hosting cost. I appreciate there are limitations and issues with this approach, but thought it worth mentioning.
I am using redis for implementing a blacklist of email addresses , and i have different TTL values for different levels of blacklisting , so having different DBs on same instance helps me a lot .
Using multiple databases in a single instance may be useful in the following scenario:
Different copies of the same database could be used for production, development or testing using real-time data. People may use replica to clone a redis instance to achieve the same purpose. However, the former approach is easier for existing running programs to just select the right database to switch to the intended mode.
Our motivation has not been mentioned above. We use multiple databases because we routinely need to delete a large set of a certain type of data, and FLUSHDB makes that easy. For example, we can clear all cached web pages, using FLUSHDB on database 0, without affecting all of our other use of Redis.
There is some discussion here but I have not found definitive information about the performance of this vs scan and delete:
https://github.com/StackExchange/StackExchange.Redis/issues/873

How to go from a full SQL querying to something like a NoSQL?

In one of my process I have this SQL query that take 10-20% of the total execution time. This SQL query does a filter on my Database, and load a list of PricingGrid object.
So I want to improve these performance.
So far I guessed 2 solutions :
Use a NoSQL solution, AFAIK these are good solutions for improving reading process.
But the migration seems hard and needs a lot of work (like import the data from sql server to nosql in a regular basis)
I don't have any knowledge , I even don't know which one I should use (the first I'd use is Ravendb because I follow ayende and it's done by the .net community).
I might have some stuff to change in my model to make my object ok for a nosql database
Load all my PricingGrid object in memory (in a static IEnumerable)
This might be a problem when my server won't have enough memory to load everything
I might reinvent the wheel (indexes...) invented by the NoSQL providers
I think I'm not the first one wondering this, so what would be the best solution ? Is there any tools that could help me ?
.net 3.5, SQL Server 2005, windows server 2005
Migrating your data from SQL is only the first step.
Moving to a document store (like RavenDB or MongoDB) also means that you need to:
Denormalize your data
Perform schema validation in your code
Handle concurrency of complex operations in your code since you no longer have transactions (at least not the same way)
Perform rollbacks in the event of partial commits (changes)
Depending on your updates, reads and network model you might also need to handle conflicts
You provided very limited information but it sounds like your needs include a single database server and that your data fits well in the relational model.
In such a case I would vote against a NoSQL solution, it is more likely that you can speed up your queries with database optimizations and still retain all the added value of a RDBMS.
Non-relational databases are tools for a specific job (no matter how they sell them), if you need them it is usually because your data doesn't fit well in the relational model or if you have a need to distribute your data over multiple machines (size or availability). For instance, I use MongoDB for a write-intensive high throughput job management application. It is centralized and the data is very transient so the "cost" of having low durability is acceptable. This doesn't sound like the case for you.
If prefer to use a NoSQL solution perhaps you should try using Memcached+MySQL (InnoDB) this will allow you to get the speed benefits of an in-memory cache (in the form of a memcached daemon plugin) with the underlying protection and capabilities of an RDBMS (MySQL). It should also ease data migration and somewhat reduce the amount of changes required in your code.
I myself have never used it, I find that I either need NoSQL for the reasons I stated above or that I can optimize the RDBMS using stored procedures, indexes and table views in a way which is sufficient for my needs.
Asaf has provided great information in regards to the usage of NoSQL and when it is most appropriate. Given that your main concern was performance, I would tend to agree with his opinion - it would take you much more time and effort to adopt a completely new (and very different) data persistence platform than it would to trick out your SQL Server cluster. That said, my answer is mainly to address the "how" part of your question.
Addressing misunderstandings:
Denormalizing Data - You do not need to manually denormalize your existing data. This will be done for you when it is migrated over. More than anything you need to simply think about your data in a different fashion - root aggregates, entity and value types, etc.
Concurrency/Transactions - Transactions are possible in both Mongo and Raven, they are simply done in a different fashion. One of the inherent ways Raven does this is by using an ORM-like "unit of work" pattern with its RavenSession objects. Yes, your data validation needs to be done in code, but you already should be doing it there anyway. In my experience this is an over-hyped con.
How:
Install Raven or Mongo on a primary server, run it as a service.
Create or extend an existing application that uses the database you intend to port. This application needs all the model classes/libraries that your SQL database provides persistence for.
a. In your "data layer" you likely have a repository class somewhere. Extract an interface form this, and use it to build another repository class for your Raven/Mongo persistence. Both DB's have plenty good documentation for using their APIs to push/pull/update changes in the document graphs. It's pretty damn simple.
b. Load your SQL data into C# objects in memory. Pull back your top-level objects (just the entities) and load their inner collections and related data in memory. Your repository is probably already doing this (ex. when fetching an Order object, ensure not only its properties but associated collections like Items are loaded in memory.
c. Instantiate your Raven/Mongo repository and push the data to it. Primary entities become "top level documents" or "root aggregates" serialized in JSON, and their collections' data nested within. Save changes and close the repository. Note: You may break this step down into as many little pieces as your data deems necessary.
Once your data is migrated, play around with it and ensure you are satisfied. You may want to modify your application Models a little to adjust the way they are persisted to Raven/Mongo - for instance you may want to make both Orders and Items top-level documents and simply use reference values (much like relationships in RDBMS systems). Watch out here though, as doing so sort-of goes against the principal and performance behind NoSQL as now you have to tap the DB twice to get the Order and the Items.
If satisfied, shard/replicate your mongo/raven servers across your remaining available server boxes.
Obviously there are tons of little details I did not explain, but that is the general process, and much of it depends on the applications already consuming the database and may be tricky if more than one app/system talks to it.
Lastly, just to reiterate what Asaf said... learn as much as you can about NoSQL and its best use-cases. It is an amazing tool, but not golden solution for all data persistence. In your case try to really find the bottlenecks in your current solution and see if they are solvable. As one of my systems guys says, "technology for technology's sake is bullshit"

JBoss TreeCache vs PojoCache when using invaludation rather than replication

We are setting up a Jboss cluster and we are building an own distributed cache solution built upon Jboss cache (Cant use it as 2nd level cache to ORM layer in our case). We want to use invalidation and not replication as cache mode. As far as i can see after (very) little testing both solutions seem to work, objects are put into the cache and objects seem to be evicted when they are updated on any of the servers.
This leads me to believe that PojoCache with AOP instrumentation is only needed when using replication so that you can replicate only updated field values and not whole objects. Am I correct here or are there any other advantages with using PojoCache over TreeCache in our scenario? And if PojoCache have advantages, do we still need AOP instrumentation and to annotate our entities with #PojoCacheable (yes, we are using JBCache 1.4.1) since we are not using relication?
Regards
Jonas Heineson
PoJoCache has the ability through AOP to:
only replicate changed fields and not whole objects. Makes a difference if e.g. your person object containes a huge image of the person and you only change the password
detect changes and thus can automatically put them on the list to be replicated.
TreeCache (plain) does not need AOP, but can thus not replicate individual fields or detect what has changed so that you need to trigger replication yourself.
If you don't replicate, those points are probably irrelevant.
IIrc, you don't need the #PojocaCacheable annotation for Pojo cache - without it, you need to specify the classes to be enhanced in a different way.
I have the feeling that if you are not replicating, the plain TreeCache will be enough.

Highest Performance Database Storage Mechanism

I need ideas to implement a (really) high performance in-memory Database/Storage Mechanism. In the range of storing 20,000+ objects, with each object updated every 5 or so seconds. I would like a FOSS solution.
What is my best option? What are your experiences?
I am working primarily in Java, but I need the datastore to have good performance so the datastore solution need not be java centric.
I also need like to be able to Query these objects and I need to be able to restore all of the objects on program startup.
SQLite is an open-source self-contained database that supports in-memory databases (just connect to :memory:). It has bindings for many popular programming languages. It's a traditional SQL-based relational database, but you don't run a separate server – just use it as a library in your program. It's pretty quick. Whether it's quick enough, I don't know, but it may be worth an experiment.
Java driver.
are you updating 20K objects every 5 seconds or updating one of the 20K every 5 seconds?
What kind of objects? Why is a traditional RDBMS not sufficient?
Check out HSQLDB and Prevayler. Prevayler is a paradigm shift from traditional RDBMS - one which I have used (the paradigm, that is, not specifically Prevayler) in a number of projects and found it to have real merit.
Depends exactly how you need to query it, but have you looked into memcached?
http://www.danga.com/memcached/
Other options could include MySQL MEMORY Tables, the APC Cache if you're using PHP.
Some more detail about the project/requirements would be helpful.
An in-memory storage ?
1) a simple C 'malloc' array where all your structures would be indexed.
2) berkeleyDB: http://www.oracle.com/technology/products/berkeley-db/index.html. It is fast because you build your own indexes (secondary database) and there is no SQL expression to be evaluated.
Look at some of the products listed here: http://en.wikipedia.org/wiki/In-memory_database
What level of durability do you need? 20,000 updates every 5 seconds will probably be difficult for most IO hardware in terms of number of transactions if you write the data back to disc for every one.
If you can afford to lose some updates, you could probably flush it to disc every 100ms with no problem with fairly cheap hardware if your database and OS support doing that.
If it's really an in-memory database that you don't want to flush to disc often, that sounds pretty trivial. I've heard that H2 is pretty good, but SQLite may work as well. A properly tuned MySQL instance could also do it (But may be more convoluted)
Oracle TimesTen In-Memory Database. See: http://www.informationweek.com/whitepaper/Business-Intelligence/Datamarts-Data-Warehouses/oracle-timesten-in-memory-databas-wp1228511232361
Chronicle Map is an pure Java key-value store
it has really high performance, sustaining 1 million writes/second from a single thread. It's a myth that a fast database couldn't be written in Java.
Seamlessly stores and loads any serializable Java objects, provides a simple Map interface
LGPLv3
Since you don't have many "tables" a full-blown SQL database could be an overkill solution, indexes & queries could be implemented with a handful of distinct key-value stores which are updated manually by vanilla Java code. Chronicle Map provides mechanisms to make such updates concurrently isolated from each other, if you need it.