I've got a small real-time chat application and would now like to store message history in Redis instead of MySQL, simply because it is much, much faster.
However, I would like my users to be able to search the message history. How could I achieve that in Redis?
After some Googling I found that I would have to create a index of all words in Redis, but this seems to be a bit overkill.
Would a better approach be to sync data back to MySQL and have users search a table there instead?
I really would like to use Redis for the history part as my tests have shown it is a lot faster in my case.
Any ideas of another approach?
I'm going to go out on a limb and say, Redis is the wrong choice for your particular use case. Check the stackoverflow post on the Use Cases of Redis for some insight.
You say that redis is much much faster, but how can you say that if we have no solution to compare with? Redis commands will be much faster than the equivalent SQL ones, but as soon as you start creating off-purpose data structures you're killing what Redis is good at.
Some more reasons:
1. Unstructured content
If you have a fixed structure to search that would be somewhat plausible to consider, for example, you only allow for user/data search. Your keys could look like
user:message_time
Do you a proper search on free form text, you'll probably better off with something that is good at analyzing metadata and finding out what you need, probably something similar to elasticsearch (not an expert myself).
2. Redis is not a decent archive
For a chat app I would imagine using Redis as a cache for recent conversations, but in the end you don't want Redis to be storing all your messages and metadata for searching and indexing, do you? If your replication is not pristine you might lose all your data, in the end Redis is an In-Memory Database.
3. Key scan nightmare
I cannot even imagine how this can be done using Redis, maybe some other users do. What kind of scan would you do on the keys? You would probably need to do multiple queries to get something decent.
Redis isn't really the best tool for this but it's possible, check out https://github.com/tj/reds for an example.
Related
Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?
These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.
I am new to redis and just start to learn it today. The official website does a good job about what the data types are and how to set them. That part is not hard to understand. But the problem is without queries, data becomes meaningless. I really failed to find any good documentation on how to do queries/searches in the official site.
When googling, I found this question Redis strings vs Redis hashes to represent JSON and people are all ignoring queries. I just don't get it at all. Many people suggest to store JSON as a string value to the key. This looks very crazy to me. How can they query JSON keys later? For example, for a user object to store in either key-value data type or hashes, how to query users whose age is greater than 30? That should be a very basic and simplest query for a database.
Thank you very much for your help. I am very confused.
EDITED:
After long time googling, I figured out a basic concept: redis can only query keys, and value are not searchable. Thus to search values, I have to create keys which contain the value. This answers my second question.
But the first and my primary question is where to find query tutorial in redis official site. Since redis is very different from sql db, the question might be changed to where to find data modeling and query tutorial in redis official site. It seems to do queries, I have to create some kind of special keys first. That makes query tutorial become modeling tutorial in the end.
Btw, for those who are new to redis and confused like me, you can read this article Storing and Querying Objects. Even if it has a little bug in it, it clarifies many thing about how to use redis for query. These kind of information really should go into redis official doc.
I really failed to find any good documentation on how to do queries/searches in the official site.
Check the command manual for how to query different data structures.
How can they query JSON keys later
JSON is NOT a built-in data structure for Redis. If you want to query JSON data you need to build the index with Redis' built-in data structures by yourself, or you might want to try RedisJSON, which is a Redis module for processing JSON data.
I'm in the planning phase of developing a very tag heavy website. Everything will essentially be associated with tags and the entire site would be based on searching these tags.
Now, I've been thinking a lot about going the nosql route here, since from what I read and understand, it makes the most sense for something like this.
Would it be best to go with this database system? Would it makes sense to go with the relational database system? Should I think incorporating something like SOLR?
What would the ideal setup be?
UPDATE:
Ideally they would be user generated, but we all know how that would turn out with giving users that much power. So, let’s change up the requirements and say that users WILL NOT have the power to create tags.
Searching on tags based on text matches is something that would probably be useful and needed. If the tag is “garage sale”, the search for “sale” should also pick this up, at a lower relevance for sure.
I can’t imagine the usage being so much that scaling would be an issue.
Thanks
I would spend a bit of time thinking about these tags. For example, are these tags going to be user generated or will you provide a few tags and let users select which ones they want?
Will you need to search on tags based on text matches? For example if a tag is "garage sale" do you want to search for "sale" to also pick this up? Maybe at a lower relevance?
Also, what kind of usage are you looking at? One good thing about Solr is that it's super easy to scale and synchronize data, it is easy to deploy multiple nodes, shard collections and replicate data to other nodes, something that traditional databases struggle with.
Another thing to keep in mind is that most of the time, Solr is not the official "repository of record", most of the time the data gets fed to it from a DB somewhere, but all reading activities are done from Solr.
See this answer for a SQL solution. Offhand I can't think of any advantage to using most NoSQL databases (i.e. key-value, columnar, or document) as the SQL solution will be more compact and ought to give good performance; a graph database may be appropriate if you're doing a lot of navigational type queries on your tags, but it doesn't sound like that's the case.
Use of Solr (or ElasticSearch or whatever) is orthogonal to your primary database; it may be appropriate to incorporate a search tool if users are typing inexact tags for search, but I recommend integrating a stemming library or something along those lines before turning to a full blown search tool.
So, I've come to a place where I wanted to segment the data I store in redis into separate databases as I sometimes need to make use of the keys command on one specific kind of data, and wanted to separate it to make that faster.
If I segment into multiple databases, everything is still single threaded, and I still only get to use one core. If I just launch another instance of Redis on the same box, I get to use an extra core. On top of that, I can't name Redis databases, or give them any sort of more logical identifier. So, with all of that said, why/when would I ever want to use multiple Redis databases instead of just spinning up an extra instance of Redis for each extra database I want? And relatedly, why doesn't Redis try to utilize an extra core for each extra database I add? What's the advantage of being single threaded across databases?
You don't want to use multiple databases in a single redis instance. As you noted, multiple instances lets you take advantage of multiple cores. If you use database selection you will have to refactor when upgrading. Monitoring and managing multiple instances is not difficult nor painful.
Indeed, you would get far better metrics on each db by segregation based on instance. Each instance would have stats reflecting that segment of data, which can allow for better tuning and more responsive and accurate monitoring. Use a recent version and separate your data by instance.
As Jonaton said, don't use the keys command. You'll find far better performance if you simply create a key index. Whenever adding a key, add the key name to a set. The keys command is not terribly useful once you scale up since it will take significant time to return.
Let the access pattern determine how to structure your data rather than store it the way you think works and then working around how to access and mince it later. You will see far better performance and find the data consuming code often is much cleaner and simpler.
Regarding single threaded, consider that redis is designed for speed and atomicity. Sure actions modifying data in one db need not wait on another db, but what if that action is saving to the dump file, or processing transactions on slaves? At that point you start getting into the weeds of concurrency programming.
By using multiple instances you turn multi threading complexity into a simpler message passing style system.
In principle, Redis databases on the same instance are no different than schemas in RDBMS database instances.
So, with all of that said, why/when would I ever want to use multiple
Redis databases instead of just spinning up an extra instance of Redis
for each extra database I want?
There's one clear advantage of using redis databases in the same redis instance, and that's management. If you spin up a separate instance for each application, and let's say you've got 3 apps, that's 3 separate redis instances, each of which will likely need a slave for HA in production, so that's 6 total instances. From a management standpoint, this gets messy real quick because you need to monitor all of them, do upgrades/patches, etc. If you don't plan on overloading redis with high I/O, a single instance with a slave is simpler and easier to manage provided it meets your SLA.
Even Salvatore Sanfilippo (creator of Redis) thinks it's a bad idea to use multiple DBs in Redis. See his comment here:
https://groups.google.com/d/topic/redis-db/vS5wX8X4Cjg/discussion
I understand how this can be useful, but unfortunately I consider
Redis multiple database errors my worst decision in Redis design at
all... without any kind of real gain, it makes the internals a lot
more complex. The reality is that databases don't scale well for a
number of reason, like active expire of keys and VM. If the DB
selection can be performed with a string I can see this feature being
used as a scalable O(1) dictionary layer, that instead it is not.
With DB numbers, with a default of a few DBs, we are communication
better what this feature is and how can be used I think. I hope that
at some point we can drop the multiple DBs support at all, but I think
it is probably too late as there is a number of people relying on this
feature for their work.
I don't really know any benefits of having multiple databases on a single instance. I guess it's useful if multiple services use the same database server(s), so you can avoid key collisions.
I would not recommend building around using the KEYS command, since it's O(n) and that doesn't scale well. What are you using it for that you can accomplish in another way? Maybe redis isn't the best match for you if functionality like KEYS is vital.
I think they mention the benefits of a single threaded server in their FAQ, but the main thing is simplicity - you don't have to bother with concurrency in any real way. Every action is blocking, so no two things can alter the database at the same time. Ideally you would have one (or more) instances per core of each server, and use a consistent hashing algorithm (or a proxy) to divide the keys among them. Of course, you'll loose some functionality - piping will only work for things on the same server, sorts become harder etc.
Redis databases can be used in the rare cases of deploying a new version of the application, where the new version requires working with different entities.
I know this question is years old, but there's another reason multiple databases may be useful.
If you use a "cloud Redis" from your favourite cloud provider, you probably have a minimum memory size and will pay for what you allocate. If however your dataset is smaller than that, then you'll be wasting a bit of the allocation, and so wasting a bit of money.
Using databases you could use the same Redis cloud-instance to provide service for (say) dev, UAT and production, or multiple instances of your application, or whatever else - thus using more of the allocated memory and so being a little more cost-effective.
A use-case I'm looking at has several instances of an application which use 200-300K each, yet the minimum allocation on my cloud provider is 1M. We can consolidate 10 instances onto a single Redis without really making a dent in any limits, and so save about 90% of the Redis hosting cost. I appreciate there are limitations and issues with this approach, but thought it worth mentioning.
I am using redis for implementing a blacklist of email addresses , and i have different TTL values for different levels of blacklisting , so having different DBs on same instance helps me a lot .
Using multiple databases in a single instance may be useful in the following scenario:
Different copies of the same database could be used for production, development or testing using real-time data. People may use replica to clone a redis instance to achieve the same purpose. However, the former approach is easier for existing running programs to just select the right database to switch to the intended mode.
Our motivation has not been mentioned above. We use multiple databases because we routinely need to delete a large set of a certain type of data, and FLUSHDB makes that easy. For example, we can clear all cached web pages, using FLUSHDB on database 0, without affecting all of our other use of Redis.
There is some discussion here but I have not found definitive information about the performance of this vs scan and delete:
https://github.com/StackExchange/StackExchange.Redis/issues/873
I have an interesting situation where I'm near the end of an evaluation period for a RavenDB prototype for use with a project at our company. The reason it's interesting is that 99.99% of the time, I believe it fits Raven's sweet spot; it repeatedly queries for new data, often, and in small batches (< 1000 documents at a time).
However, we do have an initial load period, where we need to load two days' worth of data, which can be 3 million (or more) records in some cases.
A diagram might help:
It's the Transfer Service that is responsible for getting the correct data out of three production databases and storing it in RavenDB. The WCF service will query this data and make it available to its clients.
Once we do the initial load of millions of records/documents into RavenDB, we'll rarely have to do that again.
As an initial load test, on a machine with 4GB RAM and two processors, it took just over 23 minutes to read the initial data. In this case, it was only about 1.28 million records. I eliminated all async operations from this initial load, because I wanted each read to not be interfered with by other read operations. I found the best results this way.
I know it's not recommended, but to accomplish all this, I had to change settings that aren't recommended to be changed:
I had to increase the timeout:
documentStore.JsonRequestFactory.ConfigureRequest += (e, x) => ((HttpWebRequest)x.Request).Timeout = ravenTimeoutInMilliseconds;
In the Raven.Server.exe.config, I had to increase the page size (to int.MaxValue):
<add key="Raven/MaxPageSize" value="2147483647"/>
And in my retrieval methods, I had to use Take(int.MaxValue):
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
Remember this is all for that one-time, initial load. After that, it's many queries, quickly, and often. I should also note that each document is self-contained in RavenDB. There are no relationships to manage.
Knowing all this, is RavenDB a good fit?
A good fit for what?
Full text search? Yes. Background aggregations (map/reduce ones)? Yes. Easy replication and sharding, say scaling? Yes...
Ad-hoc reporting? No. Support for probably thousands of third party tools? No...
If you're talking about performance, you probably want to look at Orens latest post on that. His numbers are quite similar to your ones: http://ayende.com/blog/154913/ravendb-amp-freedb-an-optimization-story
From what I understand of your question, you need to "prep" the WCF web-service. To do this you read 1.2M docs from RavenDB (in about 23 mins) and hold them in memory, so the WCF service can then serve queries from them, is this right? Or am I missing something?
Why not get the WCF service to send it's queries to Raven one-at-a-time? I.e. for each query it gets from a Client, ask RavenDB to do the query for it?
From what you've told us in the other answers comments, I believe the only good way to serve the wcf clients fast enough, is to actually store everything in memory, so just the way you do it now.
The question, if RavenDB is a good fit for that situation depends on whether your data model benefits in others way from the document oriented nature. So, in case you have dynamic data that would require some kind of EAV in a relational databases and lots of joins, then RavenDB will probably be a very good solution. However, if you just need something you can throw flat data in, then I would go with a relational database here. In terms of licensing costs and ease of use, you might also want to take a look at PostgreSql, as this is a really awesome database that comes completely free.