Using large strings as Redis keys with a list of values - redis

We have an Antlr parsing API which returns the given columns being accessed in a query. Only ~1/100 queries are unique so we are looking at using Redis as a caching layer to get a drastic speed increase. Some queries are thousands of lines long and take a full second or more to parse. Our estimated volume is in the hundreds of millions so we cannot afford to parse any duplicates.
Looking at Redis (and using python redis client), should I hash each query text with something like MD5 and use that as a key and use rpush to store the columns for that query as a list?
Or is hashing before in that manor a waste of time. Im also looking at Redis' own hashing functions like HMSET, but it does not look like there is a great way to store a list as a value for a key.

Your basic idea is a good one, but if you're just using this for caching there's little point in dealing with Redis lists. Those are used when you want to operate on the data within Redis itself (inserting new elements into the list, etc.). Instead you can just use regular GET and SET.
Specifically, use the hash as the key and some encoded form of the data (JSON, or whatever you like) as the value. It's possible you could skip the hashing step (Redis allows keys up to 512MB), but if queries are "thousands of lines long" that would eat up your cache memory and make serialization and transmission significantly slower.

Related

Why is RedisJSON slower than normal Redis commands?

I have a few objects which I'm storing in Redis by stringifying them and also storing as direct objects using RedisJSON.
But one thing I observed is that reading the objects using JSON.GET is slower(almost 3 times slower) compared to REDIS.GET.
But obviously I need to parse the data back that are stored using REDIS.GET.
Could somebody explain why is there a huge difference in performance.

Does splitting data between databases on the same instance increase the performance of Redis searches?

My doubt is,
I have the same instance of Redis, with multiple databases (one for each service).
If more than one service used the same database, would the prefix search be slower? (having the data of all the services in one place and having to go through all of them, as opposed to only going through the selected base)
Partitioning in Redis serves two main goals:
It allows for much larger databases, using the sum of the memory of
many computers. Without partitioning you are limited to the amount of
memory a single computer can support.
It allows scaling the computational power to multiple cores and multiple computers, and the network bandwidth to multiple computers and network adapters.
Ref
Integration through database should be avoided, read more here: https://martinfowler.com/bliki/IntegrationDatabase.html
If you use KEYS command for it, then read this from documentation:
Warning: consider KEYS as a command that should only be used in production environments with extreme care. It may ruin performance when it is executed against large databases. This command is intended for debugging and special operations, such as changing your keyspace layout. Don't use KEYS in your regular application code. If you're looking for a way to find keys in a subset of your keyspace, consider using SCAN or sets.
Redis doesn't support prefix index, but you can use sorted set to do prefix search to some degree.
Redis is in-memory store, so all reads are pretty fast as long as you model it the right way.
Better query multiple times than doing integration through db...
Btw if you have multiple services owning the same data, then you should probably model your services differently...
In general, I would avoid key prefix searching. REDIS isn't a standard database and key searches are slow.
Since REDIS is a key/value store, it's optimized as such.
To take advantage of REDIS you want to hit the desired key directly.
I expect key search time increases with the total amount of keys, so splitting the database would potentially reduce the key search time.
However, if your doing key searches, I would put the desired keys into a key list and just do a direct look up there.
desired_prefix = ["desired_prefix-a", "desired_prefix-b", ...]
lpush "prefix_x_keys" "x-a"
If you run multiple databases on a single instance, they all will try to use and acquire the same resource and memory plus they will also run their core process which will not help you.
you can think like this, 2 OS running on the same machine will never match the performance of a single OS utilizing all resources of the machine.
What you can do to increase the performance is make more tables or use the partition concept. This will not put too much data in a single table and the search will work faster.

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

SCAN vs KEYS performance in Redis

A number of sources, including the official Redis documentation, note that using the KEYS command is a bad idea in production environments due to possible blocking. If the approximate size of the dataset is known, does SCAN have any advantage over KEYS?
For example, consider a database with at most 100 keys of the form data:number:X where X is an integer. If I want to retrieve all of these, I might use the command KEYS data:number:*. Is this going to be significantly slower than using SCAN 0 MATCH data:number:* COUNT 100? Or are the two commands essentially equivalent in this circumstance? Would it be accurate to say that SCAN is preferable to KEYS because it protects against the scenario where an unexpectedly large set would be returned?
You shouldn't care about current command execution but about the impact to all other commands, since Redis processes commands using a single thread (i.e. while a command is being executed all others need to await until executing one ends).
While keys or scan might provide you similar or identical performance executed alone in your case, some milliseconds blocking Redis will significantly decrease overall I/O.
This the main reason to use keys for development purposes and scan on production environments.
OP said:
"While keys or scan might provide you similar or identical performance
executed alone in your case, some milliseconds blocking Redis will
significantly decrease overall I/O." - This sentence seems to indicate
that one command blocks Redis, and the other doesn't, which can't be
the case. If I am guaranteed 100 results from my call to KEYS, in what
way is it worse than SCAN? Why do you feel that one command is more
prone to blocking?
There should be a good difference when you can paginate the search. It's not the same being forced to get 100 keys in a single pass than being able to implement pagination and get 100 keys, 10 by 10 (or 50 and 50). This very small interruption can let other commands sent by the application layer be processed by Redis. See what Redis official documentation says about this:
Since these commands allow for incremental iteration, returning only a
small number of elements per call, they can be used in production
without the downside of commands like KEYS or SMEMBERS that may block
the server for a long time (even several seconds) when called against
big collections of keys or elements
.
The answer is in the SCAN documentation
These commands allow for incremental iteration, returning only a small number of elements per call, they can be used in production without the downside of commands like KEYS or SMEMBERS that may block the server for a long time (even several seconds) when called against big collections of keys or elements.
So ask for small chunks of data rather than getting whole of it
Also as Matías Fidemraizer pointed out, Redis is single threaded and KEYS is a blocking call thus blocking any incoming requests for operation until execution of KEYS is done.
Whether your data is small or not, it never hurts to apply best practices.
There is no performance difference between KEYS and SCAN other than pagination (count) where the amount bytes transferred (IO) from redis to client will be controlled in pagination.
The count option it self has its own specification where sometimes you will not get data, but still scan cursor is on, so will get data in the next iterations. So the count option should be reasonable amount say 200 to something max to avoid multiple round trip time. I think this value depends on total number of keys in your db.
There is no point/difference when we use SCAN within LUA compare to KEYS, though there is no IO involved, still both are blocking other calls till entire big collection get iterated. I haven't tried this, my guess it is.

Implementing a Triplestore Atom

I'm trying to implement my own triplestore ontop of a SQL database (yes I know there are finished projects out there) and I'm trying to decide on the best way to implement a symbolic "atom".
In a naive design, we might implement a triplestore in SQL by creating a single "triple" table with three varchar columns called subject, predicate, object. To save space I was going to create an "atom" table, that would store the unique text used in any subject/predicate/object field, and change those fields to foreign keys linking back to the atoms that contain their text.
However, I see a couple ways to implement the Atom table.
Store the text as a varchar.
Pros: Simple to index and enforce uniqueness of the text.
Cons: It could not store arbitrarily large text.
Store the text as a text blob, as well as a hash of the text to use when querying and enforcing uniqueness.
Pros: Could store arbitrarily large text.
Cons: A little more complicated. Possibly, albeit rare, hash collisions depending on the hashing algorithm (md5, sha, etc).
Which is the better approach in terms of performance, long-term reliability, and ability to store any type of data? If I use a hash, is there a valid concern about collisions? Even if collisions are rare, it would only have to happen once to corrupt triplestore.
Don't waste any time trying to optimize this until you can prove that it's a bottleneck and is the most important thing to fix.
"To save space..." don't. Space is almost free. Unless you have over a terabyte of data, you don't have much to worry about. You can easily waste more time thinking about storage than the storage is worth.
The varchar solution will work and scale fine. The idea of a "string pool" or "atom table" is actually a good one because you'll have lots of references to the same underlying object. Why repeat the varchar? Why not just repeat an index number?
"Arbitrarily large text" is a strange requirement. Why bother?
A blob will be generally be slower. The hash collision -- while little more than a theoretical concern -- is something you handle two ways. First, use a hash with more than 32 bits. Second, a collision won't corrupt anything unless you (foolishly) fail check the actual blobs to see if they're actually the same. If you want to avoid comparing the entire blob to confirm that there's no collisions, keep two hashes by different algorithms.