I need to match tens of thousands of 4 byte strings with about one or more boolean values. I don't mind using up a whole word for the booleans if it means faster retrieval. However I have such tight constraints for my data I imagine there is some, albeit minor, optimizations that can be made if these are reported to the storage engine in advance. Does Redis have any way to take advantage of this?
Here is a sample of my data:
"DENL": false
"NLES": false
"NLUS": true
"USNL": true
"AEGB": true
"ITAE": true
"ITFR": false
The keys are the concatination of two ISO 3166-1 alpha-2 codes. As such they are guaranteed to be 4 uppercase English letters.
The data structures I have considered using are:
Hashes to map the 4 byte keys to a string representing the booleans
A separate set for each boolean value
And since my data only contains uppercase English letters and there are only 456976 possible combinations of those (which comes out to 56KB per bit stored per key):
One or more strings that are accessed with bitwise operations (GETBIT, BITFIELD) using a function to convert the key string to a bit index.
I think that sets are probably the most elegant solution and a binary string over all possible combinations will be the most efficient. I would like to know wheter there is there some kind of middle ground? Like a set with fixed length strings as members. I would expect a datatype optimized for fixed length strings would be able to provide faster searching than one optimized for variable length strings.
It is slightly better to use the 4-letter country-code-combination as a simple key, with an empty value.
The set data-type is really a hash map where the keys are the element and are added to the hash map with NULL value. I wouldn't use a set as this implies to hashes and two lookups into a hash map: the first for the set key in the database and the second for the hash internal to the set for the element.
Use the existence of the key as either "need customs declaration" or "does not need a customs declaration" as Tomasz says.
Using simple keys allows you to use the SET command with NX/XX conditions, which may be handy in your logic:
NX -- Only set the key if it does not already exist.
XX -- Only set the key if it already exists.
Use EXISTS command instead of GET as it is slightly faster (no type checking, no value fetching).
Another advantage of simple keys vs sets is to get the value of multiple keys at once using MGET:
> MGET DENL NLES NLUS
1) ""
2) ""
3) (nil)
To be able to do complex queries, assuming these are rare and not optimized for performance, you can use SSCAN (if you go with sets) or KEYS (if you go with simple keys). However, if you go with simple keys you better use a dedicated database, see SELECT.
To query for those with NL on the left side you would use:
KEYS NL??
There are a couple of optimizations you could try:
use a set and treat all values as either "need customs declaration" or "does not need a customs declaration" - depending which one has fewer values; then with SISMEMBER you can check if your key is in that set which gives you the correct answer,
have a look at introduction to Redis data types, chapter "Bitmaps" - if you pre-define all of your keys in some array you can use SETBIT and GETBIT operations to store the flag "needs customs declaration" for given bit number (index in array).
Related
I have a dozen of REDIS Keys of the type SET, say
PUBSUB_USER_SET-1-1668985588478915880,
PUBSUB_USER_SET-2-1668985588478915880,
PUBSUB_USER_SET-3-1668988644477632747,
.
.
.
.
PUBSUB_USER_SET-10-1668983464477632083
The set contains a userId and the problem statement is to check if the user is present in any of the set or not
The solution I tried is to get all the keys and append with a delimiter (, comma) and pass it as an argument to lua script wherein with gmatch operator I split the keys and run sismember operation until there is a hit.
local vals = KEYS[1]
for match in (vals..","):gmatch("(.-)"..",") do
local exist = redis.call('sismember', match, KEYS[2])
if (exist == 1) then
return 1
end
end
return 0
Now as and when the number of keys grows to PUBSUB_USER_SET-20 or PUBSUB_USER_SET-30 I see an increase in latency and in throughput.
Is this the better way to do or Is it better to batch LUA scripts where in instead of passing 30keys as arguments I pass in batches of 10keys and return as soon as the user is present or is there any better way to do this?
I would propose a different solution instead of storing keys randomly in a set. You should store keys in one set and you should query that set to check whether a key is there or not.
Lets say we've N sets numbered s-0,s-1,s-2,...,s-19
You should put your keys in one of these sets based on their hash key, which means you need to query only one set instead of checking all these sets. You can use any hashing algorithm.
To make it further interesting you can try consistent hashing.
You can use redis pipeline with batching(10 keys per iteration) to improve the performance
I'm considering redis for my next project (in-memory, fast) but now I have the problem of figuring out how and if at all it could actually achieve my goal. The goal is to store "large" (millions) amount of fixed-length bit strings and then searching over the database with a input (query) bit string. Search means to return everything which fulfills below condition:
query & value = query
eg. if all bits set in the query are also set in the value return that key eg. bloom-filter albeit in my domain of work it isn't usually called like that.
I found the module RedisBloom but I already have my bloom filter (bit strings) available from external program and would simply like to use RedisBloom for storage of them and searching (exists command). therefore in my case the "Add" command should take the input as is and not hash it again.
Is that possible? And if not other suggestions?
Nope, that isn't possible as RedisBloom is a "black box" in that sense - it manages its own data structures.
It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.
How can I query my sorted set to get all keys containing some characters?
"Starts with" works fine but I need "contains".
I am using below query for "start with" which works fine
zrangebylex zset [2110 "[2110\xff" LIMIT 0 10
Is there any way we can do \xff query \xff ?
No. The lexicographical range for Redis' Sorted Sets can only be used for prefix searches.
Note that by using another Sorted Set that stores the reverse of the values you can also perform a suffix search on the values. However, even combining these two approaches will not provide the functionality you need.
Alternatively, you could perform a prefix search and then filter the results using a Lua script. Depending on your queries and data, this may or may not be an effective approach.
You could, also, consider implementing a full text indexing mechanism on top of Redis but that would be an overkill in most cases and besides, there are existing tested technologies that already do that.
But you can use ZSCAN with a glob-style pattern, for example to get all the strings which contains the characters "s" and/or "a":
ZSCAN key 0 MATCH *[sa]*
From the ZRANGEBYLEX original documentation (also look zzlCompareElements function realization in source code):
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
From memcmp documentation:
memcmp compares the first num bytes of the block of memory pointed by ptr1 to the first num bytes pointed by ptr2, returning zero if they all match or a value different from zero representing which is greater if they do not.
So you cant use zrangebylex with contains query. And I'm afraid there is not any "lite" workaround. "Lite" - without redis sourfce patching.
Using keys I can query the keys as you can see below:
redis> set popo "pepe"
OK
redis> set coco "kansas"
OK
redis> set cool "rock"
OK
redis> set cool2 "punk"
OK
redis> keys *co*
1) "cool2"
2) "coco"
3) "cool"
redis> keys *ol*
1) "cool2"
2) "cool"
Is there any way to get the values instead of the keys? Something like: mget (keys *ol*)
NOTICE: As others have mentioned, along with myself in the comments on the original question, in production environments KEYS should be avoided. If you're just running queries on your own box and hacking something together, go for it. Otherwise, question if REDIS makes sense for your particular application, and if you really need to do this - if so, impose limits and avoid large blocking calls, such as KEYS. (For help with this, see 2015 Edit, below.)
My laptop isn't readily available right now to test this, but from what I can tell there isn't any native commands that would allow you to use a pattern in that way. If you want to do it all within redis, you might have to use EVAL to chain the commands:
eval "return redis.call('MGET', unpack(redis.call('KEYS', KEYS[1])))" 1 "*co*"
(Replacing the *co* at the end with whatever pattern you're searching for.)
http://redis.io/commands/eval
Note: This runs the string as a Lua script - I haven't dove much into it, so I don't know if it sanitizes the input in any way. Before you use it (especially if you intend to with any user input) test injecting further redis.call functions in and see if it evaluates those too. If it does, then be careful about it.
Edit: Actually, this should be safe because neither redis nor it's lua evaluation allows escaping the containing string: http://redis.io/topics/security
2015 Edit: Since my original post, REDIS has released 2.8, which includes the SCAN command, which is a better fit for this type of functionality. It will not work for this exact question, which requests a one-liner command, but it's better for all reasonable constraints / environments.
Details about SCAN can be read at http://redis.io/commands/scan .
To use this, essentially you iterate over your data set using something like scan ${cursor} MATCH ${query} COUNT ${maxPageSize} (e.g. scan 0 MATCH *co* COUNT 500). Here, cursor should always be initialized as 0.
This returns two things: first is a new cursor value that you can use to get the next set of elements, and second is a collection of elements matching your query. You just keep updating cursor, calling this query until cursor is 0 again (meaning you've iterated over everything), and push the found elements into a collection.
I know SCAN sounds like a lot more work, but I implore you, please use a solution like this instead of KEYS for anything important.