Does ZRANGEBYLEX support contains query? - redis

How can I query my sorted set to get all keys containing some characters?
"Starts with" works fine but I need "contains".
I am using below query for "start with" which works fine
zrangebylex zset [2110 "[2110\xff" LIMIT 0 10
Is there any way we can do \xff query \xff ?

No. The lexicographical range for Redis' Sorted Sets can only be used for prefix searches.
Note that by using another Sorted Set that stores the reverse of the values you can also perform a suffix search on the values. However, even combining these two approaches will not provide the functionality you need.
Alternatively, you could perform a prefix search and then filter the results using a Lua script. Depending on your queries and data, this may or may not be an effective approach.
You could, also, consider implementing a full text indexing mechanism on top of Redis but that would be an overkill in most cases and besides, there are existing tested technologies that already do that.

But you can use ZSCAN with a glob-style pattern, for example to get all the strings which contains the characters "s" and/or "a":
ZSCAN key 0 MATCH *[sa]*

From the ZRANGEBYLEX original documentation (also look zzlCompareElements function realization in source code):
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
From memcmp documentation:
memcmp compares the first num bytes of the block of memory pointed by ptr1 to the first num bytes pointed by ptr2, returning zero if they all match or a value different from zero representing which is greater if they do not.
So you cant use zrangebylex with contains query. And I'm afraid there is not any "lite" workaround. "Lite" - without redis sourfce patching.

Related

Fixed length data structures in redis

I need to match tens of thousands of 4 byte strings with about one or more boolean values. I don't mind using up a whole word for the booleans if it means faster retrieval. However I have such tight constraints for my data I imagine there is some, albeit minor, optimizations that can be made if these are reported to the storage engine in advance. Does Redis have any way to take advantage of this?
Here is a sample of my data:
"DENL": false
"NLES": false
"NLUS": true
"USNL": true
"AEGB": true
"ITAE": true
"ITFR": false
The keys are the concatination of two ISO 3166-1 alpha-2 codes. As such they are guaranteed to be 4 uppercase English letters.
The data structures I have considered using are:
Hashes to map the 4 byte keys to a string representing the booleans
A separate set for each boolean value
And since my data only contains uppercase English letters and there are only 456976 possible combinations of those (which comes out to 56KB per bit stored per key):
One or more strings that are accessed with bitwise operations (GETBIT, BITFIELD) using a function to convert the key string to a bit index.
I think that sets are probably the most elegant solution and a binary string over all possible combinations will be the most efficient. I would like to know wheter there is there some kind of middle ground? Like a set with fixed length strings as members. I would expect a datatype optimized for fixed length strings would be able to provide faster searching than one optimized for variable length strings.
It is slightly better to use the 4-letter country-code-combination as a simple key, with an empty value.
The set data-type is really a hash map where the keys are the element and are added to the hash map with NULL value. I wouldn't use a set as this implies to hashes and two lookups into a hash map: the first for the set key in the database and the second for the hash internal to the set for the element.
Use the existence of the key as either "need customs declaration" or "does not need a customs declaration" as Tomasz says.
Using simple keys allows you to use the SET command with NX/XX conditions, which may be handy in your logic:
NX -- Only set the key if it does not already exist.
XX -- Only set the key if it already exists.
Use EXISTS command instead of GET as it is slightly faster (no type checking, no value fetching).
Another advantage of simple keys vs sets is to get the value of multiple keys at once using MGET:
> MGET DENL NLES NLUS
1) ""
2) ""
3) (nil)
To be able to do complex queries, assuming these are rare and not optimized for performance, you can use SSCAN (if you go with sets) or KEYS (if you go with simple keys). However, if you go with simple keys you better use a dedicated database, see SELECT.
To query for those with NL on the left side you would use:
KEYS NL??
There are a couple of optimizations you could try:
use a set and treat all values as either "need customs declaration" or "does not need a customs declaration" - depending which one has fewer values; then with SISMEMBER you can check if your key is in that set which gives you the correct answer,
have a look at introduction to Redis data types, chapter "Bitmaps" - if you pre-define all of your keys in some array you can use SETBIT and GETBIT operations to store the flag "needs customs declaration" for given bit number (index in array).

RedisBloom: Option to add items (bit strings) as is with no hashing?

I'm considering redis for my next project (in-memory, fast) but now I have the problem of figuring out how and if at all it could actually achieve my goal. The goal is to store "large" (millions) amount of fixed-length bit strings and then searching over the database with a input (query) bit string. Search means to return everything which fulfills below condition:
query & value = query
eg. if all bits set in the query are also set in the value return that key eg. bloom-filter albeit in my domain of work it isn't usually called like that.
I found the module RedisBloom but I already have my bloom filter (bit strings) available from external program and would simply like to use RedisBloom for storage of them and searching (exists command). therefore in my case the "Add" command should take the input as is and not hash it again.
Is that possible? And if not other suggestions?
Nope, that isn't possible as RedisBloom is a "black box" in that sense - it manages its own data structures.

Is there a Postgres feature or built-in function that limits the display of uuids to only that needed to make them uniquely identifiable?

It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.

How to get all hashes in foo:* using a single id counter instead of a set/array

Introduction
My domain has articles, which have a title and text. Each article has revisions (like the SVN concept), so every time it is changed/edited, those changes will be stored as a revision. A revision is composed of changes and the description of those changes
I want to be able to obtain all revisions descriptions at once.
What's the problem?
I'm certain that I would store the revision as a hash in articles:revisions:<id> storing the changes, and the description in it.
What I'm not certain of is how do I get all of the descriptions at once.
I have many options to do this, but none of them convinces me.
Store the revision ids for an article as a set, and use SORT articles:revisions:idSet BY NOSORT GET articles:revisions:*->description. This means that I would store a set for each article. If every article had 50 revisions, and we had 10.000 articles, we would have 500.000 ids stored.
Is this the best way? Isn't this eating up too much RAM?
I have other ideas in mind, but I don't consider them good either.
Iterate from 0 to the last revision's id, doing a HGET for each id using MULTI
Create the idSet for a specific article if it doesn't exist and is request, expire after some time.
Isn't there a way for redis to do a SORT array BY NOSORT GET, with array being an adhoc array in the form of [0, MAX]?
Seems like you have a good solution.
As long as you keep those id numbers less than 10,000 and your sets with less than 512 elements(set-max-intset-entries), your memory consumption will be much lower than you think.
Here's a good explanation of it.
This can be solved in an optimized way using a TRIE or DAWG better than what Redis provides. I don't know your application or other info on your search problem (e.g. construction time, unsuccessful searches, update performance).
If you search much more often than you need to update / insert into your lookup storage, I'd suggest you have a look at DAWGDIC [1] as a library, and construct "search paths" (similar as you already described) using a string format that can be search-completed later:
articleID:revisionID:"changeDescription":"change"
Example (I assume you have one description per revision, and n changes. This isn't clear to me from your question):
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
2:4:"Advertisement changes":"Added this, removed that"
Note: Even though you construct these strings with duplicate prefixes, the DAWG will store them in a very space efficient way (simply put, it will append the right side of the string to the data structure and create a shortcut for the common prefix, see also [2] for a comparison of TRIE data structures).
To list changes of article 1, revision 2, set the common prefix for your lookup:
completer.Start(index, "1:2");
Now you can simple call completer.Next() to lookup a next record that shares the same prefix, and completer.value() to get the record's value. In our example we'll get:
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
Of course you need to parse the strings yourself into your data object.
Maybe that's not what you're looking for and overkill. But it can be a very space and search performance efficient way, if it meets your requirements.
[1] https://code.google.com/p/dawgdic/
[2] http://kmike.ru/python-data-structures/

search lucene NumericField for maximum value

I know there is a NumericRangeQuery in Lucene but is it possible to have lucene simply return the maximum value stored in in a NumericField. I can use a RangeQuery over the entire known range and then sort but this is extremely cumbersome and it may return a huge amount of results if there are a lot of records
The second parameter of IndexSearcher.search(Query query, int n, Sort sort) allows to specify the top n hits (in your case 1), which, if you sort correctly, only returns the desired result. There are other overloaded methods that allow achieving the same.
Can't argue about the cumbersomeness though :)
You could Term Enum through your index. Unfortunately I don't think they're sorted in a way which makes finding the maximum instantaneous, but at least you won't have to do an actual search to find it. You will need to use NumericUtils to convert from Lucene's internal structure to a normal number.
This thread contains an example.