Phrase matching for keys in redis - redis

I have the following keys in redis :
"542 136 mountain road"
"542 136 mountainview road"
"542136 mountain road"
"542 136 mountain"
"136 mountain road"
"136 mountain"
I would like to get the keys which contains the phrase 136 mountain.
With the glob-style pattern, I am currently making 4 queries so as to satisfy all the possible cases .
scan 0 MATCH '*[\ ]136 mountain[\ ]*'
scan 0 MATCH '*[\ ]136 mountain'
scan 0 MATCH '136 mountain[\ ]*'
scan 0 MATCH '136 mountain'
These four queries in total would return 4 results :
"542 136 mountain road"
"542 136 mountain"
"136 mountain road"
"136 mountain"
Please do share your inputs if there is any better way of changing the pattern string so that all the 4 results can be obtained in a single query.

I don't believe this can be achieved with a glob-style pattern.
I would also note that even if there was a pattern that matched the phrases presented, due to the nature of how SCAN works you would need to iterate through the entire dataset (making separate calls) to get the results you are looking for. Then you would need to consider the fact your data may be changing during the iteration period.
From the docs
It is important to note that the MATCH filter is applied after
elements are retrieved from the collection, just before returning data
to the client. This means that if the pattern matches very little
elements inside the collection, SCAN will likely return no elements in
most iterations.
Reference: https://redis.io/commands/scan#the-match-option
--
Option 1
Use SCAN to iterate through the entire dataset and further filter the data at the application level.
Option 2
Depending on what type of guarantees you're looking for and how much data you have, you could use KEYS. This is generally not a recommended approach, but it's an option to consider.
Example:
KEYS '*136 mountain*'
Much like the SCAN approach, you will have a larger response than what you are looking for and will need to use your language of choice to further filter the result.
Option 3
Index your data by doing some preprocessing at the application level. If the key matches your desired pattern add it to a SET / Sorted Set.
Option 4
Write a Lua script.

Related

Optimizing a vb.net code that uses a very large string list, over 400 000 entries

I use a static list of unique strings T (french dictionary), with 402 325 entries. My code is playing Scrabble and uses a specialized construction called a DAGGAD to construct playable words and verifies that the words are actually in the list. Is there a way faster than list.indexof(T) to find if a word exists ? I looked at HashSet.Contains(T) but it does not use an index that I can use to retrieve a word. For example, for a given turn of play there could be thousands of valid solutions: I actually store only the index of the list, but with a HashSet I would not be able to do that and will have to store all words, which increases memory usage. In most cases solutions are found in one or two seconds, but in some cases (i.e. with blanks) it takes up to 15 seconds, and I need to reduce that if at all possible with VB !
As Craig suggested, using List.BinarySearch(T) on a sorted list of T improves the speed by around 10 fold. A Scrabble play with a blank letter is now taking no more than 1 or 2 seconds compared to 15 to 20 seconds when I was using IndexOf.

Does ZRANGEBYLEX support contains query?

How can I query my sorted set to get all keys containing some characters?
"Starts with" works fine but I need "contains".
I am using below query for "start with" which works fine
zrangebylex zset [2110 "[2110\xff" LIMIT 0 10
Is there any way we can do \xff query \xff ?
No. The lexicographical range for Redis' Sorted Sets can only be used for prefix searches.
Note that by using another Sorted Set that stores the reverse of the values you can also perform a suffix search on the values. However, even combining these two approaches will not provide the functionality you need.
Alternatively, you could perform a prefix search and then filter the results using a Lua script. Depending on your queries and data, this may or may not be an effective approach.
You could, also, consider implementing a full text indexing mechanism on top of Redis but that would be an overkill in most cases and besides, there are existing tested technologies that already do that.
But you can use ZSCAN with a glob-style pattern, for example to get all the strings which contains the characters "s" and/or "a":
ZSCAN key 0 MATCH *[sa]*
From the ZRANGEBYLEX original documentation (also look zzlCompareElements function realization in source code):
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
From memcmp documentation:
memcmp compares the first num bytes of the block of memory pointed by ptr1 to the first num bytes pointed by ptr2, returning zero if they all match or a value different from zero representing which is greater if they do not.
So you cant use zrangebylex with contains query. And I'm afraid there is not any "lite" workaround. "Lite" - without redis sourfce patching.

How to build a simple inverted index?

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.
Now, I have several questions:
What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?
You can see a very simple implementation of inverted index and search in TinySearchEngine.
For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:
val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]
or a Java-esque:
HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();
The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:
case class Posting(docId:Int, var termFrequency:Int)
Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.
Regarding your second question, there are normally two cases:
(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.
(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.
For case (1), you can have at least 2 files:
1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.
Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
...
TermN<docId5,termFreq><docId7,termFreq>
2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.
Term1 -> 0
Term2 -> 126
Term3 -> 222
....
Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.
For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.
The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.
To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.
Lucene calls this "little" indexes segments.
The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Compound Queries with Redis

For learning purposes I'm trying to write a simple structured document store in Redis. In my example application I'm indexing millions of documents that look a little like the following.
<book id="1234">
<title>Quick Brown Fox</title>
<year>1999</year>
<isbn>309815</isbn>
<author>Fred</author>
</book>
I'm writing a little query language that allows me to say YEAR = 1999 AND TITLE="Quick Brown Fox" (again, just for my learning, I don't care that I'm reinventing the wheel!) and this should return the ID's of the matching documents (1234 in this case). The AND and OR expressions can be arbitrarily nested.
For each document I'm generating keys as follows
BOOK_TITLE.QUICK_BROWN_FOX = 1234
BOOK_YEAR.1999 = 1234
I'm using SADD to plop these documents in a series of sets in the form KEYNAME.VALUE = { REFS }.
When I do the querying, I parse the expression into an AST. A simple expression such as YEAR=1999 maps directly to a SMEMBERS command which gets me the set of matching documents back. However, I'm not sure how to most efficiently perform the AND and OR parts.
Given a query such as:
(TITLE=Dental Surgery OR TITLE=DIY Appendectomy)
AND
(YEAR = 1999 AND AUTHOR = FOO)
I currently make the following requests to Redis to answer these queries.
-- Stage one generates the intermediate results and returns RANDOM_GENERATED_KEY3
SUNIONSTORE RANDOMLY_GENERATED_KEY1 BOOK_TITLE.DENTAL_SURGERY BOOK_TITLE.DIY_APPENDECTOMY
SINTERSTORE RANDOMLY_GENERATED_KEY2 BOOK_YEAR.1999 BOOK_YEAR.1998
SINTERSTORE RANDOMLY_GENERATED_KEY3 RANDOMLY_GENERATED_KEY1 RANDOMLY_GENERATED_KEY2
-- Retrieving the top level results just requires the last key generated
SMEMBERS RANDOMLY_GENERATED_KEY3
When I encounter an AND I use SINTERSTORE based on the two child keys (and similarly for OR I use SUNIONSTORE). I randomly generate a key to store the results in (and set a short TTL so I don't fill Redis up with cruft). By the end of this series of commands the return value is a key that I can use to retrieve the results with SMEMBERS. The reason I've used the store functions is that I don't want to transport all the matching document references back to the server, so I use temporary keys to store the result on the Redis instance and then only bring back the matching results at the end.
My question is simply, is this the best way to make use of Redis as a document store?
I'm using a similar approach with sorted sets to implement full text indexing. The overall approach is good, though there are a couple of fairly simple improvements you could make.
Rather than using randomly generated keys, you can use the query (or a short form thereof) as the key. That lets you reuse the sets that have already been calculated, which could significantly improve performance if you have queries across two large sets that are commonly combined in similar ways.
Handling title as a complete string will result in a very large number of single member sets. It may be better to index individual words in the title and filter the final results for an exact match if you really need it.