Search complexity of redisearch FT.SEARCH? - redis

Their document below says it is O(n), without specifying what n is there. If it is no of documents in the index then search can be extremely slow. This doesn't make any sense, or does it ?
https://oss.redislabs.com/redisearch/Commands.html#complexity_6

n is the number of the results in the result set, basically finding all the documents that have a specific term is O(1), then a scan on all those documents is needed to load the documents data from redis hashes and return them.

Related

Index versus Sequential search performance?

Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.

What is indexing? Why don't we use hashing for everything?

Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...

How to define a primary key field in a Lucene document to get the best lookup performance?

When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):
doc.add(new StringField("uid",uid,Field.Store.YES))
To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:
searcher.search(new TermQuery(new Term("uid",uid)),1)
Being a Lucene "novice", I would like to know the following:
How should I improve this approach to get the best lookup performance?
Would it, for example, make a difference if I store the unique id as
a byte array instead of as a string? Or are there some special codecs or filters that can be used?
What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?
Theory
There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.
Here is some highlights related to your question:
Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
term's length - more chunks to load
term's pattern - to avoid superfluous reads
terms count - to reduce chunks count
Algorithms and Complexity
Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:
The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.
Here is an example of DAFSA for 3 terms: bath, bat and batch:
In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.

Does lucene traverse the whole inverted index when searching?

In web search engine, the inverted index is usually very large, so the search engine will quit searching when getting enough results. Since traversing to the tail of a long inverted index is time consuming.
How does Lucene handle this case?
For example, if an inverted index of term 'A' consists of 10000 documents, when searching 'A' for 10 results, will Lucene go through all these 10000 documents then return 10 results, or return 10 results when retrieved enough results even if it does not reach the end of inverted index?
Lucene will indeed visit all 10k matches, compute the score for each of those matches and put then in a heap in order to compute the top k hits.
The lucene/misc module has a SortingMergePolicy which allows you to sort merged segments based on a certain field (on a web index, this could be the page rank for instance). This way, if you want to sort documents based on this field at search time (or more generally if the sort order is strongly correlated to the value of this field), you can stop collecting documents per segment as soon as you collected enough matches.
This is currently a very expert feature, but we have plans to make it easier to use, see https://issues.apache.org/jira/browse/LUCENE-6766.

LIST alternative in redis

Redis.io
The main features of Redis Lists from the point of view of time
complexity is the support for constant time insertion and deletion of
elements near the head and tail, even with many millions of inserted
items. Accessing elements is very fast near the extremes of the list
but is slow if you try accessing the middle of a very big list, as it
is an O(N) operation.
what is the LIST alternative when the data is too high and writes are lesser than Reads
This is something I'd definitely benchmark before doing, but if you're really hitting a performance issue accessing items in the middle of the list, there are a couple of alternatives that really depend on your use case.
Don't make a list so big, age out/trim pieces that don't matter any more.
Memoize hot sections of the list. If a particular paginated range is being requested much more often than others, make that it's own list. Check if it exists already, and if it doesn't create a subset of your list in the paginated range.
Bucket your list from the beginning into "manageable sizes" (for whatever your definition of managable is). If a list is purely additive (no removal from the list), you could use the modulus index of an item as part of the key so that your list is stored in smaller buckets. Ex: key = "your_key_name_" + index % 100000