In web search engine, the inverted index is usually very large, so the search engine will quit searching when getting enough results. Since traversing to the tail of a long inverted index is time consuming.
How does Lucene handle this case?
For example, if an inverted index of term 'A' consists of 10000 documents, when searching 'A' for 10 results, will Lucene go through all these 10000 documents then return 10 results, or return 10 results when retrieved enough results even if it does not reach the end of inverted index?
Lucene will indeed visit all 10k matches, compute the score for each of those matches and put then in a heap in order to compute the top k hits.
The lucene/misc module has a SortingMergePolicy which allows you to sort merged segments based on a certain field (on a web index, this could be the page rank for instance). This way, if you want to sort documents based on this field at search time (or more generally if the sort order is strongly correlated to the value of this field), you can stop collecting documents per segment as soon as you collected enough matches.
This is currently a very expert feature, but we have plans to make it easier to use, see https://issues.apache.org/jira/browse/LUCENE-6766.
Related
Their document below says it is O(n), without specifying what n is there. If it is no of documents in the index then search can be extremely slow. This doesn't make any sense, or does it ?
https://oss.redislabs.com/redisearch/Commands.html#complexity_6
n is the number of the results in the result set, basically finding all the documents that have a specific term is O(1), then a scan on all those documents is needed to load the documents data from redis hashes and return them.
Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.
In lucene spatial 4 I'm wondering how the geohash index works behind the scenes. I understand the concept of the geohash which basically takes 2 points (lat, lon) and creates a single "string" hash.
Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.
For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Is there a default number of n-grams that we might want indexed?
With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.
Thanks
The best online explanation is my video: Lucene / Solr 4 Spatial deep dive
Is the index just a "string" index (r-tree or quad-tree) or something
along these lines (such as just indexing a last name).....or is there
something special with it.
Lucene, fundamentally, has just one index used for text, numbers, and now spatial. You could say it's a string index. It's a sorted list of bytes/strings. From a higher level view, using spatial in this way is the family of "Tries" AKA "PrefixTrees" in computer science.
For pre-fixed type searches do all of the n-grams of the hash get
indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Yes.
Is there a default number of n-grams that we might want indexed?
You tell it conveniently in terms of the precision requirements you have and it'll lookup how long it needs to be. Or you can tell it by length.
With this type of indexing will search queries with 100 thousand
records verse 100 million records have similar query performance for
spatial queries. (Such as box/polygon, or distance) or can I expect a
general/typical slow degradation of the index as lots of records
added.
Indeed, this type of indexing (and more specifically the clever recursive search tree algorithm that uses it) means that you'll have scalable search performance. 100m is a ton of documents for one filter to match so it's of course going to be slower than one that matches only 100k docs, but it's definitely sub-linear. And by next year it'll be even faster, due to work happening this summer on a new PrefixTree encoding plus a spatial benchmark in progress which will allow me to make further tuning optimizations I have planned.
I am using Lucene to index the records from my database. I have a million records in my table called "Documents". The records will be accessed by particular users only. A real case scenario is that a single user can access a maximum of 100 records in the Documents table. Which of the following is a best practice for this scenario.
Indexing all the 1 million records in Documents table as a single index file with the user information as one of the field in that index OR
Creating user specfic indexes
Sounds like you'll end up with a lot of indices in the second scenario, and if you want to search them concurrently, Lucene will have to keep a lot of files open, so you might easily hit your OS limit on the number of open files. If you decide to open/close them on demand, you might not benefit from caching and your search might be slow because of cold indices (or you prewarm them but again you might have a lot of overhead processing). I'd go with the first approach, Lucene can handle 1M documents in a single index.
I'm doing an index report on my MS SQL 2008 database (Right click database -> Reports -> Index Usage Statistics)
It tells me that one of my indexes uses:
88 user seeks
0 user scans
6,134,141 user updates
Can someone explain to me:
What the difference between user seeks and user scans are?
How should I determine when to keep an index or drop it depending on the user seeks + user scans vs user updates?
I think in this case the cost of maintaining the index is not worth it.
Here is a good article that
goes over seeks and scans (and indexes in general). It will
probably do a better job than any SO
post.
It can be a bit of an art form
determining whether you need an
index or not. If those 88 seeks
take absolutely essential reporting
queries from a runtime of 3 hours
down to 30 seconds then keep them. I think the first step would be to figure out which queries are using them, how much the indexes help those queries and how important those queries are.
Snipit from the article (definitely give it a read though):
Scans
An index scan is a complete read of all of the leaf pages in the index. When an index scan is done on the clustered index, it’s a table scan in all but name.
When an index scan is done by the query processor, it is always a full read of all of the leaf pages in the index, regardless of whether all of the rows are returned. It is never a partial scan.
A scan does not only involve reading the leaf levels of the index, the higher level pages are also read as part of the index scan.
Seeks
An index seek is an operation where SQL uses the b-tree structure to locate either a specific value or the beginning of a range of value. For an index seek to be possible, there must be a SARGable3 predicate specified in the query and a matching (or partially matching) index. A matching index is one where the query predicate used a left-based subset of the index columns. This will be examined in much greater detail in a part 3 of this series.
The seek operation is evaluated starting at the root page. Using the rows in the root page, the query processor will locate which page in the next lower level of the index contains the 1st row that is being searched for. It will then read that page. If that is the leaf level of the index, the seek ends there. If it is not the leaf then the query processor again identifies which page in the next lower level contains the specified value. This process continues until the leaf level is reached.
Once the query processor has located the leaf page containing either the specified key value or the beginning of the specified range of key values then it reads along the leaf pages until all rows that match the predicate have been returned.
One important point to note up front: the index usage statistics are reset every time the database is started. So, it's hard to evaluate your 88 seeks without knowing when you last restarted. 88 seeks in the last hour is quite different than 88 seeks in the last month.
A user seek is looking for a particular row or set of rows in the index that match the criteria of your query. A user scan is reading all rows in the index. For obvious reasons, seek operations are preferable over scan operations.
I'm not aware of any general guidelines that say "when the seek/update ratio is X, drop the index." Look at your index in terms of these General Design Guidelines and benchmark before and after performance of your queries to determine the impact of dropping the index.