I found the following in ML documentation:
a range index lets the server map values to fragments, and fragments to values...The former capability is used to support "range predicates" ....The latter is used to support fast order by operations..
Can anyone please explain this to me.Some sort of diagram depicting how this mapping is maintained would be very helpful.
Yes, do read Jason's excellent paper for all the detail of the inner workings of MarkLogic.
A simple summary of range indexes is this: A range index is a sorted term list. Term lists are an inverted index of values stored in documents. For word indexes, for example, a list of terms (a term list) is created that contains all the words in all the documents. Each term in the list is a word, say "humdinger", and an associated set of fragment IDs where that word occurs. When you do a word search for "humdinger", ML checks the term lists to find out which fragments that word occurs in. Easy. A more complex search is simply the set intersections of all the matched terms from all applicable term lists.
Most "regular" indexes are not sorted, they're organized as hashes to make matching terms efficient. They produce a set of results, but not ordered (relevance ordering is applied after). A range index on the other hand, is a term list that's sorted by the values of its terms. A range index therefore represents the range of unique values that occur in all instances of an element or attribute in the database.
Because range index term lists are ordered, when you get matches in a search you not only know which fragments they occur in, you also know the sorted order of the possible values for that field. MarkLogic's XQuery is optimized to recognize when you've supplied an "order by" clause that refers to a element or attribute which is range indexed. This lets it sort not by comparing the matched documents, but by iterating down the sorted term list and fetching matched documents in that order. This makes it much faster because the documents themselves need not be touched to determine their sort order.
But wait, there's more. If you're paginating through search results, taking only a slice of the matching results, then fast sorting by a range indexed field helps you there as well. If you're careful not to access any other part of the document (other than the range index element) before applying the page-window selection predicate, then the documents outside that window will never need to be fetched. The combination of pre-sorted selection and fast skip ahead is really the only way you can efficiently step through large, sorted result sets.
Range indexes have one more useful feature. You can access their values as lexicons, enumerating the unique values that occur in a given element or attribute across your entire database but without every actually looking inside any documents. This comes in handy for things like auto-suggest and getting counts for facets.
I hope that clarifies what range indexes are.
Take a look at Jason Hunter's writeup in Inside MarkLogic Server. There's a whole section on range indexes.
Related
What are the performance implications in postgres of using an array to store values as compared to creating another table to store the values with a has-many relationship?
I have one table that needs to be able to store anywhere from about 1-100 different string values in either an array column or a separate table. These values will need to be frequently searched for exact matches, so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
These values will need to be frequently searched
Searched how? This is crucial.
Prefix pattern match only? Infix/suffix pattern matches too? Fuzzy string search / similarity matching? Stubbing and normalization for root words, de-pluralization? Synonym search? Is the data character sequences or natural language text? One language, or multiple different languages?
Hand-waving around "searched" makes any answer that ignores that part pretty much invalid.
so lookup performance is critical. Would the array solution be faster, or would it be faster to use joins to lookup the values in the separate table?
Impossible to be strictly sure without proper info on the data you're searching.
Searching text fields is much more flexible, giving you many options you don't have with an array search. It also generally reduces the amount of data that must be read.
In general, I strongly second Clodaldo: Design it right. Optimize later, if you need to.
According to the official PostgreSQL reference documentation, searching for specific elements in a table is expected to perform better than in an array
https://www.postgresql.org/docs/current/arrays.html#ARRAYS-SEARCHING :
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements.
The reason for the worse search performance on array elements than on tables could be that arrays are internally stored as strings as stated here
https://www.postgresql.org/message-id/op.swbsduk5v14azh%40oren-mazors-computer.local
the array is actually stored as a string by postgres. a string that
happens to have lots of brackets in it.
although I could not corroborate this statement by any official PostgreSQL documentation. I also do not have any evidence that handling well-structured strings is necessarily less performant than handling tables.
I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham
Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.
If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.
I know the title might suggest it is a duplicate but I haven't been able to find the answer to this specific issue:
I have to filter search results based on a date range. Date of each document is stored (but not indexed) on each one. When using a Filter I noticed the filter is called with all the documents in the index.
This means the filter will get slower as the index grows (currently only ~300,000 documents in it) as it has to iterate through every single document.
I can't using RangeQuery since the date is not indexed.
How can I apply the filter AFTER only on the documents that are the results of the query to make it more efficient?
I prefer to do it before I am handed the results not to mess up the scores and collectors I have.
Not quite sure if this will help, but I had a similar problem to yours and came up with the following (+ notes):
I think you're really going to have to index the date field. Nothing else makes any sense in terms of querying/filtering etc.
In Lucene.net v2.9, range querying where there are lots of terms seems to have got terribly slow compared to v2.9
I fixed my speed issues when using date fields by switching to using a numeric field and numeric field queries. This actually gave me quite a speed boost over my Lucene.net v2.4 baseline.
Wrapping your query in a caching wrapper filter means you can hang onto the document bit set for the filter. This will also dramatically speed up subsequent queries using the same filter.
A filter doesn't play a part in the scoring for a set of query results
Joining your cached filter to the rest of your query (where I guess you've got your custom scores and collectors) means it should meet the final part of your criteria
So, to summarise: index your date fields as numeric fields; build your queries as numeric range queries; transform these into cached filter wrappers and hang onto them.
I think you'll see some spectacular speedups over your current index usage.
Good luck!
p.s.
I would never second guess what'll be fast or slow when using Lucene. I've always been surprised in both directions!
First, to filter on a field, it has to be indexed.
Second, using a Filter is considered to be the best way to restrict the set of document to search on. One reason for this is that you can cache the filter results to be used for other queries. And the filter data structure is pretty efficient: it is a bit set of documents matching the filter.
But if you insist on not using filters, I think the only way is to use a boolean query to do the filtering.
I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager
Is it possible to compare data from multiple Lucene indexes? I would like to get documents that have the same value in similar fields (like first name, last name) across two indexes. Does Lucence support queries that can do this?
Well, partly. You can build identical document schemas across indexes, and at least get the set of hits correctly. However, as the Lucene Similarity documentation shows, the idf (inverse document frequency) factor in the Lucene scoring depends both on the index size and the number of documents having the search term in the index. Both these factors are index-dependent. Therefore the same match from different indexes may get different scores depending on these factors.