combine lucene indexing and traversal in neo4j to give a single resultset - lucene

Is there any way to combine lucene indexing and traversal in neo4j to search the users indexed by their name but the search results should return minimum depth first (or breadth first traversal)..
i.e. say search all users with name "John*" but closeness to a particular user node should be given more priority than others.
i.e. say the particular node is X then the output should be in the following order:
X--JohnG
X------JohnM
X------------------JohnY
and so on...
I am not sure if i should use an evaluator to filter out on names since there may be thousands of nodes and so it does not sound very efficient without indexing.
Thanks for any help!

I do not believe this is possible. I do not see anywhere in the REST traversal framework where you can define the Node by Index, only by Node ID. What you'd have to do is use the REST framework to perform the index lookup to get the Node ID, then perform the traversal on that.

Related

Search large number of ID search in MongoDB

Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.

How does Neo4j indexing (using lucene) work under the hood?

A few questions relating lucene indexes in Neo4j and how they're used during queries and traversal. Basically, the way relationship are stored on disk (a linked list), it seems to me that any graph traversal would require to sequential visit all relationships for a node - not sure how an index could be used in this case. More specifically:
1) When node properties are indexed, how would that be used for a query such as "all my female friends of friends" (gender is indexed). The only way I see an index being used it by first finding all friends of friends, and then submitting a query to lucene to get all the females. Is it faster than just doing to comparison in memory though?
2) When relationships properties are indexed. Since the relationships are stored in a linked list, it's impossible to get a subset of relationships for a node without sequentially walking the list. I suppose we could always index relationships using node_ids but that seems silly - we end up storing adjacency lists in both lucene and Neo4J
Indexes are not used for traversals.
They are only used to find your starting points in the graph.
Depending on the relationship-types and directions you only traverse a subset of relationships from a node.
For your query 1, you don't need an index on gender, as it will return about 50% of the people in your graph. But you would use an index for the initial user lookup (me)
create index on :User(name);
MATCH (m:User {name:"Me"})-[:FRIEND]->(other:User)
WHERE other.gender = "female"
RETURN other;
2) yes, you are right.
You can do that, but it is only necessary if you have a lot of relationships (millions) and want to access a tiny slice of those.
So if that's your use case a relationship-index might help.
Relationships are actually indexed with both node-id's and a relationship-property

Understanding range indexes in Marklogic

I found the following in ML documentation:
a range index lets the server map values to fragments, and fragments to values...The former capability is used to support "range predicates" ....The latter is used to support fast order by operations..
Can anyone please explain this to me.Some sort of diagram depicting how this mapping is maintained would be very helpful.
Yes, do read Jason's excellent paper for all the detail of the inner workings of MarkLogic.
A simple summary of range indexes is this: A range index is a sorted term list. Term lists are an inverted index of values stored in documents. For word indexes, for example, a list of terms (a term list) is created that contains all the words in all the documents. Each term in the list is a word, say "humdinger", and an associated set of fragment IDs where that word occurs. When you do a word search for "humdinger", ML checks the term lists to find out which fragments that word occurs in. Easy. A more complex search is simply the set intersections of all the matched terms from all applicable term lists.
Most "regular" indexes are not sorted, they're organized as hashes to make matching terms efficient. They produce a set of results, but not ordered (relevance ordering is applied after). A range index on the other hand, is a term list that's sorted by the values of its terms. A range index therefore represents the range of unique values that occur in all instances of an element or attribute in the database.
Because range index term lists are ordered, when you get matches in a search you not only know which fragments they occur in, you also know the sorted order of the possible values for that field. MarkLogic's XQuery is optimized to recognize when you've supplied an "order by" clause that refers to a element or attribute which is range indexed. This lets it sort not by comparing the matched documents, but by iterating down the sorted term list and fetching matched documents in that order. This makes it much faster because the documents themselves need not be touched to determine their sort order.
But wait, there's more. If you're paginating through search results, taking only a slice of the matching results, then fast sorting by a range indexed field helps you there as well. If you're careful not to access any other part of the document (other than the range index element) before applying the page-window selection predicate, then the documents outside that window will never need to be fetched. The combination of pre-sorted selection and fast skip ahead is really the only way you can efficiently step through large, sorted result sets.
Range indexes have one more useful feature. You can access their values as lexicons, enumerating the unique values that occur in a given element or attribute across your entire database but without every actually looking inside any documents. This comes in handy for things like auto-suggest and getting counts for facets.
I hope that clarifies what range indexes are.
Take a look at Jason Hunter's writeup in Inside MarkLogic Server. There's a whole section on range indexes.

Lucene: Query at least

I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham
Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.
If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.

Determining search results quality in Lucene

I have been searching about score normalization for few days (now i know this can't be done) in Lucene using mailing list, wiki, blogposts, etc. I'm going to expose my problem because I'm not sure that score normalization is what our project need.
Background:
In our project, we are using Solr on top of Lucene with custom RequestHandlers and SearchComponents. For a given query, we need to detect when a query got poor results to trigger different actions.
Assumptions:
Inmutable index (once indexed, it is not updated) and Same query tipology (dismax qparser with same field boosting, without boost functions nor boost queries).
Problem:
We know that score normalization is not implementable. But is there any way to determine (using TF/IDF and boost field assumptions) when search results match quality are poor?
Example: We've got an index with science papers and other one with medcare centre's info. When a user query against first index and got poor results (inferring it from score?), we want to query second index and merge results using some threshold (score threshold?)
Thanks in advance
You're right that normalization of scores across different queries doesn't make sense, because nearly all similarity measures base on term frequency, which is of course local to a query.
However, I think that it is viable to compare the scores in this very special case that you are describing, if only you would override the default similarity to use IDF calculated jointly for both indexes. For instance, you could achieve it easily by keeping all the documents in one index and adding an extra (and hidden to the users) 'type' field. Then, you could compare the absolute values returned by these queries.
Generally, it could be possible to determine low quality results by looking at some features, like for example very small number of results, or some odd distributions of scores, but I don't think it actually solves your problem. It looks more similar to the issue of merging of isolated search results, which is discussed for example in this paper.