When to use labels and relation types indexing from Neo4j 4.3? - indexing

I've already asked it on the Neo4j forum, hoping for more success here.
In NODES 2021, label and type indices, available from the 4.3 server, were presented. However, I haven't clear when I should or should not use this kind of indices.
Naively, I tend to think that in most cases, there will be searches like MATCH ( n: ) - [r:] -> (...). So, I'd say it's good to have them in most cases.
But maybe I'm wrong? Maybe they benefit more certain types of queries only?
Thanks in advance for your answer.

Eventually, I got an answer on the Neo4j forum.
Essentially, label/type indices are useful for filters on label/types like: WHERE label = x OR label = y.

Related

Neo4j: how do I use an index in a relationship between two nodes?

I'm debugging the code of an api and I found a cypher instruction that takes 6 minutes to return the data.
I ran the neo4j code in smaller chunks and found that this snippet is causing the problem: MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->(:Coexistence)
I'm new to neo4j so I still haven't figured out how I can optimize this instruction.
Thanks to everyone who contributed.
Optimizations of this kind, usually depend on the schema, of your graph database, without that it's very hard to provide any insights. But you can try this:
MATCH (copart:CopartOperadora)-[:FROM_TO]->(:Coexistence)
WITH collect(id(copart)) AS connectedNodesIds
MATCH (copart:CopartOperadora) WHERE id(copart) NOT IN connectedNodesIds
We can't create any index as such, unfortunately. But if the relationship FROM_TO is only present from CopartOperadora to Coexistence nodes. Then you can remove the node label for Coexistence, all together, which will be optimal. Something like this:
MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->()

Range of possible values for alpha, gamma and eta params of HLDA's Mallet implementation

I'm trying to run the hlda algorytmm and producing a descriptive hierarchy of the input documents. The problem is I'm running diverse parameters configs and trying to understand how it works in an "empirical way", because I can not match the ones that are being used in the original papers (I understand it's a different team). E.g. alpha in Mallet seems to be eta in the paper, but I'm not very sure. Besides, I can not know the boundaries for each of them. I mean, the range of possible values for each parameter.
In the source code, there is some help:
double alpha; // smoothing on topic distributions
double gamma; // "imaginary" customers at the next
double eta; // smoothing on word distributions.
First, I used the default values: alpha=10.0; gamma=1.0; eta = 0.1;
Then, I tryed running the algorythm by changing the values and interpret the results, but I can't understand the meaning of them. E.g. I think changing gamma (in Mallet) has an effect on the customers decition: to start a new node in the tree or to be placed in an existing one. So, if I set gamma = 0.5, less nodes should be produced, because 0.5 is half the probability of the default one, right? But the results with gamma=1 give me 87 nodes, and with gamma=0.5, it returns 98! And then, I'm asking me something new: is that a probability? I was trying to find the range of possible values in these two papers, but I didn't find them:
Hierarchical Topic Models andthe Nested Chinese Restaurant Process
The Nested Chinese Restaurant Process and BayesianNonparametric Inference of Topic Hierarchies
I know I could be missing something, because I don't have the a good background on this, but that's why I'm asking here, maybe someone already had this problem and can help me understanding those limits.
Thanks in advance!
It may be helpful to run multiple times with each hyperparameter setting. I suspect that gamma does not have a big influence on the final number of topics, and that what you are seeing could just be typical variability in the sampling process.
In my experience the parameter that has by far the strongest influence on the number of topics is actually eta, the topic-word smoothing.

How to traverse & query faster TitanGraph with Cassandra as a backend?

I have millions of nodes stored in Titan 1.0.0 with Cassandra 2.2.4. I want to retrieve graph from Cassandra and query or traverse it in fast way.
If I build index in a code,
mgmt.buildIndex("nameSearchIndex", Vertex.class).addKey(namep, Mapping.TEXT.asParameter()).buildMixedIndex("search");
mgmt.buildIndex("addressSearchIndex", Vertex.class).addKey(addressp, Mapping.TEXT.asParameter()).buildMixedIndex("search");
Still the querying seems to be slower.
When I use
g.traversal().V().count()
it still gives warning - please use indexes, when I have already build indexes in code. Is there any specific configuration to forcefully activate indexes? How to query graph with using indexes?
g.traversal().V().has("Name","Jason")
Does this query uses indexes? if not then how do I make use of indexes to query faster?
Can Spark be used for fast traversal? How to use SparkComputerGraph for the same? I am not able to find the configurations for CassnadraInputFormat with Spark.
Thanks.
There are lots of questions bundled up in this question.
The indexes you're making are mixed indexes, which are implemented by an external indexing system, such as Solr or ElasticSearch. These would help if you are looking for a vertex with a certain name, such as your .has("Name", "Jason) example.
To find out if an index is being used, I suggest looking into the profile() step in Gremlin. You can read about it here.
Spark is meant to be used for traversals that need to potentially load a graph that is bigger than one machine can hold. What use case is .V().count() important for?
This answer was cross-posted on the Titan mailing list.
Indexing is useful for doing fast traversals, but ultimately "fast queries" depends on many factors, including your graph model/volume/shape and the types of questions you are trying to answer.
Read Chapter 8 "Indexing for better Performance" in the Titan docs, and digest the differences between the different types: Composite, Mixed, and Vertex-centric.
Based on the example query you posted, and as Daniel noted, it looks to me like an exact match type of query, so I would start with a Composite index. You can cut and paste this to try it out in the Titan Console.
graph = TitanFactory.open('inmemory')
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name).buildCompositeIndex()
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear
If after reading the Composite vs Mixed Index section you decide that a Mixed index (backed by Elasticsearch, Solr, or Lucene) is what you really need, read Chapter 20 "Index Parameters and Full-Text Search", and digest the differences between the mappings TEXT, STRING, and TEXTSTRING.
Here's an example that uses a STRING mixed index
graph = TitanFactory.build().set('storage.backend','inmemory').set('index.search.backend','elasticsearch').open()
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SINGLE).make()
nameIndex = mgmt.buildIndex('nameIndex',Vertex.class).addKey(name, Mapping.STRING.asParameter()).buildMixedIndex("search")
mgmt.commit()
graph.addVertex('name','jason')
g = graph.traversal()
g.V().has('name','jason') // no warning should appear

Cplex/OPL local search

I have a model implemented in OPL. I want to use this model to implement a local search in java. I want to initialize solutions with some heuristics and give these initial solutions to cplex find a better solution based on the model, but also I want to limit the search to a specific neighborhood. Any idea about how to do it?
Also, how can I limit the range of all variables? And what's the best: implement these heuristics and local search in own opl or in java or even C++?
Thanks in advance!
Just to add some related observations:
Re Ram's point 3: We have had a lot of success with approach b. In particular it is simple to add constraints to fix the some of the variables to values from a known solution, and then re-solve for the rest of the variables in the problem. More generally, you can add constraints to limit the values to be similar to a previous solution, like:
var >= previousValue - 1
var <= previousValue + 2
This is no use for binary variables of course, but for general integer or continuous variables can work well. This approach can be generalised for collections of variables:
sum(i in indexSet) var[i] >= (sum(i in indexSet) value[i])) - 2
sum(i in indexSet) var[i] <= (sum(i in indexSet) value[i])) + 2
This can work well for sets of binary variables. For an array of 100 binary variables of which maybe 10 had the value 1, we would be looking for a solution where at least 8 have the value 1, but not more than 12. Another variant is to limit something like the Hamming distance (assume that the vars are all binary here):
dvar int changed[indexSet] in 0..1;
forall(i in indexSet)
if (previousValue[i] <= 0.5)
changed[i] == (var[i] >= 0.5) // was zero before
else
changed[i] == (var[i] <= 0.5) // was one before
sum(i in indexSet) changed[i] <= 2;
Here we would be saying that out of an array of e.g. 100 binary variables, only a maximum of two would be allowed to have a different value from the previous solution.
Of course you can combine these ideas. For example, add simple constraints to fix a large part of the problem to previous values, while leaving some other variables to be re-solved, and then add constraints on some of the remaining free variables to limit the new solution to be near to the previous one. You will notice of course that these schemes get more complex to implement and maintain as we try to be more clever.
To make the local search work well you will need to think carefully about how you construct your local neighbourhoods - too small and there will be too little opportunity to make the improvements you seek, while if they are too large they take too long to solve, so you don't get to make so many improvement steps.
A related point is that each neighbourhood needs to be reasonably internally connected. We have done some experiments where we fixed the values of maybe 99% of the variables in a model and solved for the remaining 1%. When the 1% was clustered together in the model (e.g. all the allocation variables for a subset of resources) we got good results, while in comparison we got nowhere by just choosing 1% of the variables at random from anywhere in the model.
An often overlooked idea is to invert these same limits on the model, as a way of forcing some changes into the solution to achieve a degree of diversification. So you could add a constraint to force a specific value to be different from a previous solution, or ensure that at least two out of an array of 100 binary variables have a different value from the previous solution. We have used this approach to get a sort-of tabu search with a hybrid matheuristic model.
Finally, we have mainly done this in C++ and C#, but it would work perfectly well from Java. Not tried it much from OPL, but it should be fine too. The key for us was being able to traverse the problem structure and use problem knowledge to choose the sets of variables we freeze or relax - we just found that easier and faster to code in a language like C#, but then the modelling stuff is more difficult to write and maintain. We are maybe a bit "old-school" and like to have detailed fine-grained control of what we are doing, and find we need to create many more arrays and index sets in OPL to achieve what we want, while we can achieve the same effect with more intelligent loops etc without creating so many data structures in a language like C#.
Those are several questions. So here are some pointers and suggestions:
In Cplex, you give your model an initial solution with the use of IloOplCplexVectors()
Here's a good example in IBM's documentation of how to alter CPLEX's solution.
Within OPL, you can do the same. You basically set a series of values for your variables, and hand those over to CPLEX. (See this example.)
Limiting the search to a specific neighborhood: There is no easy way to respond without knowing the details. But there are two ways that people do this:
a. change the objective to favor that 'neighborhood' and make other areas unattractive.
b. Add constraints that weed out other neighborhoods from the search space.
Regarding limiting the range of variables in OPL, you can do it directly:
dvar int supply in minQty..maxQty;
Or for a whole array of decision variables, you can do something along the lines of:
range CreditsAllowed = 3..12;
dvar int credits[student] in CreditsAllowed;
Hope this helps you move forward.

Improving lucene spellcheck

I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.
Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?
Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.
thanks
almir
As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellCheckers to achieve this.
EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.
However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).
If you look at the source of SpellChecker.SuggestSimilar you can see:
BooleanQuery query = new BooleanQuery();
String[] grams;
String key;
for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
{
<...>
if (bStart > 0)
{
Add(query, "start" + ng, grams[0], bStart); // matches start of word
}
<...>
I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:
query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
BooleanClause.Occur.MUST));
which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.
To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).
After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.
Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.
They have a feature called morphology preprocessors:
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
There are many stemmers available, and as you can see, german is among them.
UPDATE:
Elaboration on why I feel that sphinx has been the clear winner for me.
Speed: Sphinx is stupid fast. Both indexing and in the serving search queries.
Relevance: Though it's hard to quantify this, I felt that I was able to get more relevant results with sphinx compared to my lucene implementation.
Dependence on the filesystem: With lucene, I was unable to break the dependence on the filesystem. And while their are workarounds, like creating a ram disk, I felt it was easier to just select the "run only in memory" option of sphinx. This has implications for websites with more than one webserver, adding dynamic data to the index, reindexing, etc.
Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.
Hope that helps...