Obtain all keys of a Neo4j index - indexing

I have a Neo4j database whose content is generated dynamically from a big dataset.
All “entry points” nodes are indexed on a named index (IndexManager.forNodes(…)). I can therefore look up a particular “entry point” node.
However, I would now like to enumerate all those specific nodes, but I can't know on which key they were indexed.
Is there any way to enumerate all keys of a Neo4j Index?
If not, what would be the best way to store those keys, a data type that is eminently non-graph-oriented?
UPDATE (thanks for asking details :) ): the list would be more than 2 million entries. The main use case would be to never update it after an initialization step, but other use cases might need it, so it has to be somewhat scalable.
Also, I would really prefer avoiding killing my current resilience abilities, so storing all keys at once, as opposed to adding them incrementally, would be a last-resort solution.

I would either use a different data store to supplement Neo4j- I like Redis- or try #MattiasPersson's suggestion and store the the list on a node.

Is it just one list of keys or is it a list per node? You could store such a list on a specific node, say the reference node.

Instead of using a different storage which increases complexety you could try again with
lucene indices. normally lucene is able to handle this easily, especially now that the MatchAllDocsQuery is better. but one problem is that the neo4j guys are using a very old lucene version.
a special "reference" field in every node especially for this key-traversal case linking to the next node where you easily get ALL properties :)

If you want to get all Nodes, which were indexed in a particular index, you can just do:
IndexHits<Node> hits = IndexManager.forNodes(<INDEX_NAME>).query("*:*");
try{
while(hits.hasNext()){
Node n = hits.next();
...process the node...
}
}finally{
hits.close();
}

Related

Performance of Lucene queries in Ignite

I have a very simple object as keys in my cache and I want to be able to iterate on the key/value pairs where a string matches a field in my keys.
Here is how the field is declared in the class
#AffinityKeyMapped #QueryTextField String crawlQueueID;
I run many queries and expect a small amount of documents to match. The queries take a relatively large amount of time, which is surprising given that there are maybe only 100K pairs locally in the cache. My queries are local, I want to hit only the K/V stored in the local node.
According to the profiler I am using, 80% of the CPU is spent here
GridLuceneIndex.java:285 org.apache.lucene.search.IndexSearcher.search(Query, int)
Knowing Lucene's performance, I am really surprised. Any suggestions?
BTW I want to sort the results based on a numerical field in the value object. Can this be done via annotations?
I could have one cache per value of the field I am querying against but given that there are potentially hundreds of thousands or even millions of different values, that would probably be too many caches for Ignite to handle.
EDIT
Looking at the code that handles the Lucene indexing and querying, the index gets reloaded for every query. Given that I do hundreds of them in a row, we probably don't benefit from any caching or optimisation of the index structure in Lucene.
Additionally, there is a range query running as a filter to check for the TTL. FilterQueries are faster but on a fresh indexreader, there would not be much caching either. Of course, if no TTL is needed for a given table, this should not be required.
Judging by the documentation about the indexing with SQL indexing:
Ignite automatically creates indexes for each primary key and affinity
key field.
the indexing is done on the key alone. In my case, the value I want to use for sorting is in the value object so that would not work.

Performance difference in Couchbase's get by Key and select by index

As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html

Neo4j - Find node by ID - How to get the ID for querying?

I want to be able to to find a specific node by it's ID for performance reasons (IDs are more efficient than indexes)
In order to execute the following example:
MATCH (s)
WHERE ID(s) = 65110
RETURN s
I will need the ID of the node (65110 in this case)
But how to I get it? Since the ID is auto-generated, It's impossible to find the ID without querying the graph, which kind of defeats the purpose since I will already have the node.
Am I missing something?
TL;DR: use an indexed property for lookups unless you absolutely need to optimise and can measure the difference.
Typically you use an index lookup as an entry point to the graph, that is, to obtain the node that provides the start of an edge traversal. While the pointer-like nature of Neo4j node IDs means they are theoretically faster, index lookups are also very efficient so you should not discount them on performance grounds unless you are sure it will make a measurable difference.
You should also consider that Neo4j node IDs are not stable. If you delete a node it is possible for the same ID to be re-used in future. For this reason they should really be considered an internal implementation detail and not one that should be relied on as part of your application's external interface.
That said, I have an application that stores Neo4j IDs in a Solr index for looking up nodes in bulk, but this index is considered volatile and the nodes also contain an indexed, application-generated UUID property (with a unique constraint) that serves as their main "primary key".
Further reading and discussion: https://github.com/neo4j/neo4j/issues/258

Using the document store as a cache

I've set up a basic implementation of ElasticSearch, storing a couple of fields in the document and I'm able to execute queries.
var searchResult = client.Search<SearchTest>(s =>
s
.Size(1000)
.Fields(f => f.ID)
.Query(q => q.QueryString(d => d.Query(query)))
)
.Documents.Select(item =>
item.ID
)
.ToList();
var products = this.DbContext.Products
.Where(item =>
searchResult.Contains(item.ProductId)
&& ...
)
.Select(item => ...);
// subsequent queries here
Right now, I simply return the index, which I use in database queries to retrieve a whole lot of information. The information stored in the documents is also retrieved. Now I'm wondering, should I skip retrieving this from the database, and use the data in the document store? Or should I use it for nothing else but searching?
Some context: searching in a product database, some information is always the same, some information (like price calculation) depends on which customer is searching.
There isn't really a hard and fast answer to this question. I like to pull enough information from the index to populate a list of search results, but retrieve the full contents of the document from others, external sources (ex. a database). Entirely subjectively, this seems to be the more common use of Lucene, from what I've seen.
Storage strategy, as far as I know, should not have a direct impact on search performance, but keeping data stored for each document to a minimum will improve performance retrieving documents from the index (ie, for that list of results mentioned before).
I'm also sometimes hesitant to make Lucene the system of record. It seems to be much easier to find yourself with a broken/corrupt index than a database. I like having the option available to trash and rebuild it.
I see you already accepted an answer but i'd like to offer a second approach.
Elasticsearch excels at storing Documents (json) and so retrieving complete object graphs can be a very fast and powerful approach to overcome the impedance mismatch and N+1 sensitive database queries.
To me the best approach would be for searchResults to already be the list of definitive IEnumerable<Product> without having to do N database queries afterwards.
Elasticsearch (unlike raw lucene or even Solr) has a special field that stores the original json graph called _source so the overhead of loading your whole document is very minimal.
This comes at the cost of having to basically write your data twice, once to the database and once to elasticsearch on every mutation. Depending on your architecture this may or may not be achievable.
I agree with #femtoRgon that being able to reindex from an external datasource is a good idea, but the Elasticsearch developers are working very hard to get a proper backup and restore going for 1.0. This will greatly reduce the need for the second datastorage.
BTW not sure if you are aware but specifying .Fields() will already force Elasticsearch to only load up the specified fields instead of the whole graph from the special _source field.

Are Neo4J node ids optimized for access?

I am building a large graph database using neo4j.
I have my own external indexes which give me identifiers for relevant nodes that I use for further neo4j graph traversal. In other words I already have my start node ids when I get to query the database.
My question is: can node lookups be faster if I use neo4j/lucene indexes to access relevant nodes?
Or are queries such as:
START n=node({ids})
already optimized for node access and nothing can be gained by using:
START n=node:nodeIndexName(key={value})
?
Thanks,
Yes. Neo4j is optimized for Node ID as at the persistence level, all nodes are a block, so accessing node 100 is like accessing block 100.
I will warn you though that Neo4j makes no guarantee about the node id if you delete it. Neo4j reclaims ID's. So if in the course of your DB's life you delete and add multiple nodes, your external entries may be "valid" but not what you'd expect.
//EDIT: Also, why not just use Lucene to perform your lookups? Of course accessing the Node ID is faster, but that's what Lucene does under the cover when you do a lookup, so key:name, value:frank will return node id 5123 and neo4j will return the node that corresponds to that ID.