I am building a large graph database using neo4j.
I have my own external indexes which give me identifiers for relevant nodes that I use for further neo4j graph traversal. In other words I already have my start node ids when I get to query the database.
My question is: can node lookups be faster if I use neo4j/lucene indexes to access relevant nodes?
Or are queries such as:
START n=node({ids})
already optimized for node access and nothing can be gained by using:
START n=node:nodeIndexName(key={value})
?
Thanks,
Yes. Neo4j is optimized for Node ID as at the persistence level, all nodes are a block, so accessing node 100 is like accessing block 100.
I will warn you though that Neo4j makes no guarantee about the node id if you delete it. Neo4j reclaims ID's. So if in the course of your DB's life you delete and add multiple nodes, your external entries may be "valid" but not what you'd expect.
//EDIT: Also, why not just use Lucene to perform your lookups? Of course accessing the Node ID is faster, but that's what Lucene does under the cover when you do a lookup, so key:name, value:frank will return node id 5123 and neo4j will return the node that corresponds to that ID.
Related
As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html
In the aerospike documentation, it is mentioned that aerospike has 4096 logical partitions and each key is hashed and eventually mapped to any of the partitions between 1 to 4096, which determines in which node the data for that key should be stored.
However if we have two keys "A" and "AB" and we want to store them in the same node, is there a way?
In Redis it can be achieved by making the keys as "A" and "{A}B" that will make sure that the key "{A}B" will go to a node where "A" is hashed and stored.
In Apache Ignite, same can be done using "AffinityKey".
Does a similar idea exist in Aerospike?
Thanks
Aerospike was designed as a distributed database. Redis was designed to run on a single node, and lacks concepts such as data distribution, clustering, replication, failover, at least natively. I'm aware that you can use various application-side shenanigans to make it into an ad-hoc cluster.
Don't worry about the implementation details of Aerospike's data distribution. Those happen automatically between the client and cluster, and don't require you to do anything on the application side. Instead, think about your access patterns.
First, your Aerospike cluster will make sure the data is evenly distributed. Because work is directly proportional to data, you should make sure the nodes are homogeneous. You can then expect multi-node operations to wrap up in roughly the same amount of time on each node.
You can create a secondary index on the fields that you'll be querying often to enhance the speed of the query. Release 3.12 adds predicate filtering, allowing you to create more complex query predicates on top of the initial secondary index based filter (also see the Java client's PredExp class).
If you don't want to use secondary indexes (there are several valid reasons), you can create your own lookup using external records. In a set called country-school you can have a record for each country (keys such as 'india', 'luxembourg') with the value being a list containing the IDs of the schools in that country. You can get the list with a single get (or a batch-get if it's several records, such as india-1, india-2, ... , india-9999), then use the results to compose a batch-get operation for the schools. Batch reads return results in the ordered you asked so you can get a large batch, check whether the last element is null, and if not get another batch.
('ns1', 'country-school', 'us-california') => [ 1, 2, 3, 5, 8, 11, .. ]
Similarly, you can create permutations such as country-state-city, (example, us-california-oakland) with smaller lists. This costs some extra space, but gives you faster (key-value based) retrieval without spending memory on secondary indexes.
('ns1', 'country-school', 'us-california-oakland') => [ 1, 5, 42, .. ]
I want to be able to to find a specific node by it's ID for performance reasons (IDs are more efficient than indexes)
In order to execute the following example:
MATCH (s)
WHERE ID(s) = 65110
RETURN s
I will need the ID of the node (65110 in this case)
But how to I get it? Since the ID is auto-generated, It's impossible to find the ID without querying the graph, which kind of defeats the purpose since I will already have the node.
Am I missing something?
TL;DR: use an indexed property for lookups unless you absolutely need to optimise and can measure the difference.
Typically you use an index lookup as an entry point to the graph, that is, to obtain the node that provides the start of an edge traversal. While the pointer-like nature of Neo4j node IDs means they are theoretically faster, index lookups are also very efficient so you should not discount them on performance grounds unless you are sure it will make a measurable difference.
You should also consider that Neo4j node IDs are not stable. If you delete a node it is possible for the same ID to be re-used in future. For this reason they should really be considered an internal implementation detail and not one that should be relied on as part of your application's external interface.
That said, I have an application that stores Neo4j IDs in a Solr index for looking up nodes in bulk, but this index is considered volatile and the nodes also contain an indexed, application-generated UUID property (with a unique constraint) that serves as their main "primary key".
Further reading and discussion: https://github.com/neo4j/neo4j/issues/258
I have some very basic conceptual questions related to functioning of neo4j.
1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere.
2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error.
Thanks for clearing these questions in advance.
Best,
What is this node index? Where is it actually used?
The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person nodes into a database, then 1 million :Location nodes in a database. When you MATCH (p:Person { last_name: "Smith" } you want the database to search through only the :Person nodes, and not all 2 million. The index is what makes that happen.
Read up on indexes in neo4j
What is this index and where is it actually used?
The index by label is basically a searchable collection of nodes categorized by label (in this case :Person and :Location) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.
Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.
Running a cypher query with does not show that index is being used anywhere.
Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.
What I understood that while running cypher queries, results are stored in heap
Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.
What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size
See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.
Would neo4j switch to disk?
No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.
or would it produce an error
Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.
I have a Neo4j database whose content is generated dynamically from a big dataset.
All “entry points” nodes are indexed on a named index (IndexManager.forNodes(…)). I can therefore look up a particular “entry point” node.
However, I would now like to enumerate all those specific nodes, but I can't know on which key they were indexed.
Is there any way to enumerate all keys of a Neo4j Index?
If not, what would be the best way to store those keys, a data type that is eminently non-graph-oriented?
UPDATE (thanks for asking details :) ): the list would be more than 2 million entries. The main use case would be to never update it after an initialization step, but other use cases might need it, so it has to be somewhat scalable.
Also, I would really prefer avoiding killing my current resilience abilities, so storing all keys at once, as opposed to adding them incrementally, would be a last-resort solution.
I would either use a different data store to supplement Neo4j- I like Redis- or try #MattiasPersson's suggestion and store the the list on a node.
Is it just one list of keys or is it a list per node? You could store such a list on a specific node, say the reference node.
Instead of using a different storage which increases complexety you could try again with
lucene indices. normally lucene is able to handle this easily, especially now that the MatchAllDocsQuery is better. but one problem is that the neo4j guys are using a very old lucene version.
a special "reference" field in every node especially for this key-traversal case linking to the next node where you easily get ALL properties :)
If you want to get all Nodes, which were indexed in a particular index, you can just do:
IndexHits<Node> hits = IndexManager.forNodes(<INDEX_NAME>).query("*:*");
try{
while(hits.hasNext()){
Node n = hits.next();
...process the node...
}
}finally{
hits.close();
}