Does a snapahot of elasticsearch that ended with ‘success’ state, can have missing index documents? - api

I have a snapahot I did that ended in 'success' state.
but I wonder, is that necessarily means that all the documents inside the index that have been snapshotted, were indeed snapshotted?
it has been said in their documentation that it means all shards were available and got snapshoted,
which means that all indices were snapshotted but what about the docs inside? is there any
validation about them? any index docs count check??

elasticsearch team member confirms:
Yes, "success" means that all the data from the index was successfully snapshotted.

Related

Couchbase - what is asynchronously index?

I'm reading up on Couchbase on this site and came across. I understand indexes at a high level, but that's about it. The second line is just as unintelligible.
The primary index, like every other index in Couchbase, is maintained asynchronously. You set the recency of the data by setting the consistency level for your query
It means that when a document is stored in Couchbase, it is put into a queue to be indexed. The write operation does not wait for indexing to finish. Imagine a situation where your application:
Writes some document A like {'type': 'invoice', 'foo':'bar', ... etc ... }
Immediately executes a N1QL query SELECT * FROM mybucket WHERE type = 'invoice'
An overly simplistic explanation: Document A will be queued up for indexing after step 1. In Step 2, the N1QL is NotBounded (the SDKs/Server will be NotBounded by default), and may not return document A, because it hasn't been indexed yet. If your situation calls for it, you can specify RequestPlus (or AtPlus) instead of NotBounded. This will make your query wait for the indexing to finish before executing the query.
An example in C# of using RequestPlus:
var request = QueryRequest.Create("SELECT * FROM mybucket WHERE type = 'invoice'");
request.ScanConsistency(ScanConsistency.RequestPlus);
This does have performance implications though! RequestPlus will be the worst performance, NonBounded will be the best, AtPlus will be somewhere in the middle.
Note that if you use key/value access instead of N1QL, then you don't have to worry about this. You will always be able to access a document by its key directly, regardless of indexing.
For more information:
Request specification (docs)
Blog post about AtPlus

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

Not in List Operation - AeroSpike

Does AeroSpike support not in operation on List ? I couldn't see it in the document or find any reference to it. Could somebody confirm if that's the case ?
Your understanding is correct. Aerospike does not support "not in" operation on list. It will be an operation based on a search value, if my understanding is correct. There are no list read operations by value.

How to return all newest nodes from neo4j?

Is it possible to query neo4j for the newest nodes? In this case, the indexed property "timestamp" records time in milliseconds on every node.
All of the cypher examples I have lfound concern graph-type queries- "start at node n and follow relationships. What is the general best approach for returning resultsets sorted on one field? Is this even possible in a graph database such as node4j?
In the embedded Java API it is possible to add sorting using Lucene constructs.
http://docs.neo4j.org/chunked/milestone/indexing-lucene-extras.html#indexing-lucene-query-objects
http://blog.richeton.com/2009/05/12/lucene-sort-tips/
In the server mode you can pass an ?order parameter to the lucene lookup query.
http://docs.neo4j.org/chunked/milestone/rest-api-indexes.html#rest-api-find-node-by-query
Depending on how you indexed your data (not numerically as there are issues with the lucene query syntax parser and numeric searches :( ), in cypher you can do:
start n=node:myindes('time: [1 to 1000]') return n order by n.time asc
There are also more graphy ways of doing that, e.g. by linking the events with a NEXT relationship and returning the head and next n elements from this list
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html
or to create a tree structure for time:
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Yes, it is possible, and there are some different ways to do so.
You could either use a timestamp property and a classic index, and sort your result set by that property. Or you could create an in-graph time-based index, like f.e. described in Peter's blog post:
http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)