Couchbase performance issues when scanning bucket for certain documents - Getting timeout exception - indexing

We have a Couchbase server version Community Edition 5.1.1 build 5723
In our Cars bucket we have the Car Make and the Cars it manufactured.
The connection between the two is the Id of the Car Make that we save as another field in the Car document (like a foreign key in a MySQL table).
The bucket has only 330,000 documents.
Queries are taking a lot of time - dozens of seconds for very simple query such as
select * from cars where model="Camry" <-- we expect to have about 50,000 results for that
We perform the queries in 2 ways:
The UI of the Couchbase
A Spring boot app, that constantly get a TimeOutException after 7.5 seconds
We thought the issue is a missing index to the bucket.
So we added an index:
CREATE INDEX cars_idx ON cars(makeName, modelName, makeId, _class) USING GSI;
We can see that index when running
SELECT * FROM system:indexes
What are we missing here? Are these reasonable amount of times for such queries in a NoSQL DB?

Try
CREATE INDEX model_idx ON cars(model);
Your index does not cover model field.
And you should have index for spring data couchbase "_class" property
CREATE INDEX `type_idx` ON `cars`(`_class`)

So, this is how we solved the issue:
Using this link and #paralen answer, we created several indexes that speed up the queries.
We altered our code to use pagination when we know the returned result set will be big, and came up with something like this:
do{
Pageable pageable = PageRequest.of(pageNumber, SLICE_SIZE, Sort.by("id"));
Slice slice carsRepository.findAllByModelName("Camry", pageable);
List cars = slice.getContent();
} while (slice.hasNext());

Related

Using index in DSE graph

I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.

Changing index in Neo4j affects the search in cypher

I am using Neo4j 2.0 .
I have a label PROD. All nodes having label PROD have a property name. I have created an index on name like this :
CREATE INDEX ON :PROD(name)
After creating an index if I change value of the property name from "old" to "new", the following query works fine in a small dataset used for testing but not in our production data with 700 nodes having label PROD (where totally there are around a million nodes with other labels).
MATCH (n:PROD) WHERE n.name="new" RETURN n;
I have also created legacy index on the same field and after dropping and re-indexing the node on modification, it works perfectly fine on both test and production datasets.
Is there a way to ensure the indices are updated? What am I doing wrong? Why does the above query fail for large dataset?
You can use the :schema command in Neo4j browser or schema in neo4j-shell. The output should indicate which indexes have been created and which of them are already online.
Additionally there's a programmatic way to wait until index population has finished.
Consider upgrading to 2.1.3, indexing has improved since 2.0.

Neo4j schema indexing is very slow

I'm trying to implement indexes in NE4OJ in order to speed up some queries.
One of the queries is to find some people based on a string using where people.key IN [<10 string values here>].
We have around 180k nodes with the label People in the database and the query takes 41s to return the result.
So I created a schema Index on this property, run again the query and nothing changed. By curiosity, I decided to select by ID:
match (people:People)
where ID(people) IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took 92ms. Perfect! So I tried to create a new property called test, index it and set the same value as the node id. Then I run the following query:
match (people:People)
where people.test IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took again 41s (41000ms). Am I missing something? I really don't understand... Is it some performance conf?
FYI, we use NEO4J 2.0.0 on Debian.
Thanks guys !
As far as I remember the IN operation was not using an existing index in 2.0.x. So try to upgrade to 2.1.3 and retry.

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.