I am using RediSearch and storing the data in an Index through documents. I want to get all the documents from the Index, please suggest.
This is half of an answer but perhaps it will spark a solution for you.
I can't speak to the Spring Data Redis part—not my jam—but in RediSearch itself you can get all the documents in the index by simply passing * as your query:
> FT.SEARCH my:index *
Related
We have a Couchbase server version Community Edition 5.1.1 build 5723
In our Cars bucket we have the Car Make and the Cars it manufactured.
The connection between the two is the Id of the Car Make that we save as another field in the Car document (like a foreign key in a MySQL table).
The bucket has only 330,000 documents.
Queries are taking a lot of time - dozens of seconds for very simple query such as
select * from cars where model="Camry" <-- we expect to have about 50,000 results for that
We perform the queries in 2 ways:
The UI of the Couchbase
A Spring boot app, that constantly get a TimeOutException after 7.5 seconds
We thought the issue is a missing index to the bucket.
So we added an index:
CREATE INDEX cars_idx ON cars(makeName, modelName, makeId, _class) USING GSI;
We can see that index when running
SELECT * FROM system:indexes
What are we missing here? Are these reasonable amount of times for such queries in a NoSQL DB?
Try
CREATE INDEX model_idx ON cars(model);
Your index does not cover model field.
And you should have index for spring data couchbase "_class" property
CREATE INDEX `type_idx` ON `cars`(`_class`)
So, this is how we solved the issue:
Using this link and #paralen answer, we created several indexes that speed up the queries.
We altered our code to use pagination when we know the returned result set will be big, and came up with something like this:
do{
Pageable pageable = PageRequest.of(pageNumber, SLICE_SIZE, Sort.by("id"));
Slice slice carsRepository.findAllByModelName("Camry", pageable);
List cars = slice.getContent();
} while (slice.hasNext());
I'm trying to get the list of persons in a datastax graph that have the same address with other persons and the number of persons is between 3 and 5.
This is the query:
g.V().hasLabel('person').match(__.as('p').out('has_address').as('a').dedup().count().as('nr'),__.as('p').out('has_address').as('a')).select('nr').is(inside(3,5)).select('p','a','nr').range(0,20)
At first run I've noticed this error messages:
Could not find an index to answer query clause and graph.allow_scan is
disabled: ((label = person))
I've enabled graph.allow_scan=true and now it's working
I'm wondering how can I create an index to be able to run this query without enabling allow_scan=true ?
Thanks
You can create an index by adding it to the schema using a command like this:
schema.vertexLabel('person').index('address').materialized().by('has_address').add()
Full documentation on adding indexes is available here: https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/using/createIndexes.html
You should not enable graph.allow_scan=true as under the covers it is turning on ALLOW FILTERING on the CQL queries. This will cause a lot of cluster scans and will inevitably time out with any real amount of data in the system. You should never enable this in any sort of production environment.
I am not sure that indexing is the solution for your problem.
The best way to do this would be to reify addresses as nodes and look for nodes with an indegree between 3 and 5.
You can use index on textual fields of your address nodes.
I'm trying to implement indexes in NE4OJ in order to speed up some queries.
One of the queries is to find some people based on a string using where people.key IN [<10 string values here>].
We have around 180k nodes with the label People in the database and the query takes 41s to return the result.
So I created a schema Index on this property, run again the query and nothing changed. By curiosity, I decided to select by ID:
match (people:People)
where ID(people) IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took 92ms. Perfect! So I tried to create a new property called test, index it and set the same value as the node id. Then I run the following query:
match (people:People)
where people.test IN [789806,908117,934851,934857,935125,935174,935177,935183,935581,935586,935587,935588,935634,935636,935637,935638,935639]
return ID(people)
It took again 41s (41000ms). Am I missing something? I really don't understand... Is it some performance conf?
FYI, we use NEO4J 2.0.0 on Debian.
Thanks guys !
As far as I remember the IN operation was not using an existing index in 2.0.x. So try to upgrade to 2.1.3 and retry.
Is it possible to query neo4j for the newest nodes? In this case, the indexed property "timestamp" records time in milliseconds on every node.
All of the cypher examples I have lfound concern graph-type queries- "start at node n and follow relationships. What is the general best approach for returning resultsets sorted on one field? Is this even possible in a graph database such as node4j?
In the embedded Java API it is possible to add sorting using Lucene constructs.
http://docs.neo4j.org/chunked/milestone/indexing-lucene-extras.html#indexing-lucene-query-objects
http://blog.richeton.com/2009/05/12/lucene-sort-tips/
In the server mode you can pass an ?order parameter to the lucene lookup query.
http://docs.neo4j.org/chunked/milestone/rest-api-indexes.html#rest-api-find-node-by-query
Depending on how you indexed your data (not numerically as there are issues with the lucene query syntax parser and numeric searches :( ), in cypher you can do:
start n=node:myindes('time: [1 to 1000]') return n order by n.time asc
There are also more graphy ways of doing that, e.g. by linking the events with a NEXT relationship and returning the head and next n elements from this list
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html
or to create a tree structure for time:
http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Yes, it is possible, and there are some different ways to do so.
You could either use a timestamp property and a classic index, and sort your result set by that property. Or you could create an in-graph time-based index, like f.e. described in Peter's blog post:
http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html
I'm searching articles in PubMed via Lucene.
Each of the 20,000,000 articles has an abstract with ~250 words and an ID.
At the moment I store my searches, with each take multiple seconds, in a TopDocs object.
Searchs can find thousands of articles.
I'm just interested in the ID of the article.
Does Lucene load the abstracts internally into the TopDocs?
If so can I prevent that behavior through FieldSelectors or do FieldSelectors only work with IndexReader and don't work with IndexSearcher?
No, Lucene does not load the values of fields into TopDocs. TopDocs only contains the doc number and score for each one of the matching documents.
If you're having performance issues, here's another SO question that can help you:
Optimizing Lucene performance
Lucene, by default, does not load any stored fields. If you want to retrieve only the ID field, and if you can afford to load up all the IDs in memory, then you can load all values as follows and reuse them.
String[] allIDs = FieldCache.DEFAULT.getStrings(indexReader, "IDFieldName")
Please check the answer for FieldCache. Best way to retrieve certain field of all documents returned by a Lucene search
You're on the right lines.
Try using a SetBasedFieldSelector when you retrieve the document from the index.
As another poster noted, iterating through the hits will return a ScoreDoc object. This will give you the document Id that can be used to retrieve the document using the IndexReader associated with the IndexSearcher.
If IO is a problem because of loading fields you aren't interested in, you should be in for a pleasant surprise.
Hope this helps,