MongoDB $elemMatch Index - optimization

I have an mongo query that looks like the following:
{'schedule': {'$elemMatch': {'time': {'$gt': start, '$lte': end}}}}
Which searches a collection for an item whose "schedule" field (which is a list of objects) that contains an element whose "time" field is between start and end.
I am unclear as to how mongo's indexes handle this situation, and wonder what would be the best practice, if this query were going to be happening regularly?

This is the wrong syntax, but should work. $elemMatch is used when there is more than one field in an object (esp. in an array) that you want to match. With a single field there is no need for $elemMatch.
You should be doing this:
{"schedule.time" : { $gt: <start>, $lte: <end> }}
It is much more clear that you are using a single field, even though it will be in multiple array elements, in this case. An index on "schedule.time" can be used for this query to make it more efficient.

Related

arangodb query values in an index

I have a collection of baseball players that is structured something like the following:
{
"name": "john doe",
"education": [
"Community College",
"University"
]
}
If I want to get a list of all schools in the education arrays I can do something like the following:
FOR school IN unique((
FOR player IN players
COLLECT schools = player.education
RETURN schools
)[**])
FILTER school != null
FILTER LOWER(school) LIKE CONCAT('%', #name, '%')
LIMIT 10
RETURN school
But in order to do this it has to touch every document in the collection. I built an index on players.education[*] which would have all the schools in it. Is there any way I could directly query the index for the keys (school names) instead of having to touch every record in the collection each time I need to run the query?
There are two things to consider:
The FILTER school != null statement requires a non-sparse hash index (sparse indexing leaves-out null values)
Using LOWER(school) and LIKE will always touch every document - no index will help (it has to access the document to get the value to make it lowercase, etc.)
Keep in mind that most indexes work in one of two ways ("fulltext" is the outlier):
Exact match ("hash")
numerical gt/lt evaluation ("skiplist")
To accomplish what you're after, you need to create an index on a string property that you can exactly match (case-sensitive). If you can't reliably match the case between your attribute value and search string, then I would recommend either transforming the value in the document or creating a lower-cased copy of that attribute and indexing that.
Here are the ArangoDB docs regarding index types. The manual has a section on index basics and usage, but I like the HTTP docs better.

Schema index for words in a string

I have a large amount of nodes which have the property text containing a string.
I want to find all nodes which text contains a given string (exact match). This can be done using the CONTAINS operator.
MATCH (n)
WHERE n.text CONTAINS 'keyword'
RETURN n
Edit: I am looking for all nodes n where n.text contains the substring 'keyword'. E.g. n.text = 'This is a keyword'.
To speed up this I want to create an index for each word. Is this possible using the new Schema Indexes?
(Alternatively this could be done using a legacy index and adding each node to this index but I would prefer using a schema index)
Absolutely. Given that you are looking for an exact match you can use a schema index. Judging from your question you probably know this but to create the index you will need to assign your node a label and then create the index on that label.
CREATE INDEX ON :MyLabel(text)
Then at query time the cypher execution index will automatically use this index with the following query
MATCH (n:MyLabel { text : 'keyword' })
RETURN n
This will use the schema index to look up the node with label MyLabel and property text with value keyword. Note that this is an exact match of the complete value of the property.
To force Neo4j to use a particular index you can use index hints
MATCH (n:MyLabel)
USING INDEX n:MyLabel(text)
WHERE n.text = 'keyword'
RETURN n
EDIT
On re-reading your question I am thinking you are not actually looking for a full exact match but actually wanting an exact match on the keyword parameter within the text field. If so, then...no, you cannot yet use schema indexes. Quoting Use index with STARTS WITH in the Neo4j manual:
The similar operators ENDS WITH and CONTAINS cannot currently be solved using indexes.
If I understand your question correctly, a legacy index would accomplish exactly what you're looking to do. If you don't want to have to maintain the index for each node you create/delete/update, you can use auto indexing (http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/).
If you're looking to only use schema indexing, another approach would be to store each keyword as a separate node. Then you could use a schema index for finding relevant keyword nodes which then map to the node they exist on. Just a thought.

Get multiple persistent entries by keylist

I use SQLite with persistent in Haskell.
I have a list of keys i.e. [PostId].
Now I want to get all entries
[Desc PostCrtDate, OffsetBy from, LimitTo (to - from + 1)].
Is there an alternative to selectList but with a list of keys instead of or in addition to the "normal" conditions of a SQL query?
It seems horribly inefficient to use mapM get keyList and then do sorting/offsetting/limiting, especially with a big database.
I am open to using esqueleto if necessary but I would rather not introduce another dependency.
Thanks!
I'm on a mobile right now and therefore may get the syntax wrong, but it's something like:
selectWhere [PostId <-. IdList] []
That operator is the "in" operator, checking if a value is in a list.
Note that this will not give any errors if some of the keys are not found, you'd need to check for that manually.

How to extract property of a collection in the root document

I'm using RavenDB and I'm having trouble extracting a particular value using the Lucene Query.
Here is the JSON in my document:
{
"customer" : "my customer"
"locations": [
{
"name": "vel arcu. Curabitur",
"settings": {
"enabled": true
}
}
]
}
Here is my query:
var list = session.Advanced.LuceneQuery<ExpandoObject>()
.SelectFields<ExpandoObject>("customer", "locations;settings.enabled", "locations;name")
.ToList();
The list is populated and contains a bunch of ExpandoObjects with customer properties but I can't for the life of me get the location -> name or location -> settings -> enabled to come back.
Is the ";" or "." incorrect usage??
It seems that you have misunderstood the concept of indexes and queries in RavenDB. When you load a document in RavenDB you always load the whole document including all of its contents it contains. So in your case, if you load a customer, you already have the collection and all its children loaded. That means, you can use standard linq-to-objects to extract all these values, no need for anything special like indexes or lucene here.
If you want to do this extraction on the database side, so that you can query on those properties, then you need an index. Indexes are written using linq, but it's important to understand that they run on the server and just extract some data to populate the lucene index from. But here again, in most cases you don't even have to write the indexes yourself because RavenDB can create them automatically for you.
I no case, you need to write lucene queries like the one in your question because in RavenDB lucene queries will always be executed against a pre-built index, and these are generally flat. But again, chances are you don't need to do anything with lucene to get what you want.
I hope that makes sense for you. If not, please update your question and tell us more about what you actually want to do.
Technically, you can use the comma operator "," to nest into collections.
That should work, but it isn't recommended. You can just get your whole object and use it, it is easier and faster.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article