I have a collection of baseball players that is structured something like the following:
{
"name": "john doe",
"education": [
"Community College",
"University"
]
}
If I want to get a list of all schools in the education arrays I can do something like the following:
FOR school IN unique((
FOR player IN players
COLLECT schools = player.education
RETURN schools
)[**])
FILTER school != null
FILTER LOWER(school) LIKE CONCAT('%', #name, '%')
LIMIT 10
RETURN school
But in order to do this it has to touch every document in the collection. I built an index on players.education[*] which would have all the schools in it. Is there any way I could directly query the index for the keys (school names) instead of having to touch every record in the collection each time I need to run the query?
There are two things to consider:
The FILTER school != null statement requires a non-sparse hash index (sparse indexing leaves-out null values)
Using LOWER(school) and LIKE will always touch every document - no index will help (it has to access the document to get the value to make it lowercase, etc.)
Keep in mind that most indexes work in one of two ways ("fulltext" is the outlier):
Exact match ("hash")
numerical gt/lt evaluation ("skiplist")
To accomplish what you're after, you need to create an index on a string property that you can exactly match (case-sensitive). If you can't reliably match the case between your attribute value and search string, then I would recommend either transforming the value in the document or creating a lower-cased copy of that attribute and indexing that.
Here are the ArangoDB docs regarding index types. The manual has a section on index basics and usage, but I like the HTTP docs better.
Related
I'm reading "Tsql Fundamental" by Ben Itzik.
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently. But he didn't really go into detail as to why this is the case.
So could someone please kindly explain the reason behind it?
The author briefly mentioned that we shouldn't manipulate the filtered column if we want to use index efficiently
What author mentions is called SARGABILITY.
Assume this statement
select * from t1 where name='abc'
Assume,you have an index on above filtered column
,then the query is Sargable
But not below one
select * from t1 where len(name)=3
When SQL is presented with above query,the only way ,it can filter out the data is to scan the table and then apply predicate to each row
Think of an index as being like a telephone directory (hopefully that's still a familiar enough concept) where everyone is listed by their surnames followed by their addresses.
This index is useful if you want to locate someone's phone number and you know their surname (and maybe their address).
But what if you want to locate everyone who (to steal TheGameiswar's example) has a 3 letter surname - is the index useful to you? It may be slightly more useful than having to go and visit every house in town1, but it's not nearly so efficient as being able to just jump to the appropriate surnames. You have to search the entire book.
Similarly, if you want to locate everyone who lives on a particular street, the index isn't so useful2 - you have to search through the entire book to make sure you've found everyone. Or to locate everyone whose surname ends with Son, etc.
1This is the analogy for when a database may choose to perform an index scan to satisfy a query simply because the index is smaller and so is easier than a full table scan.
2This is the analogy for a query that isn't attempting to filter on the left-most column in the index.
WHERE clause in a SQL query use predicates to filter the rows. Predicate is an expression, that determines whether an argument applied on a database object is true or false. Example : "Salary > 5000".
Relational models use predicates as a core element in filtering the data. These predicates should be written in certain form known as "Search Arguments" in order for the query optimizer to use the indexes effectively on the attributes used in the WHERE clause to filter data.
A predicate in the form - "column - operator - value" or "value - operator - column" is considered an appropriate search argument. Example - Salary = 1000 or Salary > 5000. As you can see, the column name should appear ALONE on one side of the expression and the constant or calculated value should be on the other side to form a valid search argument. The moment a built-in function like MAX , MIN, DATEADD or DATEDIFF etc was used on the column name, the expression is no longer treated as a search argument and the query optimizer won't use the indexes on those column names.
I hope this is clear.
I define a Document object for my product entity which has several fields: Title, Brand, Category, Size, Color, Material.
Now I want to support user to do an AND search on multiple fields. Any document that have one, two or more fields contain all the search words will be responded.
For example, when user enter "gucci shirt red" I want to return all documents that have fields matched with all 3 tokens "gucci", "shirt" AND "red". So all documents below will be responded:
1.Documents with title contains all the 3 words, for example Title = "Gucci Modern Shirt Red" or "Gucci blue shirt"...
2.Documents with Title = "Gucci classical shirt" AND Color = "red"
3.Documents with Category = "mens shirt" AND Brand = "gucci" AND Color = "red"
4.etc..
I know that Lucene support operator + that do a MUST for search query. For example I can translate the above keyword to query "+gucci +shirt +red" then I'm sure documents of example (1) above will definitely be responded. But does it work for cases (2) and (3) above ?
When doing these types of queries I like to: create a master BooleanQuery and add several sub-queries that work together to give the best result:
TermQuery: (exact match), someone types in the exact match of the title
PhraseQuery: (use slop), so if you have "Gucci Modern Shirt Red" and someone types in "Gucci Shirt" (notice one word gap) it would match
FuzzyQuery: (slow on large(> 50 million records)/non-memory indexes) to account for potential misspellings
Boolean SubQuery: with all of the terms seperated and OR'ed. Queries matching 1 our of 4 words will have low score however 3/4 words will have a higher score.
Query Parse (as mentioned above with potential field boosts)
Other: i.e. Synonym search on phrases etc.
I would OR all of these types and then filter them out using a Collector minimum score.
The reason I like the master BooleanQuery approach is that you can have a setting where a user chooses "the type" of query. Maybe as simple -> advanced and it is easy to add/remove query types rather quickly on the fly and the query can be built pretty easily giving predicitve results. Boosting records/similarity you are working within the internal Lucene algorithm and results are not sometimes clear.
Performance: I have done queries like this using Lucene 3.0.x on indexes with > 100M records NOT IN MEMORY and it works pretty quickly giving sub-second responses. Fuzzy Query does slow things down, but as stated before that can be made into an advanced search option (or "Search again with...")
No, when not given a a field to search explicitly in the query, it will go to the default field, which it would appear is the "title" in your case. You would need a query more like:
+shirt +color:red +brand:gucci
for instance.
Or, one common usage is to set up a catch all field, in which all (or a large subset) of searchable data is mashed together, allowing you to search everything in a very loose fashion, on that field, in which case you would just use something like:
all:(+shirt +gucci +red)
Or, if you made that field your default field instead:
+shirt +gucci +red
As you indicated.
You could use MultiFieldQueryParser. Add Title, color, brand etc to this.
If you search for "gucci shirt red" then using above Parser would return query like
+((Title:gucci Color:gucci Brand:gucci) (Title:shirt Color:shirt Brand:shirt) (Title:red Color:red Brand:red)
This should solve the problem.
Also, if you want that lets say, for above query you want to show brand with gucci products to be shown 1st then you could apply boost to this field.
I would like to get some feedback and suggestions regarding two approaches I'm considering to implementing searchable indexes using Redis sorted sets.
Situation and objective
We currently have some key-value tables we're storing in Cassandra, and which we would like to have indexes for. For example, one table would contain records of people, and the Cassandra table would have id as its primary key, and the serialized object as the value. The object would have fields such as first_name, last_name, last_updated, and others.
What we want is to be able to do searches such as "last_name = 'Smith' AND first_name > 'Joel'" , "last_name < 'Aaronson'" , "last_name = 'Smith' AND first_name = 'Winston'" and so on. The searches should yield the ids of matches so we can then retrieve the objects from Cassandra. I'm thinking the above searches could be done with a single index, sorted lexicographically by last_name, first_name, and last_updated. If we need some searches using a different order (e.g. "first_name = 'Zeus'") we can have a similar index that would allow those (e.g. first_name, last_updated).
We are looking at using Redis for this, because we need to be able to handle a large number of writes per minute. I've read up on some common ways Redis sorted sets are used, and come up with two possible implementations:
Option 1: a single sorted set per index
For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes:people:last_name:first_name:last_updated , which would contain strings with the format last_name:first_name:last_updated:id . For example:
smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw
(For the separator I might use '::' rather than ':' or something else to work better with the lexicographic ordering, but let's ignore that for now)
The items would all be given score 0 so that the sorted set will just be sorted lexicographically by the strings themselves. If I then want to do a query like "last_name = 'smith' AND first_name < 'bob'", I would need to get all the items in the list that come before 'smith:bob'.
As far as I can tell, there are the following drawbacks to this approach:
There is no Redis function to select a range based on the string value. This feature, called ZRANGEBYLEX, has been proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but is not implemented, so I would have to find the endpoints using binary searches and get the range myself (perhaps using Lua, or at the application-level with Python which is the language we're using to access Redis).
If we want to include a time-to-live for index entries, it seems the simplest way to do it would be having a regularly scheduled task which goes through the whole index and removes expired items.
Option 2: small sorted sets, sorted by last_updated
This approach would be similar, except we would have many, smaller, sorted sets, with each having a time-like value such as last_updated for the scores. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each last_name, first_name combination. For example, the key might be indexes:people:last_name=smith:first_name=joel , and it would have an entry for each person we have called Joel Smith. Each entry would have as its name the id and its score the last_updated value. E.g.:
value: 0azbjZRHTQ6U8enBw6BJBw ; score: 1372761839.444
The main advantages to this are (a) searches where we know all the fields except last_updated would be very easy, and (b) implementing a time-to-live would be very easy, using the ZREMRANGEBYSCORE.
The drawback, which seems very large to me is:
There seems to be a lot more complexity in managing and searching this way. For example, we would need the index to keep track of all its keys (in case, for example, we want to clean up at some point) and do this in a hierarchical manner. A search such as "last_name < 'smith'" would require first looking at the list of all the last names to find those which come before smith, then for each of those looking at all the first names it contains, then for each of those getting all the items from its sorted set. In other words, a lot of components to build up and worry about.
Wrapping up
So it seems to me the first option would be better, in spite of its drawbacks. I would very much appreciate any feedback regarding these two or other possible solutions (even if they're that we should use something other than Redis).
I strongly discourage the use of Redis for this. You'll be storing a ton of extra pointer data, and if you ever decide you want to do more complicated queries like, SELECT WHERE first_name LIKE 'jon%' you're going to run into trouble. You'll also need to engineer extra, very big indexes that cross multiple columns, in case you want to search for two fields at the same time. You'll essentially need to keep hacking away and reengineering a search framework. You'd be much better off using Elastic Search or Solr, or any of the other frameworks already built to do what you're trying to do. Redis is awesome and has lots of good uses. This is not one of them.
Warning aside, to answer your actual question: I think you'd be best served using a variant of your first solution. Use a single sorted set per index, but just convert your letters to numbers. Convert your letters to some decimal value. You can use the ASCII value, or just assign each letter to a 1-26 value in lexicographic order, assuming you're using English. Standardize, so that each letter takes up the same numeric length (so, if 26 is your biggest number, 1 would be written "01"). Then just append these together with a decimal point in front and use that as your score per index (i.e. "hat" would be ".080120"). This will let you have a properly ordered 1-to-1 mapping between words and these numbers. When you search, convert from letters to numbers, and then you'll be able to use all of Redis' nice sorted set functions like ZRANGEBYSCORE without needing to rewrite them. Redis' functions are written very, very optimally, so you're much better off using them whenever possible instead of writing your own.
You could use my project python-stdnet for that, it does all the indexing for you. For example:
class Person(odm.StdModel):
first_name = odm.SymbolField()
last_name = odm.SymbolField()
last_update = odm.DateTimeField()
Once a the model is registered with a redis backend, you can do this:
qs = models.person.filter(first_name='john', last_name='smith')
as well as
qs = models.person.filter(first_name=('john','carl'), last_name=('smith','wood'))
and much more
The filtering is fast as all ids are already in sets.
You can check redblade, it can maintenance index automatically for you and it's written by Node.JS.
//define schema
redblade.schema('article', {
"_id" : "id"
, "poster" : "index('user_article')"
, "keywords" : "keywords('articlekeys', return +new Date() / 60000 | 0)"
, "title" : ""
, "content" : ""
})
//insert an article
redblade.insert('article', {
_id : '1234567890'
, poster : 'airjd'
, keywords : '信息技术,JavaScript,NoSQL'
, title : '测试用的SLIDE 标题'
, content : '测试用的SLIDE 内容'
}, function(err) {
})
//select by index field or keywords
redblade.select('article', { poster:'airjd' }, function(err, articles) {
console.log(articles[0])
})
redblade.select('article', { keywords: 'NoSQL' }, function(err, articles) {
console.log(articles[0])
})
I have an mongo query that looks like the following:
{'schedule': {'$elemMatch': {'time': {'$gt': start, '$lte': end}}}}
Which searches a collection for an item whose "schedule" field (which is a list of objects) that contains an element whose "time" field is between start and end.
I am unclear as to how mongo's indexes handle this situation, and wonder what would be the best practice, if this query were going to be happening regularly?
This is the wrong syntax, but should work. $elemMatch is used when there is more than one field in an object (esp. in an array) that you want to match. With a single field there is no need for $elemMatch.
You should be doing this:
{"schedule.time" : { $gt: <start>, $lte: <end> }}
It is much more clear that you are using a single field, even though it will be in multiple array elements, in this case. An index on "schedule.time" can be used for this query to make it more efficient.
In my Lucene documents I have a field "company" where the company name is tokenized.
I need the tokenization for a certain part of my application.
But for this query, I need to be able to create a PrefixQuery over the whole company field.
Example:
My Brand
my
brand
brahmin farm
brahmin
farm
Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.
Any suggestions?
Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.
If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?
You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.
Use a SpanQuery to only search the first term position. A PrefixQuery wrapped by SpanMultiTermQueryWrapper wrapped by SpanPositionRangeQuery:
<SpanPositionRangeQuery: spanPosRange(SpanMultiTermQueryWrapper(company:bra*), 0, 1)>