I would like to get some feedback and suggestions regarding two approaches I'm considering to implementing searchable indexes using Redis sorted sets.
Situation and objective
We currently have some key-value tables we're storing in Cassandra, and which we would like to have indexes for. For example, one table would contain records of people, and the Cassandra table would have id as its primary key, and the serialized object as the value. The object would have fields such as first_name, last_name, last_updated, and others.
What we want is to be able to do searches such as "last_name = 'Smith' AND first_name > 'Joel'" , "last_name < 'Aaronson'" , "last_name = 'Smith' AND first_name = 'Winston'" and so on. The searches should yield the ids of matches so we can then retrieve the objects from Cassandra. I'm thinking the above searches could be done with a single index, sorted lexicographically by last_name, first_name, and last_updated. If we need some searches using a different order (e.g. "first_name = 'Zeus'") we can have a similar index that would allow those (e.g. first_name, last_updated).
We are looking at using Redis for this, because we need to be able to handle a large number of writes per minute. I've read up on some common ways Redis sorted sets are used, and come up with two possible implementations:
Option 1: a single sorted set per index
For our index by last_name, first_name, last_updated, we would have a sorted set in Redis under the key indexes:people:last_name:first_name:last_updated , which would contain strings with the format last_name:first_name:last_updated:id . For example:
smith:joel:1372761839.444:0azbjZRHTQ6U8enBw6BJBw
(For the separator I might use '::' rather than ':' or something else to work better with the lexicographic ordering, but let's ignore that for now)
The items would all be given score 0 so that the sorted set will just be sorted lexicographically by the strings themselves. If I then want to do a query like "last_name = 'smith' AND first_name < 'bob'", I would need to get all the items in the list that come before 'smith:bob'.
As far as I can tell, there are the following drawbacks to this approach:
There is no Redis function to select a range based on the string value. This feature, called ZRANGEBYLEX, has been proposed by Salvatore Sanfilippo at https://github.com/antirez/redis/issues/324 , but is not implemented, so I would have to find the endpoints using binary searches and get the range myself (perhaps using Lua, or at the application-level with Python which is the language we're using to access Redis).
If we want to include a time-to-live for index entries, it seems the simplest way to do it would be having a regularly scheduled task which goes through the whole index and removes expired items.
Option 2: small sorted sets, sorted by last_updated
This approach would be similar, except we would have many, smaller, sorted sets, with each having a time-like value such as last_updated for the scores. For example, for the same last_name, first_name, last_updated index, we would have a sorted set for each last_name, first_name combination. For example, the key might be indexes:people:last_name=smith:first_name=joel , and it would have an entry for each person we have called Joel Smith. Each entry would have as its name the id and its score the last_updated value. E.g.:
value: 0azbjZRHTQ6U8enBw6BJBw ; score: 1372761839.444
The main advantages to this are (a) searches where we know all the fields except last_updated would be very easy, and (b) implementing a time-to-live would be very easy, using the ZREMRANGEBYSCORE.
The drawback, which seems very large to me is:
There seems to be a lot more complexity in managing and searching this way. For example, we would need the index to keep track of all its keys (in case, for example, we want to clean up at some point) and do this in a hierarchical manner. A search such as "last_name < 'smith'" would require first looking at the list of all the last names to find those which come before smith, then for each of those looking at all the first names it contains, then for each of those getting all the items from its sorted set. In other words, a lot of components to build up and worry about.
Wrapping up
So it seems to me the first option would be better, in spite of its drawbacks. I would very much appreciate any feedback regarding these two or other possible solutions (even if they're that we should use something other than Redis).
I strongly discourage the use of Redis for this. You'll be storing a ton of extra pointer data, and if you ever decide you want to do more complicated queries like, SELECT WHERE first_name LIKE 'jon%' you're going to run into trouble. You'll also need to engineer extra, very big indexes that cross multiple columns, in case you want to search for two fields at the same time. You'll essentially need to keep hacking away and reengineering a search framework. You'd be much better off using Elastic Search or Solr, or any of the other frameworks already built to do what you're trying to do. Redis is awesome and has lots of good uses. This is not one of them.
Warning aside, to answer your actual question: I think you'd be best served using a variant of your first solution. Use a single sorted set per index, but just convert your letters to numbers. Convert your letters to some decimal value. You can use the ASCII value, or just assign each letter to a 1-26 value in lexicographic order, assuming you're using English. Standardize, so that each letter takes up the same numeric length (so, if 26 is your biggest number, 1 would be written "01"). Then just append these together with a decimal point in front and use that as your score per index (i.e. "hat" would be ".080120"). This will let you have a properly ordered 1-to-1 mapping between words and these numbers. When you search, convert from letters to numbers, and then you'll be able to use all of Redis' nice sorted set functions like ZRANGEBYSCORE without needing to rewrite them. Redis' functions are written very, very optimally, so you're much better off using them whenever possible instead of writing your own.
You could use my project python-stdnet for that, it does all the indexing for you. For example:
class Person(odm.StdModel):
first_name = odm.SymbolField()
last_name = odm.SymbolField()
last_update = odm.DateTimeField()
Once a the model is registered with a redis backend, you can do this:
qs = models.person.filter(first_name='john', last_name='smith')
as well as
qs = models.person.filter(first_name=('john','carl'), last_name=('smith','wood'))
and much more
The filtering is fast as all ids are already in sets.
You can check redblade, it can maintenance index automatically for you and it's written by Node.JS.
//define schema
redblade.schema('article', {
"_id" : "id"
, "poster" : "index('user_article')"
, "keywords" : "keywords('articlekeys', return +new Date() / 60000 | 0)"
, "title" : ""
, "content" : ""
})
//insert an article
redblade.insert('article', {
_id : '1234567890'
, poster : 'airjd'
, keywords : '信息技术,JavaScript,NoSQL'
, title : '测试用的SLIDE 标题'
, content : '测试用的SLIDE 内容'
}, function(err) {
})
//select by index field or keywords
redblade.select('article', { poster:'airjd' }, function(err, articles) {
console.log(articles[0])
})
redblade.select('article', { keywords: 'NoSQL' }, function(err, articles) {
console.log(articles[0])
})
Related
I have a collection of baseball players that is structured something like the following:
{
"name": "john doe",
"education": [
"Community College",
"University"
]
}
If I want to get a list of all schools in the education arrays I can do something like the following:
FOR school IN unique((
FOR player IN players
COLLECT schools = player.education
RETURN schools
)[**])
FILTER school != null
FILTER LOWER(school) LIKE CONCAT('%', #name, '%')
LIMIT 10
RETURN school
But in order to do this it has to touch every document in the collection. I built an index on players.education[*] which would have all the schools in it. Is there any way I could directly query the index for the keys (school names) instead of having to touch every record in the collection each time I need to run the query?
There are two things to consider:
The FILTER school != null statement requires a non-sparse hash index (sparse indexing leaves-out null values)
Using LOWER(school) and LIKE will always touch every document - no index will help (it has to access the document to get the value to make it lowercase, etc.)
Keep in mind that most indexes work in one of two ways ("fulltext" is the outlier):
Exact match ("hash")
numerical gt/lt evaluation ("skiplist")
To accomplish what you're after, you need to create an index on a string property that you can exactly match (case-sensitive). If you can't reliably match the case between your attribute value and search string, then I would recommend either transforming the value in the document or creating a lower-cased copy of that attribute and indexing that.
Here are the ArangoDB docs regarding index types. The manual has a section on index basics and usage, but I like the HTTP docs better.
Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.
In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)
Example: Suppose I want to presort all of the names linked to some ID so that I can use ZINTERSTORE to quickly get an ordered list of IDs based on their names (without using redis' SORT command). Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.
Does that make sense? Or am I going about it wrong?
You're trying to use an order preserving hash function to generate a score for each id. While it appears you've written one, you've already found out that the score's range allows you to use only the first 9 characters (it would be interesting to see your function btw).
Instead of this approach, here's a simpler one that would be easier IMO - use set members of the form <name>:<id> and set the score to 0. You'll be able to use lexicographical ordering this way and use something like split(':') to get the id from the set's members.
After doing some research and testing with Couchbase, I am getting some good results.
However, it seems strange that views must be created a head of time and are not very flexible.
Basically, if I have a view like this..
function(doc, meta) {
emit([doc.name, doc.location, doc.gender, doc.birthYear, doc.birthMonth], null);
}
And I want to query but different keys. Such as, maybe name = "John" and gender = "M"
It doesnt seem I can do startKey = ["John", {}, "M"], endKey = ["John", {}, "M", {}].
Similarly, what if I just want to filter the above by gender and birth month?
It seems i have to manually crate an individual view for every possible type of query, which with lots of data points if less than optimal.
I havnt seen any questions addressing this. Also, I looked into passing args to map or reduce to do any of it dynamically but that cant be done. I'd be stuck pulling ALL records across all group levels then having to manually sort/aggregate this data.
Can this be done?
Thank you
As of Couchbase version 4.x you have N1QL query language. You can specify a filter criteria to select your json objects without having any views in place.
So as per your example, you should be able issue a query like that:
SELECT *
FROM your_bucket_name
WHERE name = 'John' AND gender = 'M'
Here is a N1QL tutorial to get feel of it.
Yet another way is to use Couchbase integration with ElasticSearch and execute search query in ElasticSearch engine that will return you all the keys it found based on your search criteria.
http://www.couchbase.com/communities/n1ql
N1QL is more rich wuering language to couchbase data, which does not as limited as views
I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)
AJAX autocomplete is fairly simple to implement. However, I wonder how to handle smart tag suggestion like this on SO.
To clarify the difference between autocomplete and suggestion:
autocomplete: foo [foobar, foobaz]
suggestion: foo [barfoo, foobar, foobaz], or even better, with 'did you mean' feature: [barfoo, foobar, foobaz, fobar, fobaz]
I suppose I need some full text search in tags (all letters indexed, not just words). There would be no problem to do it witch regex or other patterns for limited number of tags (even client side).
But how to implement this feature for big number of tags?
Is there any particular reason (besides URL) the tags on SO are dash separated? What about Unicode characters in tags?
I store the tags in the table with the following columns: id, tagname.
My SQL query returns objects with following fields: id, tagname, count
(I use Doctrine ORM and pgsql as default db driver.)
I would go with SELECTING them from database by REGEXP at every keypress. I did this on my sites and the was no prefrormance problem (I do not have heavy loaded server thought). If you do not like this idea, I would cash all 1-5 letters combinations which will users enter and refresh them on daily basis in separate table. If this table is indexed than you have very fast implementation.
To elaborate more on the second appreach:
Briefly: 1. Make a table SEARCHTABLE representing 1-n relationship betwean keywords (limit it to 3-4 letters) and primary IDs of tags. 2. INDEX on both fields. 3. Everytime the user makes a search do look at the SEARCHTABLE and if the combination is there, use that - very fast, as everything is indexed. If not do the regexp search and put all results to SEARCHTABLE.
Notes:
You should invalidate the table if
you add tags, but this should much
less often than a search. When
invalidating table you do not
necesarilly TRUNCATE it, you can
easily rebuild it taking all
keywords into account.
If you want to speed it up, you can "pregenerate" all two or even three
letters searches.
If you care enough, you should be using information from n-1 letter kewords to generate
the n letter keyword. It speeds the things tremendously. Imagine that user has typed "mo"
and you have shown them appropriate result from SEARCHTABLE. Than when she types "n"
giving it "mon" you need only serach trough already selected items to generate new
response.
Hope it is more comprehensive now.