How to implement autocomplete on a massive dataset - optimization

I'm trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:
Dictionary<string,List<string>>
where the key is the first two letters. That's OK, but what if I want to do something similar to Pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have to say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok, that was a long question. Thoughts?

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.
And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.
When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.
And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.

Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.

Not algorithmically related to what you are asking, but make sure you have a 200ms or more delay (lag) after the kaypress(es) so you ensure that the user has stopped typing before issuing the asynchronous request. That way you will reduce redundant http requests to the server.

I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.

You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.

if you don't want a trie and you want stuff from the middle of the string, you generally want to run some sort of edit distance function (levenshtein distance) which will give you a number indicating how well 2 strings match up. it's not a particularly efficient algorithm, but it doesn't matter too much for things like words, as they're relatively short. if you're running comparisons on like, 8000 character strings it'll probably take a few seconds. i know most languages have an implementation, or you can find code/pseudocode for it pretty easily on the internet.

I've built AutoCompleteAPI for this scenario exactly.
Sign up to get a private index, then,
Upload your documents.
Example upload using curl on document "New York":
curl -X PUT -H "Content-Type: application/json" -H "Authorization: [YourSecretKey]" -d '{
"key": "New York",
"input": "New York"
}' "http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]"
After indexing all document, to get autocomplete suggestions, use:
http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]?prefix=new
You can use any client autocomplete library to show these results to the user.

Related

RedisBloom: Option to add items (bit strings) as is with no hashing?

I'm considering redis for my next project (in-memory, fast) but now I have the problem of figuring out how and if at all it could actually achieve my goal. The goal is to store "large" (millions) amount of fixed-length bit strings and then searching over the database with a input (query) bit string. Search means to return everything which fulfills below condition:
query & value = query
eg. if all bits set in the query are also set in the value return that key eg. bloom-filter albeit in my domain of work it isn't usually called like that.
I found the module RedisBloom but I already have my bloom filter (bit strings) available from external program and would simply like to use RedisBloom for storage of them and searching (exists command). therefore in my case the "Add" command should take the input as is and not hash it again.
Is that possible? And if not other suggestions?
Nope, that isn't possible as RedisBloom is a "black box" in that sense - it manages its own data structures.

Dictionary API used for stressed syllables

This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?
I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.

How to get all hashes in foo:* using a single id counter instead of a set/array

Introduction
My domain has articles, which have a title and text. Each article has revisions (like the SVN concept), so every time it is changed/edited, those changes will be stored as a revision. A revision is composed of changes and the description of those changes
I want to be able to obtain all revisions descriptions at once.
What's the problem?
I'm certain that I would store the revision as a hash in articles:revisions:<id> storing the changes, and the description in it.
What I'm not certain of is how do I get all of the descriptions at once.
I have many options to do this, but none of them convinces me.
Store the revision ids for an article as a set, and use SORT articles:revisions:idSet BY NOSORT GET articles:revisions:*->description. This means that I would store a set for each article. If every article had 50 revisions, and we had 10.000 articles, we would have 500.000 ids stored.
Is this the best way? Isn't this eating up too much RAM?
I have other ideas in mind, but I don't consider them good either.
Iterate from 0 to the last revision's id, doing a HGET for each id using MULTI
Create the idSet for a specific article if it doesn't exist and is request, expire after some time.
Isn't there a way for redis to do a SORT array BY NOSORT GET, with array being an adhoc array in the form of [0, MAX]?
Seems like you have a good solution.
As long as you keep those id numbers less than 10,000 and your sets with less than 512 elements(set-max-intset-entries), your memory consumption will be much lower than you think.
Here's a good explanation of it.
This can be solved in an optimized way using a TRIE or DAWG better than what Redis provides. I don't know your application or other info on your search problem (e.g. construction time, unsuccessful searches, update performance).
If you search much more often than you need to update / insert into your lookup storage, I'd suggest you have a look at DAWGDIC [1] as a library, and construct "search paths" (similar as you already described) using a string format that can be search-completed later:
articleID:revisionID:"changeDescription":"change"
Example (I assume you have one description per revision, and n changes. This isn't clear to me from your question):
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
2:4:"Advertisement changes":"Added this, removed that"
Note: Even though you construct these strings with duplicate prefixes, the DAWG will store them in a very space efficient way (simply put, it will append the right side of the string to the data structure and create a shortcut for the common prefix, see also [2] for a comparison of TRIE data structures).
To list changes of article 1, revision 2, set the common prefix for your lookup:
completer.Start(index, "1:2");
Now you can simple call completer.Next() to lookup a next record that shares the same prefix, and completer.value() to get the record's value. In our example we'll get:
1:2:"Some changes":"Added two sentences here, removed one sentence there"
1:2:"Some changes":"Fixed article title"
Of course you need to parse the strings yourself into your data object.
Maybe that's not what you're looking for and overkill. But it can be a very space and search performance efficient way, if it meets your requirements.
[1] https://code.google.com/p/dawgdic/
[2] http://kmike.ru/python-data-structures/

Large results set from Oracle SELECT

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?
The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.
SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

How do I implement these strong password requirements?

No keyboard patterns. i.e. keys that are adjacent vertically or horizontally on a keyboard. For example, 'ZXCVBN123' should be rejected.
No commonly used words and no words written backwards or disguised with special characters. For example 'Universe1' and 'Un1ver$e' should be rejected.
Well, first you need to define exactly what you want. What are keyboard patterns? Is 'jk' a keyboard pattern, or just 'jkl'? What's the shortest pattern there is? Is 'gy' a pattern? First you need to define what a pattern really.
Then you should make a list of all the available patterns (There aren't all that many. You have 36 starting points and 4 directions to go from each starting point). When you get a password, try to locate each of the patterns in it. Note that if you decide the shortest pattern is 3 letters long, you don't need to search for 4-letter patterns, all 4-letter patterns already contain 3-letter patterns.
As for words, that's easier, but first you need to make a list of all disallowed transformations ($->S, 1->i, etc...). Once you get a word, apply all the transformations and get yourself a 'normalized' word. Compare the normalized password against a dictionary of all legal words twice - the second time reverse the password.
You will probably need to do something a little more complicated than that, because you need to ignore numbers at the end of the word - sometimes. 1ncredible can be a substitute for 'incredible', although ncredible is not a word.
If you inspect the code of http://howsecureismypassword.net you can see that the password is compared to a large array of usual passwords.
On the page threre is a reference to the page http://xato.net/passwords/more-top-worst-passwords/ which lists the top 10.000 most common passwords.
One approach would be to download that list and check the users passwords against it or at least some top 100 of them.