Designing a System that would detect typos and suggestions - spell-checking

This was asked in an interview.
I think the answer can be done by constructing a trie of all valid words and then suggestions can be made based on a possible valid path which was otherwise given as incorrect.
Say if user types apfle, and system would detect that after ap a possible valid path was app, which would then satisfy apple.
Is there any better solution than this? Perhaps the one implemented by spell checkers.

See:
How does the Google "Did you mean?" Algorithm work?
How do I approximate "Did you mean?" without using Google?
How to write a spelling corrector
Youtube Video: Search 101

Within typical search engines you will find a lot of Analyzer stuff, which directs to the same underlying problem. A very popular Analyzer would be the n-gram Analyzer.
Perhaps this helps.

Related

reCAPTCHA accepting one word out of two

I am a bit confused about how reCAPTCHA works. I have implemented it
using ROR.
Sometimes even if i specify only one word out of two, it returns true
while sometimes it fails.
I am really confused and not able to understand the behaviour of
reCAPTCHA.
Only one of the recaptcha words is "known" by the system - it is relying on the user performing the captcha to tell the system what the other word is, because it is not machine-readable.
That is the "point" of recaptcha, or the added benefit - it is not only performing a human test, it is also massively group-sourcing translation where automated OCR has failed.
Recaptcha shows two words. One that a computer scanner has scanned and recognized and one that the computer scanner cannot recognize. Recaptcha checks for the word it knows the answer to and saves the response for the unknown word. These responses to the unknown words are compiled and analyzed so that it is essentially "solved" by humans and not by the computer scanner.
Here's more info, in their own words:
"But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."
source - http://www.google.com/recaptcha/learnmore
Recaptcha uses two words, one of which is known and one which is unknown (the unknown word is the one that the program is trying to help decipher--it's probably scanned out of an old book or something somewhere!). So really, all the service is looking for is the right answer to the KNOWN word. If that's the word you put it, it will succeed even if you don't put in anything for the unknown word. If you put in the other word (the unknown one) it will fail.
I think that's the main point of recaptcha. It helps developers make difference between humans and robots and it also helps digitalize books.
There're always two words. One is easier to read. If you can read this word, it's fine, you're human.
The second word is a scan from a book where automatic OCR (recognition) is not sure about this word. So users are helping read this word so books can be digitalized better.

What tools exist to find frequencies of searches

I'm new to seo, so please excuse what may be a very basic question.
I want to count (or estimate) the number of times that a given search phrase has been searched within a particular time period. Are there any API's out there for this? Does Google (or any other relevant search engine) release this information?
Any helpful links are greatly appreciated.
I'll be using Java, though I doubt that makes much difference.
I'm not aware of any API for it, but you can use Google Insights for Search
I use link text
it also presents the volume for every search
It looks like the WordTracker API might be the best took for programmatically finding search data.
http://www.wordtracker.com/api/
Check out this improv API to Google Trends which helps you export data.

How does Google Know you are Cloaking?

I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tell that you have checked the user agent and executed a different code path because you saw "googlebot" in the name?
It's in relation to this question on legitimate url cloaking for seo. If textual content is exactly the same, but the rendering is different (1995-style html vs. ajax vs. flash), is there really a problem with cloaking?
Thanks for your put on this one.
As far as I know, how Google prepares search engine results is secret and constantly changing. Spoofing different user-agents is easy, so they might do that. They also might, in the case of Javascript, actually render partial or entire pages. "Do they have a team of human beings comparing?" This is doubtful. A lot has been written on Google's crawling strategies including this, but if humans are involved, they're only called in for specific cases. I even doubt this: any person-power spent is probably spent by tweaking the crawling engine.
Google looks at your site while presenting user-agent's other than googlebot.
See the Google Chrome comic book page 11 where it describes (even better than layman's terms) about how a Google tool can take a schematic of a web page. They could be using this or similar technology for Google search indexing and cloak detection - at least that would be another good use for it.
Google does hire contractors (indirectly, through an outside agency, for very low pay) to manually review documents returned as search results and judge their relevance to the search terms, quality of translations, etc. I highly doubt that this is their only tool for detecting cloaking, but it is one of them.
In reality, many of Google's algos are trivially reversed and are far from rocket science. In the case of, so called, "cloaking detection" all of the previous guesses are on the money (apart from, somewhat ironically, John K lol) If you don't believe me set up some test sites (inputs) and some 'cloaking test cases' (further inputs), submit your sites to uncle Google (processing) and test your non-assumptions via pseudo-advanced human-based cognitive correlationary quantum perceptions (<-- btw, i made that up for entertainment value (and now i'm nesting parentheses to really mess with your mind :)) AKA "checking google resuts to see if you are banned yet" (outputs). Loop until enlightenment == True (noob!) lol
A very simple test would be to compare the file size of a webpage the Googlbot saw against the file size of the page scanned by an alias user of Google that looks like a normal user.
This would detect most suspect candidates for closeer examination.
They call your page using tools like curl and they construct a hash based on the page without the user agent, then they construct another hash with the googlebot user-agent. Both hashes must be similars, they have algorithms to check the hashes and know if its cloaking or not

Implementing CAPTCHA after 50% of Article

We are planning to put large number of Business Research Reports and Articles from our intranet on to the Internet. However, we don't want others to copy the content and host it on their own.
I read about protection by CAPTCHA and was wondering if this is possible. Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article [In this way we are making life little harder for those copycats]
Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.
Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?
Thanks.
There's a really good Captcha service provided by Recaptcha - http://recaptcha.net/
There is a PHP class that you can use to do all the hard work.
It's important to bear in mind that search engines aren't able to solve a Captcha and so they will only index the first half of the report. As long as this half contains largely the correct key words, it shouldn't cause a massive problem. Don't make the mistake of "detecting" a search engine and showing them different content to a normal user as the major search engines think that this is spamming.
An alternative solution would be to use a service like Copyscape (http://www.copyscape.com/) to protect your content.
I know this is not what you're asking, but please take into account that CAPTCHAs are universally broken, and will not protect your content. You said the first half is free, does that mean you intend to charge for the other half? CAPTCHA won't help you here at all...
But even if you're just trying to prevent automated scraping, CAPTCHA still won't do the trick. Check out my answer to another captcha question... Or you can go straight to the ppt I presented at OWASP last year.
Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article
Have your PHP programmer output 50% of the article. On the bottom, add a captcha. If the user types in the correct captcha, output 100% of the article.
Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.
As a PHP programmer, I use http://www.phpcaptcha.org to implement captcha.
Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?
No, it won't penalize you but that particular section will not be shown on the search results.
As already mentioned reCAPTCHA is a good way to go.
Have a look at Captcha::reCAPTCHA on CPAN which according to the CPAN rating reviews "Works out of the box"
If your want Captcha then there are plenty of modules that do this on CPAN ;-)
Hope that helps.

How to describe Full text search for a client?

I implemented a full text search "searching in tags", using SQL server 2005,
I want to describe for the client what i did, what what full text search means by simple examples?
My Client is not a programmer but a good internet user.
I find when describing something to clients use a metaphor or use a very concrete domain specific example.
As a metaphor you could say that Full Text Search is like Google for your site. It looks at everything and anything to try and help you. Whereas, what we had before was more like using the Find feature in XP. It works, but works well if you know a lot about what you are searching for. And isn't Google better than Find :)
Or just give them an example of something they couldn't do before that they can do now! Experience and results always convey the message more than words. Show them how you made their lives easier and they will immediately understand.
Best of luck.
"It now finds stuff much much faster."
Everything else is technical details not interesting to a user.
"Finds information using non exact matches"
Give some examples.
"finds stuff faster, and works more like google".
Gotta toss in a comparison to a search engine--hell, I can't even really describe what "full text search" means. "Full Text Search" is a technical term, really.
I can describe what I dislike about search. Requiring me to type in boolean junk like "cat AND dog" or not offering me alternative queries. In other words, maybe think about what what was wrong with the old way and why this one is better.