What is the logic behind google spellcheck - spell-checking

When I wanted to search a word or some thing in google; If there is some spelling mistake in that word or sentence, google can get back me with correct spell or corrected sentence. Can anyone explain me how exactly this is being done. I will happy if anyone can explain in terms of programming than in terms of database and all those stuff. Thank you.

Combination of string comparison (with dictionary), stemming and popularity match word base on its large user statistic data.
EDIT: there's a wikipedia page that may helps you understand how computer spell check works.

Related

Looking for a program/script to collect sentences from news articles

So I'm currently working on a research paper on media bias (or lack thereof) towards 2020 presidential candidates.
For this, I'm looking for a way to make a huge database of sentences that mention these politicians by name or (if possible) with a pronoun. Right now I'd like to only focus on 5-7 of the biggest American news outlets (WaPo, NYT, FOX, etc.).
I want to collect all of these sentences into an Excel sheet, including a timestamp of when the article was released and a link to the article itself. I actually don't know if that's feasible or whether such program/script exists or not.
Do you think there's a way to solve this, does it already exist, and if not, can a rookie programmer write a script for this?
Thank you for all your help in advance!
You'd probably just need to create your own web scraper. You could have a Set of names that you're looking for, and if the name exists on the page then you can have some heuristics to get the sentence it's in. You'll probably have to have some specific stuff for getting the timestamp from the article. I'd say it wouldn't be too bad since you're targeting only a few news outlets, but probably a bit challenging for a rookie programmer.
Also, I recommend checking out something like https://www.webscraper.io/

find indexed terms in non-indexed document/string

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3
There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Should I use proper punctuation for single sentence alert/notification popups?

Is it necessary to use a period for single sentence notification boxes? Even though its considered proper grammar to do so, it just looks ugly and feels too formal.
Here are two screenies for comparison (first includes period, second doesn't).
alt text http://wordofjohn.com/files/stack_alert_1.png
alt text http://wordofjohn.com/files/stack_alert_2.png
Can't go wrong with correct grammar
Good grammar shows to your customers that you took time to make a good software even where others might not took time.
This way they can expect the best out of you and your company.
If you are using a full sentence to tell the user what to do, then I think proper grammar is important, although I always stay away from exclamation points, I find them annoying.
It is more preference that anything, but I like to maintain the best grammar possible in any situation.
In both instances you capitalized the first word in the sentence so I would say go with proper grammar
but it really is a preference
I'd vote No.
These alerts are like signposts or roadsigns, they need to present a brief but important message as succinctly as possible.
My reasoning extended - I think it's subjective, and so I doubt anyone's going to have a bad user experience because of the presense or absence of a full stop (period). A question mark might be confusing if it was left out, but a full stop is kind of implicit.
If you use periods at the end of your sentences, then users will know that the string hasn't been truncated (well OK, they won't know that it hasn't been truncated, but it's a good indicator. Plus, as others have said, it shows you went to the trouble to get it right.
I can't remember - what do MS/Apple do?
Let me explain my preference with an analogy.
I used to work at a bookstore where they sold Bibles. Some of them were Cambridge calfskin leather bound deluxe editions that came in special boxes for over US$100.00 each. Some of them were mass market paperback throw-away versions for US$1.99 each. The cheap ones often had glaring grammatical and spelling errors. I don't think this was a coincidence.
Regardless of where my software is going to be used or what it is for, I try to do my best to make sure it gets put (metaphorically) on the high-quality, expensive rack. Every time. Even at the risk of sounding "too formal".
If you are using the string as a normal resource, you (or someone else in your project) could use the text in another context, which would mean you need to keep track of which resources contain a period or not.

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)