How can I do spell check with Membase ? - spell-checking

I have a membase server and Have a list of words which is my dictionary. Now I have some words with which i can make spell check. I am not able to come up with some good algorithm. Any idea ?

Disclaimer: Found links i bookmarked a while ago. Got links from a similar SO discussion a while ago, maybe you can search for it.
Check out the following links. The first is an article on how to write a spell checker, the others are for making it faster:
http://norvig.com/spell-correct.html
http://en.wikipedia.org/wiki/Bloom_filter
http://theyougen.blogspot.ca/2010/02/faster-spelling-corrector.html
Hope it helps.

Related

Looking for a program/script to collect sentences from news articles

So I'm currently working on a research paper on media bias (or lack thereof) towards 2020 presidential candidates.
For this, I'm looking for a way to make a huge database of sentences that mention these politicians by name or (if possible) with a pronoun. Right now I'd like to only focus on 5-7 of the biggest American news outlets (WaPo, NYT, FOX, etc.).
I want to collect all of these sentences into an Excel sheet, including a timestamp of when the article was released and a link to the article itself. I actually don't know if that's feasible or whether such program/script exists or not.
Do you think there's a way to solve this, does it already exist, and if not, can a rookie programmer write a script for this?
Thank you for all your help in advance!
You'd probably just need to create your own web scraper. You could have a Set of names that you're looking for, and if the name exists on the page then you can have some heuristics to get the sentence it's in. You'll probably have to have some specific stuff for getting the timestamp from the article. I'd say it wouldn't be too bad since you're targeting only a few news outlets, but probably a bit challenging for a rookie programmer.
Also, I recommend checking out something like https://www.webscraper.io/

Is there an online source of complete documentation on Pandas objects/classes (besides reading its code)?

Today, I was looking for a long time on the page http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html , trying to find something very simple: an attribute or method that would retrieve the index from the DataFrame. It doesn't show any. I scanned the page many times, as well as doing text searches through it.
Then, of course, I came to stackoverflow and got the answer almost immediately: DataFrame.index is the attribute.
Obviously, in the future, I want to be able to trust the documentation and not waste time like today. So my question is: Is there an online source of complete documentation on Pandas objects/classes (besides reading its code) that doesn't omit any attributes/methods/etc? Thanks.
Read 10 minutes to pandas. The third section makes use of the .index attribute.
See also the tutorials section of the docs.
IMO both online documentation and docstring help is very good for specific methods, that said it's difficult to be better than Google/StackOverflow for finding the answer to specific questions...

How to use regular expression in T-SQL?

For example:
where strword match with {%{J{GC * GC}J} or strword={%{J{GC * GC}J}
I typically try to avoid posting answers that are simply links to somewhere else. But in this case it's a fairly big answer as you have to involve CLR in your approach.
So - in this case I'm going to make an exception and just give you a link. I feel better about it since it's an official Microsoft doc and they are pretty good about not moving things around.
Here's the walk through from Microsoft on using RegEx with SQLServer.
It has good sample code and is extensive in it's coverage. If your OK with adding CLR to your solution then it should give you exactly what you need.
Update: Turns out Microsoft did in fact change that link. Another walk through by a respected company (Red Gate) can be found here: https://www.red-gate.com/simple-talk/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

What is the logic behind google spellcheck

When I wanted to search a word or some thing in google; If there is some spelling mistake in that word or sentence, google can get back me with correct spell or corrected sentence. Can anyone explain me how exactly this is being done. I will happy if anyone can explain in terms of programming than in terms of database and all those stuff. Thank you.
Combination of string comparison (with dictionary), stemming and popularity match word base on its large user statistic data.
EDIT: there's a wikipedia page that may helps you understand how computer spell check works.