Issue with QnA when a grammatical word is used as a question in kb - qnamaker

I created a KB for acronyms, In that one of the acronym is WAS but when ever I input "was" it is returning "No good match found in KB." but that particular acronym is there. Can anyone suggest any solution? Is it because of "was" is a grammatical word?

Related

Finding non-existing words with spaCy?

I am new to spaCy. I have a (German) text in which I want to find all the words not in the dictionary (using the de_core_news_lg pipeline). Reading spaCy's documentation, the only thing I found that looked promising was Token.has_vector(). When I check all the tokens in the Doc object I get by running nlp(TEXT) I find that, indeed, the tokens for which has_vector() returns False seem to be either typos or rare words not likely to be in the dictionary.
So my hypothesis is that returning False from Token.has_vector() is equivalent to not having found the respective word in the dictionary. Am I correct? Is there a better way for finding words not in dictionary?
spaCy does not include functionality for checking if a word is in the dictionary or not.
If you've loaded a pipeline with vectors, you can use has_vector to check if a word vector is present for a given token. This is kind of similar to checking if a word is in the dictionary, but it depends on the vectors - for most languages the vectors just include any word that appeared at least a certain number of times in a training corpus, so common typos or other strange things will be present, while some words may be randomly missing.
If you want to detect "real" words in some way it's best to source your own list.

Synonyms in QnA Maker

How can I make QnAMaker recognize synonyms in the questions asked in order to help it return the correct answer. For example, the word disbursement and distribution in my work mean the same thing. Is there a way to ensure that QnA maker will understand each.
All you need to do is include the phrases as individual questions for that particular "answer". As you can see from the included image, you can enter an initial phrase (aka question) and then enter additional phrases by clicking "Add alternative phrases". Both of
these phrases are then associated to the one provided answer.
Hope of help!

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

What is the logic behind google spellcheck

When I wanted to search a word or some thing in google; If there is some spelling mistake in that word or sentence, google can get back me with correct spell or corrected sentence. Can anyone explain me how exactly this is being done. I will happy if anyone can explain in terms of programming than in terms of database and all those stuff. Thank you.
Combination of string comparison (with dictionary), stemming and popularity match word base on its large user statistic data.
EDIT: there's a wikipedia page that may helps you understand how computer spell check works.