I have a SentimentAttribute class which extends AttributeImpl. Also I am currently writing a SentenceSentimentTaggingFilter class which should
take InputStream (consisting of text)
tokenize it into sentences
assign a sentiment to each sentence, i.e., by adding SentimentAttribute to it
The problem I currently have is that it seems like there is only functionality inside Lucene which tokenizes text into individual tokens, e.g., single words, but nothing to split into sentences.
What is the best way to integrate this with a regular EnglishAnalyzer I'm also using during indexing? I would like to avoid to process both EnglishAnalyzer and my analysis in parallel but rather hook in my analysis in between the processing steps of the EnglishAnalyzer (assuming that this is the fastest / most efficient way).
Thanks a lot in advance :)
I'm actually doing something very similar but in an earlier version of Lucene, V3.0.2. You may want to look at the following class:
org.apache.lucene.wordnet.AnalyzerUtil
Although you've probably found a way to do this by now. I hope it might help anyway.
Related
I'm new to elasticsearch.The doc on official site just say the basic and do not contain specific example.Due to it is a little disorganized as my view, I can't figure out how to get start to achieve my purpose.
I have crawl a lot of torrents, they are published by many different language.
I see there is analysis in elasticsearch to deal with input text, but I don't understand the work flow. elasticsearch do not use all analyzers to process input data as I try.
It seems I should appoint a analyzer to process a text.
Such as a text :no game no life 游戏人生 ノーゲーム・ノーライフ, it contain three language.How can I know which three analyzers I have to use?And it also too heavy to use all analyzer to process this text.
I have seen a article Three Principles for Multilingal Indexing in Elasticsearch talk about this.However I am a beginner and non-native English speaker, it is hard to understand without a example.
Please give me some guide.
Thank you.
I would probably create two fields (or multiple for number of expected languages) and apply different analyzers (language dependent) to each of them. Then when you search you would search both fields.
Using Lucene, I want to compare a document in the index with the rest of documents. I found out that an easy way would be to submit the document as a query. The problem is that I need to put terms as an OR-Ring and, the most difficult part, boost the terms with the term frequency.
I think that if I trim all blank spaces of the document and replace them with ' OR ', lucene will parse it and interpret it. But is there a most sophisticated way to deal with this problem?
And which is the easiest way to boost the terms with their respective frequencies?
It looks like you are trying to re-implement Lucene's MoreLikeThis.
I would like to eliminate from the search query the words/phrases that bring no meaning to the query (we could call them stop phrases). Example:
"How to .."
"Where can I find .."
"What is the meaning of .."
etc.
Where to find / how to compute a list of 'common phrases' for English and for French?
How to implement it in Solr (Is there anything more advanced than the stopwords feature?)
I think that you shouldn't try to completely get rid of these phrases, because they reveal intent of the searcher. You can try to leverage the existence of them by using a natural language question answering system like Ephyra. There is even a project aimed at integration of it with Lucene. I haven't used it myself, but maybe at least evaluating it is
worth a try.
If you are determined to remove them, then I think that you need to write custom QueryParser that will filter the query, delegating the further processing to a parser of your choice.
For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai
Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)