best way to index a text consist of multilingual word in elasticsearch - indexing

I'm new to elasticsearch.The doc on official site just say the basic and do not contain specific example.Due to it is a little disorganized as my view, I can't figure out how to get start to achieve my purpose.
I have crawl a lot of torrents, they are published by many different language.
I see there is analysis in elasticsearch to deal with input text, but I don't understand the work flow. elasticsearch do not use all analyzers to process input data as I try.
It seems I should appoint a analyzer to process a text.
Such as a text :no game no life 游戏人生 ノーゲーム・ノーライフ, it contain three language.How can I know which three analyzers I have to use?And it also too heavy to use all analyzer to process this text.
I have seen a article Three Principles for Multilingal Indexing in Elasticsearch talk about this.However I am a beginner and non-native English speaker, it is hard to understand without a example.
Please give me some guide.
Thank you.

I would probably create two fields (or multiple for number of expected languages) and apply different analyzers (language dependent) to each of them. Then when you search you would search both fields.

Related

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

Suggestion/Help regarding text mining

I need to do text mining on a single document using Map-Reduce concept.
A few of my friends suggested me to use Apache Lucene.
But after going thorugh few documents about Apache Lucene, I found that it can be useful only when we need to index documents.
Can anyone suggest me on any better methods ?
Thank you in advance
Lucene is a framework for document indexing and retrieval. Of course one could play around with indexed data like keyword search, document similarity etc.
If you are interested in TM, have a look on OpenNLP and LingPipe. They have 100s of libraries for text mining and natural language processing.

Lucene to bring cheeseburger when searching for burger

I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.