Identification of an important document - text-mining

I have a set of text documents in java . I have to identify the most important document (just as what an expert would identify) using a computer.
eg. I have 10 books on java , the system identifies Java complete reference as the most important document or the most relevant.(based on similarities with the wikipedia page about java)
One method would be to have a reference document and find similarities between this document and the set of documents at hand (as mentioned in the previous example). And provide a result saying the one which has maximum similarity is the most important docuemnt .
I want to identify other more efficient methods of performing this.
please suggest other methods for finding the relevant document (in a unsupervised way if possible) .

I think another mechanism would be, have a dictionary of words and ranking map associated with each document.
For example, in Java complete reference book case, there will be a dictionary of keywords and its ranking.
Java-10
J2ee-5
J2SDK-10
Java5-10 etc.,
Note:If your documents are dynamic streams and names also dynamic, I am not sure how to handle it.

Related

different cloudsearch relevance scores for equivalent matches

I'm new to AWS CloudSearch and have set up my first domain. It only has one basic text index field.
I've tried a number of simple searches and – more often than not – I get different relevance scores across documents where it seems they should be the same. Even searching for one simple word, which matches exactly once in a number of documents, often produces different scores.
Is this supposed to happen? If so, why?
This is normal. Document length is one factor that will affect this. Think about it: finding your query in a 5 word document indicates a better match than finding your query in a 1000 word document.
The current version of CloudSearch uses Solr/Lucene, an Apache project, so you can dig into the internals to your heart's content if you'd like to learn more. Here is the Similarity which discusses the underlying scoring algorithm in Lucene.
As your app matures, you may want to look into custom ranking of your results. CloudSearch provides this capability as well as a tool for comparing the results according to different rankers. You aren't able to customize the base document relevance score but you can boost it according to different fields, etc.

Tagging documents with predefined labels

I am working with large number of documents and have a set of predefined categories/tags(could be phrases) that would be present in the text of the documents either in the exact or inexact form.
I want to assign each document to exactly one tag among the tags that is closest to its text.
Please give me some directions as to what should I do to address this problem.
You can look at the lucene search engine that tags the documents while indexing. Northernlight search engine used to do a similar task mentioned by you in their searching methodology. You can have a look at its implementation in order to get an idea.

What is the easiest way to implement terms association mining in Solr?

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json

Lucene.NET: Retrieving all the Terms used in a particular Document

Is there a way to itterate through all of the terms held against a particular document in a Lucene.NET index?
Basically I want to be able to retrieve a Document from the Index based on it's ID and then find the frequency with which each Term is used in that Document. Does anyone know a way to do this?
I can find the number of Documents that match a particular Term but not the Terms contained within a particular Document.
Many thanks,
Tim
In Lucene Java, at least, one of the options when indexing a document is storing the term frequency vector. The term frequency vector is simply a list of all the terms in a given field of a document, and how often each of those terms was used. Getting the term frequency vector at runtime involves calling a method in the IndexReader with the Lucene ID of the document in question.

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being encouraged to read works by John-Paul Sartre" and article Bar is titled (and about) "The important works of John-Paul Sartre", I'd like to automagically create that HTML link from Foo to Bar within the text of Foo.
You should ask yourself something before adding the links. What benefit for users do you want to achieve by doing this? You probably want to increase the navigability of your site. Maybe it is better to create an easier way to add links to older articles in form used to submit new ones. Maybe it is possible to add a "one click search for selected text" feature. Maybe you can add a wiki-like functionality that lets users propose link for selected text. You probably want to add links to related articles (generated through tagging system or text mining) below the articles.
Some potential problems with fully automated link adder:
You may need to implement a good word sense disambiguation algorithm to avoid confusing or even irritating the user by placing bad automatic links with regex (or simple substring matching).
As the number of articles is large you do not want to generate the html for extra links on every request, cache it instead.
You need to make a decision on duplicate titles or titles that contain other title as substring (either take longest title or link to most recent article or prefer article from same category).
TLDR version: find alternative solutions that provide desired functionality to the users.
What you are looking for are text mining tools. You can find more info and links at http://en.wikipedia.org/wiki/Text_mining. You might also want to check out Lucene and its ports at http://lucene.apache.org. Using these tools, the basic idea would be to find a set of similar articles based on the article (or title) in question. You could search various properties of the article including titles and content or both. A tagging system a la Delicious (or Stackoverflow) might also be helpful. Rather than pre-creating the links between articles, you'd present the relevant articles in an interface much like the Related questions interface on the right-hand side of this page.
If you wanted to find and link specific text in each article, I think you'd need to do some preprocessing to select pertinent phrases to key on. Even then I think it would be very hard not to miss things due to punctuation/misspellings or to not include irrelevant links for the same reasons.