Linking related topics IR - text-mining

How to link terms(keywords entities) which have some relation among them through text documents . Example is of google when you search for a person it shows recommendations of other people related to that person .
In this picture it figured out spouse , presidential candidate , and equal designation
I am using frequency count technique . The more two terms occur in same document the more chance of them to have some relation. But this also links unrelated terms like pagemarks , verbs and page refences in a text document .
How should I improve it and is there any other easy but reliable technique ?

You should look a few techniques
1.) Stop word filtering: it is common in text mining two filter words which are typically not very important as they are two frequent. Like the, a, is and so on. There are predefined dictionaries.
2.) TF/IDF: TF/IDF re-weights words on how much they separate documents.
3.) Named Entity Recognition: For your task at hand it might be sufficient to just focus on the names. Named entity recognition can extract names from documents
4.) Linear Dirichlet Allocation: LDA finds concept in documents. A concept is a set of words which frequently appear together.

Related

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

evaluation of the ontology based semantic search query document rank precision recall IR

May I know how to evaluate the semantic search (ontology search) and do the ranking for the retrieved document ?
since semantic search can retrieve the similar meaning of the document even if the document does not have the keyword of the query. it means that I cannot use TFIDF to compare the query and documents and do the ranking. as the precision and recall will not be accurate.
How to evaluate the ontology based semantic search and do the document ranking?
You should use data sets that are used as gold standards.
Relevance is assessed relative to an , not a query. For example, an information need might be:
Information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
This might be translated into a query such as:
wine and red and white and heart and attack and effective
A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.
Here is a list of the most standard test collections and evaluation series.
The Cranfield collection. This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.
Text Retrieval Conference (TREC) . The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but overlapping sets of documents. TRECs 6-8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, because it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are available only for the documents that were among the top $k$ returned for some system which was entered in the TREC evaluation for which the information need was developed.
In more recent years, NIST has done evaluations on larger document collections, including the 25 million page GOV2 web page collection. From the beginning, the NIST test document collections were orders of magnitude larger than anything available to researchers previously and GOV2 is now the largest Web collection easily available for research purposes. Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies.
NII Test Collections for IR Systems ( NTCIR ). The NTCIR project has built various test collections of similar sizes to the TREC collections, focusing on East Asian language and cross-language information retrieval , where queries are made in one language over a document collection containing documents in one or more other languages. See: http://research.nii.ac.jp/ntcir/data/data-en.html
Cross Language Evaluation Forum ( CLEF ). This evaluation series has concentrated on European languages and cross-language information retrieval. See: http://www.clef-campaign.org/
and Reuters-RCV1. For text classification, the most used test collection has been the Reuters-21578 collection of 21578 newswire articles; see Chapter 13 , page 13.6 . More recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of 806,791 documents; see Chapter 4 , page 4.2 . Its scale and rich annotation makes it a better basis for future research.
20 Newsgroups . This is another widely used text classification collection, collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.

Forward Index Implementation in google

I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.

What's the difference between an inverted index and a plain old index?

In software engineering we create indexes all the time (e.g., in databases) but I also hear a lot of people talk about inverted indices. Is there something fundamentally different between the two? They sound like the same thing.
One common use is "...to allow fast full-text searching."
The two types denote directionality. One takes you forward through the index, and the other takes you backward (the inverse) through the index. That's it. There's no mystery to uncover here. Otherwise the two types are identical, it's just a question of what information you have, and as a result what information you're trying to find.
To address your inquiry, I don't think there's actually a way to know why the use is what it is today. The only reason it's important to define which is forward and which one is inverted is so that we can all have a conversation about them, and everyone knows which direction we're talking about. Think about the terms "left" and "right": they are relative. Which is which doesn't matter, except that everyone needs to agree which one is "left" and which one is "right" in order for the words to have meaning. If, as a culture, we decided to flip left and right, then you'd have the same issue figuring out what a "right turn" vs a "left turn" is since the agreed upon meaning had changed. However, the naming is arbitrary, so which one is which (in and of itself) doesn't matter - what matters is that we all agree on the meaning.
In your comment where you ask, "please don't just define the terms", you're missing the point, and I think you're just getting hung up on the wording when there is absolutely no difference between them.
For the benefit of future readers, I will now provide several "forward" and "inverted" index examples:
Example 1: Web search
If you're thinking that the inverse of an index is something like the inverse of a function in mathematics, where the inverse is a special thing that has a different form, then you're mistaken: that's not the case here.
In a search engine you have a list of documents (pages on web sites), where you enter some keywords and get results back.
A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page.
The inverted index is the list of words, and the documents in which they appear. In the web search example, you provide the list of words (your search query), and Google produces the documents (search result links).
They are both indexes - it's just a question of which direction you're going. Forward is from documents->to->words, inverted is from words->to->documents.
Example 2: DNS
Another example is a DNS lookup (which takes a host name, and returns an IP address) and a reverse lookup (which takes an IP address, and gives you the host name).
Example 3: A book
The index in the back of a book is actually an inverted index, as defined by the examples above - a list of words, and where to find them in the book. In a book, the table of contents is like a forward index: it's a list of documents (chapters) which the book contains, except instead of listing the words in those sections, the table of contents just gives a name/general description of what's contained in those documents (chapters).
Example 4: Your cell phone
The forward index in your cell phone is your list of contacts, and which phone numbers (cell, home, work) are associated with those contacts. The inverted index is what allows you to manually enter a phone number, and when you hit "dial" you see the person's name, rather than the number, because your phone has taken the phone number and found you the contact associated with it.
They called it inverted just because there is already a forward index. Take the example of search engine, it composed by two parts: the first part is "web crawler and parser" which build a index from document to word, the second part is search database which build a index from word to document. Because of the first index exist, we naturally call the second index as inverted index.
If you name the TOC (Table of Content) of a book as index, then you should call the index at the end of book as "inverted index". Or, in other side, you can call the TOC as inverted index.
typically when speaking about index, you mean some added calculations or stored results of procedures which have been done in order to speed up application (e.g. MySQL or other RDBMS Consult MySQL the docs). Indexing can also be related to caching etc.
Inverted index creates file with structure that is primarily intender for (fulltext) searching.
Inverted index consists of two main files:
Vocabulary
Occurences
In vocabulary are common words extracted from text (of course after filtering blacklist words like pronouns). The occurences file holds the connection between words and documents (word1 appears in doc1 and doc2, not in doc3). It is represented in a form of a matrix.
In the above image is shown the process of creating the two files mentioned.
If you are further interester in this problematic I can recommend you a great book written by Ricardo Yated - Modern Information Retrieval (See it on Amazon) - about page 200 I think.
Hope it helps :-)
normalocity has already wonderfully differentiated between a forward and an inverted index but for the question of why one is called a forward index and the other an inverted index, maybe this is why they are called that way---
Taking example of search engine crawling and indexing (or building index for a book), a forward index can be built simultaneously while you are crawling the web pages(or reading the book) or going forward. So if you have 10 webpages to crawl(or 10 chapters in a book) you can crawl the first webpage(read the first chapter) and then make a list of words which appear in the webpage(words which appear in the chapter) and continue this process for other webpages(other chapters) so by the time you have crawled all the 10 webpages(read all 10 chapters) your forward index is complete with each webpage(chapter) pointing to a list of words it contains.
But to make an inverted index you have to crawl all the 10 webpages(read the 10 chapters) and and then take each word from each documents list and figure out which documents contain that word. So this is like going backward once you have crawled the webpages(read chapters of the book). So its called an inverted index.
This is just my speculation.
The term "Inverted Word Index" refers to the change in relationship of
a single-document containing many-words, to each unique word containing
(or identifying) a list of many-documents. This is effectively taking a One-to-Many Relationship (Docs to Words) and Inverting (or reversing) it such that a new "Inverted" One-to-Many Relationship now exists, which is each-unique-word relating to Many-Documents (i.e., all that contain that word). It's origin really is that simple, and the term "inverted index" was used to describe manual indexes of the same type long before computers and electronic high-speed indexing even existed (yes, admittedly, I'm an old, geezer programmer, almost old enough to have considered Grace Hopper a "sweet young lady" age appropriate for courting back when COBOL was a shiny new language). Please don't discard us geezers just yet, as we may occasionally provide a useful, and possibly even valuable, historical tid-bit or two - when our personal RAM is still working, that is. [grin]
There are many types of index. For example, B-tree, R-tree, hash... For different purposes, we must choose correct index.
Inverted index is a special one. Inverted index usually used in full text search engine. Use inverted index we can find out a word's locate in a document(or documents set) as fast as possible. Think about the limit of memory and cpu, other index can't finish this job.
You can read lucene document for more details. It's a open source search engine. http://lucene.apache.org/java/docs/index.html
in inverted indexes, we have the following form:
word1-> list of docs it occurs in (sorted order)
word2-> list of docs it occurs in (sorted order)
It is very useful for search engine query processing as it allows us to find docs that word occurs in .
You can use supervised machine learing to build this inverted index.
One more difference:
Handling updates with the inverted index are expensive in comparison with forward index.
Forward index handles updates easily by reflecting the changes only in the corresponding document index, whereas in the inverted index, the same change has to reflect in multiple positions across the inverted index.

What formula is used for building a list of related items in a tag-based system?

There are a lot of sites out there that use 'tags' to categorize items in their system. For example, YouTube uses keywords to categorize videos, Stack Overflow uses tags to categorize questions, etc.
What formulas do these sites use (especially SO) to build a list of items related to another item based on the tags it has? I'm building a system much like the one on SO and I'd like to find a way to generate a list of 20 items or so based on the tags of one item, but also make it spread enough so that each photo generates a vastly different list, and so that clicking an item in any given related list could eventually lead you to almost every item in the database.
The technical term for an organization based on user tags is a folksonomy. A google search for that term brings up a huge amount of material on how these systems are put together. A good place to start is the Wikipedia article.
I had to solve this exact problem for a contract a few years back, and the company was nice enough to let me blog about how I did it at http://bentilly.blogspot.com/2011/02/finding-related-items.html.
You'll note that if you get a decent volume of data then you'll really, really want to do this out of the database.
Similarity between items is often represented as dot products between the vectors representing the items. So if you have a tag based system, each tag will define one dimension. The vector then for an item becomes 1 in dimension i if tag i is set for this item (or higher numbers if you allow multiple tagging). If you calculate the dot product of the vectors of two items you will get the similarity for those items (N.b. the vectors have to be normalized so that the absolute value is 1).
Note that the dimensionality will get very large (several tens of thousands of tags are common). This sounds like a show stopper for this kind of thing. But you will also not that the vectors are really sparse and multiple dot product become one big matrix multiplication of a sparse matrix with it's own transposition. Using efficient algorithms for sparse matrix multiplication, this can be done relatively fast.
Also note, that most systems do not only rely on tags, but rather on "user behavior" (whatever that means). I.e. for Youtube user behavior would be "Watching a video", "Subscribing to a channel", "looking for similar videos as video X" or "tagging video x with tag y".
I ended up using the following code (with different names), which finds all other items with at least one tag in common, and orders the results by number of common tags, descending, and subsorts by other criteria specific to my problem:
SELECT PT.WidgetID, COUNT(*) AS CommonTags, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date FROM WidgetTags PT INNER JOIN WidgetStatistics PS ON PT.WidgetID = PS.WidgetID
WHERE PT.TagID IN (SELECT PTInner.TagID FROM WidgetTags PTInner WHERE PTInner.WidgetID = #WidgetID)
AND PT.WidgetID != #WidgetID
GROUP BY PT.WidgetID, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date
ORDER BY CommonTags DESC, PS.OtherOrderingCriteria1 DESC, PS.OtherOrderingCriteria2 DESC, PS.OtherOrderingCriteria3 DESC, PS.Date DESC, PT.WidgetID DESC