evaluation of the ontology based semantic search query document rank precision recall IR - semantic-web

May I know how to evaluate the semantic search (ontology search) and do the ranking for the retrieved document ?
since semantic search can retrieve the similar meaning of the document even if the document does not have the keyword of the query. it means that I cannot use TFIDF to compare the query and documents and do the ranking. as the precision and recall will not be accurate.
How to evaluate the ontology based semantic search and do the document ranking?

You should use data sets that are used as gold standards.
Relevance is assessed relative to an , not a query. For example, an information need might be:
Information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
This might be translated into a query such as:
wine and red and white and heart and attack and effective
A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.
Here is a list of the most standard test collections and evaluation series.
The Cranfield collection. This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.
Text Retrieval Conference (TREC) . The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but overlapping sets of documents. TRECs 6-8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, because it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are available only for the documents that were among the top $k$ returned for some system which was entered in the TREC evaluation for which the information need was developed.
In more recent years, NIST has done evaluations on larger document collections, including the 25 million page GOV2 web page collection. From the beginning, the NIST test document collections were orders of magnitude larger than anything available to researchers previously and GOV2 is now the largest Web collection easily available for research purposes. Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies.
NII Test Collections for IR Systems ( NTCIR ). The NTCIR project has built various test collections of similar sizes to the TREC collections, focusing on East Asian language and cross-language information retrieval , where queries are made in one language over a document collection containing documents in one or more other languages. See: http://research.nii.ac.jp/ntcir/data/data-en.html
Cross Language Evaluation Forum ( CLEF ). This evaluation series has concentrated on European languages and cross-language information retrieval. See: http://www.clef-campaign.org/
and Reuters-RCV1. For text classification, the most used test collection has been the Reuters-21578 collection of 21578 newswire articles; see Chapter 13 , page 13.6 . More recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of 806,791 documents; see Chapter 4 , page 4.2 . Its scale and rich annotation makes it a better basis for future research.
20 Newsgroups . This is another widely used text classification collection, collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.

Related

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Linking related topics IR

How to link terms(keywords entities) which have some relation among them through text documents . Example is of google when you search for a person it shows recommendations of other people related to that person .
In this picture it figured out spouse , presidential candidate , and equal designation
I am using frequency count technique . The more two terms occur in same document the more chance of them to have some relation. But this also links unrelated terms like pagemarks , verbs and page refences in a text document .
How should I improve it and is there any other easy but reliable technique ?
You should look a few techniques
1.) Stop word filtering: it is common in text mining two filter words which are typically not very important as they are two frequent. Like the, a, is and so on. There are predefined dictionaries.
2.) TF/IDF: TF/IDF re-weights words on how much they separate documents.
3.) Named Entity Recognition: For your task at hand it might be sufficient to just focus on the names. Named entity recognition can extract names from documents
4.) Linear Dirichlet Allocation: LDA finds concept in documents. A concept is a set of words which frequently appear together.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

how did WordNet come in being

I wonder how the hierarchical relationship in WordNet between the words are retrieved.
Is that manually done or via computer techniques.
If based on computer techniques, what are they?
From the FAQ:
q.1.2 Where do you get the definitions for WordNet? (short answer) Our
lexicographers write them.
Where do you get the definitions for WordNet? (long answer) From the
foreword to WordNet: An Electronic Lexical Database, pp. xviii-xix:
People sometimes ask, "Where did you get your words?" We began in 1985
with the words in Kučera and Francis's Standard Corpus of Present-Day
Edited English (familiarly known as the Brown Corpus), principally
because they provided frequencies for the different parts of speech.
We were well launched into that list when Henry Kučera warned us that,
although he and Francis owned the Brown Corpus, the syntactic tagging
data had been sold to Houghton Mifflin. We therefore dropped our plan
to use their frequency counts (in 1988 Richard Beckwith developed a
polysemy index that we use instead). We also incorporated all the
adjectives pairs that Charles Osgood had used to develop the semantic
differential. And since synonyms were critically important to us, we
looked words up in various thesauruses: for example, Laurence Urdang's
little "Basic Book of Synonyms and Antonyms" (1978), Urdang's revision
of Rodale's "The Synonym Finder" (1978), and Robert Chapman's 4th
edition of "Roget's International Thesaurus" (1977) -- in such works,
one word quickly leads on to others. Late in 1986 we received a list
of words compiled by Fred Chang at the Naval Personnel Research and
Development Center, which we compared with our own list; we were
dismayed to find only 15% overlap.
So Chang's list became input. And in 1993 we obtained the list of
39,143 words that Ralph Grishman and his colleagues at New York
University included in their common lexicon, COMLEX; this time we were
dismayed that WordNet contained only 74% of the COMLEX words. But that
list, too, became input. In short, a variety of sources have
contributed; we were not well disciplined in building our vocabulary.
The fact is that the English lexicon is very large, and we were lucky
that our sponsors were patient with us as we slowly crawled up the
mountain.

What formula is used for building a list of related items in a tag-based system?

There are a lot of sites out there that use 'tags' to categorize items in their system. For example, YouTube uses keywords to categorize videos, Stack Overflow uses tags to categorize questions, etc.
What formulas do these sites use (especially SO) to build a list of items related to another item based on the tags it has? I'm building a system much like the one on SO and I'd like to find a way to generate a list of 20 items or so based on the tags of one item, but also make it spread enough so that each photo generates a vastly different list, and so that clicking an item in any given related list could eventually lead you to almost every item in the database.
The technical term for an organization based on user tags is a folksonomy. A google search for that term brings up a huge amount of material on how these systems are put together. A good place to start is the Wikipedia article.
I had to solve this exact problem for a contract a few years back, and the company was nice enough to let me blog about how I did it at http://bentilly.blogspot.com/2011/02/finding-related-items.html.
You'll note that if you get a decent volume of data then you'll really, really want to do this out of the database.
Similarity between items is often represented as dot products between the vectors representing the items. So if you have a tag based system, each tag will define one dimension. The vector then for an item becomes 1 in dimension i if tag i is set for this item (or higher numbers if you allow multiple tagging). If you calculate the dot product of the vectors of two items you will get the similarity for those items (N.b. the vectors have to be normalized so that the absolute value is 1).
Note that the dimensionality will get very large (several tens of thousands of tags are common). This sounds like a show stopper for this kind of thing. But you will also not that the vectors are really sparse and multiple dot product become one big matrix multiplication of a sparse matrix with it's own transposition. Using efficient algorithms for sparse matrix multiplication, this can be done relatively fast.
Also note, that most systems do not only rely on tags, but rather on "user behavior" (whatever that means). I.e. for Youtube user behavior would be "Watching a video", "Subscribing to a channel", "looking for similar videos as video X" or "tagging video x with tag y".
I ended up using the following code (with different names), which finds all other items with at least one tag in common, and orders the results by number of common tags, descending, and subsorts by other criteria specific to my problem:
SELECT PT.WidgetID, COUNT(*) AS CommonTags, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date FROM WidgetTags PT INNER JOIN WidgetStatistics PS ON PT.WidgetID = PS.WidgetID
WHERE PT.TagID IN (SELECT PTInner.TagID FROM WidgetTags PTInner WHERE PTInner.WidgetID = #WidgetID)
AND PT.WidgetID != #WidgetID
GROUP BY PT.WidgetID, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date
ORDER BY CommonTags DESC, PS.OtherOrderingCriteria1 DESC, PS.OtherOrderingCriteria2 DESC, PS.OtherOrderingCriteria3 DESC, PS.Date DESC, PT.WidgetID DESC