How to index a WEB TREC collection? - lucene

I've build a WEB TREC collection by downloading and parsing html pages by myself. Each TREC file contains a Category field. How can I build an index by using Lucene in order to perform a search in that collection? The idea is that this search, instead of returning documents as results, it could return categories.
Thank you!

This should be a relatively simple task since you have them in HTML format. You could index them in Lucene thus (Java based pseudo code)
foreach(file in htmlfiles)
{
Document d = new Document();
d.add(new Field("Category", GetCategoryName(...), Field.Store.YES, Field.Index.NOT_ANALYZED));
d.add(new Field("Contents", GetContents(...), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();
}
GetCategoryName = should return the category string and GetContents(...) the contents of corresponding HTML file.It would be a good idea to parse out the HTML contents from the tags there are several ways of doing it. HtmlParser being one.
When you search, search the contents field and iterate through your search results to collect your Categories.
If you want to get a list of categories with counts attached ("facets") look into faceted search. Solr is a search server built using Lucene that provides this out of the box.

Related

How to add another field to an existing Lucene index?

I have a lucene index that contains documents with the following fields: num(IntField), title(TextField,stored), contents(TextField,not stored)
I want to add a field to this index. I tried this (after finding the documentId, both the reader and the writer are open and q is a query that i used to find documentId):
Document doc = indexreader.document(documentId);
doc.add(new TextField("terms",terms,Store.YES));
writer.deleteDocuments(q);
writer.addDocument(doc);
However, when I try to query the index for the newly edited document , i can't seem to find it.
edit:it worked perfectly before I added the field, and it still works for other documents that I haven't edited.

Sitecore Lucene search - skip html tags

I create Lucene query this way:
BooleanQuery innerQuery = new BooleanQuery();
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields.ToArray<string>(), this.SearchIndex.Analyzer);
queryParser.SetDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.Parse(QueryParser.Escape(searchExpression.ToLowerInvariant()));
if (boost.HasValue)
{
query.SetBoost(boost.Value);
}
innerQuery.Add(query, BooleanClause.Occur.SHOULD);
The problem is that when a field contains html tag, for example <a href.../>, and search expression is "href", it returns this item. Can I somehow set it to skip searching in "<>" tags?
This is actually an issue with the crawling process (i.e. what gets stored in the index) rather than the search query.
I see you're using Sitecore 6. Take a look at this pdf:
Sitecore 6.6 Search and Indexing
It has a section explaining how to make a crawler. This should allow you to parse the content however you like, so you can omit anything that's part of an HTML tag.

Using lucene to index a webpage

I'm trying to index webpages with lucene. Therefore, I'm using doc.add(new TextField("content", webPageContent, Store.YES)) where doc is the document about to be added to the index, and webPageContent is the string of the content of the webpage parsed with JSoup.
Is it the right way to do it - ie, will lucene compute the frequency of each token created from webPageContent?

How to read a document in lucene which is not stored but indexed

Hello i have 3 fields title,content,url and i created the index added some document
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new TextField("content", title, Field.Store.YES));
doc.add(new StringField("url", isbn, Field.Store.NO));
w.addDocument(doc);
I can read the index using the index writer and iterate and receive the field title,content how can i receive the field url ?
You can search using "url" field but cannot get(display) it
for Example:
Field.Store.NO is suitable for id like fields which you need only to retrieve documents not for displaying
Since you chose not to store it, I don't think you can. That's exactly what the "store" option is for (allowing you to retrieve more data than just the document id).
how can i receive the field url?
You can't. Field.Store.NO means Lucene takes this value and only uses it for indexing purposes so this document can be found if you query by matching url.

How can I read a Lucene document field tokens after they are analyzed?

If I create a document and add a field that is both stored and analyzed, how can I then read this field back as a list of tokens? I have the following:
Document doc = new Document();
doc.add(new Field("url", fileName, Store.YES, Index.NOT_ANALYZED));
doc.add(new Field("text", fileContent, Store.YES, Index.ANALYZED));
// add the document to the index
writer.addDocument(doc);
So the fileContext is a String containing a lot of text. It is analyzed whereby it is tokenized when it is stored in the index. However, how can I get these tokens? I can retrieve the document from the index after it is stored, and I can read the "text" field from the document, but this is returned as a string. I would like to get the tokens if possible. My 'writer' is an IndexWriter instance and it uses a StandardAnalyzer. Any pointers would be very much welcomed.
Thank you very much
Check out document.getField("name").tokenStreamValue().
EDIT: Actually this question gives you the full solution using the above TokenStream.