Sitecore Lucene search - skip html tags - lucene

I create Lucene query this way:
BooleanQuery innerQuery = new BooleanQuery();
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields.ToArray<string>(), this.SearchIndex.Analyzer);
queryParser.SetDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.Parse(QueryParser.Escape(searchExpression.ToLowerInvariant()));
if (boost.HasValue)
{
query.SetBoost(boost.Value);
}
innerQuery.Add(query, BooleanClause.Occur.SHOULD);
The problem is that when a field contains html tag, for example <a href.../>, and search expression is "href", it returns this item. Can I somehow set it to skip searching in "<>" tags?

This is actually an issue with the crawling process (i.e. what gets stored in the index) rather than the search query.
I see you're using Sitecore 6. Take a look at this pdf:
Sitecore 6.6 Search and Indexing
It has a section explaining how to make a crawler. This should allow you to parse the content however you like, so you can omit anything that's part of an HTML tag.

Related

Indexing a document with content using solrj in EmbeddedSolrServer

I want to query an EmbeddedSolrServer instance with a Filter query. Like we normally do in the picture with an admin panel. But the problem here is that I want to do this programmatically with Java. I know that we can do that query.setQuery("*:*"); , but this is not what I want if someone want to search by a specific word in content's document. I found also this solrParams.add(CommonParams.QT, "*:*");, But it's not working. I think that may be the problem is from parsing the PDF document, when I try to index it. So please if someone know how to index a document using EmbeddedSolrServer exactly the same way we index it using post.jar in command.
Indexing a file is as easy as
EmbeddedSolrServer server = new EmbeddedSolrServer(solrHome, defaultCoreName)
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(fileToIndex, "application/octet-stream");
req.setParam("commit", "true");
req.setParam("literal.id", id);
NamedList<Object> namedList = server.request(req);
server.close();

Lucene request : stemming with FrenchAnalyzer and QueryParser or TermQuery

I need to perform requests with stemming functionnality.
When the search term is "invention", both of these documents must be returned:
"Ils inventèrent le feu"
"L'invention est belle"
I use lucene 6.2.1 and my code follow this:
The index is created with a IndexWriter populated with a
FrenchAnalyser.
The searched field is a stored text field.
The request is performed with a QueryParser populated with a
FrenchAnalyzer.
Currently documents are returned well if the search is "invent", but not with "invention". Am I missing something to perform stemmed request?
Thanks you
Ok,
The described method is good.
Actually, "inventions" is stemmed into "invention" and "inventer" is stemmed into "invent". That's what perturbed me.

How to index a WEB TREC collection?

I've build a WEB TREC collection by downloading and parsing html pages by myself. Each TREC file contains a Category field. How can I build an index by using Lucene in order to perform a search in that collection? The idea is that this search, instead of returning documents as results, it could return categories.
Thank you!
This should be a relatively simple task since you have them in HTML format. You could index them in Lucene thus (Java based pseudo code)
foreach(file in htmlfiles)
{
Document d = new Document();
d.add(new Field("Category", GetCategoryName(...), Field.Store.YES, Field.Index.NOT_ANALYZED));
d.add(new Field("Contents", GetContents(...), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
writer.close();
}
GetCategoryName = should return the category string and GetContents(...) the contents of corresponding HTML file.It would be a good idea to parse out the HTML contents from the tags there are several ways of doing it. HtmlParser being one.
When you search, search the contents field and iterate through your search results to collect your Categories.
If you want to get a list of categories with counts attached ("facets") look into faceted search. Solr is a search server built using Lucene that provides this out of the box.

match any part of a URL in lucene

Presently i am using PrefixQuery it's working fine but it get's a record like if my url is
http://xyz.com then it will get http://xyz.com and http://xyz.com/service/...
but it can't get http://www.xyz.com and http://xyz.co.in.i want to search based on any parts of url my code is :-
Term term = new Term("URL", siteUrl.toLowerCase());
Query query1 = new PrefixQuery(term);
booleanQuery.add(query1,BooleanClause.Occur.MUST);
You can use a WildcardQuery. But you need to know that it has bad performance, especially with queries with a leading wildcard (not because it has been poorly implemented but because of how Lucene internally stores its term dictionary).
Can't your use-case be solved by using a custom analyzer?

Lucene.net - how to query a path filed with numeric sections?

I've created an index which indexes the event items in different sections of a website.
This items are on the website in a structure like this:
/Start/Section1/Events/2011/12/25/X-mas
/Start/Section2/Events/2012/01/01/New-years-day
These paths are stored in the field path in the index.
On the start page I need an overview of the events from all the different sections.
When I'm in a section I only need the events placed under that section.
I add a booleanquery like this:
QueryParser queryParser = new QueryParser("path", analyzer);
Query query = queryParser.Parse(startPath);
completeQuery.Add(query, BooleanClause.Occur.MUST);
"path" is a field that is added through a custom index script;
To retreive the items for the start page I would search my index using:
string startPath = "/Start";
This normally gives me all item where the path starts with "/Start"
To retreive the items for section1 I would search my index using:
string startPath = "/Start/Section1/Events";
This normally gives me all item where the path starts with "/Start/Section1/Events"
I've implemented this solution for news items and that works fine. For event items it does not.
When I search my index it returns no hits. The problem is that the last three folder names are numeric.
When I rename the folders (f.e. 2011,12,25) to text (two-thousand,twelve,twenty-five) it DOES return hits.
How can I get my index to return results keeping my folder names numeric?
Use a CharTokenizer for your path, and have IsTokenChar(char c) return false for the /.
This way you'll be sure each part of your path is an individual Token.