Lucene.net - how to query a path filed with numeric sections? - indexing

I've created an index which indexes the event items in different sections of a website.
This items are on the website in a structure like this:
/Start/Section1/Events/2011/12/25/X-mas
/Start/Section2/Events/2012/01/01/New-years-day
These paths are stored in the field path in the index.
On the start page I need an overview of the events from all the different sections.
When I'm in a section I only need the events placed under that section.
I add a booleanquery like this:
QueryParser queryParser = new QueryParser("path", analyzer);
Query query = queryParser.Parse(startPath);
completeQuery.Add(query, BooleanClause.Occur.MUST);
"path" is a field that is added through a custom index script;
To retreive the items for the start page I would search my index using:
string startPath = "/Start";
This normally gives me all item where the path starts with "/Start"
To retreive the items for section1 I would search my index using:
string startPath = "/Start/Section1/Events";
This normally gives me all item where the path starts with "/Start/Section1/Events"
I've implemented this solution for news items and that works fine. For event items it does not.
When I search my index it returns no hits. The problem is that the last three folder names are numeric.
When I rename the folders (f.e. 2011,12,25) to text (two-thousand,twelve,twenty-five) it DOES return hits.
How can I get my index to return results keeping my folder names numeric?

Use a CharTokenizer for your path, and have IsTokenChar(char c) return false for the /.
This way you'll be sure each part of your path is an individual Token.

Related

Extract portion of HTML from website?

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?
Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

Replace url that contains a unique id

I am trying to write sql to replace sql fields anywhere a certain url occurs with a new url. The problem is part of the url contains a unique id with I need to replace too.
E.g. the field content contains the following:
<ul><li>Identity Verification</li>
<li>Personal Data</li>
<li>Professional Check</li></ul>
when updated I would like it to become:
<ul><li>Identity Verification</li>
<li>Personal Data</li>
<li>Professional Check</li></ul>
I need to replace all the first part of the url up to the document name with http://domain2/ but this needs to be done for each of the urls and they all contain unique strings. Is there anyway to do this using the replace function ?
ie.
Update Table
set content = replace(content, url,newurl)
where content like '%http://domain1/d/d/workspace/SpacesStore/%'

How to add another field to an existing Lucene index?

I have a lucene index that contains documents with the following fields: num(IntField), title(TextField,stored), contents(TextField,not stored)
I want to add a field to this index. I tried this (after finding the documentId, both the reader and the writer are open and q is a query that i used to find documentId):
Document doc = indexreader.document(documentId);
doc.add(new TextField("terms",terms,Store.YES));
writer.deleteDocuments(q);
writer.addDocument(doc);
However, when I try to query the index for the newly edited document , i can't seem to find it.
edit:it worked perfectly before I added the field, and it still works for other documents that I haven't edited.

Sitecore Lucene search - skip html tags

I create Lucene query this way:
BooleanQuery innerQuery = new BooleanQuery();
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields.ToArray<string>(), this.SearchIndex.Analyzer);
queryParser.SetDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.Parse(QueryParser.Escape(searchExpression.ToLowerInvariant()));
if (boost.HasValue)
{
query.SetBoost(boost.Value);
}
innerQuery.Add(query, BooleanClause.Occur.SHOULD);
The problem is that when a field contains html tag, for example <a href.../>, and search expression is "href", it returns this item. Can I somehow set it to skip searching in "<>" tags?
This is actually an issue with the crawling process (i.e. what gets stored in the index) rather than the search query.
I see you're using Sitecore 6. Take a look at this pdf:
Sitecore 6.6 Search and Indexing
It has a section explaining how to make a crawler. This should allow you to parse the content however you like, so you can omit anything that's part of an HTML tag.

Nested (Chained) query in Lucene

I have the following document structure:
Item: {ItemId (string), Flag (bool), Type ("Item")}
SubItem" {ItemId (string), Text (sting), Type ("SubItem")}
I need to get all Items with Flag=true and any of its SubItem Text has a term "term".
I can easily get list of Items if any of its SubItem Text has the term by using DuplicateFiler but how to do filter by Flag? Tried to create BooleanQuery but it's not very good as number of Items is big
I greatly recommend you to take a look into BlockJoinQuery in Lucene.
Very good start for it - http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html