Searching for href values with Lucene (Examine in Umbraco)? - lucene

I want to search for a href value with lucene/examine - more precise the 'locallink' value. Examine is straight out-of-the-box standard config.
I have the following snippet which does not return any results;
string searchQuery = "localLink:" + id;
UmbracoHelper helper = new UmbracoHelper(UmbracoContext.Current);
foreach (var result in helper.Search(searchQuery, false))
{
// Do something
}
Upon inspection of the index via Developer > Examine Management (in Umbraco backend), I can see that the index does contain the value I am trying to search for but under a "_Raw" property. So I guess the question is, how I can make my search, search in these fields also?

You made search with UmbracoHelper.
Try to use Examine Searcher as described in docs :
var searcher = ExamineManager.Instance.SearchProviderCollection["WebsiteSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var searchResults = searcher.Search(query);
http://our.umbraco.org/documentation/Reference/Searching/Examine/
http://umbraco.com/follow-us/blog-archive/2011/9/16/examining-examine.aspx

Related

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

Ravendb retrieve all documents in collection for reporting

I need to retrieve all documents from a collection to dump it to an Excel file.
Using this seems to work
var luceneQuery = Session.Advanced.LuceneQuery<Test.ReduceResult>("Test/ByTestData");
var enumerator = Session.Advanced.Stream(luceneQuery);
var obj = new List<Test.ReduceResult>();
while (enumerator.MoveNext())
{
obj.Add(enumerator.Current.Document);
}
This gives me all the results in the index.
But instead of index I want to retrieve all the documents in a collection (as the index does not contain all the information stored in the document).
How can this be done?
Change your query to be:
var luceneQuery = Session.Advanced.LuceneQuery<Test.ReduceResult>("Raven/DocumentsByEntityName")
.WhereEquals("Tag", "Customers");
This will give you all customers.

Fastest way to get the properties of umbraco documents

I am working on search functionality for a website designed through umbraco. I am using Examine to fetch the search results. Here is my code:
var Searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = Searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.Field("tags", searchTerm.Fuzzy(0.5f)).Compile();
var searchResults = Searcher.Search(query);
With this method i can only get the nodes in which the search term belongs.But I want to directly fetch the whole value from the property.
I want to know what is the fastest way to fetch all the values from the same property in all the nodes.
I have finally managed to get the values directly from the property.This is the code I have used:
List<string> nodesList = new List<string>();
var Searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = Searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.Field("tags", queryString.Fuzzy(0.5f)).Compile();
var searchResults = Searcher.Search(query);
foreach (var item in searchResults)
{
string paths = ((Examine.SearchResult)item).Fields["tags"];
nodesList.Add(paths);
}
using ((Examine.SearchResult)item).Fields["tags"] gets the property value directly.
If you want to costumize your search you need to define a new index set with the properties that you actually want to search on the /config/examineIndex.config file.
Quite well explained in this post.

Katta docId to Document

How to use FieldCache in Katta, FieldCache expects IndexReader as arguments, then how to get IndexReader from Katta API. And In katta the search method in LuceneClient.java returns Hits.
From this I can get List, from that I can able to get each hit's docId, but I need particular field value of the docId in Katta. Please give me some coding example.
I've never worked with Katta, I worked with Solr and if I had to get document by its id and I had to use only Lucene classes, I'd use org.apache.lucene.search.IndexSearcher:
// when you figure out how to get IndexReader using Katta API, you'll be able to get the searcher
IndexSearcher searcher = new IndexSearcher(indexReader);
org.apache.lucene.document.Document doc = searcher.doc(docId);
String yourFieldValue = doc.get("yourFieldName");
you can't use the FieldCache on client side, since the IndexReader is located on the server side!
But you can get field-values through the getDetails() method on LuceneClient.
final Hits hits = client.search(query, new String[] { INDEX_NAME }, 10);
for (final Hit hit : hits.getHits()) {
final MapWritable details = client.getDetails(hit, new String[] { "path" });
details.get(new Text("path"));
HTH
Johannes

Lucene.Net Search result to highlight search keywords

I use Lucene.Net to index some documents. I want to show the user a couple of lines as to why that document is in the result set. just like when you use google to search and it shows the link and followed by the link there are a few lines with the keywords highlighted.
any ideas?
When you have a result you can get the indexed text pass it along with your query through a method similar to this:
public string GeneratePreviewText(Query q, string text)
{
QueryScorer scorer = new QueryScorer(q);
Formatter formatter = new SimpleHTMLFormatter(highlightStartTag, highlightEndTag);
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter(fragmentLength));
TokenStream stream = new StandardAnalyzer().TokenStream(new StringReader(text));
return highlighter.GetBestFragments(stream, text, fragmentCount, fragmentSeparator);
}