Lucene search getting strange facet results for a text field - lucene

I'm using Sitecore with Lucene, and I'm trying to facet on an integer field, so that I can get all of the existing values for that field. I have the following search result class with a definition for the field:
public class ContentTypeSearchResultItem : Sitecore.ContentSearch.SearchTypes.SearchResultItem
{
[Sitecore.ContentSearch.IndexField("crop_heat_units")]
public int CropHeatUnits { get; set; }
}
in my query, I have
query = query.FacetOn.FacetOn(x => x.CropHeatUnits)
I have a number of other facets of type ID or IEnumerable<Guid> and these work as I expect, but the crop_heat_units string facet is giving me weird results, such as chufacet.Values[0].Name = \u0001\0\0\0\0\0\0\0\u000e\b. Some of the other values are #\b\0\0\0\0 and 8\u0010\0\0\0\0\0.
In Sitecore, the values of the Crop Heat Units field are things like "2075" and "2200".

Each numeric value is indexed as a trie structure within Lucene where each term is logically assigned to larger and larger predefined brackets which are simply lower precision representations of the value.
So, the simple solution is to change int to string for your CropHeatUnits field definition and remove it from the fieldmap. Then your queries and facets should work as expected. If you want to use the CropHeatUnits values as integers then you will need to cast their string values to integer after their retrieval from Lucene.

Related

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

How can you boost documents by recency in RavenDB?

Is it possible to boost recent documents in a RavenDB query?
This question is exactly what I want to do but refers to native Lucene, not RavenDB.
For example, if I have a Document like this
public class Document
{
public string Title { get; set; }
public DateTime DateCreated { get; set; }
}
How can I boost documents who's date are closer to a given date, e.g. DateTime.UtcNow?
I do not want to OrderByDecending(x => x.DateCreated) as there are other search parameters that need to affect the results.
You can boost during indexing, it's been in RavenDB for quite some time, but it's not in the documentation at all. However, there are some unit tests that illustrate here.
Those tests show a single boost value, but it can easily be calculated from other document values instead. You have the full document available to you since this is done when the index entries are written. You should be able to combine this with the technique described in the post you referenced.
Map = docs => from doc in docs
select new
{
Title = doc.Title.Boost(doc.DateCreated.Ticks / 1000000f)
};
You could also boost the entire document instead of just the Title field, which might be useful if you have other fields in your search algorithm:
Map = docs => from doc in docs
select new
{
doc.Title
}.Boost(doc.DateCreated.Ticks / 1000000f);
You may need to experiment with the right value to use for the boost amount. There are 10,000 ticks in a millisecond, so that's why i divide by such a large number.
Also, be careful that the DateTime you're working with is in UTC, or if you don't have control over where it comes from, then use a DateTimeOffset instead. Why? Because you're using a calculated duration from some reference point and you don't want the result to be ambiguous for different time zones or around daylight savings time changes.

Find all Lucene documents having a certain field

I want to find all documents in the index that have a certain field, regardless of the field's value. If at all possible using the query language, not the API.
Is there a way?
If you know the type of data stored in your field, you can try a range query. Per example, if your field contain string data, a query like field:[a* TO z*] would return all documents where there is a string value in that field.
I've done some experimenting, and it seems the simplest way to achieve this is to create a QueryParser and call SetAllowLeadingWildcard( true ) and search for field:* like so:
var qp = new QueryParser( Lucene.Net.Util.Version.LUCENE_29, field, analyzer );
qp.SetAllowLeadingWildcard( true );
var query = qp.Parse( "*" ) );
(Note I am setting the default field of the QueryParser to field in its constructor, hence the search for just "*" in Parse()).
I cannot vouch for how efficient this method is over other methods, but being the simplest method I can find, I would expect it to be at least as efficient as field:[* TO *], and it avoids having to do hackish things like field:[0* TO z*], which may not account for all possible values, such as values starting with non-alphanumeric characters.
Another solution is using a ConstantScoreQuery with a FieldValueFilter
new ConstantScoreQuery(new FieldValueFilter("field"))

Good way to format data in a large DataTable

I have a large data.DataTable and some formatting rules to apply. I'm sure this is not a unique problem.
For example, the LASTNAME column has a value of "Jones" but my formatting rule requires it be 10 characters padded with spaces on the right and uppercase only. Like: "JONES "
My initial thought is to loop through each row and generate a string. But, I wonder if I could accomplish this more efficiently with a DataView, LINQ or something else.
Can someone point me in a direction?
It really depends how you display the results. I would say if you display it in a grid, the easiest would be to do a quick loop, no real performance harm there in a datatable.
If you display the records individually you can create an extension method for your string, and simply call it like this for example. LastName.Padded()
public static class StringExtensions
{
public static string Padded(this string s)
{
return s.ToUpper().PadRight(10);
}
}

using date range in Lucene.net

I understand how Lucene.net can work for text indexing. Will I be able to efficiently search for documents based on a given date range? Or will Lucene.net just use text matching to match the dates?
Lucene.Net will just use text matching, so you'd need to format the dates correctly before adding to the index:
public static string Serialize(DateTime dateTime)
{
return dateTime.ToString("yyyyMMddHHmmss", CultureInfo.InvariantCulture);
}
public static DateTime Deserialize(string str)
{
return DateTime.ParseExact(str, "yyyyMMddHHmmss", CultureInfo.InvariantCulture);
}
You can then, for example, perform a range based query to filter by date (e.g. 2006* to 2007* to include all dates in 2006 and 2007).
I went in to trouble when i converted date in to yyyymmddHHmmssff. When i tried sorting the data, it gave me an exception that too big to convert..something. Hence i search and found then you need to have two columns. one in yyyymmdd and the other HHmmss, and then use Sort[] and give these two columns and then use. This will solve the issue.