Why Term.Text returns invalid data for Numericfields in Lucene.Net even though it supposed to convert accrodingly? - lucene

I was trying to return all values in order to use them later for facets as following:
TermEnum termsEnum = reader.Terms(new Term(groupByField, string.Empty));
But as soon as I added a filed like this:
NumericField tempNumericField = new NumericField("price", Field.Store.YES, true);
Term.Text started to return wrong data for the price field.
Is there a way to return all date for both Field and NumericFields?

NumericFields are stored in an encoded form (allows for correct ordering, ranges etc).
Try using NumericUtils.PrefixCodedToInt (or the appropriate method for long etc)

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Add field/string length to logstash event

I'm trying to add a string length field to an index. Ideally, I'd like to use the kibana script feature as I can 'add' this field later but I keep getting a null_pointer_exception with the following code... I'm trying to sort in a visualization based on the fields length.
doc['field'].value ? doc['field'].length() : 0
Is this correct?
I thought it was because my field isn't always set (sparse data), but I added the ?:0 to combat that (which didn't work)
Any ideas?
You can define an scripted field in Kibana, of type int, language painless, and try this:
return (doc['field'].value != null? doc['field'].value.length(): 0);

How do you get Endeca to search on a particular target field rather than across all indexed fields?

We have an Endeca index configured across multiple fields of email content - subject and body. But we only want searches to be performed on the subject lines. Endeca is returning matches within the bodies too. How do you limit the search to the subject?
You can search a specific field or fields by specifying it (them) with the Ntk parameter.
Or if you wish to search a specific group of fields frequently you can set up an interface (also specified with the Ntk parameter), that includes that group of fields.
This is how you can do it using presentation API.
final ENEQuery query = new ENEQuery();
final DimValIdList dimValIdList = new DimValIdList("0");
query.setNavDescriptors(dimValIdList);
final ERecSearchList searches = new ERecSearchList();
final StringBuilder builder = new StringBuilder();
for(final String productId : productIds){
builder.append(productId);
builder.append(" ");
}
final ERecSearch eRecSearch = new ERecSearch("product.id", builder.toString().trim(), "mode matchany");
searches.add(eRecSearch);
query.setNavERecSearches(searches);
Please see this post for a complete example.
Use Search Interfaces in Developer Studio.
Refer - http://docs.oracle.com/cd/E28912_01/DeveloperStudio.612/pdf/DevStudioHelp.pdf#page=209

SQL Select Like Keywords in Any Order

I am building a Search function for a shopping cart site, which queries a SQL Server database. When the user enters "Hula Hoops" in the search box, I want results for all records containing both "Hula" and "Hoop", in any order. Furthermore, I need to search multiple columns (i.e. ProductName, Description, ShortName, MaufacturerName, etc.)
All of these product names should be returned, when searching for "Hula hoop":
Hula hoop
Hoop Hula
The Hoopity of xxhula sticks
(Bonus points if these can be ordered by relevance!)
It sounds like you're really looking for full-text search, especially since you want to weight the words.
In order to use LIKE, you'll have to use multiple expressions (one per word, per column), which means dynamic SQL. I don't know which language you're using, so I can't provide an example, but you'll have to produce a statement that's like this:
For "Hula Hoops":
where (ProductName like '%hula%' or ProductName like '%hoops%')
and (Description like '%hula%' or Description like '%hoops%')
and (ShortName like '%hula%' or ShortName like '%hoops%')
etc.
Unfortunately, that's really the only way to do it. Using Full Text Search would allow you to reduce your criteria to one per column, but you'll still have to specify the columns explicitly.
Since you're using SQL Server, I'm going to hazard a guess that this is a C# question. You'd have to do something like this (assuming you're constructing the SqlCommand or DbCommand object yourself; if you're using an ORM, all bets are off and you probably wouldn't be asking this anyway):
SqlCommand command = new SqlCommand();
int paramCount = 0;
string searchTerms = "Hula Hoops";
string commandPrefix = #"select *
from Products";
StringBuilder whereBuilder = new StringBuilder();
foreach(string term in searchTerms.Split(' '))
{
if(whereBuilder.Length == 0)
{
whereBuilder.Append(" where ");
}
else
{
whereBuilder.Append(" and ");
}
paramCount++;
SqlParameter param = new SqlParameter(string.Format("param{0}",paramCount), "%" + term + "%");
command.Parameters.Add(param);
whereBuilder.AppendFormat("(ProductName like #param{0} or Description like #param{0} or ShortName like #param{0})",paramCount);
}
command.CommandText = commandPrefix + whereBuilder.ToString();
SQL Server Full Text Search should help you out. You will basically create indexes on the columns you want to search. in the where clause of your query you will use the CONTAINS operator and pass it your search input.
you can start HERE or HERE to learn more
You might want to check out SOLR too - if you're going to be doing this type of searching. Super cool.
http://lucene.apache.org/solr/

nutch field problem

I was using something like:
Field notdirectory = new Field("notdirectory","1", Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "notdirectory:1" can be processed quite well all the time.
But recently I've changed the "Field.Store.NO, Field.Index.UN_TOKENIZED" to index a non-numeric string:
Field stateField = new Field("state","irn_" + state, Field.Store.NO, Field.Index.UN_TOKENIZED);
and queries like "state:irn_CA" can never fetch any results any more,even though I watch through hadoop logs that "irn_CA" is added to "state" field in fact.
So I doubt for Fields that satisfy "Field.Store.NO, Field.Index.UN_TOKENIZED",only numeric Fields can searchable,but I didn't see any documents about that.
So what's the true reason for this?
I think, you are using StandardAnalyzer for parsing the input query, which will tokenize your input query "irn_CA" into two tokens - "irn" and "CA". Since the index has "irn_CA" as single token, it won't match.
Try using KeywordAnalyzer for while searching. It will generate single token for the query string and match the indexed token correctly.
I think the searcher bean forces everything to lowercase...so make the state is in lower case when adding to the index:
Field stateField = new Field("state","irn_" + state.toLowerCase(), Field.Store.NO, Field.Index.UN_TOKENIZED);
and when you query: 'state:irn_ca' instead of 'state:irn_CA'.
I also note you prefixed with 'irn_' - good call, otherwise the highlighter flags up the the query.