Indexing a integer in lucene - lucene

I am using followng code to index a integer value
String key = hmap.get("key");
System.out.println("key == "+Integer.parseInt(key));
if(key!=null && key.trim().length()>0)
doc.add(new IntField("kv", Integer.parseInt(key),IndexFieldTypes.getFieldType(INDEX_STORE_FIELD)));
The problem is if 'key' is '50' the line 'key== 50' get printed well but when it reach 'doc.add' line it throw following exception:
java.lang.IllegalArgumentException: type.numericType() must be INT but got null
at org.apache.lucene.document.IntField.<init>(IntField.java:171)
Can someone figure out.

An IntField must have a NumericFieldType of FieldType.NumericType.INT. Of course, I don't have intimate knowledge of your IndexFieldTypes class, but I would guess it's default INDEX_STORE_FIELD has no numeric type (rightly so, if it is non-null lucene will try to index as a number).
You may not necessarily need to pass a field type to IntField though, you could just do something like:
doc.add(new IntField("kv", Integer.parseInt(key), Field.Store.YES));
If you do need to define a FieldType, either use a different type from existing functionality in IndexFieldTypes, or implement logic to create an IntField from it. Or just set the NumericFieldType after it is retreived, like:
FieldType type = IndexFieldTypes.getFieldType(INDEX_STORE_FIELD);
type.setNumericFieldType(FieldType.NumericType.INT);
doc.add(new IntField("kv", Integer.parseInt(key), type));

Related

Lucene query finds [0 TO 1] for value e.g. 167

I've opened the Index with Luke and the field is there.
The field is indexed via HibernateSearch and is annotated like this:
#Field(name = "id", index = Index.YES, analyze = Analyze.NO, store = Store.NO)
Long id
The values of this field are between 109 and 185.
If I search for this field e.g:
[150 TO 180]
then nothing is found.
If I search for it with:
[0 TO 1]
then all results are returned.
It seems the field is indexed in a wrong format, correct?
How to correct this?
Note that I also indexed it one time with store = Store.YES to see in Luke the values and could see them correctly.
The query parsers in Lucene will treat everything as strings. It makes sense: At query time Lucene has no idea of which types has been used. This means they will create string range queries.
So if you want Lucene to create a proper numeric range query you'll need to subclass MultiFieldsQueryParser (assuming you use that) and override newRangeQuery. There you can check the field name and create a range query (e.g. with LongPoint.newRangeQuery()) if it's a field you know it's numeric.
I now found out that I've to use
NumericRangeQuery.newLongRange()

TableRow.get("field_name") can only be cast to String in Dataflow ParDo

I am exporting a table from BQ by dataflow and it seems when processed by ParDo, I could only get the "string" value of data of each field in TableRow regardless of what originally the data type is in BQ schema.
For example, say my table has a INTEGER typed column "fieldA":
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
String str = (String) c.get("fieldA"); // OK
Integer i = (Integer) c.get("fieldA"); // Throw "String cannot be cast to Integer" exception
}
Is it a bug or it is only me? If not only me, is there anyway to get around it? For integer type I could still do Integer.valueOf(String) but it will have to be a little bit hacky and err-prone when parsing Timestamp field.
FYI, I am using BlockDataflowPipelineRunner
According to BigQueryTableRowIterator:
Note that integers are encoded as strings to match BigQuery's exported JSON format.
So you'll need to Integer.parseInt. Sorry for the trouble, we should improve the documentation about typing of values in the TableRow when reading from BigQueryIO.Read - this documentation is not very discoverable.

Lucene Query does not return results even though it should

i am currently trying to get all Documents from a Lucene Index (v. 4) in a RamDirectory.
on index creation the following addDocument function is used:
public void addDocument(int id, String[] values, String[] fields) throws IOException{
Document doc = new Document();
doc.add(new IntField("getAll", 1, IntField.TYPE_STORED));
doc.add(new IntField("ID", id, IntField.TYPE_STORED));
for(int i = 0; i < fields.length; i++){
doc.add(new TextField(fields[i], values[i], Field.Store.NO));
}
writer.addDocument(doc);
}
after calling this for all documents the writer is closed.
as you can see from the first field added to the document, i added an additional field "getAll" to make it easy to retrieve all documents. If I understood it right, the Query "getAll:1" should return all documents in the index. But thats not the case.
I am using the following function for that:
public List<Integer> getDocIds(int noOfDocs) throws IOException, ParseException{
List<Integer> result = new ArrayList<Integer>(noOfDocs);
Query query = parser.parse("getAll:1");
ScoreDoc[] docs = searcher.search(query, noOfDocs).scoreDocs;
for(ScoreDoc doc : docs){
result.add(doc.doc);
}
return result;
}
noOfDocs is the number of Documents that were indexed. Of course i used the same RamDirectory when creating the IndexSearcher.
Substitution of the parsed Query to a manually created TermQuery didn't help either.
The query returns no results.
Hope someone can help to find my error.
Thanks
I believe you are having trouble searching because you are using an IntField, rather than a StringField or TextField, for instance. IntField, and other numeric fields, are designed for numeric range querying, and are not indexed in their raw form. You may use a NumericRangeQuery to search for them.
Really, though, IntField should only be used, to my mind, for numeric values, and not for a string of digits, which is what you appear to have. IDs should be keyword or text fields, generally.
As far as pulling all records, you don't need to add a field to do that. Simply use a MatchAllDocsQuery.
I think first you should run Luke to verify the contents of the index.
Also, if you allow * as the first character of a query with queryParser.setAllowLeadingWildcard(true); , then a query like ID:* would retrieve all documents without having to include the getAll field.

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?
Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Is it possible to add custom metadata to a Lucene field?

I've come to the point where I need to store some additional data about where a particular field comes from in my Lucene.Net index. Specifically, I want to attach a guid to certain fields of a document when the field is added to the document, and retrieve it again when I get the document from a search result.
Is this possible?
Edit:
Okay, let me clarify a bit by giving an example.
Let's say I have an object that I want to allow the user to tag with custom tags like "personal", "favorite", "some-project". I do this by adding multiple "tag" fields to the document, like so:
doc.Add( new Field( "tag", "personal" ) );
doc.Add( new Field( "tag", "favorite" ) );
The problem is I now need to record some meta data about each individual tag itself, specifically a guid representing where that tag came from (imagine it as a user id). Each tag could potentially have a different guid, so I can't simply create a "tag-guid" field (unless the order of the values is preserved---see edit 2 below). I don't need this metadata to be indexed (and in fact I'd prefer it not to be, to avoid getting hits on metadata), I just need to be able to retrieve it again from the document/field.
doc.GetFields( "tag" )[0].Metadata...
(I'm making up syntax here, but I hope my point is clear now.)
Edit 2:
Since this is a completely different question, I've posted a new question for this approach: Is the order of multi-valued fields in Lucene stable?
Okay let's try another approach... The key problem area is the indeterminacy of the multiple field values under the same field name (e.g. "tag"). If I could introduce or obtain some kind of determinacy here, I might be able to store the metadata in another field.
For example, if I could rely on the order of the values of the field never changing, I could use an index in the set of values to identify exactly which tag I am referring to.
Is there any guarantee that the order I add the values to a field will remain the same when I retrieve the document at a later time?
Depending on your search requirements for this index, this may be possible. That way you can control the order of fields. It would require updating both fields as the tag list changes of course, but the overhead may be worth it.
doc.Add(new Field("tags", "{personal}|{favorite}"));
doc.Add(new Field("tagsref", "{1234}|{12345}"));
Note: using the {} allows you to qualify your search for uniqueness where similar values exist.
Example: If values were stored as "person|personal|personage" searching for "person" would return a document that has any one of person, personal or personage. By qualifying in curly brackets like so: "{person}|{personal}|{personage}", I can search for "{person}" and be sure it won't return false positives. Of course, this assumes you don't use curly brackets in your values.
I think you're asking about payloads.
Edit: From your use case, it sounds like you have no desire to use this metadata in your search, you just want it there. (Basically, you want to use Lucene as a database system.)
So, why can't you use a binary field?
ExtraData ed = new ExtraData { Tag = "tag", Type = "personal" };
byte[] byteData = BinaryFormatter.Serialize(ed); // this isn't the correct code, but you get the point
doc.Add(new Field("myData", byteData, Field.Store.YES));
Then you can deserialize it on retrieval.