im trying to index a database table with Lucene 4. I index all fields of a table entry into a document as a TextField (1 document per table entry) and try to search over the directory afterward.
So my problem is, that i need all the field names that are in the directory to use a MultiFieldQuery.
QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_42, !FIELDS! , analyzer);
How do i get them? I could save them away while indexing, but it wouldn't be very performant to log them away with the index :/
Thank You
Alex
You can get the fieldnames from AtomicReader.getFieldInfos().
That will pass back a FieldInfos instance. Loop through FieldInfos.iterator(), and get the field names from FieldInfo.name
I don't see why it wouldn't be performant to store them somewhere ahead of time, though.
Related
From the database, some data are taken and is indexed and stored using lucene.
Later, some more data is added to the db and I need to index those newly added data only and append to the existing indexed files.
Can you explain with a program ?
What you are asking is incremental indexing and this is less on indexing side and more on selection approach of data ( target documents ) from database.
You need to make your SQL SELECT query to flexible enough to use a column that identifies newly added / updated rows.
That column is usually a DATE column i.e. something like - LAST_ADDED_DT , LAST_UPDT_DT so you can fetch records added / updated in last x days , x hours etc.
e.g. on DB2 like , WHERE DATE(LAST_UPDT_DT) >= CURRENT DATE - 2 DAY will give your records updated in last two days etc.
and then use updateDocument(...) method of Lucene writer instead of addDocument(...) since updateDocument(...) will add documents if a document is a new document and will update if document already exists.
So this approach takes care of updated existing rows as well as new rows.
Lucene creates new files or appends to existing files is not your headache then, Lucene will organize files as per its settings and structure for that version.
You should open your writer in OpenMode.CREATE_OR_APPEND mode.
Hope this helps !!
I am using Luke, to add a new field to all existing documents in the index. I can easily do this for 1 document by using the "Reconstruct and Edit" option. However, I want to do this for all document, in one go. Right now, I have to do this one by one. There are 2000 Documents in the index. This would take a lot of time.
I have indexed approximately 1000 documents in Solr. But all of them are missing a field. I need to add a field to all these documents, and this field will have the same value for all of them. I do not have access to these documents to index them again. Is there any way to do this without re-indexing all the data again?
Unless you've configured your schema to store all values, no, there is no usable way to add a field to the documents without reindexing. If you all fields are stored, you can use atomic updates to add a new field for a document, so you could query Solr for the ids of all existing documents and perform an update that way.
Otherwise you're going to have to go with the suggestion from #michielvoo, and return a static value from the query string .. but then you could also just append it in your application before returning it to the user (or, you could add the field as a default value for the request handler in solrconfig.xml, so that you can edit and change it server side).
I have a particular SQL join such that:
select DISTINCT ... 100 columns
from ... 10 tabes, some left joins
Currently I export the result of this query to XML using Toad (I'll query it straight from Java later). I use Java to parse the XML file, and I use Lucene (Java) to index it and to search the Lucene index. This works great: I get results 6-10 times faster than querying it from the database.
I need to think of a way to incrementally update this index when the data in the database changes.
Because I am joining tables (especially left joins) I'm not sure I can get a unique business key combination to do an incremental update. On the other hand, because I am using DISTINCT, I know that every single field is a unique combination. Given this information, I thought I could put the hashCode of a document as a field of the document, and call updateDocument on the IndexWriter like this:
public static void addDoc(IndexWriter w, Row row) throws IOException {
//Row is simply a java representation of a single row from the above query
Document document = new Document();
document.add(new StringField("fieldA", row.fieldA, Field.Store.YES));
...
String hashCode = String.valueOf(document.hashCode());
document.add(new StringField("HASH", hashCode, Field.Store.YES));
w.updateDocument(new Term("HASH", hashCode), document);
}
Then I realized that updateDocument was actually deleting the document with the matching hash code and adding the identical document again, so this wasn't of any use.
What is the way to approach this?
Lucene has no concept of "updating" a document. So an update or an add is essentially a delete + add.
YOu can track the progress here - https://issues.apache.org/jira/browse/LUCENE-4258
So you will need to keep the logic of doc.hashCode() in your app i.e Do not ask lucene to index a document if you know that no values have changed ( You can have a set of hashCode values and if it matches it then that document has not changed ) . You might want to have a logic for tracking deletes also
If you increment an id on each relevant update of your source DB tables
and if you log these ids on record deletion,
you should then be able to list deleted, updated and new records
of your data being indexed.
This step might be performed within a transitory table,
itself extracted into the xml file used as input to lucene.
I only want my Lucene search to give the highest scoring highlighted fragment per document. So say I have 5 documents with the word "performance" on each one three times, I still only want 5 results to be printed and highlighted to the results page. How can I go about doing that? Thanks!
You get only one fragment per document returned from the search by calling getBestFragment, rather than getBestFragments.
If your call to search is returning the same documents more than once, you very likely have more than one copy of the same document in your index. Make sure that if you intend to create a new index, you open your IndexWriter with it's OpenMode set to: IndexWriterConfig.OpenMode.CREATE.