Approach to Incrementally Index Database Data from Multi-Table Join in Lucene with No Unique Key - sql

I have a particular SQL join such that:
select DISTINCT ... 100 columns
from ... 10 tabes, some left joins
Currently I export the result of this query to XML using Toad (I'll query it straight from Java later). I use Java to parse the XML file, and I use Lucene (Java) to index it and to search the Lucene index. This works great: I get results 6-10 times faster than querying it from the database.
I need to think of a way to incrementally update this index when the data in the database changes.
Because I am joining tables (especially left joins) I'm not sure I can get a unique business key combination to do an incremental update. On the other hand, because I am using DISTINCT, I know that every single field is a unique combination. Given this information, I thought I could put the hashCode of a document as a field of the document, and call updateDocument on the IndexWriter like this:
public static void addDoc(IndexWriter w, Row row) throws IOException {
//Row is simply a java representation of a single row from the above query
Document document = new Document();
document.add(new StringField("fieldA", row.fieldA, Field.Store.YES));
...
String hashCode = String.valueOf(document.hashCode());
document.add(new StringField("HASH", hashCode, Field.Store.YES));
w.updateDocument(new Term("HASH", hashCode), document);
}
Then I realized that updateDocument was actually deleting the document with the matching hash code and adding the identical document again, so this wasn't of any use.
What is the way to approach this?

Lucene has no concept of "updating" a document. So an update or an add is essentially a delete + add.
YOu can track the progress here - https://issues.apache.org/jira/browse/LUCENE-4258
So you will need to keep the logic of doc.hashCode() in your app i.e Do not ask lucene to index a document if you know that no values have changed ( You can have a set of hashCode values and if it matches it then that document has not changed ) . You might want to have a logic for tracking deletes also

If you increment an id on each relevant update of your source DB tables
and if you log these ids on record deletion,
you should then be able to list deleted, updated and new records
of your data being indexed.
This step might be performed within a transitory table,
itself extracted into the xml file used as input to lucene.

Related

Append new files to the already indexed files using Lucene

From the database, some data are taken and is indexed and stored using lucene.
Later, some more data is added to the db and I need to index those newly added data only and append to the existing indexed files.
Can you explain with a program ?
What you are asking is incremental indexing and this is less on indexing side and more on selection approach of data ( target documents ) from database.
You need to make your SQL SELECT query to flexible enough to use a column that identifies newly added / updated rows.
That column is usually a DATE column i.e. something like - LAST_ADDED_DT , LAST_UPDT_DT so you can fetch records added / updated in last x days , x hours etc.
e.g. on DB2 like , WHERE DATE(LAST_UPDT_DT) >= CURRENT DATE - 2 DAY will give your records updated in last two days etc.
and then use updateDocument(...) method of Lucene writer instead of addDocument(...) since updateDocument(...) will add documents if a document is a new document and will update if document already exists.
So this approach takes care of updated existing rows as well as new rows.
Lucene creates new files or appends to existing files is not your headache then, Lucene will organize files as per its settings and structure for that version.
You should open your writer in OpenMode.CREATE_OR_APPEND mode.
Hope this helps !!

DocX4: How to add table in word docx using Quick part field and write to that template with a table

I would like to add table in my docx template. I know how to do it with just a simple field and how to use that template and write the value to it. But how can I create a table and write to the template. Assuming I have a list of Students object, how am I going to write it in a table?
This is how you add field name.
Quick Parts > Fields > Choose MergeField in the category and write the
desired field name
And here's how to write the value to it using docx4j
Map<DataFieldName, String> map = new HashMap<DataFieldName, String>();
map.put(new DataFieldName("myName"),"yourName");
MailMerger.setMERGEFIELDInOutput(MailMerger.OutputField.DEFAULT);
MailMerger.performMerge(template, map, true);
template.save(new File("C:/temp/OUT_SIMPLE.docx") );
Assuming you want one row per list item, the easiest way is to iterate through the list, writing your table rows.
To generate suitable code, create a Word document with the table looking as you want it, then generate corresponding code.
Alternatively, you could use OpenDoPE content control data binding, wrapping a repeat around a table row. This is a more generic solution, but with a learning curve...

Adding an extra field to already indexed data Solr

I have indexed approximately 1000 documents in Solr. But all of them are missing a field. I need to add a field to all these documents, and this field will have the same value for all of them. I do not have access to these documents to index them again. Is there any way to do this without re-indexing all the data again?
Unless you've configured your schema to store all values, no, there is no usable way to add a field to the documents without reindexing. If you all fields are stored, you can use atomic updates to add a new field for a document, so you could query Solr for the ids of all existing documents and perform an update that way.
Otherwise you're going to have to go with the suggestion from #michielvoo, and return a static value from the query string .. but then you could also just append it in your application before returning it to the user (or, you could add the field as a default value for the request handler in solrconfig.xml, so that you can edit and change it server side).

Lucene - Reading all field names that are stored

I need to populate a dropdown with all field names in a lucene index and need to show those values. I was able to do it successfully using
var luceneIndexReader IndexReader.Open("D:\path_to\index_directory", true);
var allAvailableFieldNames = luceneIndexReader.GetFieldNames(IndexReader.FieldOption.ALL);
Only problem is I need to include only 'Stored' fields in the drop down. This list includes all 'Indexed' and/or 'Stored' fields in it. Is there a way to query/search the indexes if a field has any 'stored' values and thereby filter out this list?
The problem is that every document in the index can have different fields containing stored fields. Since those are not storef as inverted index (they are stored per document) you can't retrieve them from the IndexReader. You need to retrieve one specific document, e.g. Document doc = indexReader.document(1); and call Fieldable fields[] = doc.getFields();. Then iterate over them and checking: field.isStored();.
Late at the party, but in the meanwhile you can just call FieldInfos# GetEnumerator()
See https://github.com/Shazwazza/lucenenet/blob/docs-update-jan2020/src/Lucene.Net/Index/FieldInfos.cs/#L168

Lucene 4 - get all used fields from directory

im trying to index a database table with Lucene 4. I index all fields of a table entry into a document as a TextField (1 document per table entry) and try to search over the directory afterward.
So my problem is, that i need all the field names that are in the directory to use a MultiFieldQuery.
QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_42, !FIELDS! , analyzer);
How do i get them? I could save them away while indexing, but it wouldn't be very performant to log them away with the index :/
Thank You
Alex
You can get the fieldnames from AtomicReader.getFieldInfos().
That will pass back a FieldInfos instance. Loop through FieldInfos.iterator(), and get the field names from FieldInfo.name
I don't see why it wouldn't be performant to store them somewhere ahead of time, though.