Append new files to the already indexed files using Lucene - lucene

From the database, some data are taken and is indexed and stored using lucene.
Later, some more data is added to the db and I need to index those newly added data only and append to the existing indexed files.
Can you explain with a program ?

What you are asking is incremental indexing and this is less on indexing side and more on selection approach of data ( target documents ) from database.
You need to make your SQL SELECT query to flexible enough to use a column that identifies newly added / updated rows.
That column is usually a DATE column i.e. something like - LAST_ADDED_DT , LAST_UPDT_DT so you can fetch records added / updated in last x days , x hours etc.
e.g. on DB2 like , WHERE DATE(LAST_UPDT_DT) >= CURRENT DATE - 2 DAY will give your records updated in last two days etc.
and then use updateDocument(...) method of Lucene writer instead of addDocument(...) since updateDocument(...) will add documents if a document is a new document and will update if document already exists.
So this approach takes care of updated existing rows as well as new rows.
Lucene creates new files or appends to existing files is not your headache then, Lucene will organize files as per its settings and structure for that version.
You should open your writer in OpenMode.CREATE_OR_APPEND mode.
Hope this helps !!

Related

How to get the column index number of a specific field name in a staged file on Snowflake?

I need to get the column number of a staged file on Snowflake.
The main idea behind it, is that I need to automate getting this field in other queries rather than using t.$3 whereas 3 is the position of the field, that might be changed because we are having an expandable surveys (more or less questions depending on the situation).
So what I need is something like that:
SELECT COL_NUMBER FROM #my_stage/myfile.csv WHERE value = 'my_column_name`
-- Without any file format to read the header
And then this COL_NUMBER could be user as t.$"+COL_NUMBER+" inside merge queries.

In SQL how do I update a table with a similar table?

In my current Database I have a table whose data is manually entered or comes in an excel sheet every week. Before we had the "manual entry option", the table would be dropped and replaced by the excel version.
Now because there is data that only exists in the original table this can not be done.
I'm trying to find a way to update the original table with changes and additions from the (excel) table while preserving all rows not in the new sheet.
I've been attempting to simply use an insert query and an update query /but/ I can't find a way to detect changes in a record.
Any suggestions? I can provide the current sql if you'd find that helpful.
Based on what I have read so far, I think I can offer some suggestions:
It appears you have control of the MS Access. I would suggest adding a field to your data table called "source". Modify your form in the access database to store something like "m" for manual entry in the source field. When you import the excel, store an "e" for excel in the field.
You would need to do a one time scrub of the data to mark existing records as manual entries or excel entries. There are a couple of ways you can do it through automation/queries that I can explain in detail if you want.
Once past these steps, your excel process is fairly simple. You can delete all records with source = "e" and then do a full excel import. Manual records would remain unchanged.
This concept will allow you to add new sources and codes and allow you to handle each differently if needed. You just need to spend some time cleaning up your old data. I think you will find it worth it in the end.
Good Luck.

Approach to Incrementally Index Database Data from Multi-Table Join in Lucene with No Unique Key

I have a particular SQL join such that:
select DISTINCT ... 100 columns
from ... 10 tabes, some left joins
Currently I export the result of this query to XML using Toad (I'll query it straight from Java later). I use Java to parse the XML file, and I use Lucene (Java) to index it and to search the Lucene index. This works great: I get results 6-10 times faster than querying it from the database.
I need to think of a way to incrementally update this index when the data in the database changes.
Because I am joining tables (especially left joins) I'm not sure I can get a unique business key combination to do an incremental update. On the other hand, because I am using DISTINCT, I know that every single field is a unique combination. Given this information, I thought I could put the hashCode of a document as a field of the document, and call updateDocument on the IndexWriter like this:
public static void addDoc(IndexWriter w, Row row) throws IOException {
//Row is simply a java representation of a single row from the above query
Document document = new Document();
document.add(new StringField("fieldA", row.fieldA, Field.Store.YES));
...
String hashCode = String.valueOf(document.hashCode());
document.add(new StringField("HASH", hashCode, Field.Store.YES));
w.updateDocument(new Term("HASH", hashCode), document);
}
Then I realized that updateDocument was actually deleting the document with the matching hash code and adding the identical document again, so this wasn't of any use.
What is the way to approach this?
Lucene has no concept of "updating" a document. So an update or an add is essentially a delete + add.
YOu can track the progress here - https://issues.apache.org/jira/browse/LUCENE-4258
So you will need to keep the logic of doc.hashCode() in your app i.e Do not ask lucene to index a document if you know that no values have changed ( You can have a set of hashCode values and if it matches it then that document has not changed ) . You might want to have a logic for tracking deletes also
If you increment an id on each relevant update of your source DB tables
and if you log these ids on record deletion,
you should then be able to list deleted, updated and new records
of your data being indexed.
This step might be performed within a transitory table,
itself extracted into the xml file used as input to lucene.

Lucene 4 - get all used fields from directory

im trying to index a database table with Lucene 4. I index all fields of a table entry into a document as a TextField (1 document per table entry) and try to search over the directory afterward.
So my problem is, that i need all the field names that are in the directory to use a MultiFieldQuery.
QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_42, !FIELDS! , analyzer);
How do i get them? I could save them away while indexing, but it wouldn't be very performant to log them away with the index :/
Thank You
Alex
You can get the fieldnames from AtomicReader.getFieldInfos().
That will pass back a FieldInfos instance. Loop through FieldInfos.iterator(), and get the field names from FieldInfo.name
I don't see why it wouldn't be performant to store them somewhere ahead of time, though.

Caching in VBA (Excel 2010) to connect to SQL Server 2008 Express

I have an excel 2010. In the VBA code, one procedure fetches data from SQPEXPR 2008.
There are repeated calls, and many times same/filter data is fetched (which is already present in the excel document).
It makes a good use case to apply caching (for performance improvement).
Is it possible? Is yes how?
SQL Native Client is used to connect to database.
When you update the underlying data that you wish to cache, also store the date updated. This could be done manually or via a trigger.
When you perform your query for the main data in Excel, also query for and store in the Excel spreadsheet the date last updated.
Finally, when you perform your data-refresh operation from Excel, first query to see if the current last-updated date is the same as the stored last-updated date. If so, there is no need to refresh the data.
If your data has an inherent "last updated" date and there is an index of any kind with this value as the first column within it, then you already have your "last updated date" stored just fine--it will take only one seek to read the most recent updated date, from which you can derive the same optimization.
SELECT TOP 1 DateChanged FROM dbo.YourTable ORDER BY DateChanged DESC;
Assuming that index I was talking about on DateChanged, you've got your "table last updated" date. Also assuming, of course, that every operation on the table will faithfully update this date when inserting or updating, and that rows are never deleted, just marked inactive (otherwise you would not know to remove a row).
Either way--explicitly saving a separate last-updated date or using a column implicitly, you now have a way to cache your data.
It may help you to think about how browsers and web servers perform this task, which is pretty much exactly how I outlined it: the file being requested has a modified date, and this data is exchanged with the client browser first. Only if the file has a newer date than the cached copy the browser has does the browser request the actual file contents.