Updating Lucene payloads without a full re-index

Updating Lucene payloads without a full re-index - lucene

In Lucene, I'm using payloads to store information for each token in a document (a float value in my case). From time to time, those payloads may need to be updated. If I know the docID, termID, offset, etc., is there any way for me to update the payloads in place without having to re-index the whole document?

I'm no aware of any Lucene API to support this, even an "update" operation under the hood is executed as a "delete" and "add" add operation.
A workaround that will require more storage, but reduces IO and latency could be to store the whole source of a document either in the Lucene index itself or a dedicated data store on the same node as the Lucene index. Then you still could send only the updated payload info to your application, to get your document updated. But still the whole document needs to be re-indexed.
See also How to set a field to keep a row unique in lucene?

Related

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.

We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

How to create a facet in Sitecore Content Search (Lucene) based on Real Time Data?

With Sitecore Content Search configuration is it possible to support the addition of a field which is populated with a value at search time, not index time? The population would be from an in-memory data structure for performance.
Essentially without re-indexing the values need to be updated/accessed, examples for this real time field would be Facebook Likes, In Stock, or Real Time Pricing. This data would then be used for faceting such as items with a range of Facebook likes, in-stock versus out-of-stock, or real time price facets.

The content search api does the searching on an iindexable, so I would look into that - you'd probably have to implement this interface yourself.
More info here:
http://www.sitecore.net/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-search-operations-explained.aspx
If you need to search on data that is not in the index I would question whether sitecore search is the best option here. If the data needs to be searched in real time then maybe a database would suffice.
If the data set is large and you need realtime access then maybe a nosql database such as MongoDB might be the right choice. Hope this has given you some ideas and you reach a solution

You can leverage the Sitecore dynamic index. The idea is to query your "large" index from within your in-memory index which you'll use dynamically. The implementation is relatively easy.
More info: http://www.sitecore.net/en-gb/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-dynamic-indexes.aspx

Search index replication

I am developing an application that requires a CLucene index to be created in a desktop application, but replicated for (read-only) searching on iOS devices and efficiently updated when the index is updated.
Aside from simply re-downloading the entire index whenever it changes, what are my options here? CLucene does not support replication on its own, but Solr (which is built on top of Lucene) does, so it's clearly possible. Does anybody know how Solr does this and how one would approach implementing similar functionality?
If this is not possible, are there any (non-Java-based) full-text search implementations that would meet my needs better than CLucene?
Querying the desktop application is not an option - the mobile applications must be able to search offline.

A Lucene index is based on write-once read-many segments. This means that when new documents have been committed to a Lucene index, all you nee to retrieve is:
the new segments,
the merged segments (old segments which have been merged in a single segment, if any),
the segments file (which stores information about the current segments).
Once all these new files have been downloaded, the segments files which have been merged can be safely removed. To take the changes into account, just reopen an IndexReader.
Solr has a Java implementation to do this, but given how simple it is, using a synchronization tool such as rsync would do the trick too. By the way, this is how Solr replication worked before Solr 1.4, you can still find some documentation on the wiki about rsync replication.

Best way to keep index real time?

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!

First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.

You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.

Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.

^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.

Check out this wiki page

Strategies for keeping a Lucene Index up to date with domain model changes

Was looking to get peoples thoughts on keeping a Lucene index up to date as changes are made to the domain model objects of an application.
The application in question is a Java/J2EE based web app that uses Hibernate. The way I currently have things working is that the Hibernate mapped model objects all implement a common "Indexable" interface that can return a set of key/value pairs that are recorded in Lucene. Whenever a CRUD operation is performed involving such an object I send it via JMS queue into a message driven bean that records in Lucene the primary key of the object and the key/value pairs returned from the index( ) method of the Indexable object that was provided.
My main worries about this scheme is if the MDB gets behind and can't keep up with the indexing operations that are coming in or if some sort of error/exception stops an object from being index. The result is an out-of-date index for either a sort, or long, period of time.
Basically I was just wondering what kind of strategies others had come up with for this sort of thing. Not necessarily looking for one correct answer but am imagining a list of "whiteboard" sort of ideas to get my brain thinking about alternatives.

Change the message: just provide the primary key and the current date, not the key/value pairs. Your mdb fetches the entity by primary key and calls index(). After indexing you set a value "updated" in your index to the message date. You update your index only if the message date is after the "updated" field of the index. This way you can't get behind because you always fetch the current key/value pairs first.
As an alternative: have a look at http://www.compass-project.org.

The accepted answer is 8 years old now and very out of date.
The Compass Project is not maintained anymore since a long time, as its main developer moved on to create the excellent Elasticsearch.
The modern answer to this is to use Hibernate Search, which incidentally can map to either a Lucene index directly or through Elasticsearch.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas