SOLR Atomic update of custom stored and index metadata clears full-text index - indexing

I use bin/post to index all my files in /documents (mounted volume). It works and full-text search works fine.
I do an atomic update for specific metadata that I added to the schema BEFORE posting all docs, it works too.
I do a full-text search to find back the document for which the metadata has been updated, it DOESN'T work anymore, the updates are there but it seems that the full-text index has disappeared.
I do a full re-index and then it overrides my added metadata for the doc, resetting it to the default value. Although the metadata field I added is both stored and indexed.
Not sure what to do. That means that each reindexing will reset my added metadata...not great

The update - under the hood - reconstructs the document from stored fields, applies changes and puts them back to disk. On Lucene level, there is no "document update", it is a higher level concept. That's how the search indexes stay fast in this architecture.
So, your full-text field which is not stored, does not show up in the reconstructed document and does not get stored again in the "updated document".
If you have such a mix of stored and non-stored fields, you have to merge your updates outside of Solr from the original full-content.
Alternatively, depending on your use case, if you are just returning those update values, you could inject them with custom SearchComponent, use ExternalFileField or similar. The user mailing list could be a good place to ask for various options possible.

Related

Creating dynamic facets using apache solr

I'm new to apache solr.
I have uploaded a few log files using solr-cell and I want to create facets based on the content which is there in the log file.
For example: inside my log file I have a record for transaction, I would like to create transactionid as my facet and clicking it should result in a search in the uploaded log files and give me results according to that particular id.
Note: I need to facet field according to the content which is in the log.
As long as the field is indexed, you can facet on it. So, you can use either schemaless configuration or use dynamicField definitions to match and automatically create fields for your log records.
Go through Solr examples first, there should be enough information there.
(updated based on the comments)
If the text needs to be pre-processed and split, there are two basic avenues:
Using DataImportHandler (DIH), probably with LineEntityProcessor and RegexTransformer to split the field into multiple fields
Using UpdateRequestProcessor chains (in solrconfig.xml) and probably clone the field multiple times and then use RegexReplaceProcessorFactory to extract relevant parts. That's even uglier than DIH though as there is no easy way to split one field into many.
Still, specifically for logs, it is better to use something like Logstash with Solr output plugin.
+1 to Alex's answer.
Another alternative is to write a custom update processor where you figure out what field you want to facet on and explicitly add that field to your document.
This makes sense only if you know what kind of fields to expect, based on some pattern. If that is not the case, then using dynamic fields or a schemaless config is your best bet.

What is a good practice to entirely replace an existing Lucene index?

We use Lucene as a search engine. Our Lucene index is created by a master server, which is then deployed to slave instances.
This deployment is currently done by a script that deletes the files, and copy the new ones.
We needed to know if there was any good practice to do a "hot deployment" of a Lucene index. Do we need to stop or suspend Lucene? Do we need to inform Lucene the index has changed?
Thanks
The first step is to open the index in append mode for writing. You can achieve this by calling IndexWriter with the open mode named IndexWriterConfig.OpenMode.CREATE_OR_APPEND.
Once this is done, you are ready to both update existing documents and add new documents. For updating documents, you need to provide some kind of a unique identifier for a document (could be the URL or something else that is guaranteed to be unique). Now if you want to update a document with id say "Doc001" simply call the updateDocument function of Lucene passing "Doc001" as the Term (the very first) argument.
By this you can update an existing index without deleting it.

Automatic alias creation on indexing request in elasticsearch

Is it possible to automatically create an alias on an indexing request for an absent index in elasticsearch instead of creating the index?
For example, let's say an indexing request comes in for the uk_london index, with type document and the document contents. The uk_london index does not exist, so it will be created. I want to avoid this creation, to avoid having a kagillion shards. What I would like is for ES to create an alias uk_london (with routing, filtering etc.) and have it point to the uk index (which already exists).
I know this problem is solvable by changing the way the indexing requests come in. The cluster I am running receives documents from multiple systems. I do not have full control over how the documents are sent and would avoid having to touch everything that talks to it unless I really have to.
I am also aware of index templates and the fact that you can automatically set-up aliases on index creation through them. However, as far as I understand, this does not solve my problem as you can't tell it to create an alias instead of an index.

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

CouchDB View, Map, Index, and Sequence

I think read somewhere that when a View is requested the "map" is only run across documents that have been added since the last time it was requested? How is this determined? I thought I saw something about a sequence number. Is this something that you can get to? Its not part of the UUID trailing on the _rev field is it?
Any way to force a 'recalc' of the entire View (across all records)?
The section about View Indexes in the Technical Overview is a great guide to this.
The view builder uses the database sequence ID to determine if the view group is fully up-to-date with the database. If not, the view engine examines the all database documents (in packed sequential order) changed since the last refresh. Documents are read in the order they occur in the disk file, reducing the frequency and cost of disk head seeks.
As documents are examined, their previous row values are removed from the view indexes, if they exist. If the document is selected by a view function, the function results are inserted into the view as a new row.
CouchDB first checks to see if anything has changed in the entire database using a sequence id (that gets updated whenever there's a change to any document in the database). If something has changed it goes looking for those documents and runs the map function on them.
There really shouldn't be any need to rebuild/regenerate your views since it will incrementally refresh as you modify your documents (note that it won't update the view until you use it though). With hat said one way (and I'm sure there's a better way) would be to remove the design document describing the view and insert it again seeing as a design document is no different (almost) from a normal document.