Solr Optimization job is not deleting logically deleted documnents - optimization

I've two questions:
I've tried optimization with this following command:
curl 'http://hostname:port/solr//update?optimize=true&maxSegments=N&waitSearcher=false'
But when one segment have highest size with live docs and deleted docs both....Solr optimization job is not able to delete those logically deleted docs and also this merges the current segments to resulting segment count with the same deleted doc count as previous.
When already a core have certain segment count, I'm not able to optimize solr core with the same 'maxSegments=N'. Can optimization not be performed with resulting segment count similar to current segment count of a solr core?
Please provide best practices to do this and tell what I'm doing wrong.
Thanks! in Advance.

Starting with Solr 7.5, there is a change in behaviour of merging segments. Merging segments is what "optimize" does including removal of deleted documents, so you were on the right path. But starting with 7.5 segments are merged only, if certain criteria is fullfilled.
Please review article (found in email thread in Solr community):
https://lucidworks.com/post/solr-optimize-merge-expungedeletes-tips/
I had the same issue. After reading the article, I did set "maxSegments=1" and this made "optimize" do the desired job, since this enforces the old behaviour.
So it should work with your instance as well, if you specify "maxSegments=1" instead of "maxSegments=N".

Related

Column Deletion from Apache Druid

How can we delete a column from druid datasource ?
I removed it from the datasource spec but still i can see it in the datasource.
Please assist if anyone is familiar with this.
Druid is not like a conventional database where you define a structure, and that the structure is applied for all the data.
The data is stored in segments. Each segment contains the data which was put in this segment, together with the "structure" of that segment.
So, changing it in your dataSource spec will make sure that newly created segments will not include that new column. However, existing segments will still contain the column.
To remove this column, you need to re-index the older segments. During this re-index task, you can read the data from your existing segments and apply your new dataSource spec to it. You can then write it to the same segment where you have read it from.
See this link to read data from existing data sources:
https://druid.apache.org/docs/latest/ingestion/native-batch.html#druid-input-source
In the latest version of druid (0.17.0) this is changed. It previously was done by an IngestSegmentFirehose.
Please make sure that you process the WHOLE segment. If you only overwrite a part of the segment, all the other data will be lost (at least, in the new version of your data).
Also note: After applying the rewrite, druid will put your new data in a newer version. However, your "old" version still exists. If you are not aware of this, your data storage can grow very quickly.
If you are happy with your result, you should execute a KILL task. This will delete all data (from older versions) which are no longer the "active" version.
If you are an PHP user, you can take a look at this package: https://github.com/level23/druid-client
We have implemented these re-index tasks together with easy querying in a class. Maybe it helps.

-Denable-debug-rules=true not giving out statistics

I'm giving the flag -Denable-debug-rules, which the documentation says should print something to a log at least every 5 minutes, according to http://graphdb.ontotext.com/documentation/standard/rules-optimisations.html
Unfortunately it's not, and I need to figure out why inferencing is taking so long.
Help?
The specific files is http://purl.obolibrary.org/obo/pr.owl and I'm using owl2-rl-optimized
Version graphdb-ee-6.3.1
An exchange with GraphDB tech support clarified that the built-in rule sets can not be monitored. To effectively monitor them, copy into a new file and add that file as a ruleset following http://graphdb.ontotext.com/documentation/enterprise/reasoning.html#operations-on-rulesets

Newly inserted documents to RavenDb not showing up in searches

I have a standard install of RavenDb and am running into some problems after I insert a new document.
If I do a subsequent search or try to pull that document by it's Id after I've inserted it there is about a 25% chance that it's not included in the search results or that I get an error trying to retrieve it by it's Id. When I open up Raven Studio I can see that the document exists so what's the deal?
Is this because whatever index it is using to find the document hasn't been updated yet? How can I ensure that I am always querying the latest data so that this doesn't happen?
Yes it looks like this is due to stale indexes. There is a way to check if there are pending index operations which you can use as a way of ensuring that you are querying the latest data. This article describes how to do that:
http://ravendb.net/docs/article-page/3.0/csharp/indexes/stale-indexes

Exclude versioned documents while Querying-Raven db

I have appended the versioning bundle in midway of my project after having written most of my raven queries in my data access layer. Now because of versioning i have lots of replicated data. Whenever i query a type of document i can see the values replicated as many times as the document is versioned. Is there way to stop querying the re-visioned documents when i query for the current data in common without re-writing all of my queries with Exclude("Revisions").Is there any setting where i can say query on re-visioned document =False which i can set globally? please suggest something to overcome this..
That is the way it works, actually. It appears that you have disabled the versionning bundle, which would cause this to happen.

Is it possible to re-generate Lucene index in background?

Sometimes there is need to re-generate a lucene index, e.g. when something changes in the Compass mapping or in the way boosts are applied, or if something went corrupt for whatever reason.
In my case, generation of the index takes about 5 to 6 hours, clearing the index before leads to data not being complete for this interval. I. e. doing a search in this time returns an incomplete result.
Is there any standard way to have lucene generate the index in the background? E.g. write index to a temporary directory and (when indexing is finished without exceptions etc) replace the existing index with the new one?
Of course, one could implement this "manually", but does one have to? Sounds like a common use case to me.
Best regards + Thanks for your opinion,
Peter :)
I had a similar experience; there were certain parameters to the Analyzer which would get changed from time to time; obviously if that was the case, the entire index needs to get rebuilt. (I won't go into the details, suffice to say I had the same requirement!)
I did what you suggested in your question. There were three directories, "old", "current" and "new". Queries from the live site went against "current" always. The index recreation process was:
Recursive delete on the "old" and "new" directories
Create the new index into the "new" directory (in my case takes about 6 hrs)
Rename "current" to "old"; and "new" to "current"
Recursive delete the "old" directory
An analysis of what happens when the process crashes - if it crashes in the 1st step, the next time it will just carry on. If it crashes in the 2nd step then the "new" directory will get deleted next run. The 3rd step is very fast - renaming a directory is fast and atomic. Crashing in the 4th step doesn't matter, it'll just get cleaned up next run.
The careful observer will note that in step 3, the system could crash between renaming the current directory away and moving the new directory in. This is unlikely to happen as directory rename is so fast. The system has been in production for a few years and this has never happened (yet?).
I think the usual way to do this is to use solr's replication functionality. In your case though, the master and slave would be on the same machine, but just pointed at different directories.
We have a similar problem. Our data is indexed in Lucene, but the original source is DB and content repo.
So if an index goes out of sync (or data type changes, etc.), we simply iterate over all existing entries in the index and re-generate the data so each document gets updated. It is not really a complex thing to do.