Compass/Lucene in clustered environment - lucene

I get the following error in a clustered environment where one node is indexing the objects and the other node is confused about the segments that are there in cache. The node never recovers by itself even after server restart. The node that's indexing might be merging the segments and deleting which the other node is not aware of. I did not touch the invalidateCacheInterval setting and added compass.engine.globalCacheIntervalInvalidation property with 500ms. It didn't help.
This is happening while searching and indexing on the other node.
Can someone help me how to resolve this issue? Maybe to ask compass reload the cache or start from scratch without having to reindex all the objects?
org.compass.core.engine.SearchEngineException: Failed to search with query [+type:...)]; nested exception is org.apache.lucene.store.jdbc.JdbcStoreException: No entry for [_6ge.tis] table index_objects
org.apache.lucene.store.jdbc.JdbcStoreException: No entry for [_6ge.tis] table index_objects
at org.apache.lucene.store.jdbc.index.FetchOnBufferReadJdbcIndexInput$1.execute(FetchOnBufferReadJdbcIndexInput.java:68)
at org.apache.lucene.store.jdbc.support.JdbcTemplate.executeSelect(JdbcTemplate.java:112)
at org.apache.lucene.store.jdbc.index.FetchOnBufferReadJdbcIndexInput.refill(FetchOnBufferReadJdbcIndexInput.java:58)
at org.apache.lucene.store.ConfigurableBufferedIndexInput.readByte(ConfigurableBufferedIndexInput.java:27)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:127)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:158)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:250)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:752)
at org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:377)
at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:86)
at org.apache.lucene.search.Similarity.idf(Similarity.java:457)
at org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:44)
at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146)
at org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:185)
at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:360)
at org.apache.lucene.search.Query.weight(Query.java:95)
at org.apache.lucene.search.Hits.(Hits.java:85)
at org.apache.lucene.search.Searcher.search(Searcher.java:61)
at org.compass.core.lucene.engine.transaction.support.AbstractTransactionProcessor.findByQuery(AbstractTransactionProcessor.java:146)
at org.compass.core.lucene.engine.transaction.support.AbstractSearchTransactionProcessor.performFind(AbstractSearchTransactionProcessor.java:59)
at org.compass.core.lucene.engine.transaction.search.SearchTransactionProcessor.find(SearchTransactionProcessor.java:50)
at org.compass.core.lucene.engine.LuceneSearchEngine.find(LuceneSearchEngine.java:352)
at org.compass.core.lucene.engine.LuceneSearchEngineQuery.hits(LuceneSearchEngineQuery.java:188)
at org.compass.core.impl.DefaultCompassQuery.hits(DefaultCompassQuery.java:199)

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Ontotext GraphDB Repository cannot be used for queries

I am getting an error message while trying to sparql in a particular repository.
Error :
The currently selected repository cannot be used for queries due to an error:
Page [id=7, ref=1,private=false,deprecated=false] from pso has size of 206 != 820 which is written in the index: PageIndex#244 [OPENED] ref:3 (parent=null freePages=1 privatePages=0 deprecatedPages=0 unusedPages=0)
So I tried to recreate the repository by uploading a new RDF file, but still issue persist. Any solution? Thanks in advance
The error indicates an inconsistency between what is written in the index (pso.index) and the actual page (pso). Is there any chance that the binary files were modified/over-written/partially merged? Under normal operation, you should never get this an error.
The only way to hide this error is to start GraphDB with: ./graphdb -Dthrow.exception.on.index.inconsistency=false. I will recommend doing this only for dumping the repository content into an RDF file, drop the repository, and recreate it.

Corrupted MD-SAL queries when trying to read flows on tables: missing flow rules

I am experiencing a glitch from OpenDaylight (using Mininet).
Essentially, I am querying flow rules on specific nodes and on specific tables. The relevant code is the following, and is run by 1 separate thread per node that I am polling:
public static final InstanceIdentifier<Nodes NODES_II = InstanceIdentifier
.builder(Nodes.class).build();
public static InstanceIdentifier<Table> makeTableIId(NodeId nodeId, Short tableId) {
return NODES_IID.child(Node.class, new NodeKey(nodeId))
.augmentation(FlowCapableNode.class)
.child(Table.class, new TableKey(tableId));
}
and
InstanceIdentifier<Table> tableIId = makeTableIId(nodeId, tableId);
Optional<Table> tableOptional = dataBroker.newReadOnlyTransaction()
.read(LogicalDatastoreType.OPERATIONAL, tableIId).get();
if(!tableOptional.isPresent()) {
continue;
}
List<Flow> flows = tableOptional.get().getFlow();
The behavior: tableOptional is present, and getFlow() returns an empty list.
The observation: there ARE flow rules installed on ALL nodes on the tables I am querying, but for some reason, some of these nodes show none of these flows on none of the tables (here, tables 3, 4, 5, and 6).
The weirdness: On one of the problematic nodes, I have four rules, installed on tables 9, 13, 17 and 22 respectively. They timeout simultaneously after 150 seconds. After they disappear, the query suddenly begins to "see" the flows installed on tables 3, 4, 5, and 6, returning these for each table.
Question: How is this even possible?
EDIT I just realized that the rules whose timeout "suddenly fix everything" were also rules that generated warnings in ODL's log (OpenFlowPlugin to be more specific). I did not observe any obvious issue, so I'd sort of brushed it aside.
Here is the code relevant to the error:
https://pastebin.com/yJDZesXU
Here are the errors I get every time I install a rule that walks through these lines:
https://pastebin.com/c9HYLBt6
I must stress that these rules work as intended, and that printing them out reveals no evident formatting issue. Again, they appear fine when dumped.
My hypothesis is that this warning is a symptom of ODL "messing up" trying to store the rules in MD-SAL, which ends up messing a lot of rule-reading queries. On uninstallation of the garbage that ensues, rule-reading queries become functional again.
This makes sense to me, but then... I haven't understood how to fix these warnings, or what these warnings were about in the first place.
EDIT 2: By commenting lines suspecting of causing the warnings in the above pastebin:
//ipv4MatchBuilder.setIpv4SourceAddressNoMask(...);
//ipv4MatchBuilder.setIpv4SourcenArbitraryBitmask(...);
The warnings disappear, AND the flows appear correctly on all tables, when pinged. This confirms my hypothesis that somewhere, something wrong happens in the data store.
EDIT 3: I have found that by setting any non-trivial arbitrary bitmask, this error goes away. That is, I have tried setting an arbitrary bitmask which was neither null nor "255.255.255.255", and this error has gone away. The problem is I might like having a bitmask for the source, but an exact match on the destination. Even setting the bitmask to "127.255.255.255" (as I tried) is still unnerving. It really feels to me like this is an OpenFlowPlugin glitch, though.
EDIT 4: Steps for reproducing the bug
Install a rule with ipv4 arbitrary bitmask match, with the destination ip set, and the destination arbitrary bitmask either null or set to 255.255.255.255.
Ipv4MatchArbitraryBitMaskBuilder ipv4MatchBuilder = new Ipv4MatchArbitraryBitMaskBuilder();
ipv4MatchBuilder.setIpv4DestinationAddressNoMask(new Ipv4Address("10.0.0.1"));
ipv4MatchBuilder.setIpv4DestinationArbitraryBitmask(new DottedQuad("255.255.255.255"));
matchBuilder = new MatchBuilder().setEthernetMatch(ethernetMatchBuilder.build()).setLayer3Match(ipv4MatchBuilder.build());
... and so on ...
Extra optional steps: Install one such rule for the destination, one such rule for the source, and install equivalent rules where the bitmask is set to something else, like 127.255.255.255.
Make a query to MDSal to fetch flow information from the node on which you installed the flow rule.
Now, do "log:display" inside your ODL controller. You should have a warning about a malformed destination address. Additionally, the Table object you queried should contain no flows, so tableObject.getFlow() should return an empty list.

Need help in Apache Camel multicast/parallel/concurrent processing

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)
camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]

Sitecore Lucene indexing - file not found exception using advanced database crawler

I'm having a problem with Sitecore/Lucene on our Content Management environment, we have two Content Delivery environments where this isn't a problem. I'm using the Advanced Database Crawler to index a number of items of defined templates. The index is pointing to the master database.
The index will remain 'stable' for a few hours or so, and then in the logs I will start to see this error appearing. Along with if I try and open a Searcher.
ManagedPoolThread #17 16:18:47 ERROR Could not update index entry. Action: 'Saved', Item: '{9D5C2EAC-AAA0-43E1-9F8D-885B16451D1A}'
Exception: System.IO.FileNotFoundException
Message: Could not find file 'C:\website\www\data\indexes\__customSearch\_f7.cfs'.
Source: Lucene.Net
at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run()
at Sitecore.Search.Index.CreateReader()
at Sitecore.Search.Index.CreateSearcher(Boolean close)
at Sitecore.Search.IndexSearchContext.Initialize(ILuceneIndex index, Boolean close)
at Sitecore.Search.IndexDeleteContext..ctor(ILuceneIndex index)
at Sitecore.Search.Crawlers.DatabaseCrawler.DeleteItem(Item item)
at Sitecore.Search.Crawlers.DatabaseCrawler.UpdateItem(Item item)
at System.EventHandler.Invoke(Object sender, EventArgs e)
at Sitecore.Data.Managers.IndexingProvider.UpdateItem(HistoryEntry entry, Database database)
at Sitecore.Data.Managers.IndexingProvider.UpdateIndex(HistoryEntry entry, Database database)
From what I read this can be due to an update on the index whilst there is an open reader, and when a merge operation happens the reader will still have a reference to the deleted segment, or something to that avail (I'm not an expert on Lucene).
I have tried a few things with no success. Including sub classing the Sitecore.Search.Index object and overriding CreateWriter(bool recreate) to change the merge scheduler/policy and tweaking the merge factor. See below.
protected override IndexWriter CreateWriter(bool recreate)
{
IndexWriter writer = base.CreateWriter(recreate);
LogByteSizeMergePolicy policy = new LogByteSizeMergePolicy();
policy.SetMergeFactor(20);
policy.SetMaxMergeMB(10);
writer.SetMergePolicy(policy);
writer.SetMergeScheduler(new SerialMergeScheduler());
return writer;
}
When I'm reading the index I call SearchManager.GetIndex(Index).CreateSearchContext().Searcher and when I'm done getting the documents I need I call .Close() which I thought would've been sufficient.
I was thinking I could perhaps try overriding CreateSearcher(bool close) as well, to ensure I'm opening a new reader each time, which I will give a go after this. I don't really know enough about how Sitecore handles Lucene, its readers/writers?
I also tried playing around with the UpdateInterval value in the web config to see if that would help, alas it didn't.
I would greatly appreciate anyone who a) knows of any kind of situations in which this could occur, and b) any potential advice/solutions, as I'm starting to bang my head against a rather large wall :)
We're running Sitecore 6.5 rev111123 with Lucene 2.3.
Thanks,
James.
It seems like Lucene freaks out when you try to re-index something that is in the process of being indexed already. To verify that, try the following:
Set the updateinterval of your index to a really high value (8 hours).
Then, stop the w3wp.exe and delete the index.
After deleting the index try to rebuild the index in Sitecore and wait for this to finish.
Test again and see if this occurs.
If this doesn't occur anymore it will be the updateinterval set too low which causes your index (that is probably still being constructed) to be overwritten with a new one (that won't be finished either) causing your segments.gen file to contain the wrong index information.
This .gen file will point your indexreader to what segments are part of your index and is recreated after index rebuilding.
That's why I suggest to try to disable the updates for a large amount of time and to rebuild it manually.