RandomSortField not given same result Apache Solr 5.5 - apache

I have Apache Solr 5.5 working. On environments other then live randomSortField is working fine because no reindexing is happening or version is not changing but on live data starts changing even on same random string.
for example:
http://localhost:8983/solr/select/?q=*:*&fl=name&sort=random_1234%20desc
hitting this twice wont give me same result on live environment.
i have checked this Solr: Random sort order after index version change
but cant find this file on my solr instance

In my experience, this functionality always behaves unexpectedly with SolrCloud, even when overriding with a custom implementation of this functionality. I suspect this is because of differing timestamp values of the same documents across instances / shards / replicas.

Related

How can I get more results from anzograph

I am using anzograph with SPARQL trough http using RDFlib. I do not specify any limits in my query, and still I only receive 1000 solutions. The same seems to happen on the web interface.
If I fire the same query on other triple stores with the same data, I do get all results.
Moreover, if I fire this query using their command line tool on the same machine as the database, I do get all results (millions). Maybe it is using a different protocol with the local database. If I specify the hostname and port explicitly on the command line, I get 1030 results...
Is there a way to specify that I want all results from anzograph over http?
I have found the service_graph_rowset_limit setting and changed its value to 100000000 in both config/settings_standalone.conf and config/settings.conf, (and restarted the database) but to no avail.
let me start by thanking you for pointing this issue out.
You have identified a regression of a fix, that had been intended to protect the web UI from freezing on unbounded result sets, but affected the regular sparql endpoint user as well.
Our Anzo customers do not see this issue, as they use the internal gRPC API directly.
We have produced a fix that will be in our upcoming anzograph 2.4.0 and in our upcoming patch release 2.3.2 set of images.
Older releases will receive this fix as well (when we have a shipment vehicle).
If it is urgent to you I can provide you both a point fix (root.war file).
What exact image are you using?
Best - Frank

Liferay 6.2 Lucene replication in cluster

I'd welcome any help regarding simple issue: I have clustered environment and I enabled Lucene replication in properties (lucene.replicate.write=true). Now, all the tutorials are instructing me to reindex Lucene.
Should I run it on one node? On both? Simultaneously or sequentially?
This question has been asked in Liferay Forum as well: https://www.liferay.com/community/forums/-/message_boards/view_message/69175435.
Thank you!
Basically what I did at first was following:
cluster.link.enabled=true
lucene.replicate.write=true
and the result was NOT WORKING replication.
What I tried next was to overcome this issue and continue with clustering the rest of the portal which at the end helped lucene as well. My progress was to:
deploy cluster activation keys
deploy ehcache-cluster-web.war
portal-ext.properties:
cluster.link.enabled=true
cluster.link.autodetect.address=<COMMONLY_ACCESSIBLE_IP_AND_PORT>
lucene.commit.batch.size=1
lucene.commit.time.interval=5000
lucene.replicate.write=true
ehcache.cluster.link.replication.enabled=true
cluster.link.channel.properties.control=<PATH_TO_XML>
cluster.link.channel.properties.transport.0=<PATH_TO_XML>
portal.instance.protocol=http
portal.instance.http.port=8080
setenv.sh
-Djava.net.preferIPv4Stack=true
-Djgroups.bind_addr=<IP_OF_THE_NODE>
edit clusterlink_control and clusterlink_transport files by Liferay tutorials
when servers shutted down delete contents of data/lucene and in Control Panel run reindaxation on one node
At the end, Lucene replication IS WORKING. What I think could be significant is following. At first, portal.properties explanation on keys lucene.commit.* is kind of hard to comprehend. By trial and error I found out that these two keys are in AND relation. Also, I found out about portal.instance.* keys which are used for multiple purposes in clustering and can matter if you have loadbalancers and/or Apaches between the nodes and autodetect fails.
There are multiple ways to configure search clustering in Liferay. If you use the lucene.replicate.write=true way, you're looking at several reindexing runs: On every restart of a server you must reindex that server's documents, as it might have missed indexing requests when it was down.
So, short answer: Don't worry, reindex both. Sooner or later you'll do it anyways, no matter if you need only one now.

Redis Issue - Incr by to many

I'm running PHP-FPM with Redis on AWS.
Currently I'm having a really strange issue that I can't seem to figure out.
When I INCR or HINCRBY and increment by 1 it always increments by around 20 to 30 instead.
I have tried the following:
Commented out all other redis code (no change).
Setup a single PHP page using the same code outside of the site (this works fine - increments by 1).
In the main site (that is having the issue) I put the code in the header, after the last HTML tag and other places and it behaves the same.
I have an AJAX page within the site which is invoked separately if requested and this works fine. Therefore the issue only occurs during the main site load.
I've tested redis-cli using the commands and this works fine.
I can't seem to find any loads on the AWS Redis System to read so I'm not sure exactly what is occurring here but it appears the command is running multiple times.
I also read the value back after its written and the value reports correctly. So the increment seems to work - however when I re-check redis using a GUI tool I can see it’s increased by a much larger number.
I'm really at a loss for what to try next and was hoping someone might have some advice.
Thank you.

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

Managing ajax Couchdb calls and IE's (hta) agressive cache

I'm having a quite annoying problem, and came up with a quite ugly hack to make it work.
I develop an Hta application using a CouchDB database (for internal company use). The problem is there seems to be some very aggressive caching of the database queries, and it's been hard to come up with solutions.
So the updated data in the database just won't come up in the browser, who still has the previous request results in his cache, until the entire app is started anew.
Oh, and CouchDB (or it's mochiweb server) doesn't allow unknown GET variables, so the usual solution of appending some sort of timestamp just won't work.
I have found some sort of solution, but it's damn ugly. Solutions are:
Only open documents with latest revision number (easy and nice, won't work on views)
Use apache as forward proxy listening to 200+ ports, and select one at random on each read query. (that's the ugly one).
Hta accepts ajax calls to other ports (maybe even on other domains, strange behaviour), so it works nicely, I just have a 1/200 chance that new data won't come up, but that's still better then 1/1, I can live with that.
So what I'm asking is, is there a better solution to this ? Can I hack in to the mochiweb server to modify cache headers (and hope they're not going to be ignored) ? Is there a special unknown "throwaway" key I could use in the url's to append some random string ? Or is there a way to tell Hta not to cache anything (from within the app, this is supposed to work on hundreds of computers) ?
it's still ugly but slightly less ugly than your current apache setup but Can't you use an apache rewrite rule to allow you to set an arbitrary no_cache attribute on the url? apache can throw it away so couchdb won't see it.