ThnkingSphinx (sphinxd) on remote database server with delta indexes? - ruby-on-rails-3

I'm working on setting up a simple multi-tier Rails 3.1 setup -- web apps on one or more servers, postgresql database and our Sphinx search indexes on a remote server.
On a single-server setup we're using ThinkingSphinx, and delta indexes (using delayed_job), then a nightly cron to update the main index. Works great.
So: user creates indexable content; app tells delayed_job to schedule an update; delta-indexer adds new content to delta-index; searches look at both to resolve the search query properly; nightly job recreates single main index.
The documentation for ThinkingSphinx says here near the bottom
The best approach is to have Sphinx, the database and the delayed job processing task all running on one machine.
But I am unclear how to send the information needed by the delayed job process to the single server to be run. I have read some stuff about having a shared file system (yuck -- really?). I haven't read the code yet, but maybe there's a simple way?
Here's hoping!

The delayed job worker (running on your DB/Sphinx server) references the database, within the context of your Rails app - so you'll need the app on your DB/Sphinx server as well, but just to run the DJ worker.
From the perspective of your app servers, TS will just add job records to the database as per normal.
You'll also want to set the following settings - this one goes at the end of your config/application.rb:
ThinkingSphinx.remote_sphinx = Rails.env.production?
And add the Sphinx version to your config/sphinx.yml:
production:
version: 2.0.1-beta

Related

Distributed JobRunr using single data source

I want to create a scheduler using JobRunr that will run in two different server. This Scheduler will select data from SQL database and will call an api endpoint. How can I make sure these 2 schedulers running in 2 different server will not pick same data from database ?
Main concern is they should not call the API with duplicate data from 2 different server.
As per documentation JobRunr will push the job in the queue first, but I am wondering how one scheduler queue will know that the same data has not picked by other scheduler in different server, is they any extra locking mechanism I need to maintain ?
JobRunr will run any job only once - the locking is already done in JobRunr itself.

High availability of scheduled agents in clustered environment

One of my applications has been deemed business-critical and I'm trying to figure out a way to make my scheduled agents behave correctly in cases of failover. It doesn't need to be automatic, but an admin should be able to 'transfer' the running of the agents from one server to the other.
I was thinking of one solution of setting in a profile document the 'active' server, and have the agents (there are 4, 1 Java and 3 in LotusScript) check if they are currently running on the 'active' server, and if not, stop immediately.
Then there is IBM's workaround suggestion: http://www-01.ibm.com/support/docview.wss?uid=swg21098304 of making three agents, one 'core' agent which gets called by a 'main agent' running on the main server, and a 'failover agent' running on the failover server, but only if the 'main server' is available.
But that solution seems a bit clunky to me, that's going to be lots of agents that need to be set up in a fiddly fashion.
What would you recommend?
My logic is similar to yours, but I don't use profile- documents (caching is a bad thing for such important task), but a central configuration document.
Agents are scheduled to run on every server.
First they read the "MasterAppServer" from the config- document. If it is another server then they try to open the database (or names.nsf, depending on what you want) on the MasterServer. If the database can be opened -> everything is ok, agent stops its work. If it cannot be opened, then the agent assumes, that the other server is down and changes the MasterAppServer- field in the config- document to his own server and runs.
Of course I write a log in the config- document whenever "MasterAppServer" changes.
That works quite well and does not need any admin intervention when one server is down.

Running multiple Kettle transformation on single JVM

We want to use pan.sh to execute multiple kettle transformations. After exploring the script I found that it internally calls spoon.sh script which runs in PDI. Now the problem is every time a new transformation starts it create a separate JVM for its executions(invoked via a .bat file), however I want to group them to use single JVM to overcome memory constraints that the multiple JVM are putting on the batch server.
Could somebody guide me on how can I achieve this or share the documentation/resources with me.
Thanks for the good work.
Use Carte. This is exactly what this is for. You can startup a server (on the local box if you like) and then submit your jobs to it. One JVM, one heap, shared resource.
Benefit of that is then scalability, so when your box becomes too busy just add another one, also using carte and start sending some of the jobs to that other server.
There's an old but still current blog here:
http://diethardsteiner.blogspot.co.uk/2011/01/pentaho-data-integration-remote.html
As well as doco on the pentaho website.
Starting the server is as simple as:
carte.sh <hostname> <port>
There is also a status page, which you can use to query your carte servers, so if you have a cluster of servers, you can pick a quiet one to send your job to.

In Endeca I want to have dgraph backups saved on dgraph server automatically after baseline update.

How to have 3 dgraphs backup saved automatically in dgraph server and not on ITL server . By default backup of dgidx output gets saved on ITL server . I want it to be saved on dgraph server ie MDEX host. Please help.
I don't believe there to be an Out-of-the-Box option for backing up the deployed dgidx output on the target server. Have you gone through the documentation? I would also question whether it is a good idea. Consider you are deploying and 2 of the 3 servers have gone through successfully but the third one fails. You now need to roll back only two of the machines. Your central EAC will not know which ones to rollback and which ones to keep. However, by keeping it all at a central point (ie. on the ITL server) in the event of a rollback you will always push the same backup out to all three servers.
Assuming that you are trying to speed up the deployment of very large indices (Endeca copies the entire dgidx output to each MDEX), you can always look at the performance tuning guide.
You should be able to do this in any number of ways:
In any baseline update, dgidx_output is automatically copied to each
dgraph server. You should be add a copy or archive job as a
pre-shutdown task for your dgraph.
You could also create a custom copy job that would for each dgraph
server that would run at the end or beginning of a baseline update.
Or it could be offline from your baseline update entirely.
To a point radimpe makes, making copies on the dgraph servers is not that hard, but rather, it's the rollback process you need to really consider. You need to set that up and ensure it uses whatever backup copies you've made, whether local to the ITL machine or on the dgraph servers.
Also know that dgidx_output will not include any partial update information added since the index was created. Partial update info only be available in the dgraph_input on the dgraph servers. Accordingly, if you incorporate partial updates, you should archive the dgraph input and make that available for any rollback job.
You can create a DGRAPH post startup task and assign it in the graph definitions. It will will be executed on each MDEX startup
<dgraph id="Dgraph01" host-id="LiveMDEXHost01" port="10000" pre-shutdown-script="DgraphPreShutdownScript" post-startup-script="DgraphPostStartupScript">
<script id="DgraphPostStartupScript">
<bean-shell-script>
<![CDATA[
...code to backup here
]]>
</bean-shell-script>
</script>

How to flush out suspended WCF workflows from the instancestore?

We have identified the need to flush out several different workflows that have been suspended/persisted for a long time (i.e. hung instances). This is so that our test environment can be flushed clean before acceptance tests are re-run.
The dirty solution is to use a sql script to remove records from the InstancesTable and other related tables in the database.
What's the proper solution?
These are WCF workflows.
Test rig is running XP.
Using the AppFabric you can use the UI, or I asume PowerShell commands, to delete individual instanced. For development and test purposes I normally just recreate the database by running SqlWorkflowInstanceStoreSchema.sql script again.
Found a way to do it (thanks to Pablo Rotondo on MSDN):
http://www.funkymule.com/post/2010/04/28/how-to-resume-suspended-workflows-in-net-40.aspx