julia on PBS cluster: what to give to addprocs()? - ssh

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!

Related

BG95 Can't Activate - AT+QIACT=1 returning error

I'm trying to get a BG95 to activate on hologram.
Here are my commands:
AT+QCFG="band",F,180A,180A OK
AT+QCFG="iotopmode",2 OK
AT+QCFG="nwscanseq",020301 OK
AT+QCFG="nwscanmode",0 OK
AT+QCFG="snrscan",0 OK
AT+QICSGP=1,1,"hologram","","",1 OK
AT+QIACT=1 ERROR
At first I thought it was antenna/signal related so I ran AT+CSQ and got this:
+csq: 11,99
This tells me I have a good signal I believe.
Next I tried AT+QNWINFO and get this:
+QNWINFO: "eMTC","311480","LTE BAND 13",5230
In my mind this is saying it's connected to a network.
After trying that I tried to activate again and got this:
AT+QIACT=1
ERROR
The weird thing is it activated just fine about a week ago with pure AT commands. I did try and use an Arduino library with it (WisLTEBG96TCPIP) which may have changed a setting in it. I've done a factory reset but the it still woln't activate.
Another strange thing is the hologram dashboard. Every once and a while it will show the SIM as connected, even though I can't activate.
I have tried with 2 different SIM cards any get the same activation error.
Any help would be greatly appreciated!
Verizon has cut off all non ODI products. If your hardware has not been Verizon ODI 'certified' it will no longer be allow to be connected to their network, I have 5 new pet rocks thanks to them. The solution is to purchase new modems from vendors that have been through the Verizon ODI program or switch carriers.
I had the same problem before, after a lot of maling with network operator I find out that there isn't a LTE-CAT-M1 (eMTC) network in my area, I tested in another area successfully
Also before setting AT+QCFG commands try AT+CFUN = 0
and after setting AT+QCFG commands try AT+CFUN = 1 .
before sending AT+QIACT, try 'AT+CEREG?' command several times and tell me what is the return of it

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

Why are my flows not connecting?

Just starting with noflo, I'm baffled as why I'm not able to get a simple flow working. I started today, installing noflo and core components following the example pages, and the canonical "Hello World" example
Read(filesystem/ReadFile) OUT -> IN Display(core/Output)
'package.json' -> IN Read
works... so far fine, then I wanted to change it slightly adding "noflo-rss" to the mix, and then changing the example to
Read(rss/FetchFeed) OUT -> IN Display(core/Output)
'http://xkcd.com/rss.xml' -> IN Read
Running like this
$ /node_modules/.bin/noflo-nodejs --graph graphs/rss.fbp --batch --register=false --debug
... but no cigar -- there is no output, it just sits there with no output at all
If I stick a console.log into the sourcecode of FetchFeed.coffee
parser.on 'readable', ->
while item = #read()
console.log 'ITEM', item ## Hack the code here
out.send item
then I do see the output and the content of the RSS feed.
Question: Why does out.send in rss/FetchFeed not feed the data to the core/Output for it to print? What dark magic makes the first example work, but not the second?
When running with --batch the process will exit when the network has stopped, as determined by a non-zero number of open connections between nodes.
The problem is that rss/FetchFeed does not open a connection on its outport, so the connection count drops to zero and the process exists.
One workaround is to run without --batch. Another one I just submitted as a pull request (needs review).

Authentication failures in cassandra when 1 of 16 nodes is down

I have a Cassandra cluster running :
Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0
The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:
replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '4',
'DC2': '4',
'DC3': '4'}
and my authenticator/authorizer are set to:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:
"User my_application_user has no MODIFY permission on or any of its parents"
Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?
Re-posting this as the answer, as BrianC suggested above.
So this is resolved... Here's the sequence of events that seems to have fixed it:
Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.
Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.

Need help in Apache Camel multicast/parallel/concurrent processing

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)
camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]