SLURM: Should there be a different gres.conf for each node?

SLURM: Should there be a different gres.conf for each node? - gpu

When configuring a slurm cluster you need to have a copy of the configuration file slurm.conf on all nodes. These copies are identical. In the situation where you need to use GPUs in your cluster you have an additional configuration file that you need to have on all nodes. This is the gres.conf. My question is - will this file be different on each node depending on the configuration on that node or will it be identical on all nodes (like slurm.conf?). Assume that the nodes have different configurations of gpus in them and are not identical.

Since Slurm version 14.3.0, the gres.conf accepts a NodeName parameter so that the same file can be setup on all nodes.
From the NEWS file:
gres.conf - Add "NodeName" specification so that a single gres.conf file
can be used for a heterogeneous cluster.
It will thus look something like this:
NodeName=node001 Name=gpu File=/dev/nvidia0
NodeName=node002 Name=gpu File=/dev/nvidia[0-1]
...
Before that, the gres.conf file had to be distinct for each node.

Related

Fluentbit Cloudwatch templating with EKS and Fargate

I've an EKS cluster purely on Fargate and I'm trying to setup the logging to Cloudwatch.
I've a lot of [OUTPUT] sections that can be unified using some variables. I'd like to unify the logs of each deployment to a single log_stream and separate the log_stream by environment (name_space). Using a couple of variable I'd need just to write a single [OUTPUT] section.
For what I understand the new Fluentbit plugin: cloudwatch_logs doesn't support templating, but the old plugin cloudwatch does.
I've tried to setup a section like in the documentation example:
[OUTPUT]
Name cloudwatch
Match *container_name*
region us-east-1
log_group_name /eks/$(kubernetes['namespace_name'])
log_stream_name test_stream
auto_create_group on
This generates a log_group called fluentbit-default that according to the README.md is the fallback name in case the variables are not parsed.
The old plugin cloudwatch is supported (but not mentioned in AWS documentation) because if I replace the variable $(kubernetes['namespace_name']) with any string it works perfectly.
Fluentbit in Fargate manages automatically the INPUT section so I don't really know which variables are sent to the OUTPUT section, I suppose the variable kubernetes is not there or it has a different name or a different array structure.
So my questions are:
Is there a way to get the list of the variables (or input) that Fargate + Fluentbit are generating?
Get I solve that in a different way? (I don't want to write more than 30 different OUTPUT one for each service/log_stream_name. It would be also difficult to maintain it)
Thanks!

After few days of tests, I've realised that you need to enable the kubernetes filter to receive the kubernetes variables to the cloudwatch plugin.
This is the result, and now I can generate log_group depending on the environment label and log_stream depending of the container-namespace names.
filters.conf: |
[FILTER]
Name kubernetes
Match *
Merge_Log Off
Buffer_Size 0
Kube_Meta_Cache_TTL 300s
output.conf: |
[OUTPUT]
Name cloudwatch
Match *
region eu-west-2
log_group_name /aws/eks/cluster/$(kubernetes['labels']['app.environment'])
log_stream_name $(kubernetes['namespace_name'])-$(kubernetes['container_name'])
default_log_group_name /aws/eks/cluster/others
auto_create_group true
log_key log
Please note that the app.environment is not a "standard" value, I've added it to all my deployments. The default_log_group_name is necessary in case that value is not present.
Please note also that if you use log_retention_days and new_log_group_tags the system is not going to work. To be honest log_retention_days it never worked for me also using the new cloudwatch_logs plugin either.

Where to put common variables for groups in Ansible

We have some scripts to help us set up VPCs with up to 6 VMs in AWS. Now I want to log in to each of these machines. For security reasons we can only access one of them via SSH and then tunnel/proxy through that to the other machines. So in our inventory we have the IP address of the SSH host (we call it Redcarpet) and some other hosts like Elasticsearch, Mongodb and Worker:
#inventory/hosts
[redcarpet]
57.44.113.25
[services]
10.0.1.2
[worker]
10.0.5.77
10.0.1.200
[elasticsearch]
10.0.5.30
[mongodb]
10.0.1.5
Now I need to tell each of the groups, EXCEPT redcarpet to use certain SSH settings. If these were applicable to all groups, I would put them in inventory/group_vars/all.yml, but now I will have to put them in:
inventory/group_vars/services.yml
inventory/group_vars/worker.yml
inventory/group_vars/elasticsearch.yml
inventory/group_vars/mongodb.yml
Which leads to duplication. Therefore I would like to use an include or include_vars to include one or two variables from a common file (e.g. inventory/common.yml). However, when I try to do this in any of the group_var files above, it does not pick up the variables. What is the best practice to use with variables that are common to multiple groups?

If you want to go with the group_vars approach, I would suggest you add another group, and add the dependent groups as children to that group.
#inventory/hosts
[redcarpet]
57.44.113.25
[services]
10.0.1.2
[worker]
10.0.5.77
10.0.1.200
[elasticsearch]
10.0.5.30
[mongodb]
10.0.1.5
[redcarpet_deps:children]
mongodb
elasticsearch
worker
services
And now you can have a group_vars file called redcarpet_deps.yml and they should pickup the vars from there.

How to set Neo4J config keys in gremlin-scala?

When running a Neo4J database server standalone (on Ubuntu 14.04), configuration options are set for the global installation in etc/neo4j/neo4j.conf or possibly $NEO4J_HOME/conf/neo4j.conf.
However, when instantiating a Neo4j database from Java or Scala using Apache's Neo4jGraph class (org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph), there is no global installation, and the constructor does not (as far as I can tell) look for any configuration files.
In particular, when running the test suite for my application, I end up with many simultaneous instances of Neo4jGraph, which ends up throwing a java.net.BindException: Address already in use because all of these instances are trying to communicate over a small range of ports for online backup, which I don't actually need. These channels are set with config options dbms.backup.address (default value: 127.0.0.1:6362-6372) and dbms.backup.enabled (default value: true).
My problem would be solved by setting dbms.backup.enabled to false, or expanding the port range.
Things that have not worked:
Creating /etc/neo4j/neo4j.conf containing the line dbms.backup.enabled=false.
Creating the same file in my project's src/main/resources directory.
Creating the same file in src/main/resources/neo4j.
Manually setting the configuration property inside the Scala code:
val db = new Neo4jGraph(dataDirectory)
db.configuration.addProperty("dbms.backup.enabled",false)
or
db.configuration.addProperty("neo4j.conf.dbms.backup.enabled",false)
or
db.configuration.addProperty("gremlin.neo4j.conf.dbms.backup.enabled",false)
How should I go about setting this property?

Neo4jGraph configuration through TinkerPop is accomplished by a pass-through of configuration keys. In TinkerPop 3.x, that would mean that all Neo4j keys prefixed with gremlin.neo4j.conf that are provided via Configuration object to Neo4jGraph.open() or GraphFactory.open() will be passed down directly to the Neo4j instance. You can see examples of this here in the TinkerPop documentation on high availability configuration.
In TinkerPop 2.x, the same approach was taken however the key prefix was instead blueprints.neo4j.conf.* as discussed here.

Manipulating db.configuration after the database connection had already been opened was definitely futile.
stephen mallette's answer was on the right track, but this particular configuration doesn't appear to pass through in the way his linked example does. There is a naming mismatch between the configuration keys expected in neo4j.conf and those expected in org.neo4j.backup.OnlineBackupKernelExtension. Instead of dbms.backup.address and dbms.backup.enabled, that class looks for config keys online_backup_server and online_backup_enabled.
I was not able to get these keys passed down to the underlying Neo4jGraphAPI instance correctly. What I had to do, instead, was the following:
import org.neo4j.tinkerpop.api.impl.Neo4jFactoryImpl
import scala.collection.JavaConverters._
val factory = new Neo4jFactoryImpl()
val config = Map(
"online_backup_enabled" -> "true",
"online_backup_server" -> "0.0.0.0:6350-6359"
).asJava
val db = Neo4jGraph.open(factory.newGraphDatabase(dataDirectory,config))
With this initialization, the instance correctly listened for backups on port 6350; changing "true" to "false" disabled backup listening.

Using Neo4j 3.0.0 the following disables port listening for me (Java code)
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph;
BaseConfiguration conf = new BaseConfiguration();
conf.setProperty(Neo4jGraph.CONFIG_DIRECTORY, "/path/to/db");
conf.setProperty(Neo4jGraph.CONFIG_CONF + "." + "dbms.backup.enabled", "false");
graph = Neo4jGraph.open(config);

Akka.NET - Cluster and ActorSelection path

I have an akka.net cluster and I want to send a message to actors that are both local and remote, and that all have the path "/user/foobar" (at least locally). Should I use ActorSelection, and what should the path look like in order to target both matching local and remote actors?

It's unclear from the question whether you mean you want to send a message locally within one node in your cluster, or across multiple nodes.
If you just want to send it in one node, you can use an ActorSelection and just send it to whatever the desired actor path is (e.g. /user/*/processingActor). If you want to message across the cluster itself, you'll need to set up a cluster-aware Group router.
See the docs here for router configuration, which is where you'll define the routees.
In a nutshell, you'll be doing something like this:
# inside akka.actor.deployment HOCON
/some-group-router {
router = round-robin-group
routees.paths = ["/user/*/processingActor",]
nr-of-instances=3
cluster {
enabled=on
use-role=targetRoleName
allow-local-routees=on
}
}

How to do parellel bootstrap on different machine using chef?

Suppose I have written five different recipes and I have five different target nodes.
How can I fire a bootstrap command that will address all 5 nodes at ones and it will execute the specified recipes(one for one) in parallel on each target node.

There is way to bootstrap multiple cookbooks on multiple machines at a time using spiceweasel . In this first install the spiceweasel package on the node . U need to create a yaml file and load ur data in the file .. like node ip , cookbook name .
go through this link , it will give you an idea
https://github.com/mattray/spiceweasel#cookbooks

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas