gpu/nvidia isolation in dc/os - gpu

I installed DC/OS 1.9 on my own three VM, all node have no GPU resources, and the slave/slave-public node started up successfully. In one slave log it shows below:
Jun 15 04:43:28 localhost.localdomain mesos-agent[31752]: E0615 04:43:28.488627 31752 containerizer.cpp:335] Cannot create the Nvidia GPU isolator: NVML is not available
Jun 15 04:43:28 localhost.localdomain mesos-agent[31752]: 2017-06-15 04:43:28,494:31752(0x7f9291dd8700):ZOO_INFO#log_env#726: Client environment:zookeeper.version=zookeeper C client 3.4.8
.....
Jun 15 04:43:28 localhost.localdomain mesos-agent[31752]: I0615 04:43:28.495215 31752 slave.cpp:211] Mesos agent started on (1)#192.168.3.72:5051
In my another test environment whose mesos version is 1.0.1, I start a mesos slave (the node also have no GPU resources) with "cgroups/devices,gpu/nvidia" isolation, but it failed to start. The logs show:
Jun 15 09:29:39 w-388965952-ClusterTest-sysadmin linker-start-agent.sh[25300]: Failed to create a containerizer: Could not create MesosContainerizer: Failed to create isolator 'gpu/nvidia': Cannot create the Nvidia GPU isolator: NVML is not available
Jun 15 09:29:39 w-388965952-ClusterTest-sysadmin systemd[1]: dcos-mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Jun 15 09:29:39 w-388965952-ClusterTest-sysadmin systemd[1]: Unit dcos-mesos-slave.service entered failed state.
Jun 15 09:29:39 w-388965952-ClusterTest-sysadmin systemd[1]: dcos-mesos-slave.service failed.
I want to know: Does a node with no GPU resources can start mesos-salve with gpu/nvidia isolation? If yes, how?

The behavior here for DC/OS is slightly different than in vanilla Mesos.
With vanilla Mesos, the agent will refuse to start if you enable the gpu/nvidia isolator but NVML is not installed.
With DC/OS, the agent will emit a warning message if NVML is not installed (the gpu/nvidia isolator is always enabled).
Note: the dependency is on the NVML libraries, not actual GPU resources. If NVML is installed but no GPUs are found on the box, then the agent won't fail to start with the gpu/nvidia isolator enabled.

Related

OpenThread CodeLab: Cannot start daemon

I am currently trying to get the OpenThread code lab simulation (found here) working, but unfortunately I am stuck at the chapter "Manage the network with OpenThread Daemon". All the other chapters worked like a charm, but I cannot start the daemon in the last one.
As I am completely new to this topic there might be something obvious I just don't see. Does someone know how to fix this?
OS: macOS Monterey 12.6.3
I created the daemon with ./script/cmake-build posix -DOT_DAEMON=ON and tried running it with sudo ./build/posix/src/posix/ot-daemon -v 'spinel+hdlc+forkpty:///build/simulation/examples/apps/ncp/ot-rcp?forkpty-arg=2'.
The response is always the same:
Feb 18 11:20:08 ./build/posix/src/posix/ot-daemon[9798] <Info>: Running OPENTHREAD/thread-reference-20200818-2319-gafbb2d579; POSIX; Feb 18 2023 10:39:37
Feb 18 11:20:08 ./build/posix/src/posix/ot-daemon[9798] <Info>: Thread version: 4
Feb 18 11:20:08 ./build/posix/src/posix/ot-daemon[9798] <Critical>: 49d.18:22:23.835 [C] Platform------: Init() at hdlc_interface.cpp:151: InvalidArgument

Plain vanilla Apache Ignite cluster fails setting state back to ACTIVE

I've got a plain vanilla install of ignite 2.14, with the binaries downloaded from https://ignite.apache.org/download.cgi (exact link https://dlcdn.apache.org/ignite/2.14.0/apache-ignite-2.14.0-bin.zip). I'm on Windows 10, IGNITE_HOME is not set in the PATH (this is optional), and Ignite's using this java runtime:
OpenJDK Runtime Environment 1.8.0_201-2-redhat-b09 Oracle Corporation
OpenJDK 64-Bit Server VM 25.201-b09
I start an ignite node using the default configuration provided in the downloaded zip apache-ignite-2.14.0-bin.zip :
ignite.bat ..\config\default-config.xml
This starts fine. Following the instructions at https://ignite.apache.org/docs/latest/tools/control-script I can check the state and see I've got a single node cluster in state ACTIVE (the default-config.xml must not have native persistence enabled, so the cluster goes to ACTIVE state automatically).
I can then set the state to INACTIVE like so:
control.bat --set-state INACTIVE
This works fine. However if I set the state to active again like so:
control.bat --set-state ACTIVE
I get the error pasted below and the cluster stays in the INACTIVE state. I first came across this error when using Ignite in embedded server mode, but I can still reproduce it with a fresh out-of-the-box ignite install (not using embedded). I'm surprised that a plain vanilla install just calling a couple of basic commands falls over like this. Any idea what's happening?
This is the error:
C:\temp\apache-ignite-2.14.0-bin\bin>control.bat --set-state ACTIVE
Nov 17, 2022 4:27:17 PM
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection
INFO: Client TCP connection established: /127.0.0.1:11211 Nov
17, 2022 4:27:17 PM
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection
close INFO: Client TCP connection closed: /127.0.0.1:11211 Nov 17,
2022 4:27:17 PM org.apache.ignite.internal.client.util.GridClientUtils
shutdownNow WARNING: Runnable tasks outlived thread pool executor
service [owner=GridClientConnectionManager,
tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#6d7b4f4c]]
Control utility [ver. 2.14.0#20220929-sha1:951e8deb] 2022 Copyright(C)
Apache Software Foundation User: info Time: 2022-11-17T16:27:16.344
null suppressed:
Command [SET-STATE] finished with code: 4 Error stack trace: class
org.apache.ignite.internal.client.GridClientException: null
suppressed:
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.handleClientResponse(GridClientNioTcpConnection.java:632)
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.handleResponse(GridClientNioTcpConnection.java:563)
at org.apache.ignite.internal.client.impl.connection.GridClientConnectionManagerAdapter$NioListener.onMessage(GridClientConnectionManagerAdapter.java:691)
at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:116)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:3734)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:175)
at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1211)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2508)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2273)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1910)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
at java.lang.Thread.run(Thread.java:748)
Control utility has completed execution at: 2022-11-17T16:27:17.642
Execution time: 1298 ms Press any key to continue . . .
It's a known issue, which is, unfortunately, not fixed yet.
As a workaround, you can execute the command with the autoconfirmation flag --yes, as shown below:
control.bat --set-state ACTIVE --yes

org.openqa.selenium.remote.ProtocolHandshake createSession INFORMATION: Attempting bi-dialect session with Selenium Grid

I set up a local selenium grid to test something. The build runs normal when connecting to another grid but when using the local grid the build just stops at this point:
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running xxx.xxxxxxxxxxxx.xxx.xxxxxxxxxxx.XXXXXXXXXXXX
Sep 17, 2018 3:13:49 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFORMATION: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
No error message at all. I wasn't able to achieve anything with -X and -Dwebdriver.server.session.timeout=7200
It just hangs there and I get nothing
This error message...
org.openqa.selenium.remote.ProtocolHandshake createSession
INFORMATION: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
As per the discussion Attempting bi-dialect session, assuming Postel's Law holds true on the remote end thread 'webdriver dispatcher' panicked at 'index out of bounds: the len is 0 but the index is 0 this issue was reproducible with Selenium Client v3.0.0-beta3 released on 2016-09-01 14:57:03 -0700 with GeckoDriver.
Simon in a comment mentioned that:
The root cause was a ClassCastException. We now catch that exception, log the thing that we were trying to parse and continue with other attempts to complete the handshake. The fix was available in Selenium Client v3.0.0-beta4.
Solution
Upgrade JDK to recent levels JDK 8u181.
Upgrade Selenium to current levels Version 3.14.0.
Upgrade GeckoDriver to GeckoDriver v0.20.1 level.
GeckoDriver is present in the specified location.
GeckoDriver is having executable permission for non-root users.
Upgrade Firefox version to Firefox v61.0.2 levels.
Clean your Project Workspace through your IDE and Rebuild your project with required dependencies only.
If your base Web Client version is too old, then uninstall it through Revo Uninstaller and install a recent GA and released version of Web Client.
Take a System Reboot.
Execute your Test as a non-root user.

Problems with CATALINA_PID and ARTIFACTORY_PID while upgrading Artifactory to the latest version

While upgrading my Artifactory server (free OSS version) from the version 5.2.0 to the latest 5.4.5, I was hit by an ARTIFACTORY_PID problem.
After migrating from 5.3.2 to 5.4.0, the Artifactory server did not want to start anymore complaining about
PID file /var/opt/jfrog/run/artifactory.pid not readable (yet?) after start.
I found the only way around it is to remove the line export CATALINA_PID=$ARTIFACTORY_PID from the setenv.sh of the tomcat.
Note that upgrade from 5.2.0 to 5.3.2 went smoothly.
However, after upgrading from 5.4.0 to the latest 5.4.5 this trick does not work anymore. Now I get an error:
Job for artifactory.service failed because a configured resource limit was exceeded. See "systemctl status artifactory.service" and "journalctl -xe" for details.
And when executing service artifactory status, I get:
● artifactory.service - Setup Systemd script for Artifactory in Tomcat Servlet Engine
Loaded: loaded (/usr/lib/systemd/system/artifactory.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: resources) since Tue 2017-07-25 09:40:10 CEST; 4s ago
Process: 31912 ExecStart=/opt/jfrog/artifactory/bin/artifactoryManage.sh start (code=exited, status=0/SUCCESS)
Jul 25 09:40:10 linux systemd[1]: Failed to start Setup Systemd script for Artifactory in Tomcat Servlet Engine.
Jul 25 09:40:10 linux systemd[1]: Unit artifactory.service entered failed state.
Jul 25 09:40:10 linux systemd[1]: artifactory.service failed.
In fact Artifactory is now running showing version 5.4.5, but I am not happy about all those errors above.
Plus I am a bit failing to understand the purpose of CATALINA_PID and/or ARTIFACTORY_PID. Why the tomcat was failing on the startup because of this file? What was wrong with the permissions? I think I did no extra actions before.
The only difference that before it was installed from an official downloaded rpm. But now using an official remote yum repo.
If I try to create an empty /var/opt/jfrog/run/artifactory.pid file, while Artifactory is running, it gets deleted. Who is deleting this file and why? Is it a standard tomcat behavior?
OS: CentOS 7, up to date.
In my case (in a slow virtual machine) the error message from the command artifactoryManage.sh start was:
ERROR: Artifactory Tomcat server did not start in 60 seconds. Please check the logs
The log file told that the only problem was slowness (/var/opt/jfrog/artifactory/logs/artifactory.log):
### Artifactory successfully started (64.802 seconds) ###
The problem was solved by adding a longer timeout to the service definition at /etc/systemd/system/artifactory.service:
[Service]
Environment=START_TMO=120
After editing the service definition, as you know, systemctl daemon-reload was needed.
Run this script:
/opt/jfrog/artifactory/bin/artifactoryManage.sh start
It will show the exact error to you.
In my case it was java version not updated. So I updated to java 1.8.

Running Jboss 7.1.1 on Fedora 20 as service

I have encountered a problem with running Jboss as service on Fedora. Here is the log I have after using command: systemctl status jboss-as.service
Here is the log I have been receiving:
jboss-as.service - SYSV: JBoss AS Standalone
Loaded: loaded (/etc/rc.d/init.d/jboss-as)
Active: failed (Result: resources) since Thu 2014-01-16 09:31:54 CET; 46min ago
Process: 501 ExecStart=/etc/rc.d/init.d/jboss-as start (code=exited, status=0/SUCCESS)
Jan 16 09:31:22 servername.domain systemd[1]: Starting SYSV: JBoss AS Standalone...
Jan 16 09:31:23 servername.domain jboss-as[501]: Starting jboss-as: chown: missing operand after ‘/var/run/jboss-as’
Jan 16 09:31:23 servername.domain jboss-as[501]: Try 'chown --help' for more information.
Jan 16 09:31:54 servername.domain jboss-as[501]: [ OK ]
Jan 16 09:31:54 servername.domain systemd[1]: PID file /var/run/jboss-as/jboss-as-standalone.pid not readable (yet?) after start.
Jan 16 09:31:54 servername.domain systemd[1]: Failed to start SYSV: JBoss AS Standalone.
Jan 16 09:31:54 servername.domain systemd[1]: Unit jboss-as.service entered failed state.
First, I tried to find a solution for the chown: missin operand after ... problem and found something: here but it did not help. And also, I was looking for the answer for the PID file problem but it does not even exist in in the location: var/run/jboss-as/
This is because the startup script uses the variable $JBOSS_USER but it is not defined inside the script.
Please put in the file /etc/jboss-as/jboss-as.conf the following line:
JBOSS_USER=root
(change the root with other dedicated linux user e.g. jboss-as)
It looks like the service startup script expects to be able to write to the /var/run/jboss-as directory but doesn't have permissions to do so.
In your place I'd ensure that this directory is owned by the user that runs JBoss and that it is writable.
Check that there aren't other errors (particularly missing or incorrect paths) in your /etc/rc.d/init.d/jboss-as file (I assume you copied it from the jboss install folder to create a startup script.
I had the same issue until I fixed a completely unrelated link in that script, then it went away.
In Centos 7, if you straight way copying the jboss-as-standalone.sh in /etc/rc.d/init.d/, ensure JBOSS_CONF and JBOSS_HOME path is correct.
For me, it was with systemd. When I set the service y put wrong the PID File.
Example:
In the service was like
/var/run/jboss-as/jboss-as-standalone.pid
But in the script was like
/var/run/jboss-as/jboss-as.pid