GCE instance terminated without detail or clarification - virtual-machine

We got a weird error event last night where a GCE VM instance got terminated by the system for reasons that are unclear. Is there any way to identify why a running instance got terminated?
Portion of the REST response from the GCE service:
{
"kind": "compute#operation",
"operationType": "compute.instances.guestTerminate",
"status": "DONE",
"statusMessage": "Instance terminated by guest OS shutdown.",
"user": "system",
}

In this case, it appears that your instance's guest OS itself was the one which requested to shutdown / halt so the instance was terminated as a result. This could be due to any number of reasons, e.g., someone ran a command such as shutdown -h now inside the VM.
The record of what happened in this case may be left on the instance's persistent disk; however, if the VM had the setting enabled for automatically deleting the boot disk on VM instance termination, it's likely gone at this time.

Related

Azure Container Instance is immediately killed on Startup

I am trying to run an azure container instance but it appears to be getting killed off the second I run it. This works fine in 2 other resource groups but not my production resource group where I see the following:
In events I see 'Successfully pulled image
selenium/standalone-chrome:latest' with count 1 and then 'Started
container' and then 'Killing container' with count 31. The times for
started and killed are the same.
In logs, it just says 'No logs available'
The metrics for CPU and memory on the container never show any change from zero.
Looked at this article but the proposed solution didn't work: Azure Container Group Instance I have tried putting on both an empty directory volume and 2Gb of ram as advised here: https://github.com/SeleniumHQ/docker-selenium but nothing works.
This is the code I am using to create the container:
containerGroup = await azure.ContainerGroups.Define(containerName)
.WithRegion("West Europe")
.WithExistingResourceGroup(configuration.ContainerResourceGroup)
.WithLinux()
.WithPublicImageRegistryOnly()
.WithEmptyDirectoryVolume("devshm")
.DefineContainerInstance(containerName)
.WithImage("selenium/standalone-chrome")
.WithExternalTcpPorts(4444)
.WithVolumeMountSetting("devshm", "/dev/shm")
.WithMemorySizeInGB(2)
.Attach()
.WithDnsPrefix(configuration.AppServiceName + "container")
.WithRestartPolicy(ContainerGroupRestartPolicy.OnFailure)
.CreateAsync(cancellationToken);
How do I debug what is going wrong?
What is wrong with the container?
In case this helps someone I renamed the "containerName" parameter in the above example from myinstance to myinstance1 and changed the region from West Europe to UK South. This fixed the issue. I can only think that Azure caches instances somehow to reduce start up times and the cached image I was using was poisoned somehow.
One issue could be the restart policy - have a look at the Microsoft restart policy troubleshooting on Microsoft's ACI troubleshooting page. According to the website under the Container continually exits and restarts (no long-running process) header in the page:
Container groups default to a restart policy of Always, so containers
in the container group always restart after they run to completion.
You may need to change this to OnFailure or Never if you intend to run
task-based containers. If you specify OnFailure and still see
continual restarts, there might be an issue with the application or
script executed in your container.
In your case you may need to adjust the code as follows using the withStartingCommand:
containerGroup = await azure.ContainerGroups.Define(containerName)
.WithRegion("West Europe")
.WithExistingResourceGroup(configuration.ContainerResourceGroup)
.WithLinux()
.WithPublicImageRegistryOnly()
.WithEmptyDirectoryVolume("devshm")
.DefineContainerInstance(containerName)
.WithImage("selenium/standalone-chrome")
.WithExternalTcpPorts(4444)
.WithVolumeMountSetting("devshm", "/dev/shm")
.WithMemorySizeInGB(2)
.WithStartingCommandLine("tail")
.WithStartingCommandLine("-f")
.WithStartingCommandLine("/dev/null")
.Attach()
.WithDnsPrefix(configuration.AppServiceName + "container")
.WithRestartPolicy(ContainerGroupRestartPolicy.OnFailure)
.CreateAsync(cancellationToken);
This link is helpful for this issue.
--command-line
linux => "tail -f /dev/null"
windows => "ping -t localhost"
# .yml
command: tail -f /dev/null
It will keep your azure instance running.
As now azure do have a endpoint to connect/analyze the process on.

Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 not communicating and forwarding logs to Indexer after certain period of time

I have noticed Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 is not communicating to deployment server and forwarding logs to indexer after certain period of time. "splunkd" process appears to be running while this issue persists.
I have to restart UFW for it to resume communication to deployment and forward logs. But this will again stop communication after certain period of time.
I cannot see any specific logs in splunkd.log while this issue occurs.
However, i noticed below message from watchdog.log
06-16-2020 11:51:09.055 +0200 ERROR Watchdog - No response received from IMonitoredThread=0x7f24365fdcd0 within 8000 ms. Looks like thread name='Shutdown' is busy !? Starting to trace with 8000 ms interval.
Can somebody help to understand what is causing this issue.
This appears to be a Known Issue. From the 7.2.9.1 release notes:
Universal Forwarders stop sending data repeatedly throughout the day
Workaround: In limits.conf, try changing file_tracking_db_threshold_mb
in the [inputproc] stanza to a lower value.
I did not find a version where this is not listed as a known problem.

Hyper-v Machine shuts down unexpectedly, Hyper-V log shows an Error:"DM operation add for the virtual machine"

This cause me a big production problems, please help me.
one virtual machine shutted down unexpectly.
The Hype-v log shows:
DM operation add for the virtual machine 'XXXXName' failed with error: Unspecified error (0x80004005) (Virtual machine ID 7EDDD39A-F963-4FAA-8854-6179B7611AC3).
could it be possible if nobody touch the Hype-v and DM error happened??? please tell me how to fix this problem.
DM is "dynamic memory". Your virtual machine tried to add memory and failed. A human didn't trigger this, you've just run out of memory.
https://social.technet.microsoft.com/Forums/windowsserver/en-US/f60b2767-dbfc-41d0-8019-24039aab187d/dynamic-memory-issue-dm-add-fails-with-0x800705aa?forum=winserverhyperv
Edit: the reason it failed, the error code, is explained here. It's possible there was a permission problem trying to allocate more memory. But still, essentially, you ran out of memory.
https://appuals.com/solved-how-to-fix-error-0x80004005/

RDO unable to boot VM with disk size specified

I have packstack-allinone setup on my RHEL7.1 trial for Juno release.
I am facing problem while launching VM(for ex: cirros) with a disk size mentioned in flavor. If there is 0gb disk size then VM are getting launched but not for higher flavor sizes.
I also observe that when I do this, openstack-nova-compute service goes down which I observed when I checked using nova-manage service list with nova-compute being XXX making me restart the service everytime I try this scenario. The compute logs doesn't throw any error, it just gets stuck at "Creating image".
Is there any Filesystem issue which i missing to be configured? I am new to this, so please help.
PS: I run all commands with "root" user.
The problem was with esxi. Esxi needs to be 5.5v to support RHEL7x Since mine was 5.1v it only supported RHEL6x.
After upgrading esxi5.1 to 5.5v it worked fine.

Redis - can't [BG]SAVE after one failure

I have a message in the Redis log saying
BeginForkOperation: system error caught. error code=0x00000000, message=Forked process did not respond in a timely mannager.: unknown error
Can't save in background: fork: Invalid argument
This message appeared in the log more than a week ago, and since then, each time I'm trying to run "BGSAVE" or "SAVE", it throws the "err background save already in progress" error..
I can't see another redis-server process except the main process in the task manager, nor in "client list" command.
I'm using the Redis on windows project (constraints:)
Any ideas how to tell the Redis there's no background save process and force it to save the data into the disk, before it crashes?