Terminated ACI not disappearing - azure-container-instances

I'm working on a new container image that runs my worker process to drain an Azure queue. Once the queue is empty my app exits and I'd like the ACI to de-allocate and be removed as well. What I am seeing is the ACI stick around. It is in a "Terminated" state with a restart count of 0 as I would expect (seen in Azure Portal), but why is it not removed/deleted from the ACI list entirely?
I am using the Azure cli to create these instances and am specifying the restart never option. Here is my command line (minus the image specific details):
az container create --cpu 4 --memory 14 --restart-policy never --os-type windows --location eastus
I am of course also wondering where billing stops. Once I see the terminated state I am hoping that billing has stopped. Though this is unclear. I can of course manually delete the ACI and it is gone immediately, should exiting the app do the same?

If your container is in terminated state, you are no longer being billed. The resource itself though remains until you delete it though in the event you want to query the logs, events, or details of the container after termination. If you wish to delete existing container groups, writing some code on Azure Functions is a good route so you can define when something should be deleted.
Check out this base example of such a concept.
https://github.com/dgkanatsios/AzureContainerInstancesManagement/tree/master/functions/ACIDelete

Related

ACI container group disappears after successful exit

I am using ACI to run a single-shot container that reads a storage blob, computes for anything from a few seconds to a few hours depending on the blob contents, then writes results to another storage blob. Containers are spawned as needed using node.js, and I periodically check for terminated containers to retrieve their exit codes, after which I delete them.
This normally works fine, but sometimes, when the computation completes very quickly and the container exits normally, Azure appears to delete the container on its own. This means that I cannot retrieve the exit code, which is inconvenient. The container is really gone, and appears neither in the list returned by the Javascript ContainerGroups.listByResourceGroup function nor in the output of the Azure cli "az container list" command.
Is this a known problem, and if so, is there a workaround? I guess I could just have my container sleep for a while before starting its computation, but without understanding the cause of the problem, I don't know how long to sleep for.

Get notification if ECS service launches a new task, if autoscaling is triggered

We have used ECS for our production setups. As per my understanding of ECS, while creating a cluster of type EC2, we specify the number of instances to be launched. When we create a service, and if autoscaling is enabled we specify the minimum and the maximum number of tasks that can be created.
While creating these tasks, if there is no space left on the existing instances, ECS launches a new instance to place these tasks.
I would like to know if we can trigger a notification whenever a new EC2 instance gets added in the ECS cluster if autoscaling is triggered?
If yes, please help me with links or steps for the same.
Thanks.
Should be doable, see here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ASGettingNotifications.html
There are simple ways to test it via manually increasing the capacity too and are other notifications you can also subscribe too.

Two Web Jobs Appear to be locking each other

Could someone help me to discover what is going on with our App Services. We have two App services connected to two Blob Storage containers that are triggered when an item is placed on the container they are listening to.
App One App two (under the same subscription)
| |
WebJobs(9) WebJobs(9)
| |
Container one Container Two (under the same storage account)
This represents environments so App One is our dev environment and App two is our Test environment. Each item that is placed into each of the containers triggers a webjob in its App Service. there is also an archive container under the storage account for each App Service where a copy of the Blob is archived.
the situation we are in is that we seem to be unable to run both WebJobs at the same time (1 of the 9 in each). We can only get a trigger activating in one WebJob when the WebJob in the other App Service is stopped. They appear to be locking each other out but I was under the impression that the structure we have would keep all of that separate and the locks would not interfere with each other. the info I can find is that reading a Blob gets a lock on the Blob and updating a Blob gets a lock on the container. If that is correct then why do they appear to be locking each other out.
Any advice on what may be causing this or how to move forward in trouble shooting it wil be greatly appreciated.
This problem seems to be related to your WebJobs functions logic. If WebJobs access the same resource at the same time, the WebJobs will influence each other. And then it will cause the problem. Please have a look at the conflict section.

Akka.net Cluster Debugging

The title is a bit misleading, so let me explain further.
I have a non thread-safe dll I have no choice but to use as part of my back end
servers. I can't use it directly in my servers as the thread issues it has causes
it to crash. So, I created an akka.net cluster of N nodes each which hosts a single
actor. All of my API calls that were originally to that bad dll are now routed through
messages to these nodes through a round-robin group. As each node only has a single, single
threaded actor, I get safe access, but as I have N of them running I get parallelism, of a sort.
In production, I have things configured with auto-down = false and default timings on heartbeats
and so on. This works perfectly. I can fire up new nodes as needed, they get added to the group,
I can remove them with Cluster.Leave and that is happy as well.
My issue is with debugging. In our development environment we keep a cluster of 20 nodes each
exposing a single actor as described above that wraps this dll. We also have a set of nodes that act as
seed nodes and do nothing else.
When our application is run it joins the cluster. This allows it to direct requests through the round-robin
router to the nodes we keep up in our cluster. When doing development and testing and debugging the app, if I configure things to use auto-down = false
we end up with problems whenever a test run crashes or we stop the application with out going through
proper cluster leaving logic. Such as when we terminate the app with the stop button in the debugger.
With out auto-down, this leaves us with a missing member of the cluster that causes the leader to disallow
additions to the cluster. This means that the next time I run the app to debug, I cant join the cluster and am
stuck.
It seems that I have to have auto-down set to get debugging to work. If it is set, then when I crash my app
the node is removed from the cluster 5 seconds later. When I next fire up my
app, the cluster is back in a happy state and I can join just fine.
The problem with this is that if I am debugging the application and pause it for any amount of time, it is almost immediately
seen as unreachable and then 5 seconds later is thrown out of the cluster. Basically, I can't debug with these settings.
So, I set failure-detector.acceptable-heartbeat-pause = 600s to give me more time to pause the app
while debugging. I will get shutdown in 10 min, but I don't often sit in the debugger for that long, so its an acceptable
trade-off. The issue with this is of course that when I crash the app, or stop it in the debugger, the cluster thinks it
exists for the next 10 minutes. No one tries to talk to these nodes directly, so in theory that isn't a huge issue, but I keep
running into cases where the test I just ran got itself elected as role leader. So the role leader is now dead, but the cluster
doesn't know it yet. This seems to prevent me from joining anything new to the cluster until my 10 min are up. When I try to leave
the cluster nicely, my dead node gets stuck at the exiting state and doesn't get removed for 10 minutes. And I don't always get
notified of the removal either, forcing me to set a timeout on leaving that will cause it to give up.
There doesn't seem to be any way to say "never let me be the leader". When I have run the app with no role set for the cluster
it seems to often get itself elected as the cluster leader causing the same problem
as when the role leader is dead but unknown to be so, but at a larger level.
So, I don't really see any way around this, but maybe someone has some tricks to pull this off. I want to be able to debug
my cluster member without it being thrown out of the cluster, but I also don't want the cluster to think that leader nodes
are around when they aren't, preventing me from rejoining during my next attempt.
Any ideas?

Monitoring glassfish session failover?

On a two instance single node test cluster I wanted to get a list of which sessions
are active on which instance, and then stop/kill an instance and get some information
about the failover process - I want to see it happening.
I've read that it's considered a reasonable strategy to have multiple instances on a
single node for "don't put all your eggs in one basket" reasons, so if an instance
went bad I can see a need to figure out the session to instance mapping.
I've read all the docs I can think of reading but have not seen anything that does
what I want. I am at a disadvantage because since running the create-cluster commmand
from asadmin the admin console simply won't load (it tries to but after 10 mins it's
still not loaded the login page).
Any suggestions? is JMS something to look at here? I'm running g/f 3.1.2.
Thanks.