Spinnaker Pipeline is failing after starting it manually - config

Spinnaker on Gke :
Pipeline is getting failed after starting manually
the first issue is the .spin/config didnt get created, I created that manually as missing in the steps https://cloud.google.com/solutions/continuous-delivery-spinnaker-kubernetes-engine
then when i started pipeline manually , it is giving me an error on production stage
Exception ( Wait For Manifest To Stabilize )
WaitForManifestStableTask of stage Deploy Production Backend timed out after 30 minutes 4 seconds. pausedDuration: 0 seconds, elapsedTime: 30 minutes 4 seconds,timeoutValue: 30 minutes

Many times this is caused by Kubernetes being unable to schedule the pods. Just had the same problem this morning and it was due to lack of resources. That would be the first thing I'd check. You can find the pod in the "Server Group" under the "Infrastructure" tab and view the details of why it's in the state it is.

I find this Spinnaker GitHub page insightful. I wanted to add it as a comment because I still don't have enough reputation.

Related

EKS can only run 2 pod of Fargate concurrently

this is my situation:
I follow this guide to build my cluster on EKS (with fargate profile): https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html. But when i installing my helm chart, only 2 pods can run properly, another ones got pending status with this message: Your AWS account has reached the limit on the number of Fargate pods it can run concurrently.
Anyone could help my fix this issue? Thanks.
Your region has not yet been activated. You have to launch an EC2 instance and terminate it to get past this error.
https://forums.aws.amazon.com/thread.jspa?threadID=292840
I've found the solution: provide ec2FullAccess for IAM user.
I have found that even though the AWS Default Quota for Fargate tasks is high, the Applied Quota for new accounts is 2, which is stunningly low. I am in the same situation, and I have requested a quota increase.

Intermittent problems starting Azure App Services: "500.37 ANCM Failed to Start Within Startup Time Limit"

Our app services are experiencing the problem, that they can’t be restarted by the hosting environment (ANCM).
The user is getting the following screen in that case:
Http Error 500.37
Our production subscription consists of up to 8 different app services and the problem can randomly harm one of them ore some of them.
The problem can occur several times a week, or just once a month.
The bootstrapping procedure of our app services is not time consuming.
The last occurrence of the problem has this log entries within the eventlog:
Failed to gracefully shutdown application 'MACHINE/WEBROOT/APPHOST/XXXXXXXXX'.
followed by:
Application '/LM/W3SVC/815681839/ROOT' with physical root 'D:\home\site\wwwroot' failed to load coreclr. Exception message: Managed server didn't initialize after 120000 ms
In most cases the problem can be resolved by manually stopping and starting the app service. In some cases we had to do that twice.
We are not able to reproduce that behavior locally.
The App Service Plan is S2 and we actually use just one instance.
The documentation of the Http error 500.37 recommends:
"You may need to stagger the startup process of multiple apps."
But there is no hint of how to do that.
How can we ensure that our app services are restarted without errors.
HTTP Error 500.37 - ANCM Failed to Start Within Startup Time Limit
You can try following approaches:
Approach 1: If possible, can try to move one app into a new App Service with a separate App Service plan, then check whether it can start as expected.
Please note that creating and using a separate App Service plan would be charged.
Approach 2: Increasing the startupTimeLimit attribute of the aspNetCore element.
For more information about the startupTimeLimit attribute, please check: https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/aspnet-core-module?view=aspnetcore-3.1#attributes-of-the-aspnetcore-element

High availability : Jobs not getting submitted immediately after name node fail over

We have an application configured for high availability.
Of the 2 nodes one of them is made active (say NN1) and other one's (say NN2) NameNode process is killed. So now NN1 is active.
Now we submit a mapreduce job , and the logs keep saying
"Application submission is not finished, submitted application application_someid is still in NEW_SAVING".
This happens for about 17 minutes and then the job gets executed successfully.
So which means the fail-over has happened and NN1 is active. But why does it take so long?
The yarn nodemanager logs says :
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: . Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Can somebody please explain as to why this is happening?
Thanks in advance
I don't know the cause of this problem,
But restarting the yarn service help me solve this problem.

How to run delayed_job's queued tasks on Heroku?

I'm currently using the delayed_job gem to queue and run background tasks in my application. In the local system, I can just use rake jobs:work to run the queued tasks. However, when I deploy my app onto Heroku, I do not want to continue using the rake command. Instead, I want the rake command to be called automatically. Is there a way to do so, without paying for a worker in Heroku?
I use cron with out problems (with django). All do you need is to configure as task the same command that you can execute after heroku run command.
Remember that cron time compute as worker time, be sure that command ends.
No, you can't do it without a worker.
The earlier point saying you need a worker is right, however you do have free worker hours. There 750 free hours per month http://www.heroku.com/pricing#1-0. Given a 31 day month is 744 hours, you have at least 6 free worker hours to use each month.
If you use the workless gem https://github.com/lostboy/workless this will spin up the worker only when needed (i.e. jobs waiting in delayed_job), then close it down again. Works perfectly for my app, and 6 hours of background processing time a month is more than enough for my requirements.

RUN#Cloud consistently throws me out during a heavy operation

I'm using a large app instance to run a basic java web application (GWT + Spring). There's an expensive operation within my application (report) which takes a long time to execute.
I've tried running it with the cloudbees SDK on my local machine with similar settings as it would be on the cloud and it seems to function just fine. It runs in about 3-4 minutes.
On the cloud, it seems to be taking longer. The problem isn't the fact that it takes long. What happens in that cloudbees terminates the session after 5 minutes and gives me an error in my browser saying 'Unable to connect to server. Please contact your administrator'. A report which doesn't take as long runs just fine. My application has a session timeout of 30 minutes, so that isn't a problem either.
What could possibly be going wrong? Is it something to do with cloudbees?
This may be due to proxy buffering of your request through the routing layer (revproxy) - so it most likely isn't a session timeout - but the http connection getting cut.
You can either set proxyBuffering=false via the bees CLI command (eg when you deploy the app) - this will ensure longer running connections can work.
Ideally, however, you could change the app slightly to return to the browser with some token which you can poll with to get completion status, as even with a connection that lasts that long, over the internet it may provide a bad experience vs locally.