Google Container Engine clusters down after upgrading Kubernetes version - load-balancing

I've upgraded a couple of Google Container Engine clusters from Kubernetes version 1.2.4 to 1.3.5, and according to the UI the upgrade process is finished. However, hitting the corresponding Web servers (which are behind L7 load balancers AKA GLBCs) results in error 502 even though it's been about half an hour since the upgrade process finished.
How do I fix/troubleshoot this? I figure it's GLBC that's having trouble, as it's typically returning 502 while transitioning, but I have no idea how to troubleshoot it.

Seems as if the ingresses of my different clusters were clashing with each other after the Kubernetes upgrade due to ingress UIDs no longer working or some such, at least that's the current theory. Reinstalling L7 ingresses for all clusters with UIDs reconfigured fixed it.

Related

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

Mirror canary deployment in RTL?

I am new to canary deployments. We are going to start doing canary deployments via Istio.
I was assuming this would just be a deployment mechanism, probably with some Istio routing testing in a pre-prod env but in earlier test envs we'd ring fence to a version being tested as we do today.
It's been suggested the canary concept is applied to all test environments so we effectively run all versions we expect to canary test in prod in the Route To Live.
Wondring what approach others are taking?
Mirroring
As mentioned here
Using Istio, you can use traffic mirroring to duplicate traffic to another service. You can incorporate a traffic mirroring rule as part of a canary deployment pipeline, allowing you to analyze a service's behavior before sending live traffic to it.
If you're looking for best practices I would recommend to start with this tutorial on medium, because it is explained very well here.
How Traffic Mirroring Works
Traffic mirroring works using the steps below:
You deploy a new version of the application and switch on traffic
mirroring.
The old version responds to requests like before but also sends an asynchronous copy to the new version.
The new version processes the traffic but does not respond to the user.
The operations team monitor the new version and report any issues to the development team.
As the application processes live traffic, it helps the team uncover issues that they would typically not find in a pre-production environment. You can use monitoring tools, such as Prometheus and Grafana, for recording and monitoring your test results.
Additionally there is an example with nginx that perfectly shows how it should work.
Canary deployment
As mentioned here
One of the benefits of the Istio project is that it provides the control needed to deploy canary services. The idea behind canary deployment (or rollout) is to introduce a new version of a service by first testing it using a small percentage of user traffic, and then if all goes well, increase, possibly gradually in increments, the percentage while simultaneously phasing out the old version. If anything goes wrong along the way, we abort and rollback to the previous version. In its simplest form, the traffic sent to the canary version is a randomly selected percentage of requests, but in more sophisticated schemes it can be based on the region, user, or other properties of the request.
Depending on your level of expertise in this area, you may wonder why Istio’s support for canary deployment is even needed, given that platforms like Kubernetes already provide a way to do version rollout and canary deployment. Problem solved, right? Well, not exactly. Although doing a rollout this way works in simple cases, it’s very limited, especially in large scale cloud environments receiving lots of (and especially varying amounts of) traffic, where autoscaling is needed.
There are the differences between k8s canary deployment and istio canary deployment.
k8s
As an example, let’s say we have a deployed service, helloworld version v1, for which we would like to test (or simply rollout) a new version, v2. Using Kubernetes, you can rollout a new version of the helloworld service by simply updating the image in the service’s corresponding Deployment and letting the rollout happen automatically. If we take particular care to ensure that there are enough v1 replicas running when we start and pause the rollout after only one or two v2 replicas have been started, we can keep the canary’s effect on the system very small. We can then observe the effect before deciding to proceed or, if necessary, rollback. Best of all, we can even attach a horizontal pod autoscaler to the Deployment and it will keep the replica ratios consistent if, during the rollout process, it also needs to scale replicas up or down to handle traffic load.
Although fine for what it does, this approach is only useful when we have a properly tested version that we want to deploy, i.e., more of a blue/green, a.k.a. red/black, kind of upgrade than a “dip your feet in the water” kind of canary deployment. In fact, for the latter (for example, testing a canary version that may not even be ready or intended for wider exposure), the canary deployment in Kubernetes would be done using two Deployments with common pod labels. In this case, we can’t use autoscaling anymore because it’s now being done by two independent autoscalers, one for each Deployment, so the replica ratios (percentages) may vary from the desired ratio, depending purely on load.
Whether we use one deployment or two, canary management using deployment features of container orchestration platforms like Docker, Mesos/Marathon, or Kubernetes has a fundamental problem: the use of instance scaling to manage the traffic; traffic version distribution and replica deployment are not independent in these systems. All replica pods, regardless of version, are treated the same in the kube-proxy round-robin pool, so the only way to manage the amount of traffic that a particular version receives is by controlling the replica ratio. Maintaining canary traffic at small percentages requires many replicas (e.g., 1% would require a minimum of 100 replicas). Even if we ignore this problem, the deployment approach is still very limited in that it only supports the simple (random percentage) canary approach. If, instead, we wanted to limit the visibility of the canary to requests based on some specific criteria, we still need another solution.
istio
With Istio, traffic routing and replica deployment are two completely independent functions. The number of pods implementing services are free to scale up and down based on traffic load, completely orthogonal to the control of version traffic routing. This makes managing a canary version in the presence of autoscaling a much simpler problem. Autoscalers may, in fact, respond to load variations resulting from traffic routing changes, but they are nevertheless functioning independently and no differently than when loads change for other reasons.
Istio’s routing rules also provide other important advantages; you can easily control fine-grained traffic percentages (e.g., route 1% of traffic without requiring 100 pods) and you can control traffic using other criteria (e.g., route traffic for specific users to the canary version). To illustrate, let’s look at deploying the helloworld service and see how simple the problem becomes.
There is an example.
There are additional resources you may want to check about traffic mirroring in istio:
https://istio.io/latest/docs/tasks/traffic-management/mirroring/
https://itnext.io/use-istio-traffic-mirroring-for-quicker-debugging-a341d95d63f8
https://dev.to/peterj/mirroring-traffic-with-istio-service-mesh-2cm4
https://livebook.manning.com/book/istio-in-action/chapter-5/v-7/130
https://istio.io/latest/docs/tasks/traffic-management/traffic-shifting/#apply-weight-based-routing

Should all pods using a redis cache be constrained to the same node as the rediscache itself?

We are running one of our services in a newly created kubernetes cluster. Because of that, we have now switched them from the previous "in-memory" cache to a Redis cache.
Preliminary tests on our application which exposes an API shows that we experience timeouts from our applications to the Redis cache. I have no idea why and it issue pops up very irregularly.
So I'm thinking maybe the reason for these timeouts are actually network related. Is it a good idea to put in affinity so we always run the Redis-cache on the same nodes as the application to prevent network issues?
The issues have not arisen during "very high load" situations so it's concerning me a bit.
This is an opinion question so I'll answer in an opinionated way:
Like you mentioned I would try to put the Redis and application pods on the same node, that would rule out wire networking issues. You can accomplish that with Kubernetes pod affinity. But you can also try nodeslector, that way you always pin your Redis and application pods to a specific node.
Another way to do this is to taint your nodes where you want to run your workloads and then add a toleration to the Redis and your application pods.
Hope it helps!

Deploying ASP.NET Core application to ElasticBeanstalk without temporary HTTP 404

Currently, ElasticBeanstalk supports ASP.NET Core applications only on Windows platforms (when using the web role), and with Windows-based platform, you can't have Immutable updates or even RollingWithAdditionalBatch for whatever reason. If the application is running with a single instance, you end up with the situation that the only running instance is being updated. (Possible reasons for running a single instance: saving cost because it is just a small backend service, or it might be a service that requires a lot of RAM in comparison to CPU time, so it makes more sense to run one larger instance vs. multiple smaller instances.)
As a result, during deployment of a new application version, for a period of up to 30 seconds, you first get HTTP 503, then HTTP 404, later HTTP 502 Bad Gateway, before the new application version actually becomes available. Obviously this is much worse compared to e.g. using WebDeploy on a single server in a "classic" environment.
Possible workarounds I can think of:
Blue/Green deployments: slow (because it depends on DNS changes), and it seems like it is more suitable for "supervised" deployments, not for automated deploy pipelines.
Modify the autoscaling group to enforce 2 active instances before deployment (so that EB can do its normal Rolling update thing), then change back. However it is far from ideal to mess with resources created and managed by EB (like the autoscaling group), and it requires a fairly complex script (you need to wait for the second instance to become active, need to wait for rolling deployment etc.).
I can't believe that this are the only options. Any other ideas? The minimal viable workaround for me would be to at least get rid of the temporary 404s because this could seriously mislead API clients (or think of the SEO effect in case of a website if a search engine spider gets a 404 for every URL). As long as it is 5xx at least everybody knows it is just a temporary error.
Finally, in Feb 2019, AWS released Elastic Beanstalk Windows Server platform v2, which supports Immutable und Rolling with an additional Batch deployments and platform updates (like their Linux-based stacks already supported for ages):
https://docs.aws.amazon.com/elasticbeanstalk/latest/relnotes/release-2019-02-21-windows-v2.html
This solves the problem even for environments (normally) running just one instance.

Using Kubernetes or Apache mesos

We have a product which is described in some docker files, which can create the necessary docker containers. Some docker containers will just run some basic apps, while other containers will run clusters (hadoop).
Now is the question which cluster manager I need to use.
Kubernetes or Apache mesos or both?
I read Kubernetes is good for 100% containerized environments, while Apache Mesos is better for environments which are a bit containerized and a bit not-containerized. But Apache Mesos is better for running hadoop in docker (?).
Our environment is composed of only docker containers, but some with an hadoop cluster and some with some apps.
What will be the best?
Both functionally do the same, orchestrate Docker containers, but obviously they will do it in different ways and what you can easily achieve with one, it might prove difficult in the other and vice versa.
Mesos has a higher complexity and learning curve in my opinion. Kubernetes is relatively simpler and easier to grasp. You can literally spawn your own Kube master and minions running one command and specifying the provider: Vagrant or AWS,etc. Kubernetes is also able to be integrated into Mesos, so there is also the possibility where you could try both.
For the Hadoop specific use case you mention, Mesos might have an edge, it might integrate better in the Apache ecosystem, Mesos and Spark were created by the same minds.
Final thoughts: start with Kube, progressively exploring how to make it work for your use case. Then, after you have a good grasp on it, do the same with Mesos. You might end up liking pieces of each and you can have them coexist, or find that Kube is enough for what you need.