Does setting "desired size: 0" prevent cluster-autoscaler from scaling up a managed node group? - amazon-eks

I have an aws managed node group that is acting unexpectedly when I set both desired size and minimum size to 0. I would expect that the managed node group would not have any nodes to start with, but that once I attempt to schedule a pod using a nodeSelector with the label eks.amazonaws.com/nodegroup: my-node-group-name, the cluster-autoscaler would set the desired size for the managed node group to 1, and a node would be booted.
However, the cluster-autoscaler logs indicate that the pending pod does not trigger a scale up because it wouldn't be schedulable: pod didn't trigger scale-up (it wouldn't fit if a new node is added). When I go set desired size to 1 in the managed node group manually however, the pod is scheduled successfully, so I know the nodeSelector works fine.
I thought this might be a labelling issue, as described here: , but I have the labels on my managed node groups set to be auto-discoverable.
spec:
containers:
- command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster-name
- --balance-similar-node-groups=true
- --expander=least-waste
- --logtostderr=true
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=info
- --v=4
I have set the same labels on the autoscaling group:
Key Value Tag new instances
eks:cluster-name my-cluster-name Yes
eks:nodegroup-name my-node-group-name Yes
k8s.io/cluster-autoscaler/enabled true Yes
k8s.io/cluster-autoscaler/my-cluster-name owned Yes
kubernetes.io/cluster/my-cluster-name owned Yes
Am I missing something? Or is this expected behavior for setting desired size to 0?

Ugh, it turns out this is just an aws incompatibility with the cluster-autoscaler that they don't tell you about. You can scale your managed node group down to zero, but without a workaround, you can't scale it back up.
For the cluster-autoscaler to scale up a node group from 0, it constructs a pseudo node based on the nodegroup specifications, in this case the aws autoscaling group. For the cluster-autoscaler to know what labels to put on that pseudo node to check if it would allow a pod to be scheduled, you need to add a specific tag to the nodegroup.
Sadly, aws does not add this tag to the autoscaling group for you, and also does not propagate tags from the managed node group to the autoscaling group. The only way to make this work is to go add the tag to the autoscaling group yourself after it was created by the managed node group. The issue is tracked here.

EKS now supports this with Cluster Autoscaler. https://realz.medium.com/reduce-amazon-eks-cost-by-scaling-node-groups-to-zero-41dce9db50ef

Related

KEDA - how does redis listLength scale pods (in detail)

The docs for listLength in the redis trigger are quite confusing. From reading around and experimenting it seems like listLength is the threshold over which a new pod should be created. Ie. if the listLength is 16, a pod will be spun up when there are 16 or more items in the given list.
How would i gaurantee that a pod (up to the maxReplicaCount threshold) will be spun up for every item in a list? The problem im having is that a number of pods get spun up (which can also be below the maxReplicaCount even when there are more elements in the list), over time the pods die and i end up with a couple of pods struggling with a long list, which makes this too unpredictable to use in production
What i really want is if i have a list length of 20, i am spun up 20 pods, if i have a maxReplicaCount of 20, all other list elements will be consumed by the existing pods but the 20 that exist will not spin down until the list length is below one PER pod.
According to my observation of the behavior of 'redis' scaler of KEDA with our app, maxReplicaCount, cooldownPeriod, and listLength parameters relate to your question.
listLength is a trigger so that if you want 20 pods work for 20 list length, then listLength is 1 to trigger for each queued task (but I am not sure if it is guaranteed because listLength is just average value according to the doc).
The situation of listLength=1 might be busy when one job is short-live because pod is created/terminated (after cooldownPeriod) for every queueing. Or, it might be reasonable when one job takes long time to process. (Just in our app case, I set listLength=5 to let one pod process several queued jobs in its event-loop. Another pod is created when list length reaches 5.)

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

Redis cluster cannot add nodes

There are two redis server. And I have run three redis instances on each server.
When I executed cluster meet [ip] [port] to add the cluster nodes, I found I just could add the nods which was running on the same server. Everytime I run this command, it alwasys echo an "OK" for me. But when I use cluster nodes to check the nodes list, it always shows like this.
172.18.0.155:7010> cluster meet 172.18.0.156 7020
OK
172.18.0.155:7010> cluster nodes
ad829d8b297c79f644f48609f17985c5586b4941 127.0.0.1:7010#17010 myself,master - 0 1540538312000 1 connected
87a8017cfb498e47b6b48f0ad69fc066c466a9c2 172.18.0.156:7020#17020 handshake - 1540538308677 0 0 disconnected
fdf5879554741759aab14eba701dc185b605ac16 127.0.0.1:7012#17012 master - 0 1540538313000 0 connected
ec7b3ecba7a175ddb81f254821243dd469a7f961 127.0.0.1:7011#17011 master - 0 1540538314288 2 connected
You can see the nodes status is disconnected. And you can find it will disappare from the list, if you check again about 5s later.
Has anybody meet this problem before? I have no idea how to solve this problem. Please help me. Thanks a lot.
I have solved the problem. I found I had done some mistakes with the bind configuration. When I just add one IP which communicate with other nodes for the bind setting. The cluster nodes can add normally.

Why flink container vcore size is always 1

I am running flink on yarn(more precisely in AWS EMR yarn cluster).
I read flink document and source code that by default for each task manager container, flink will request the number of slot per task manager as the number of vcores when request resource from yarn.
And I also confirmed from the source code:
// Resource requirements for worker containers
int taskManagerSlots = taskManagerParameters.numSlots();
int vcores = config.getInteger(ConfigConstants.YARN_VCORES,
Math.max(taskManagerSlots, 1));
Resource capability = Resource.newInstance(containerMemorySizeMB,
vcores);
resourceManagerClient.addContainerRequest(
new AMRMClient.ContainerRequest(capability, null, null,
priority));
When I use -yn 1 -ys 3 to start flink, I assume yarn will allocate 3 vcores for the only task manager container, but when I checked the number of vcores for each container from yarn resource manager web ui, I always see the number of vcores is 1. I also see vcore to be 1 from yarn resource manager logs.
I debugged the flink source code to the line I pasted below, and I saw value of vcores is 3.
This is really confuse me, can anyone help to clarify for me, thanks.
An answer from Kien Truong
Hi,
You have to enable CPU scheduling in YARN, otherwise, it always shows that only 1 CPU is allocated for each container,
regardless of how many Flink try to allocate. So you should add (edit) the following property in capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
ALso, taskManager memory is, for example, 1400MB, but Flink reserves some amount for off-heap memory, so the actual heap size is smaller.
This is controlled by 2 settings:
containerized.heap-cutoff-min: default 600MB
containerized.heap-cutoff-ratio: default 15% of TM's memory
That's why your TM's heap size is limitted to ~800MB (1400 - 600)
#yinhua.
Use the command to start a session:./bin/yarn-session.sh, you need add -s arg.
-s,--slots Number of slots per TaskManager
details:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/yarn_setup.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/cli.html#usage
I get the answer finally.
It's because yarn is use "DefaultResourceCalculator" allocation strategy, so only memory is counted for yarn RM, even if flink requested 3 vcores, but yarn simply ignore the cpu core number.

julia on PBS cluster: what to give to addprocs()?

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!