OpenShift Origin: Node not ready - openshift-origin

I appear to have some problem with my installation of OpenShift Origin.
When I get endpoints for the router, I get the following:
oc get endpoints --namespace=default --selector=router
NAME ENDPOINTS AGE
router-west <none> 21m
Obviously the router should have at least one endpoint.
Im trying to follow the troubleshooting guide on https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#debugging-the-router however it does not provide assistance in the situation where the router has not endpoints.
When I get my list of nodes, I get:
oc get nodes
NAME LABELS STATUS AGE
openshift.hughestech.space kubernetes.io/hostname=openshift.mydomain.com NotReady 38d
When I describe the node, I get the following:
oc describe node openshift.mydomain.com
Name: openshift.mydomain.com
Labels: kubernetes.io/hostname=openshift.mydomain.com
CreationTimestamp: Sat, 06 Feb 2016 21:44:23 +0100
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
──── ────── ───────────────── ────────────────── ────── ───────
Ready Unknown Fri, 04 Mar 2016 18:50:39 +0100 Fri, 04 Mar 2016 18:51:21 +0100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses: 88.198.37.183,88.198.37.183
Capacity:
memory: 24515560Ki
pods: 40
cpu: 8
System Info:
Machine ID: bafaea4f3c4c4cf6a632047c1d14db1a
System UUID: 00000000-0000-0000-0000-002421DDE3D7
Boot ID: f9febe14-ec61-41d5-b7c3-db2e42f9b452
Kernel Version: 3.10.0-327.4.5.el7.x86_64
OS Image: Red Hat Enterprise Linux
Container Runtime Version: docker://1.8.2-el7
Kubelet Version: v1.1.0-origin-1107-g4c8e6f4
Kube-Proxy Version: v1.1.0-origin-1107-g4c8e6f4
ExternalID: openshift.mydomain.com
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
───────── ──── ──────────── ────────── ─────────────── ─────────────
Allocated resources:
(Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)
CPU Requests CPU Limits Memory Requests Memory Limits
──────────── ────────── ─────────────── ─────────────
0 (0%) 0 (0%) 0 (0%) 0 (0%)
No events.
Where have I gone wrong? What do I need to do?
Thanks

Restart the node service and see if that makes a difference in oc get nodes output.
systemctl restart origin-node
Unless your node is running you can cannot make a running router pod and resulting in no endpoints.

Related

Unable to ssh VM after hardware configuration change

I followed the recommandation to reduce the size of my VM (number of CPU from 4 to 2 and memory from 16GO to 8 Go). After updating the configuration and restarting the VM i was not able to access the VM via ssh.
The VM has an external IP.
The troublshoot diagnostic using gcloud does not show any error or issue in the log. Everything is fine regarding the firewall configuration.
I tried to create a new VM under my project (same project as the original VM). I cannot access it with ssh. If i create a new project and a new VM instance under this new project then I can ssh it. --> The problem seems to be related to the project itself.
I tried to access vie serial port and I am getting these errors:
Mar 8 20:31:11 myvm systemd[1]: Started Google OSConfig Agent.
Mar 8 20:32:11 myvm OSConfigAgent[1173]: 2022-03-08T20:32:11.5643Z OSConfigAgent Critical main.go:100: Error parsing metadata, agent cannot start: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=0&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Mar 8 20:32:11 myvm systemd[1]: google-osconfig-agent.service: Main process exited, code=exited, status=1/FAILURE
Mar 8 20:32:11 myvm systemd[1]: google-osconfig-agent.service: Failed with result 'exit-code'.
Mar 8 20:32:12 myvm systemd[1]: google-osconfig-agent.service: Service hold-off time over, scheduling restart.
Mar 8 20:32:12 myvm systemd[1]: google-osconfig-agent.service: Scheduled restart job, restart counter is at 4.
I am blocked... I am asking for your support. Any idea or suggestion?

Apache Server Many requests stuck in "R" Reading Request

below apache2ctl status with almost no users online.
For over 5 years we (cloud ERP supplier) deploy instances on Google Cloud with Apache with mod_perl.
This week our largest server became slow and unresponsive. No idle workers were available. It turned out increasing both MaxRequestWorkers and ServerLimit to 400 from 150 in mpm_prefork.conf got our server back fast.
I’m wondering why many requests stay in "R" Reading Request, at least 10 times more requests then actually should be.
We did further checking, DoS does not seem to be the issue, as also other servers – in different clouds as ASW or Alibaba – we notice the same ratio of 10 between requests actually being processed (R/W/K) and requests that stay in Reading mode.
What could cause this?
sudo /usr/sbin/apache2ctl status
Apache Server Status for localhost (via 127.0.0.1)
Server Version: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.29 OpenSSL/1.0.1f
mod_perl/2.0.8 Perl/v5.18.2
Server MPM: prefork
Server Built: Apr 3 2019 18:04:25
Current Time: Saturday, 29-Feb-2020 10:15:35 CET
Restart Time: Thursday, 27-Feb-2020 09:45:48 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 2 days 29 minutes 47 seconds
Server load: 0.75 0.77 0.75
Total accesses: 1581181 - Total Traffic: 8.6 GB
CPU Usage: u30.32 s9.64 cu0 cs0 - .0229% CPU load
9.06 requests/sec - 51.5 kB/second - 5.7 kB/request
96 requests currently being processed, 9 idle workers
RRKRRRK_RKRKKRRRRRK_RRRRKRCK_RRRC_CKK_KCRKCRK_RCR__CKKCCRCRRRRRR
RRRRR.RRRKRRRKRRR_RR..R.K.RCRKR.CKK.RRKKR.W.RRKR.....RR.........
................................................................
................................................................
................................................................
................................................................
................
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

Kubernetes dashboard authentication on atomic host

I am a total newbie in terms of kubernetes/atomic host, so my question may be really trivial or well discussed already - but unfortunately i couldn't find any clues how to achieve my goal - that's why i am here.
I have set up kubernetes cluster on atomic hosts (right now i have just one master and one node). I am working in the cloud network, on the virtual machines.
[root#master ~]# kubectl get node
NAME STATUS AGE
192.168.2.3 Ready 9d
After a lot of fuss i managed to set up the kubernetes dashboard UI on my master.
[root#master ~]# kubectl describe pod --namespace=kube-system
Name: kubernetes-dashboard-3791223240-8jvs8
Namespace: kube-system
Node: 192.168.2.3/192.168.2.3
Start Time: Thu, 07 Sep 2017 10:37:31 +0200
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=3791223240
Status: Running
IP: 172.16.43.2
Controllers: ReplicaSet/kubernetes-dashboard-3791223240
Containers:
kubernetes-dashboard:
Container ID: docker://8fddde282e41d25c59f51a5a4687c73e79e37828c4f7e960c1bf4a612966420b
Image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.6.3
Image ID: docker-pullable://gcr.io/google_containers/kubernetes-dashboard-amd64#sha256:2c4421ed80358a0ee97b44357b6cd6dc09be6ccc27dfe9d50c9bfc39a760e5fe
Port: 9090/TCP
Args:
--apiserver-host=http://192.168.2.2:8080
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 100m
memory: 100Mi
State: Running
Started: Fri, 08 Sep 2017 10:54:46 +0200
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 07 Sep 2017 10:37:32 +0200
Finished: Fri, 08 Sep 2017 10:54:44 +0200
Ready: True
Restart Count: 1
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Volume Mounts: <none>
Environment Variables: <none>
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
No volumes.
QoS Class: Burstable
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1d 32m 3 {kubelet 192.168.2.3} Warning MissingClusterDNS kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to DNSDefault policy.
1d 32m 2 {kubelet 192.168.2.3} spec.containers{kubernetes-dashboard} Normal Pulled Container image "gcr.io/google_containers/kubernetes-dashboard-amd64:v1.6.3" already present on machine
32m 32m 1 {kubelet 192.168.2.3} spec.containers{kubernetes-dashboard} Normal Created Created container with docker id 8fddde282e41; Security:[seccomp=unconfined]
32m 32m 1 {kubelet 192.168.2.3} spec.containers{kubernetes-dashboard} Normal Started Started container with docker id 8fddde282e41
also
[root#master ~]# kubectl cluster-info
Kubernetes master is running at http://localhost:8080
kubernetes-dashboard is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
Now, when i tried connecting to the dashboard (i tried accessing the dashbord via the browser on windows virtual machine in the same cloud network) using the adress:
https://192.168.218.2:6443/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
I am getting the "unauthorized". I believe it proves that the dashboard is indeed running under this address, but i need to set up some way of accessing it?
What i want to achieve in the long term:
i want to enable connecting to the dashboard using the login/password (later, when i learn a bit more, i will think about authenticating by certs or somehting more safe than password) from the outside of the cloud network. For now, connecting to the dashboard at all would do.
I know there are threads about authenticating, but most of them are mentioning something like:
Basic authentication is enabled by passing the
--basic-auth-file=SOMEFILE option to API server
And this is the part i cannot cope with - i have no idea how to pass options to API server.
On the atomic host the api-server,kube-controller-manager and kube-scheduler are running in containers, so I get into the api-server container with command:
docker exec -it kube-apiserver.service bash
I saw few times that i should edit .json file in /etc/kubernetes/manifest directory, but unfortunately there is no such file (or even a directory).
I apologize if my problem is too trivial or not described well enough, but im new to (both) IT world and the stackoverflow.
I would love to provide more info, but I am afraid I would end up including lots of useless information, so i decided to wait for your instructions in that regard.
Check out wiki pages of kubernetes dashboard they describe how to get access to dashboard and how to authenticate to it. For quick access you can run:
kubectl proxy
And then go to following address:
http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy
You'll see two options, one of them is uploading your ~/.kube/config file and the other one is using a token. You can get a token by running following command:
kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep service-account-token | head -n 1 | awk '{print $1}')
Now just copy and paste the long token string into dashboard prompt and you're done.

IBM WebSphere Portal V8.5 wcm library syndication

I have a WebSphere Portal Version 8.5 Cluster on AIX 7.1 with multiple Virtual Portals, working with managed pages and each Virtual Portal has it's own libraries and one shared library for all VPs using syndication of that library to each VP.
i successfully created the syndication pair between the syndicator (WAS base portal) and the subscriber (Virtual Portal) and tested connection between them and all is good (make sense since VP are local on the same server). however when trying to syndicate the library content it stays on Queued status and in the SystemOut.log i see the following error log:
[4/25/17 9:33:53:201 IDT] 00004163 PackageConsum E Unexpected exception thrown while updating subscription: [IceId: Current State: ], exception: com.ibm.workplace.wcm.services.WCMServiceRuntimeException: code: 400
com.ibm.workplace.wcm.services.WCMServiceRuntimeException: code: 400
at com.aptrix.syndication.business.subscriber.CatalogRetrieverTask.getSourceCatalog(CatalogRetrieverTask.java:330)
at com.aptrix.syndication.business.subscriber.CatalogRetrieverTask.process(CatalogRetrieverTask.java:144)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask.processPackage(PackageConsumerTask.java:513)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask.processUpdate(PackageConsumerTask.java:267)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask$1.run(PackageConsumerTask.java:183)
at com.ibm.wps.ac.impl.UnrestrictedAccessImpl.run(UnrestrictedAccessImpl.java:84)
at com.ibm.wps.command.ac.ExecuteUnrestrictedCommand.execute(ExecuteUnrestrictedCommand.java:90)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask.doManagedWork(PackageConsumerTask.java:195)
at com.aptrix.syndication.business.ManagedTask.runWork(ManagedTask.java:62)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmWork.runImpl(AbstractWcmWork.java:162)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork.access$001(AbstractWcmSystemWork.java:40)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork$1.run(AbstractWcmSystemWork.java:92)
at com.ibm.wps.ac.impl.UnrestrictedAccessImpl.run(UnrestrictedAccessImpl.java:84)
at com.ibm.wps.command.ac.ExecuteUnrestrictedCommand.execute(ExecuteUnrestrictedCommand.java:90)
at com.ibm.workplace.wcm.services.repository.PACServiceImpl.runAsPrivileged(PACServiceImpl.java:1878)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork.runImpl(AbstractWcmSystemWork.java:87)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmWork.run(AbstractWcmWork.java:146)
at com.ibm.wps.services.workmanager.impl.WasWorkWrapper.run(WasWorkWrapper.java:44)
at com.ibm.ws.asynchbeans.J2EEContext$RunProxy.run(J2EEContext.java:271)
at java.security.AccessController.doPrivileged(AccessController.java:274)
at com.ibm.ws.asynchbeans.J2EEContext.run(J2EEContext.java:797)
at com.ibm.ws.asynchbeans.WorkWithExecutionContextImpl.go(WorkWithExecutionContextImpl.java:222)
at com.ibm.ws.asynchbeans.ABWorkItemImpl.run(ABWorkItemImpl.java:206)
at java.lang.Thread.run(Thread.java:804)
[4/25/17 9:33:53:222 IDT] 00004163 SyndicationEx W Unsuccessful request to send summary: 400
com.aptrix.deployment.wizard.SyndicatorCommunicationException: Unsuccessful request to send summary: 400
at com.ibm.workplace.wcm.api.syndication.SyndicationExtensionsServiceImpl.sendSummaryToSyndicator(SyndicationExtensionsServiceImpl.java:293)
at com.ibm.workplace.wcm.api.syndication.SyndicationExtensionsServiceImpl.processSubscriberCompleting(SyndicationExtensionsServiceImpl.java:246)
at com.aptrix.syndication.business.subscriber.SubscriberTaskManager.processFailedUpdate(SubscriberTaskManager.java:405)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask.processUpdate(PackageConsumerTask.java:400)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask$1.run(PackageConsumerTask.java:183)
at com.ibm.wps.ac.impl.UnrestrictedAccessImpl.run(UnrestrictedAccessImpl.java:84)
at com.ibm.wps.command.ac.ExecuteUnrestrictedCommand.execute(ExecuteUnrestrictedCommand.java:90)
at com.aptrix.syndication.business.subscriber.PackageConsumerTask.doManagedWork(PackageConsumerTask.java:195)
at com.aptrix.syndication.business.ManagedTask.runWork(ManagedTask.java:62)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmWork.runImpl(AbstractWcmWork.java:162)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork.access$001(AbstractWcmSystemWork.java:40)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork$1.run(AbstractWcmSystemWork.java:92)
at com.ibm.wps.ac.impl.UnrestrictedAccessImpl.run(UnrestrictedAccessImpl.java:84)
at com.ibm.wps.command.ac.ExecuteUnrestrictedCommand.execute(ExecuteUnrestrictedCommand.java:90)
at com.ibm.workplace.wcm.services.repository.PACServiceImpl.runAsPrivileged(PACServiceImpl.java:1878)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmSystemWork.runImpl(AbstractWcmSystemWork.java:87)
at com.ibm.workplace.wcm.services.workmanager.AbstractWcmWork.run(AbstractWcmWork.java:146)
at com.ibm.wps.services.workmanager.impl.WasWorkWrapper.run(WasWorkWrapper.java:44)
at com.ibm.ws.asynchbeans.J2EEContext$RunProxy.run(J2EEContext.java:271)
at java.security.AccessController.doPrivileged(AccessController.java:274)
at com.ibm.ws.asynchbeans.J2EEContext.run(J2EEContext.java:797)
at com.ibm.ws.asynchbeans.WorkWithExecutionContextImpl.go(WorkWithExecutionContextImpl.java:222)
at com.ibm.ws.asynchbeans.ABWorkItemImpl.run(ABWorkItemImpl.java:206)
at java.lang.Thread.run(Thread.java:804)
[4/25/17 9:33:53:227 IDT] 00004163 syndication I Syndication Summary - Subscriber
Syndicator: IntShared_Syn, URL=http://'Was_Server':10039/wps/wcm/connect?MOD=Synd
Subscriber: IntShared_Sub, URL=http://'Was_Server':10039/wps/wcm/connect/'VP_URL_Context'?MOD=Subs
Status: FAILED
Failure Detail: Update failed on subscriber
Unexpected exception thrown while updating subscription: [IceId: Current State: ], exception: com.ibm.workplace.wcm.services.WCMServiceRuntimeException: code: 400
Update Type: REBUILD
Start Date: Tue Apr 25 09:33:53 IDT 2017
Finished Date: Tue Apr 25 09:33:53 IDT 2017
Duration:
Total: 0
Total Failed: 0
[4/25/17 9:33:54:613 IDT] 00000136 syndication I Syndication Summary - Syndicator
Syndicator: IntShared_Syn, URL=http://'Was_Server':10039/wps/wcm/connect?MOD=Synd
Subscriber: IntShared_Sub, URL=http://'VP_HostName':10039/wps/wcm/connect?MOD=Subs
Status: FAILED
Failure Detail: Terminated without confirmation
Returned non-confirmed response: Not confirmed. Unable to contact subscriber. Check the subscriber to ensure it is active and error free. Also review your network connections and your syndication configuration to ensure the subscriber details are correct.
Update Type: REBUILD
Start Date: Tue Apr 25 09:33:53 IDT 2017
Finished Date: Tue Apr 25 09:33:54 IDT 2017
Duration: 1 second
Total: 0
Total Failed: 0
WCM Syndication requires HTTP Basis Authentication to be configured and working.
then I needed to make sure that Trust Association is enabled in WAS Console under Security -> Global Security -> Web and SIP security -> Trust association.
confirmed that the box that says Enable trust association is checked.
also ensured the Interceptor com.ibm.portal.auth.tai.HTTPBasicAuthTAI is created and the configuration were correct.
the cause of the error was that in the fields of urlBlackList and urlWhiteList there was use of the variable ${WpsContextRootPath} which i found out that it is not set anywhere so i change it to /wps instead and now the fields are as follow:
urlBlackList = /wps/myportal*
urlWhiteList = /wps/mycontenthandler*
after Restarting the server and retry syndication - it works!.
also you may follow the direction in this link:
https://developer.ibm.com/answers/questions/206675/why-do-i-see-occasionally-see-a-popup-box-with-a-t.html
but setting these parameters disabled the servlet of vieweing all items in the libraries...
You can try using the ip address instead of the hostname. or Try adding the VP context to the syndicator/subscriber URLs.

How to solve race condition in etcd leader election?

While testing a Core Os cluster with three nodes, after successfully adding and removing few additional nodes, I encountered the following problem, supposedly due to a race condition during the election process for etcd.
Checking the new leader gives:
$ curl -L http://127.0.0.1:4001/v2/stats/leader
{"errorCode":300,"message":"Raft Internal Error","index":629006}
Journalctl for each machine in the cluster gives:
$ journalctl -r -u etcd
-- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. --
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.307 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221 started.
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.306 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:33 node-1 etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219 started.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:31 node-1 etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
And listing the machines with fleet fails:
$ fleetctl list-machines
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
Listing the machines in the cluster gives:
$ curl -L http://127.0.0.1:7001/v2/admin/machines
[{"name":"","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},
{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},
{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},
{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},
{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},
{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]
but only three machines are currently running. I tried to add additional machines to reach the quorum with no avail.
I'm running the following version:
$ etcdctl -v
etcdctl version 0.4.6
for which, as mentioned here https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config, the leader module to force a leader has been removed. The ugly part is that since there is no quorum I'm not able to remove from the list of machines the ones that are not currently running using for example:
$ curl -L -XDELETE http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6
which has no effect. Is there any way to save the cluster?
In my understanding, you can save the cluster, but it isn't worth it.
The cluster is not accepting new machines because it needs a quorum to add new machines and there is not a quorum of existing machines. The same goes for removing machines and deleting keys.
If you can bring up enough machines listed as cluster members and have them successfully work as cluster members, you will have a quorum and save the cluster.
From what I can see, you have six machines listed as cluster members. You need to have at least four running for the existing cluster to operate.