KFServing pod "error: container storage-initializer is not valid" - tensorflow-serving

I am new to KFServing and Kubeflow.
I was following https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1alpha2/tensorflow to deploy a simple inference service.
However, when looking at the logs, I am unable to find the container storage-initializer. The only containers my predict service pod has are kfserving and queue-proxy.
I am currently on Kubeflow 1.2 and Kubernetes 1.17 on IBM Cloud.
Error Message Image

storage-initializer is an init container, so if you describe the pod you won't find it in the containers section of pod spec but in the initContainers section.
$ kubectl get pod flowers-sample-predictor-default-00002-deployment-58bb9557sf7g2 -o json | jq .status.initContainerStatuses
[
{
"containerID": "docker://e40e5f86401b3715118b873fec4ae6c3ef57765ffbb5c9ab48757234c4f53b6f",
"image": "gcr.io/kfserving/storage-initializer:v0.5.0",
"imageID": "docker-pullable://gcr.io/kfserving/storage-initializer#sha256:1d396c0c50892f5562a1c24d925691ec786e5d48e08200f3f9bb17bb48da40ae",
"lastState": {},
"name": "storage-initializer",
"ready": true,
"restartCount": 0,
"state": {
"terminated": {
"containerID": "docker://e40e5f86401b3715118b873fec4ae6c3ef57765ffbb5c9ab48757234c4f53b6f",
"exitCode": 0,
"finishedAt": "2021-02-27T20:13:25Z",
"reason": "Completed",
"startedAt": "2021-02-27T20:13:11Z"
}
}
}
]
I'm not familiar with the model label you are using, can you retry by using the app label or the pod name directly?
$ kubectl logs -l app=flowers-sample-predictor-default-00002 -c storage-initializer
[I 210227 20:13:12 initializer-entrypoint:13] Initializing, args: src_uri [gs://kfserving-samples/models/tensorflow/flowers] dest_path[ [/mnt/models]
[I 210227 20:13:12 storage:43] Copying contents of gs://kfserving-samples/models/tensorflow/flowers to local
[W 210227 20:13:15 _metadata:104] Compute Engine Metadata server unavailable onattempt 1 of 3. Reason: timed out
[W 210227 20:13:15 _metadata:104] Compute Engine Metadata server unavailable onattempt 2 of 3. Reason: [Errno 113] No route to host
[W 210227 20:13:18 _metadata:104] Compute Engine Metadata server unavailable onattempt 3 of 3. Reason: timed out
[W 210227 20:13:18 _default:250] Authentication failed using Compute Engine authentication due to unavailable metadata server.
[I 210227 20:13:19 storage:127] Downloading: /mnt/models/0001/saved_model.pb
[I 210227 20:13:19 storage:127] Downloading: /mnt/models/0001/variables/variables.data-00000-of-00001
[I 210227 20:13:25 storage:127] Downloading: /mnt/models/0001/variables/variables.index
[I 210227 20:13:25 storage:76] Successfully copied gs://kfserving-samples/models/tensorflow/flowers to /mnt/models

Related

Amazon Cloudwatch only receiving mem_used_percent and nothing else, despite numerous other metrics specified in config

I am trying to get CloudWatch running properly on my Lightsail instance, which I appear to achieved with only partial success.
I have ran the Wizard using sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard which has produced a config file outlining numerous metrics including cpu, memory and disk usage as outlined here. The service loads and starts the config file, and doesn't complain about invalid json (this did happen a few times, but I fixed it).
I can stop the service with sudo amazon-cloudwatch-agent-ctl -a stop
I then reload the config with sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -s -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
Verify the service is running: sudo amazon-cloudwatch-agent-ctl -a status
Which outputs this:
{
"status": "running",
"starttime": "2022-01-10T21:53:12+00:00",
"configstatus": "configured",
"cwoc_status": "stopped",
"cwoc_starttime": "",
"cwoc_configstatus": "not configured",
"version": "1.247349.0b251399"
}
Logging into my CloudWatch console, I can see the data being received, and the single line appearing on the graph there corresponds to the times that I started and stopped the service-- so it's definitely doing something. And yet... the only metric that appears on that graph is mem_used_percent... why? Why only this one metric? Where is the rest of my data pertaining to cpu, etc? What am I doing wrong?
Here is my config.json, which as I said, is being loaded by the service without issue.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"append_dimensions": {
"ImageID": "${aws:ImageId}",
"InstanceId":"${aws:InstanceId}",
"InstanceType":"${aws:InstanceType}"
},
"metrics_collected": {
"cpu": {
"resources": [
"*"
],
"measurement": [
"cpu_usage_active"
],
"metrics_collection_interval": 60,
"totalcpu": false
},
"disk": {
"measurement": [
"free",
"total",
"used",
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_active",
"mem_available",
"mem_available_percent",
"mem_free",
"mem_total",
"mem_used",
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"netstat": {
"measurement": [
"tcp_established",
"udp_socket"
]
}
}
}
}
Any help greatly appreciated here. TIA.
You likely haven't fetched the configuration yet.
Check the logfile, i.e. /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log, to see which inputs are loaded:
2022-05-18T10:18:57Z I! Loaded inputs: mem disk
To fetch the configuration, do as follows (you'll need to adapt this to your environment - this is for systemd, on-premise, without SSM):
sudo amazon-cloudwatch-agent-ctl -a fetch-config -m onPremise -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
sudo systemctl status amazon-cloudwatch-agent.service restart
After:
2022-05-18T11:45:05Z I! Loaded inputs: mem net netstat swap cpu disk diskio
Maybe you face the same issue as I did. In my case two configuration json files
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
were merged.
The files are then translated to
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml.
When I was checking the file, only the mem definition of /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json was taken. Thus, I deleted the file and restarted the service.
sudo systemctl restart amazon-cloudwatch-agent
After the restart, the toml file contained what I expected and the metrics were in place.

SSL Error: Handshake failed with fatal error - Querying fabric-sdk-rest server on a Fabric Network with TLS enabled

I started a Multi-Host Fabric Network usind docker swarm made up of 1 CA-server, 1 Orderer, 2 Peers (both in Org1, one on PC1 and one on PC2), 2 CouchBD (one for each Peer) with fabric-sdk-rest running on PC2.
Now if I disable TLS in the Fabric Network, everything works fine. But if i enable the TLS in the network, the SDK cannot connect to the peers failing to query.
Here I show the configuration of the network and the fabric-sdk-rest:
(crypto-config.yaml)
OrdererOrgs:
- Name: Orderer
Domain: example.com
Specs:
- Hostname: orderer
PeerOrgs:
- Name: Org1
Domain: org1.example.com
Template:
Count: 2
Users:
Count: 0
(datasources.json)
{
"db": {
"name": "db",
"connector": "memory"
},
"fabricDataSource": {
"name": "fabricDataSource",
"connector": "fabric",
"keyStoreFile": "/tmp/fabricSDKStore",
"fabricUser": {
"username": "Admin#org1.example.com",
"mspid": "Org1MSP",
"cryptoContent": {
"privateKey":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/users/Admin#org1.example.com/msp/keystore/KEY_sk",
"signedCert":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/users/Admin#org1.example.com/msp/signcerts/Admin#org1.example.com-cert.pem"
}
},
"COMMENT_orgs":"Referenced by peers to avoid having to configure the same file location multiple times. Change CACertFile locations for your fabric",
"orgs": [
{ "name":"org1", "CACertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/ca/ca.org1.example.com-cert.pem"}
],
"COMMENT_peers" : "Configured array is for use with the fabric-sample when running it in a local docker set up. eventURL and publicCertFile not currently used.",
"peers": [
{ "requestURL":"grpcs://peer1.org1.example.com:7051", "eventURL":"grpcs://peer1.org1.example.com:7053", "orgIndex":"0", "publicCertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/peers/peer1.org1.example.com/msp/signcerts/peer1.org1.example.com-cert.pem", "hostname":"peer1" }
],
"COMMENT_peers_secure" : "UNUSED. This is a copy of the above with grpcs URLs. Replace peers content with this if grpcs urls are needed.",
"peers-secure": [
{ "requestURL":"grpcs://peer1.org1.example.com:7051", "eventURL":"grpcs://peer1.org1.example.com:7053", "orgIndex":"0", "publicCertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/peers/peer1.org1.example.com/msp/signcerts/peer1.org1.example.com-cert.pem", "hostname":"peer1" }
],
"orderers": [
{ "url":"grpcs://orderer.example.com:7050", "CACertFile":"$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/ca/ca.example.com-cert.pem", "publicCertFile": "$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/orderers/orderer.example.com/msp/signcerts/orderer.example.com-cert.pem", "hostname":"orderer"}
],
"COMMENT_orderers_secure" : "UNUSED. This is a copy of the above with grpcs URLs. Replace orderers content with this if grpcs urls are needed.",
"orderers-secure": [
{ "url":"grpcs://orderer.example.com:7050", "CACertFile":"$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/ca/ca.example.com-cert.pem", "publicCertFile": "$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/orderers/orderer.example.com/msp/signcerts/orderer.example.com-cert.pem", "hostname":"orderer"}
],
"COMMENT_channels":"fabric-sdk-node Client class requires channel information to be configured during bootstrap.",
"channels": [
{ "name":"mychannel", "peersIndex":[0], "orderersIndex":[0] }
],
"channels-first-network": [
{ "name":"mychannel", "peersIndex":[0,1,2,3], "orderersIndex":[0] }
]
}
}
Once started the Hyperledger Fabric SDK REST server at https://0.0.0.0:3000, when I try to make the GET channels query from the explorer, I get the following error:
error: [fabricconnector.js]: Failed to queryChannels: Error: 14 UNAVAILABLE: Connect Failed
Error not handled for the GET request /api/fabric/1_0/channels: Error: 14 UNAVAILABLE: Connect Failed
at Object.exports.createStatusError ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/common.js:87:15)
at Object.onReceiveStatus ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:1214:28)
at InterceptingListener._callNext ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:590:42)
at InterceptingListener.onReceiveStatus ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:640:8)
at callback ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:867:24)
E0510 10:51:04.780559355 12247 ssl_transport_security.cc:989] Handshake failed with fatal error SSL_ERROR_SSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed.
Has anyone ever seen this error? Can anyone help me get through this, please?

Chaincode container can't connect to the local peer due to certificate signed by unknown authority

First of all I'd like to mention, that my setup works like a charm when there's no TLS enabled. It works even in Docker Swarm on AWS.
The problem starts when I enable TLS. When I deploy my .bna file via Composer, my newly created chaincode container produces the following logs:
2017-08-23 13:14:16.389 UTC [Composer] Info -> INFO 001 Setting the Composer pool size to 8
2017-08-23 13:14:16.402 UTC [shim] userChaincodeStreamGetter -> ERRO 002 Error trying to connect to local peer: x509: certificate signed by unknown authority
Error starting chaincode: Error trying to connect to local peer: x509: certificate signed by unknown authority
Funny thing is, that this works when deploying .bna via the composer playground (when the TLS is still enabled in my fabric)...
Below is my connection profile:
{
"name": "test",
"description": "test",
"type": "hlfv1",
"orderers": [
{
"url": "grpcs://orderer.company.com:7050",
"cert": "-----BEGIN CERTIFICATE-----blabla1\n-----END CERTIFICATE-----\n"
}
],
"channel": "channelname",
"mspID": "CompanyMSP",
"ca": {
"url": "https://ca.company.com:7054",
"name": "ca-company",
"trustedRoots": [
"-----BEGIN CERTIFICATE-----\nblabla2\n-----END CERTIFICATE-----\n"
],
"verify": true
},
"peers": [
{
"requestURL": "grpcs://peer0.company.com:7051",
"eventURL": "grpcs://peer0.company.com:7053",
"cert": "-----BEGIN CERTIFICATE-----\nbalbla3\n-----END CERTIFICATE-----\n"
}
],
"keyValStore": "/home/composer/.composer-credentials",
"timeout": 300
}
My certs have been generated by cryptogen tool, hence:
orderers.0.cert contains value of crypto-config/ordererOrganizations/company.com/orderers/orderer.company.com/msp/tlscacerts/tlsca.company.com-cert.pem
peers.0.cert contains value of crypto-config/peerOrganizations/company.com/peers/peer0.company.com/msp/tlscacerts/tlsca.company.com-cert.pem
ca.trustedRoots.0 contains crypto-config/peerOrganizations/company.com/peers/peer0.company.com/tls/ca.crt
I've got the feeling, that my trustedRoots certificate is wrong...
UPDATE
When I do docker inspect chaincode_container I can see that it misses ENV variable: CORE_PEER_TLS_ROOTCERT_FILE=/etc/hyperledger/fabric/peer.crt, while the chaincode container deployed via playground does have it...
When the chaincode image is built, the TLS certificate that it uses to build the trusted roots is the rootcert from:
# TLS Settings
# Note that peer-chaincode connections through chaincodeListenAddress is
# not mutual TLS auth. See comments on chaincodeListenAddress for more info
tls:
enabled: false
cert:
file: tls/server.crt
key:
file: tls/server.key
rootcert:
file: tls/ca.crt
The TLS certificate that the peer uses to run the gRPC service is the cert one.
By the way - You're using the release branch code, not the one in master - is that correct?

Meld error with Datastax Enterprise

Provisioning a DSE cluster with the lifecycle manager fails consitently. Master node (also the one OpsCenter is running on) installed correctly. Each one of the other nodes fails the install (also config) task. Have double-checked the SSH credentials and ports. Any ideas on how to investigate further and fix the issue would be great.
Please excuse the length - trying to provide all of the relevant info.
Ubuntu 14.04.4,
JRE: 1.8.0.91,
DSE 5.0.0
job events:
...
"results": [
{
"event-subtype": "start",
"event-type": "milestone",
"message": "job started...",
...
},
{
"event-subtype": "invocation",
"event-type": "shell-command",
"message": "Invoked command: if [ -x $(which yum) ] && [ -f /etc/redhat-release -o -f /etc/SuSE-release ]; then echo -n yum; elif [ -x $(which apt-get) ]; then echo -n apt; fi"
...
},
{
"event-subtype": "uploaded-facts",
"event-type": "milestone",
"message": "Uploaded facts to OpsCenter server",
...
},
{
"event-subtype": "meld-error",
"event-type": "error",
"message": "Unexpected error executing meld",
...
},
{
"event-subtype": "MeldError",
"event-type": "error",
"message": "Meld failed on: name=\"NODE-2\" ssh-management-address=\"<IP>\" node-id=\"<node-id>\" job-id=\"<job-id>\" stdout=\"\r\n\" stderr=\"\"",
...
}
]
opscenterd.log
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,848 [opscenterd] INFO: Install job started for node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,850 [opscenterd] INFO: using ssh-private-key (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,478 [opscenterd] INFO: Received milestone from node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" message="Uploaded facts to OpsCenter server" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" (MainThread)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:18,675 [opscenterd] ERROR: Received error from node event-subtype="meld-error" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" name="NODE-2" traceback="Traceback (most recent call last):
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 3313, in run
/var/log/opscenter/opscenterd.log- rc = engine.go()
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 2991, in go
/var/log/opscenter/opscenterd.log- self.file_manager.get_config_files()
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 1280, in get_config_files
/var/log/opscenter/opscenterd.log- {\"accept\": \"application/json\"})
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 598, in get
/var/log/opscenter/opscenterd.log- return json.loads(response.read())
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/socket.py\", line 351, in read
/var/log/opscenter/opscenterd.log- data = self._sock.recv(rbufsize)
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 549, in read
/var/log/opscenter/opscenterd.log- return self._read_chunked(amt)
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 609, in _read_chunked
/var/log/opscenter/opscenterd.log- value.append(self._safe_read(amt))
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 666, in _safe_read
/var/log/opscenter/opscenterd.log- raise IncompleteRead(''.join(s), amt)
/var/log/opscenter/opscenterd.log:IncompleteRead: IncompleteRead(4153 bytes read, 4039 more expected)" ssh-management-address="<IP>" node-id="<node-id>" event-type="error" message="Unexpected error executing meld" (MainThread)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,892 [opscenterd] ERROR: Install job a630c081-6ac1-4b00-ac08-18fef320e0d5 failed! (async-thread-macro-54)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:19,105 [opscenterd] ERROR: Meld failed on: name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" stdout="
/var/log/opscenter/opscenterd.log-" stderr="" (async-thread-macro-53)
Thank you
EDIT: Captured the HTTP traffic between NODE2 and master. The error occurs while transferring config files. One of them is not transferred completely for some reason. The json looks resonable until some gibberish appears.
{"filename": "dse.yaml", "contents": {"internode_messaging_options": {"client_worker_threads": 16, "port": 8609, "server_worker_threads": 16, "server_acceptor_thread
Yvatv+~UK{.kMI4^QOrqQTDX_3"DPm,v!"H&M$!1M7
LRYCs{l>-df;cj
W6C9dq
The config files are valid and do work on the master node. Only the replication fails.
OpsCenter LCM developer here. Your issue is caused by OPSC-8851 in the LCM known issues list: http://docs.datastax.com/en/opscenter/6.0/opsc/release_notes/opscReleaseNotes600.html
This is only triggered under certain network conditions and was discovered too close to release to get fixed in 6.0.0. It's a high priority though, and will be fixed in a subsequent release soon. Unfortunately, I don't think there's anything you can do to work around this in the field. If you're a DataStax customer, you could contact support and potentially get a patch now to workaround the issue... otherwise the only thing I can suggest is to watch the upcoming release notes.
Edit: I should also note that in our tests the issue is intermittent. LCM is designed so you can rerun failed jobs safely (aka it's idempotent) so in all but the most extreme cases you can also work around this just by rerunning your job.
You can specify the private IP for Listen Address and 0.0.0.0 for broadcast address and LCM should be able to provision appropriately.

flocker-docker-plugin doesn't work

I have 2 centos 7.1 nodes and I'm trying to get flocker running on it. I've followed the installation steps exactly however when it comes to running the following command to test to see if flocker-docker-plugin works:
docker run -v apples:/data --volume-driver flocker busybox sh -c "echo hello > /data/file.txt"
I get the error:
Error response from daemon: Error looking up volume plugin flocker: Plugin not found
The flocker-docker-plugin logs show the following:
{"request_body": null, "url": "https://foo.bar.com:4523/v1/state/nodes/by_era/b72bb203-b174-4241-a03a-6171cbc10f30", "timestamp": 1451201332.659948, "action_status": "started", "task_uuid": "1ae63069-286c-44fa-9dd2-6751ca0efe63", "action_type": "flocker:apiclient:http_request", "method": "GET", "task_level": [1]}
{"task_uuid": "1ae63069-286c-44fa-9dd2-6751ca0efe63", "error": false, "timestamp": 1451201332.749499, "message": "Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x2fc7710>", "message_type": "twisted:log", "task_level": [3]}
{"exception": "twisted.web._newclient.ResponseNeverReceived", "task_level": [4], "action_type": "flocker:apiclient:http_request", "reason": "[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]", "timestamp": 1451201333.050012, "task_uuid": "1ae63069-286c-44fa-9dd2-6751ca0efe63", "action_status": "failed"}
{"task_uuid": "c8d28668-f21b-4863-bf20-6c30f54c3d25", "error": true, "timestamp": 1451201333.05045, "message": "Unhandled Error\nTraceback (most recent call last):\nFailure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]\n", "message_type": "twisted:log", "task_level": [1]}
{"task_uuid": "36f1ddd5-c5fa-4438-85d7-131e7752f8d3", "error": true, "timestamp": 1451201333.050727, "message": "main function encountered error\nTraceback (most recent call last):\nFailure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]\n", "message_type": "twisted:log", "task_level": [1]}
{"task_uuid": "08bd8f13-0a8e-43f4-8b80-b4cf5b317f00", "error": false, "timestamp": 1451201333.051034, "message": "Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x2fc7710>", "message_type": "twisted:log", "task_level": [1]}
{"task_uuid": "8f0fd1ef-19ca-4033-b16f-6d42e33eda1a", "error": false, "timestamp": 1451201333.052711, "message": "Main loop terminated.", "message_type": "twisted:log", "task_level": [1]}
flocker-docker-plugin.service: main process exited, code=exited, status=1/FAILURE
Unit flocker-docker-plugin.service entered failed state.
flocker-docker-plugin.service failed.
flocker-docker-plugin.service holdoff time over, scheduling restart.
Started Flocker Docker Plugin.
Starting Flocker Docker Plugin...
Also running uft-flocker-volumes --control-service=foo.bar.net list-nodes returns:
jonathan#ubuntu:~/Flocker/sc-test-cluster$ uft-flocker-volumes --control-service=foo.bar.com list-nodes
Unhandled Error
Traceback (most recent call last):
Failure: twisted.web._newclient.ResponseNeverReceived
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Update:
I tried downgrading to docker 1.8.2 and tried rerunning the command, didn't work, same error.
Output of ls /etc/flocker:
[root#sc-test2 jonathan]# ls /etc/flocker
agent.yml cluster.crt node.crt node.key plugin.crt plugin.key
[root#sc-test1 jonathan]# ls /etc/flocker
agent.yml control-service.crt node.crt plugin.crt
cluster.crt control-service.key node.key plugin.key
Update:1/1/2016
I set the following environment variables as per the kubernetes docs http://kubernetes.io/v1.1/examples/flocker/.
export FLOCKER_CONTROL_SERVICE_HOST=foo.bar.com
export FLOCKER_CONTROL_SERVICE_CA_FILE=/etc/flocker/cluster.crt
export FLOCKER_CONTROL_SERVICE_CLIENT_CERT_FILE=/etc/flocker/node.crt
export FLOCKER_CONTROL_SERVICE_CLIENT_KEY_FILE=/etc/flocker/node.key
export FLOCKER_CONTROL_SERVICE_PORT=4523
And I got a different error when running the command
jonathan#ubuntu:~/Flocker/sc-test-cluster$ uft-flocker-volumes --control-service=sc-test1.cloudapp.net list-nodes
wget: error getting response: Connection reset by peer
===========================================================================
Unable to establish network connectivity from inside a container.
If you see an error message above, that may give you a clue how to fix it.
If you run docker in a VM, restarting the VM often helps, especially if
you have changed network (and/or DNS servers) since starting the VM.
If you are using docker-machine (e.g. as part of docker toolbox), you can
run the following command (or similar) to do that:
docker-machine restart default && eval $(docker-machine env default)
To ignore this check, and proceed anyway (e.g. if you know you are offline)
set IGNORE_NETWORK_CHECK=1
===========================================================================
So I set the flag to see what happens:
jonathan#ubuntu:~/Flocker/sc-test-cluster$ export IGNORE_NETWORK_CHECK=1
And bam! The same error :(
jonathan#ubuntu:~/Flocker/sc-test-cluster$ uft-flocker-volumes --control-service=sc-test1.cloudapp.net list-nodes
Unhandled Error
Traceback (most recent call last):
Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
Is the wget error a helpful clue as to what may be happening?
Error response from daemon: Error looking up volume plugin flocker: Plugin not found
This might be due to the control-service hostname on the agent.yml is not configured correctly. Make sure it's the host of the control-server node, not the agent node itself.