Tensorflow Serving container restarts without errors after changing S3 endpoint

Tensorflow Serving container restarts without errors after changing S3 endpoint - tensorflow

My TF Serving reads the models from Ceph via its S3 endpoint. Our internal endpoint has changed (also changed from HTTP to HTTPS).
Starting TF Serving with the new endpoint doesn't cause any issues, i.e. it connects fine and reads the models. But when a new model is available, TF Serving suddenly crashes.
This doesn't happen if I keep using our old endpoint.
I added logs to see if there's some issues with the S3 file system, i.e. I set:
TF_CPP_VMODULE: "s3_file_system=2,file_system_storage_path_source=2"
AWS_LOG_LEVEL: "trace"
But they don't seem to be helpful.
This is part of healthy logs:
...
2022-09-13 09:27:06.873782: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] Returning connection handle 0x7fd33800ba30
2022-09-13 09:27:06.873792: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] Obtained connection handle 0x7fd33800ba30
2022-09-13 09:27:06.910123: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] HTTP/1.1 200 OK
2022-09-13 09:27:06.910509: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] Transfer-Encoding: chunked
...
And when the issue occurs, the logs look like this:
...
2022-09-13 02:19:31.167630: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] Returning connection handle 0x560c7d611e00
2022-09-13 02:19:31.167638: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:84] Obtained connection handle 0x560c7d611e00 <--- same as before
2022-09-13 02:19:31.697364: I external/org_tensorflow/tensorflow/core/platform/s3/s3_file_system.cc:712] Stat on path: s3://tfserving-config/model/prod/empty.config <--- here the container restarted
...
Tested on TensorFlow Serving Docker images with version: 2.5.3 and 2.6.5.
Are there some ways to further debug this or get more helpful logs?
Setting TF_CPP_MAX_VLOG_LEVEL creates too many logs to be helpful.

Related

Health Check on Fabric CA

I have a hyperledger fabric network v2.2.0 deployed with 2 peer orgs and an orderer org in a kubernetes cluster. Each org has its own CA server. The CA pod keeps on restarting sometimes. In order to know whether the service of the CA server is reachable or not, I am trying to use the healthz API on port 9443.
I have used the livenessProbe condition in the CA deployment like so:
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 9443
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
After configuring this liveness probe, the pod keeps on restarting with the event Liveness probe failed: HTTP probe failed with status code: 400. Why might this be happening?

HTTP 400 code:
The HTTP 400 Bad Request response status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error (for example, malformed request syntax, invalid request message framing, or deceptive request routing).
This indicates that Kubernetes is sending the data in a way hyperledger is rejecting, but without more information it is hard to say where the problem is. Some quick checks to start with:
Send some GET requests directly to the hyperledger /healthz resource yourself. What do you get? You should get back either a 200 "OK" if everything is functioning, or a 503 "Service Unavailable" with details of which nodes are down (docs).
kubectl describe pod liveness-request. You should see a few lines towards the bottom describing the state of the liveness probe in more detail:
Restart Count: 0
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned example-dc/liveness-request to dcpoz-d-sou-k8swor3
Normal Pulling 4m45s kubelet, dcpoz-d-sou-k8swor3 Pulling image "nginx"
Normal Pulled 4m42s kubelet, dcpoz-d-sou-k8swor3 Successfully pulled image "nginx"
Normal Created 4m42s kubelet, dcpoz-d-sou-k8swor3 Created container liveness
Normal Started 4m42s kubelet, dcpoz-d-sou-k8swor3 Started container liveness
Some other things to investigate:
httpGet options that might be helpful:
scheme – Protocol type HTTP or HTTPS
httpHeaders– Custom headers to set in the request
Have you configured the operations service?
You may need a valid client certificate (if TLS is enabled, and clientAuthRequired is set to true).

Jibri recording issues behind reverse proxy

I'm trying to run Jibri as part of a Jitsi-Meet installation (all on one server) behind a reverse SSL proxyJitsi works out of the box, but as soon as Jibri tries to log in to the session to record it, the corresponding Chrome session times out. Here's an excerpt from the jibri log:
2021-04-04 09:09:42.546 FINE: [890] org.jitsi.jibri.selenium.pageobjects.CallPage.visit() Visiting url https://example.com/room#config.iAmRecorder=true&config.externalConnectUrl=null&config.startWithAudioMuted=true&config.startWithVideoMuted=true&interfaceConfig.APP_NAME="Jibri"&config.analytics.disabled=true&config.p2p.enabled=false&config.prejoinPageEnabled=false&config.requireDisplayName=false
2021-04-04 09:09:42.633 FINE: [890] org.jitsi.jibri.selenium.pageobjects.CallPage.apply() Not joined yet: APP is not defined
...
2021-04-04 09:10:12.945 INFO: [890] org.jitsi.jibri.selenium.JibriSelenium.onSeleniumStateChange() Transitioning from state Starting up to Error: FailedToJoinCall SESSION Failed to join the call
2021-04-04 09:10:12.947 INFO: [890] org.jitsi.jibri.service.impl.FileRecordingJibriService.onServiceStateChange() File recording service transitioning from state Starting up to Error: FailedToJoinCall SESSION Failed to join the call
The reverse proxy is configured to watch out for this login string on port 443 (normal SSL traffic per the URL above) and forward this to the Jitsi instance. Prosody accepts the request on its http-bind interface but then the invocation times out.
As the web server logs are inconclusive: Where / what logs can I check to see what happens afterwards? I can see jicofo picking up the invocation but don't know what happens afterwards (Jicofo 2021-04-04 09:09:42.130 INFO: [461] org.jitsi.jicofo.recording.jibri.JibriSession.log() Updating status from JIBRI: <iq to='focus#auth.example.com/focus647288887711795' from='jibribrewery#internal.auth.example.com/jibri-nickname' id='5iurC-49012' type='result'><jibri xmlns='http://jitsi.org/protocol/jibri' status='pending'/></iq> for room#conference.example.com)?
More than happy to provide more info as required.

kubectl exec "error: unable to upgrade connection: Unauthorized"

I was using our Kubernetes cluster, I don't think so i have changed recently after deployment but am encountering this error
Error kubectl log with verbose :
01:49:42.691510 30028 round_trippers.go:444] Response Headers:
I0514 01:49:42.691526 30028 round_trippers.go:447] Content-Length: 12
10514 01:49:42.691537 30028 round_trippers.go:447] Content-Type: text/plain; charset=utf-8
I0514 01:49:42.691545 30028 round_trippers.go:447] Date: Tue, 14 May 2019 08:49:42 GMT
F0514 01:49:42.691976 30028 helpers.go:119] error: unable to upgrade connection:
Unauthorized
Kubelet running with below options :
/usr/local/bin/kubelet --logtostderr=true --v=2 --address=0.0.0.0 --node-ip=1******
--hostname-override=***** --allow-privileged=true --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --authentication-token-webhook --enforce-node-allocatable= --client-ca-file=/etc/kubernetes/ssl/ca.crt --pod-manifest-path=/etc/kubernetes/manifests --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --node-status-update-frequency=10s --cgroup-driver=cgroupfs --max-pods=110 --anonymous-auth=false --read-only-port=0 --fail-swap-on=True --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --cluster-dns=10.233.0.3 --cluster-domain=cluster.local --resolv-conf=/etc/resolv.conf --kube-reserved cpu=200m,memory=512M --node-labels=node-role.kubernetes.io/master=,node-role.kubernetes.io/node= --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
API running with below options :
kube-apiserver --allow-privileged=true --apiserver-count=2 --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --endpoint-reconciler-type=lease --insecure-port=0 --kubelet-preferred-address-types=InternalDNS,InternalIP,Hostname,ExternalDNS,ExternalIP --runtime-config=admissionregistration.k8s.io/v1alpha1 --service-node-port-range=30000-32767 --storage-backend=etcd3 --advertise-address=******* --client-ca-file=/etc/kubernetes/ssl/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.pem --etcd-certfile=/etc/kubernetes/ssl/etcd/node-bg-kub-dev-1.pem --etcd-keyfile=/etc/kubernetes/ssl/etcd/node-bg-kub-dev-1-key.pem --etcd-servers=https://*******:2379,https://********:2379,https://*****:2379 --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/ssl/apiserver-kubelet-client.key --proxy-client-cert-file=/etc/kubernetes/ssl/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/ssl/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/ssl/sa.pub --service-cluster-ip-range=10.233.0.0/18 --tls-cert-file=/etc/kubernetes/ssl/apiserver.crt --tls-private-key-file=/etc/kubernetes/ssl/apiserver.key

I think you messed your cert files or you played with RBAC profiles.
You can have a look at great guide by Kelsey Hightower called kubernetes-the-hard-way.
It's showing how to setup a whole cluster from beggining without any automation tools like kubeadm.
In part 04-certificate-authority - Provisioning a CA and Generating TLS Certificates.
You have exampled of certs being used in Kubernetes.
The Kubelet Client Certificates
Kubernetes uses a special-purpose authorization mode called Node Authorizer, that specifically authorizes API requests made by Kubelets. In order to be authorized by the Node Authorizer, Kubelets must use a credential that identifies them as being in the system:nodes group, with a username of system:node:<nodeName>. In this section you will create a certificate for each Kubernetes worker node that meets the Node Authorizer requirements.
Once certs are generated for workers and uploaded you need to generate kubeconfig for each worker.
The kubelet Kubernetes Configuration File
When generating kubeconfig files for Kubelets the client certificate matching the Kubelet's node name must be used. This will ensure Kubelets are properly authorized by the Kubernetes Node Authorizer.
Also this case might be helpful "kubectl exec" results in "error: unable to upgrade connection: Unauthorized"

I got fixed this issue.
Actually "/etc/kubernetes/ssl/ca.crt" in my both masters are same but in worker nodes "/etc/kubernetes/ssl/ca.crt" is totally different. So i just copied "/etc/kubernetes/ssl/ca.crt" from master to my worker nodes and restarted kubelet in workers nodes which fixed my issue.
But am not sure I did right changes for fix
I hope --client-ca-file=/etc/kubernetes/ssl/ca.crt should be same for all kubelet which is running master and workers

Openshift online v3 - Timeout when reading response headers from daemon process

I created an python api on openshift online with python image. If you request all the data, it takes more than 30 seconds to respond. The server gives a 504 gateway timeout http response. How do you configure the length a response can take? > I created an annotation on the route, this seems to set proxy timeout.
haproxy.router.openshift.io/timeout: 600s
Problem remains, I now got logging. It looks like the message comes from mod_wsgi.
I want to try alter the configuration of the httpd (mod_wsgi-express process) from request-timeout 60 to request-timeout 600. Where doe you configure this. I am using base image https://github.com/sclorg/s2i-python-container/tree/master/2.7
Logging:
Timeout when reading response headers from daemon process 'localhost:8080':/tmp/mod_wsgi-localhost:8080:1000430000/htdocs
Does someone know how to fix this error on openshift online

Next to alter timeout of haproxy of the route of my app
haproxy.router.openshift.io/timeout: 600s
I altered the request-timeout and socket-timeout in app.sh of my python application. So the mod_wsgi-express server is configured with a higher timeout
ARGS="$ARGS --request-timeout 600"
ARGS="$ARGS --socket-timeout 600"
My application now wait 10 minutes before cancelling a request

Apache, mod_ssl "request failed: error reading the headers" for a specific user

Currently we have an Apache 2.2.3 server with mod_ssl 2.2.3 running Django, with users authenticating by using a x509 certificate.
So far the system is running perfectly except for a single user, who when trying to upload a file receives 400 Bad Request error, and the contents of the ssl_error_log regarding this operation are:
[<date>] [error] [client <client ip>] request failed: error reading the headers, referer: <referrer url>
The contents of the ssl_access_log are:
<client ip> - - [<date>] "POST <target page> HTTP/1.1" 400 321
Also, the user's browser is Firefox as far as I know.
I am completely unable to reproduce this bug and so far none of the other users have experienced it. Could you point out some reasons for this to happen?

I've experienced connectivity that stops the upstream after an X amount of bytes is sent. X was a pretty low value, as in enough to request some simple pages, but not to deal with ajax requests much less upload files. As far as I recall, this connectivity problem occurred only when tethering (from a specific Android phone, but I didnt even test other phones).
So if the upstream gets interrupted and the upload stalls, it makes sense apache would return this error, according to this post: "Apache waits a time equal to the Timeout directive (defaults to 5 minutes if not defined) for a response from the client. It is likely Apache is waiting for the CRLF that indicates the end of the headers, yet it is never received.."

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas