Tensorflow Serving Model Server continuously re-adding the same models when polling s3 for config file - tensorflow-serving

I am using Tensorflow Serving to run models which are stored in an s3 bucket. I am also keeping the model config file in a separate s3 bucket. My use case is that in order to dynamically add models without needing to restart the server I will poll this config file for changes periodically.
In order to do this I have used the following setup:
tensorflow/serving:1.15.0 image deployed into a Kubernetes cluster using helm.
In the helm chart for the deployment the following lines define the run command and args used for polling s3 for the config
command:
- "/usr/bin/tensorflow_model_server"
args:
- --model_config_file={path to config file}
- --model_config_file_poll_wait_seconds=60
The helm chart also sets environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, S3_ENPOINT and AWS_LOG_LEVEL=3
The model config file has contents:
model_config_list: {
config: {
name: "mlp",
base_path: "s3://bucketname/mlp",
model_platform: "tensorflow",
model_version_policy: {
specific: {
versions: 20200130
}
}
}
}
Everything seems to be working as expected with Tensorflow loading the correct models and running them properly. The issue I am seeing is that everytime the server polls for the config file in S3 it is readding models even if they are the same.
This results in regular logs as below.
2020-02-06 07:07:01.930476: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-02-06 07:07:01.930495: I tensorflow_serving/model_servers/server_core.cc:573] (Re-)adding model: mlp
2020-02-06 08:07:01.965518: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-02-06 08:07:01.965548: I tensorflow_serving/model_servers/server_core.cc:573] (Re-)adding model: mlp
2020-02-06 09:07:01.967228: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2020-02-06 09:07:01.967259: I tensorflow_serving/model_servers/server_core.cc:573] (Re-)adding model: mlp
My concern is that this will be impacting performance of the model if the polling frequency is too high. I am wondering though whether there are actually any changes going on or if this is just additional logging.

This all looks normal.
Looking at the code (server_core.cc), it seems that these messages are displayed when reading the model.config file rather than when loading the models... which would have to happen after reading the model.config file. Although I don't understand the code clearly, I think we can conclude that these messages are not displayed when loading the models themselves; but are displayed at an earlier point in the workflow.
It only tries to figure out which models are new after these messages have been displayed.
You can see the comparison happening here.
There is one strange thing about your question. From the timestamps of your messages, it seems that your models.config file is being read every 60 minutes instead of every 60 seconds.

Related

Dash: how to use progress bar with background callback with celery and redis

I am developing a dash application for machine learning project. I have used docker-compose, services as below:
redis: # built-in redis image
db: # built-in postgres image
pgadmin: # built-in pgadmin image
web: # my custom dash app image. I have used gunicorn to start it in docker-compose file
celery: # my custom dash app image. I have used celery to start it in docker-compose file
proxy: # built-in nginx image
Usecases:
Upload dataset: I'm using dcc.Upload() and file is getting uploaded successfully. e.g. user uploads data.csv
Create train, test, dev dataset files: I'm using dbc.Button() where user clicks it and all 3 files are getting created successfully [I have used sklearn]. e.g. we get train.csv, test.csv and dev.csv
Model training: I'm using dbc.Button() where user clicks it and model is getting generated and stored successfully. e.g. model.joblib
This above setup is running fine.
Now, I'm trying to use background callback with celery and redis for my above listed usecases.
First Implemented:
From dash website, I have built this demo Example 4: Progress Bar https://dash.plotly.com/background-callbacks which is running successfully.
Issues:
how to update progress bar values i.e. what value to pass.
I Read celery documentation, I have found that we can get result like state, status etc from there. So, how to use those results(state, status etc) with progress bar in dash callback.
Please suggest.

Named processes grafana dashboard not working

I created this dashboard by importing its ID.
Then, in order to have the necessary metrics, used this chart to install this exporter in my EKS cluster:
helm repo add prometheus-process-exporter-charts https://raw.githubusercontent.com/mumoshu/prometheus-process-exporter/master/docs
helm install --generate-name prometheus-process-exporter-charts/prometheus-process-exporter
All the prometheus-process-exporter are up and running, but the only log they have is:
2022/11/23 18:26:55 Reading metrics from /host/proc based on "/var/process-exporter/config.yml"
I was expecting to automatically have all default processes listed in the dashboard as soon as I deployed the exporter, but the dashboard still say "No data":
Do you have any ideas on why this is happening? Did I miss any step in configuring this exporter?

Is Tensorflow continuously polling a S3 filesystem during training or using Tensorboard?

I'm trying to use tensorboard on my local machine to read tensorflow logs on S3. Everything works but tensorboard continuously throws the following errors to the console. According to this the reason is that when Tensorflow s3 client checks if directory exists it firstly run Stat on it since s3 have no possibility to check whether directory exists. Then it checks if key with such name exists and fails with such error messages.
While this could be a wanted behavior for model serving to look for updated models and can be stopped using file_system_poll_wait_second, I don't know how to stop it for training. In fact the same happens during training if you save checkpoints and logs in S3.
Suppressing these errors increasing the log level is not an option because Tensorflow still continuously polls S3 and you pay for these useless requests.
I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2020-11-23 11:41:02.502274: E tensorflow/core/platform/s3/aws_logging.cc:60] HTTP response code: 404
Exception name:
Error message: No response body.
6 response headers:
connection : close
content-type : application/xml
date : Mon, 23 Nov 2020 10:41:01 GMT
server : AmazonS3
x-amz-id-2 : ...
x-amz-request-id : ...
2020-11-23 11:41:02.502364: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2020-11-23 11:41:02.502699: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2020-11-23 11:41:03.327409: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2020-11-23 11:41:03.491773: E tensorflow/core/platform/s3/aws_logging.cc:60] HTTP response code: 404
Any idea?
I was wrong. TF just write logs to S3 and while the errors are related to the linked issue, this is the normal behavior. Extra costs are minimal because AWS doesn't charge you for data transfer between services in the same region, but only for the operations. The same apply using tensorboard with S3. For anyone interested in these topics, I made a repository here

ECS Fargate - No Space left on Device

I had deployed my asp.net core application on AWS Fargate and all was working fine. I am using awslogs driver and logs were correctly sent to the cloudwatch. But after few days of correctly working, I am now seeing only one kind of log as shown below:
So no application logs are showing up due to no space. If I update the ECS service, logging starts working again, suggesting that the disk has been cleaned up.
This link suggests that awslogs driver does not take up space and sends log to cloudwatch instead.
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task_cannot_pull_image.html
Did anyone also faced this issue and knows how to resolve the same?
You need to set the "LibraryLogFileName" parameter in your AWS Logging configuration to null.
So in the appsettings.json file of a .Net Core application, it would look like this:
"AWS.Logging": {
"Region": "eu-west-1",
"LogGroup": "my-log-group",
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information"
},
"LibraryLogFileName": null
}
It depends on how you have logging configured in your application. The AWSlogs driver is just grabbing all the output sent to the console and saving it to CloudWatch, .NET doesn't necessarily know about this and is going to keep writing logs like it would have otherwise.
Likely .NET is still writing logs to whatever location it otherwise would be.
Advice for how to troubleshoot and resolve:
First, run the application locally and check if log files are being saved anywhere
Second, optinally run a container test to see if log files are being saved there too
Make sure you have docker installed on your machine
Download the container image from ECR which fargate is running.
docker pull {Image URI from ECR}
Run this locally
Do some task you know will genereate some logs
Use docker exec -it to connect up to your container
Check if log files are being written to the location you identified when you did the first test
Finally, once you have identified that logs are being written to files somewhere pick one of these options
Add some flag which can be optionally specified to disable logging to a file. Use this when running your application inside of the container.
Implement some logic to clean up log files periodically or once they reach a certain size. (Keep in mind ECS containers have up to 20GB local storage)
Disable all file logging(not a good option in my opinion)
Best of luck!

curl_json plugin not sending data (using it to send load balancer metrics)

I've implemented curl_jason plugin to recolect and send LoadBalancer metrics to my RabbitMQ to be graphed in Graphite.
Thing is, it's not sending any data, while it is working just fine (and great) with other plugins like memory, cpu, df root, network, etc. I've tried to troubleshoot following this suggestion: https://serverfault.com/questions/499378/collectd-stores-nan-instead-of-correct-value-in-ubuntu-12-04, but there're no issues coming out.
Here's my collectd.conf: https://gist.github.com/Mariano-gon/8732467
Here're the last lines of collectd.log when I start it: https://gist.github.com/Mariano-gon/8732488
The request is made against Rackspace API where my LoadBalancer is located, and if run manually, the curl gets me a json response perfectly normal.
Here's a snippet of it: https://gist.github.com/Mariano-gon/8732518
Finally, collectd does not create any new folders besides network, df, memory, cpu, etc (all plugins that are correctly working and sending data) when started.
Hope this info helps and any comment will be really appreciated.
Thanks!
No answers..tried with collectd mail list and IRC. Closing.