Our jobs service test suite expects a Redis database to connect to in order to run its test cases. We're running into an issue where sometimes this jobs service fails to load in Redis and sometimes it doesn't.
We've followed the Codeship guide to the dot, and are finding that sometimes, our service is unable to connect to Redis while sometimes it is. I've tried switching Redis versions and this does not seem to have solved the issue.
Sounds like it would be appropriate to implement a Docker healthcheck on your service.
Related
We are running one of our services in a newly created kubernetes cluster. Because of that, we have now switched them from the previous "in-memory" cache to a Redis cache.
Preliminary tests on our application which exposes an API shows that we experience timeouts from our applications to the Redis cache. I have no idea why and it issue pops up very irregularly.
So I'm thinking maybe the reason for these timeouts are actually network related. Is it a good idea to put in affinity so we always run the Redis-cache on the same nodes as the application to prevent network issues?
The issues have not arisen during "very high load" situations so it's concerning me a bit.
This is an opinion question so I'll answer in an opinionated way:
Like you mentioned I would try to put the Redis and application pods on the same node, that would rule out wire networking issues. You can accomplish that with Kubernetes pod affinity. But you can also try nodeslector, that way you always pin your Redis and application pods to a specific node.
Another way to do this is to taint your nodes where you want to run your workloads and then add a toleration to the Redis and your application pods.
Hope it helps!
We are using prometheus in our production envirment recently. Before we only have 30-40 nodes for each service and those servers not change very often, so we just write it in the prometheus.yml, but right now it become too long to hold in one file and change much frequently then before, so my question is should i use file_sd_config to put those server list out of yml file and change those config files sepearately, or using consul for service discovery(same much easy to handle changes).
I have install 3 nodes consul cluster in data center and as i can see if i change to use consul to slove this problem , i also need to install consul client in each server(node) and define its services info. Is that correct? or does anyone have good advise.
Thanks
I totally advocate the use of a service discovery system. It may be a bit hard to deploy at first but surely it will worth it in the future.
That said, Prometheus comes with a lot of service discovery integrations. It's possible that you don't need a Consul cluster. If your servers are in a cloud provider like AWS, GCP, Azure, Openstack, etc, prometheus are able to autodiscover the instances.
If you keep running with Consul, the answer is yes, the agent must be running in every node. You can also register services and nodes via API but it's easier to deploy the agent.
I'm trying to test out Airflow on Kubernetes. The Scheduler, Worker, Queue, and Webserver are all on different deployments and I am using a Celery Executor to run my tasks.
Everything is working fine except for the fact that the Scheduler is not able to queue up jobs. Airflow is able to run my tasks fine when I manually execute it from the Web UI or CLI but I am trying to test the scheduler to make it work.
My configuration is almost the same as it is on a single server:
sql_alchemy_conn = postgresql+psycopg2://username:password#localhost/db
broker_url = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
celery_result_backend = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
I believe that with these configurations, I should be able to make it run but for some reason, only the workers are able to see the DAGs and their state, but not the scheduler, even though the scheduler is able to log their heartbeats just fine. Is there anything else I should debug or look at?
First, you use postgres as database for airflow, don't you? Do you deploy a pod and service for postgres? If it is the case, do you verify that in your config file you have :
sql_alchemy_conn = postgresql+psycopg2://username:password#serviceNamePostgres/db
You can use this github. I used it 3 weeks ago for a first test and it worked pretty well.
The entrypoint is useful to verify that rabbitMq and Postgres are well configured.
I'm running Celery on my laptop, with rabbitmq being the broker and redis being the backend. I just used all the default settings and ran celery -A tasks worker --loglevel=info, then it all worked. The workers can get jobs done and I get fetch the execution results by calling result.get(). My question here is that why it works even if I didn't run the rebbitmq and redis servers at all. I did not set the accounts on the servers either. In many tutorials, the first step is to run the broker and backend servers before starting celery.
I'm new to these tools and do not quite understand how they work behind the scene. Any input would be greatly appreciated. Thanks in advance.
Never mind. I just realized that redis and rabbitmq automatically run after installation or shell startup. They must be running for celery to work.
The redis cache offered by CloudFoundry has a small capacity, i.e. 16MB.
I know redis has a command "FLUSHALL" which is used to delete all the keys in the cache. How to do the same thing in cloudfoundry?
You can recreate and rebind the service as you wish unless you have any specific configuration that cannot be migrated. (I assume services provisioned by CF.com should be created as the same.)
Also sending FLUSHALL to the redis tunnel should be another option if you have vmc and caldecott gem installed as well as a redis execution locally. Would you mind if you can send the error why you cannot connect to it?