Two Logstash instances on same Docker container - redis

Am wondering if there is a way two logstash processes with separate configurations can be run on a single Docker container.
My setup has a Logstash process using file as input sending events to Redis and from there to second Logstash process and over to custom http process. So, Logstash --> Redis --> Logstash --> Http. Was hoping to keep the two Logstash instances and Redis on the same Docker container. Am still new to Docker & Would highly appreciate any inputs / feedback on the same.

This would be more complicated than it needs to be. It is much simpler in the Docker world to run three containers to do three things than to run one container that does them all. It is possible though-
You need to run an init process in your container to control multiple processes, and launch that as your container's entry point. The init will have to know how to launch the processes you are interested in, both logstash and the redis. Basimage/phusion provides an image with a good init system, but the launch scripts are based on runit and can be hard to pick up.
If you wanted to only run a single process, you can use a docker-compose file to launch all three processes and link them together.

Related

Google cloud kubernetes cluster newbie question

I am a newbie of GKE. I created a GKE cluster with very simple setup. It only has on gpu node and all other stuff was default. After the cluster is up, I was able to list the nodes and ssh into the nodes. But I have two questions here.
I tried to install nvidia driver using the command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
It output that:
kubectl apply --filename https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer configured
But 'nvidia-smi' cannot be found at all. Should I do something else to make it work?
On the worker node, there wasn't the .kube directory and the file 'config'. I had to copy it from the master node to the worker node to make things work. And the config file on the master node automatically updates so I have to copy again and again. Did I miss some steps in the creation of the cluster or how to resolve this problem?
I appreciate someone can shed some light on this. It drove me crazy after working on it for several days.
Tons of thanks.
Alex.
For the DaemonSet to work, you need to have a tag on your worker Node as cloud.google.com/gke-accelerator (see this line). The DaemonSet checks for this tag on a node before scheduling any pods for installing the driver. I'm guessing a default node pool you create did not have this tag on it. You can find more details on this on the GKE docs here.
The worker nodes, by design are just that worker nodes. They do not need privileged access to the Kubernetes API so they don't need any kubeconfig files. The communication between worker nodes and the API is strictly controlled through the kubelet binary running on the node. Therefore, you will never find kubeconfig files on a worker node. Also, you should never put them on the worker node either, since if a node gets compromised, the keys in that file can be used to damage the API Server. Instead, you should make it a habit to either use the master nodes for kubectl commands, or better yet, have the kubeconfig on your local machine, and keep it safe, and issue commands remotely to your cluster.
After all, all you need is access to an API endpoint for your Kubernetes API server, and it shouldn't matter where you access it from, as long as the endpoint is reachable. So, there is no need whatsoever to have kubeconfig on the worker nodes :)

Is there a way of using dask jobqueue over ssh

Dask jobqueue seems to be a very nice solution for distributing jobs to PBS/Slurm managed clusters. However, if I'm understanding its use correctly, you must create instance of "PBSCluster/SLURMCluster" on head/login node. Then you can on the same node, create a client instance to which you can start submitting jobs.
What I'd like to do is let jobs originate on a remote machine, be sent over SSH to the cluster head node, and then get submitted to dask-jobqueue. I see that dask has support for sending jobs over ssh to a "distributed.deploy.ssh.SSHCluster" but this seems to be designed for immediate execution after ssh, as opposed to taking the further step of putting it in jobqueue.
To summarize, I'd like a workflow where jobs go remote --ssh--> cluster-head --slurm/jobqueue--> cluster-node. Is this possible with existing tools?
I am currently looking into this. My idea is to set-up an SSH tunnel with paramiko and then use Pyro5 to communicate with the cluster object from my local machine.

How to distribute spiders across the cluster using Scrapyd and ScrapydWeb?

I am working in a crawling project, using Scrapy, and I need to distribute my spiders across different nodes in a cluster to make the process faster. I am using ScrapydWeb to manage it and I have already configured two machines, one of them with ScrapydWeb up and both with Scrapyd up. The Web App recognizes both and I can run my spider properly. The problem is that the crawling is running just in parallel (the content is being fetched by both machines), and my purpose was to do it in a distributed way to minimize the crawling time.
Could anybody help me? Thank you in advance.
I don't think Scrapyd & ScrapydWeb offer the possibility of running a spiders across different servers other than just fully running the same spider. If you want to distribute the crawling you can either:
Run 1 spider only on 1 server
If you need actual distributed crawling (where the same spider runs across different machines without multiple machines parsing the same url), you can look into Scrapy-Cluster
You can write custom code where you have 1 process generating the urls to scrape on one side, put the found urls in a queue (using Redis f.e.), and have multiple servers popping urls from this queue to fetch & parse the page
I used Scrapy Cluster to solve the problem and I'm sharing my experience:
Docker installation was hard for me to control and debug, so I tried the Cluster Quick-start and it worked better.
I have five machines available in my cluster and I used one to host the Apache Kafka, as well as the Zookeeper. I also had one for Redis DB. It's important to make sure those machines are available for external access from the ones you are going to use for spidering.
Once these three components were properly installed and running, I installed Scrapy Cluster's requirements in a python3.6 environment. Then, I configured a local settings file with the IP address for the hosts and made sure all online and offline tests passed.
Everything set up, I was able to run the first spider (the official documentation provides an example). The idea is that you create instances for your spider (you can, for example, use tmux to open 10 different terminal windows and run one instance at each). When you feed Apache Kafka with a URL to be crawled, it's sent to a queue at Redis, to where your instances will periodically look for a new page to crawl.
If your spider generates more URLs from the one you passed initially, they return to Redis to be possibly crawled by other instances. And that's where you can see the power of this distribution.
Once a page is crawled, the result is sent to a Kafka topic.
The official documentation is huge and you can find more details on the installation and setup.

Clone RabbitMQ admin users, etc. on replacement server

We have a couple of crusty AWS hosts running a RabbitMQ implementation in a cluster. We need to upgrade the hardware, and therefore we developed a Chef cookbook to spawn replacement servers.
One thing that we would rather not recreate by hand is the admin users, the queues, etc.
What is the best method to get that stuff from the old hosts to the new ones? I believe it's everything that lives in the /var/lib/rabbitmq/mnesia directory.
Is it wise to copy the files from one host to another?
Is there a programmatic means to do this?
Can it be coded into our Chef cookbook?
You can definitely export and import configuration via command line: https://www.rabbitmq.com/management-cli.html
I'm not sure about admin user, though.
If you create new rabbitmq nodes on your new hardware, you will get all the users in that new node. This is easy to try:
run docker container with image of rabbitmq (with management plugin)
and create a user
run another container and add that node to the
cluster of the first one
kill rabbitmq on the first one, or delete
the docker container and you will see that you still have the newly
created user on the 2nd (but now master) node
I wrote docker since it's faster to create a cluster this way, but if you already have a cluster you could use it for testing if you prefer.
For the queues and exchanges, I don't want to quote almost everything found in the rabbitmq doc page for the high availability, but I will just say that you have to pay attention to the following:
exclusive queues because they are gone once the client connection is gone
queue mirroring (if you have any set up, if not it would be wise to consider it, if not even necessary)
I would do the migration gradually, waiting for the queues to get emptied and then kill of the nodes on the old hardware. It maybe doable in a big-bang fashion, but seems riskier. If you have a running system, than set up queue mirroring and try to find appropriate moment to do manual sync - but careful, this has a huge impact on the broker performance.
Additionally there is this shovel plugin (I have to point out that I did not use it or even explore it) but that may be another way to go since (quoting form the link):
In essence, a shovel is a simple pump. Each shovel:
connects to the source broker and the destination broker, consumes
messages from the queue, re-publishes each message to the destination
broker (using, by default, the original exchange name and
routing_key).

Fuse Fabric8 Clustering

I am noob in fabric8. I have a doubt regarding clustering with docker images.
I have pulled the docker image for fabric8 fabric8/fabric8. I just want to make the containers i launch to automatically fall into the same cluster without using fabric:create and fabric:join.
Say if i launch 3 containers of fabric8/fabric8 they should fall under the same cluster without manual configuration.
Please give some links are references. I'm lost.
Thanks in advance
In fabric8 v1 the idea was that you create a fabric, using the fabric:create command and then you spin docker container, using the docker container provider in pretty much the same way as you were doing with child containers (either using the container-create-docker command or using hawtio and selecting docker as the container type).