Resque Workers from other hosts registered and active on my system - ruby-on-rails-3

The Rails application I'm currently working on is hosted at Amazon EC2 servers. It's using Resque for running background jobs, and there are 2 such instances (would-be production and a stage). Also I've mounted Resque monitoring web app to the /resque route (on stage only).
Here is my question:
Why there are workers from multiple hosts registered within my stage system and how can I avoid this?
Some additional details:
I see workers from apparently 3 different machines, but only 2 of them I managed to identify - the stage(obviously) and the production. The third has another address format(starts with domU) and haven't any clue what it could be.

It looks like you're sharing a single Redis server across multiple resque server environments.
The best way to do this safely is to use separate Redis servers or separate Redis databases or namespaces. The Redis-namespace gem can be used with Resque to isolate each environments Resque queues and worker data.

I can't really help you with what the unknown one is, but I had something similar happen when moving hosts and having dns names change. The only way I found to clear out the old ones was to stop all workers on the machine, fire up IRB, require 'resque' and look at Resque.workers. This will list all the workers resque knows about, which in your case will include about 20 bogus ones. You can then do:
Resque.workers.each do {|worker| worker.unregister_worker}
This should prune all the not-really-there workers and get you back to a proper display of the real workers.

Related

How to distribute spiders across the cluster using Scrapyd and ScrapydWeb?

I am working in a crawling project, using Scrapy, and I need to distribute my spiders across different nodes in a cluster to make the process faster. I am using ScrapydWeb to manage it and I have already configured two machines, one of them with ScrapydWeb up and both with Scrapyd up. The Web App recognizes both and I can run my spider properly. The problem is that the crawling is running just in parallel (the content is being fetched by both machines), and my purpose was to do it in a distributed way to minimize the crawling time.
Could anybody help me? Thank you in advance.
I don't think Scrapyd & ScrapydWeb offer the possibility of running a spiders across different servers other than just fully running the same spider. If you want to distribute the crawling you can either:
Run 1 spider only on 1 server
If you need actual distributed crawling (where the same spider runs across different machines without multiple machines parsing the same url), you can look into Scrapy-Cluster
You can write custom code where you have 1 process generating the urls to scrape on one side, put the found urls in a queue (using Redis f.e.), and have multiple servers popping urls from this queue to fetch & parse the page
I used Scrapy Cluster to solve the problem and I'm sharing my experience:
Docker installation was hard for me to control and debug, so I tried the Cluster Quick-start and it worked better.
I have five machines available in my cluster and I used one to host the Apache Kafka, as well as the Zookeeper. I also had one for Redis DB. It's important to make sure those machines are available for external access from the ones you are going to use for spidering.
Once these three components were properly installed and running, I installed Scrapy Cluster's requirements in a python3.6 environment. Then, I configured a local settings file with the IP address for the hosts and made sure all online and offline tests passed.
Everything set up, I was able to run the first spider (the official documentation provides an example). The idea is that you create instances for your spider (you can, for example, use tmux to open 10 different terminal windows and run one instance at each). When you feed Apache Kafka with a URL to be crawled, it's sent to a queue at Redis, to where your instances will periodically look for a new page to crawl.
If your spider generates more URLs from the one you passed initially, they return to Redis to be possibly crawled by other instances. And that's where you can see the power of this distribution.
Once a page is crawled, the result is sent to a Kafka topic.
The official documentation is huge and you can find more details on the installation and setup.

Distributed Rabbitmq within a spring-cloud environment

I am trying to setup a distributed system based on current spring-cloud release (meaning mostly Netflix OSS) using the following components
1 or more cloud config servers
1 or more Eureka servers
1 or more services using Eureka and Config Server clients
The setup above is easy enough to get going however once you start looking into setting up so that configuration changes in the cloud Config servers automatically trigger changes in the values of the actual clients, things start getting more complicated.
It is my understanding that for such a feature to work one should introduce spring-cloud-bus clients to the services which in turn will use, currently the only supported implementation, rabbitmq servers (the actual rabbitmq binaries and not some spring-boot app like eureka or Config servers) to allow change events in the Config server to be propagated to the clients automatically.
It sounds counterintuitive to setup such a system and have to hardcode addresses to rabbitmq servers in the clients (even if one will be keeping the amount of rabbitmq servers more or less static).
How is one supposed to register rabbitmq server instances in the Eureka service discovery server(s) to allow for clients to find them without having to have any knowledge about their location prior to startup?
I cannot seem to find any documentation on how this is done given that rabbitmq is not a spring-cloud component. In fact very little documentation seems to exist regarding on how the rabbitmq + eureka + spring-cloud-bus should be setup together.
I know that I am on a VERY old question, even though I think it worth a comment for people who read this in the future.
Most of the cloud services, lets take AWS as an example, have an Elastic IP solution - so you can configure IPs for RabbitMQ servers, and the IPs always belong to the RabbitMQ, no matter whether the instances change. You can re-attach the Elastic IP to different instances.
It works nearly the same with Elastic Load Balancer, which keeps its IP, so you could configure your microservices to a specific IP using Spring Cloud Config Server - and scale the RabbitMQ instances without a need to worry about configuration change.

What are the most effective tools to manage multiple apache httpd instances?

We have many Apache instances all over our intranet. Some instances run on the same machine. Some instances run on different machines.
I need a tool that can manage these instances from one central location.
Get CPU stats
Get Connection stats
Stop/start Apache instances
Get access to error log
I looked at webmin, but the documentation isn't too clear how it works. Without installing it I'd have trouble getting it to go.
Any recommendations?
I've never used it myself, but I've seen people with monitoring requirements be very happy with Cacti. Besides general health monitoring like CPU stats it has an extremely simple Apache stats plugin that might do what you need:
Script to get the requests per second and the requests currently being processed from
an Apache webserver.
maybe you can put something together with that.

New Relic API - difference between instances and hosts?

Referring to https://github.com/newrelic/newrelic_api for the New Relic API, I was wondering what was the difference between hosts and instances.
Basically, I get what an application is and what a server is (obviously). I would assume instances are instances of the application, i.e. if my app were running on Heroku, each instance would correspond to a dyno running my app. But then what is a host? And what's the difference between host and instance?
Thanks,
-Billy
UPDATE
Thanks for the answer!
So if I got this right, in the general case, the mapping between applications and instances is 1-to-n, i.e. each app can have 1 or more instances. Also, the mapping between instances and hosts is n-to-m, i.e. each instance can be running on at most one host (at any given time), but instances are distributed among available hosts. Similarly, hosts are distributed among servers (say, m-to-s). Is that it? (Apologies if this sound like I'm saying very obvious stuff, but I'm unfamiliar with the terminology they are using over at New Relic)
If the above is correct, how can I get the instances - hosts and hosts - servers mappings from the API? I can see how to get the applications - instances and applications - hosts, but what about the other two?
Thanks again for your help!
A host (server) can run many instances of an application. Each process that responds to requests (e.g., a Unicorn worker) is an instance from the New Relic perspective. The host/instance distinction is roughly equivalent to the difference between an IP address and a port.
If you're using Heroku, New Relic treats the entire dyno grid as a single host/server, and each dyno as an instance.
Re: the updated question
A host is a machine or VM that applications run on, and each one can run N instances of the application.
A "server", for the purposes of the NR API, is an OS+hardware that's monitored by New Relic Server Monitoring. The NR application monitoring agent can also be running on a server monitored by the Server Monitoring agent. In that case, both the host and the server should report the same name to New Relic ("server01.example.com").
There isn't a way to get the instance-host or host-server mappings explicitly from the New Relic API. But in the case of server-host, the mapping is that they share the same name. You can probably infer the instance-host mapping from the instance names, too, since they will almost always contain the host name (and possibly also the port number).

Communicating between two processes on heroku (what port to use)

I have a Procfile like so:
web: bundle exec rails server -p $PORT
em: script/eventmachine
The em process fires up an eventmachine with start_server (port ENV['PORT']) and my web process occasionally needs to communicate with it.
My question is how does the web process know what port to communicate with it on? If I understand heroku correctly it assigns you a random port when the process starts up (and it can change if the ps is killed or restarted). Thanks!
According to Heroku documentation,
Two processes running on the same dyno can communicate over TCP/IP using whatever ports they want.
Two processes running on different dynos cannot communicate over TCP/IP at all. They need to use memcached, or the database, or one of the Heroku plugins, to communicate.
Processes are isolated and cannot communicate directly with each other.
http://www.12factor.net/processes
There are, however, a few other ways. One is to use a backing service such as Redis, or Postgres to act as an intermediary - another is to use FIFO to communicate.
http://en.wikipedia.org/wiki/FIFO
It is a good thing that your processes are isolated and share-nothing, but you do need to architecture your application slightly differently to accommodate this.
I'm reading this while on my commute to work. So I haven't tried anything with it (sorry) but this looks relevant and potentially awesome.
https://blog.heroku.com/archives/2013/5/2/new_dyno_networking_model