What are the most effective tools to manage multiple apache httpd instances? - apache

We have many Apache instances all over our intranet. Some instances run on the same machine. Some instances run on different machines.
I need a tool that can manage these instances from one central location.
Get CPU stats
Get Connection stats
Stop/start Apache instances
Get access to error log
I looked at webmin, but the documentation isn't too clear how it works. Without installing it I'd have trouble getting it to go.
Any recommendations?

I've never used it myself, but I've seen people with monitoring requirements be very happy with Cacti. Besides general health monitoring like CPU stats it has an extremely simple Apache stats plugin that might do what you need:
Script to get the requests per second and the requests currently being processed from
an Apache webserver.
maybe you can put something together with that.

Related

How to distribute spiders across the cluster using Scrapyd and ScrapydWeb?

I am working in a crawling project, using Scrapy, and I need to distribute my spiders across different nodes in a cluster to make the process faster. I am using ScrapydWeb to manage it and I have already configured two machines, one of them with ScrapydWeb up and both with Scrapyd up. The Web App recognizes both and I can run my spider properly. The problem is that the crawling is running just in parallel (the content is being fetched by both machines), and my purpose was to do it in a distributed way to minimize the crawling time.
Could anybody help me? Thank you in advance.
I don't think Scrapyd & ScrapydWeb offer the possibility of running a spiders across different servers other than just fully running the same spider. If you want to distribute the crawling you can either:
Run 1 spider only on 1 server
If you need actual distributed crawling (where the same spider runs across different machines without multiple machines parsing the same url), you can look into Scrapy-Cluster
You can write custom code where you have 1 process generating the urls to scrape on one side, put the found urls in a queue (using Redis f.e.), and have multiple servers popping urls from this queue to fetch & parse the page
I used Scrapy Cluster to solve the problem and I'm sharing my experience:
Docker installation was hard for me to control and debug, so I tried the Cluster Quick-start and it worked better.
I have five machines available in my cluster and I used one to host the Apache Kafka, as well as the Zookeeper. I also had one for Redis DB. It's important to make sure those machines are available for external access from the ones you are going to use for spidering.
Once these three components were properly installed and running, I installed Scrapy Cluster's requirements in a python3.6 environment. Then, I configured a local settings file with the IP address for the hosts and made sure all online and offline tests passed.
Everything set up, I was able to run the first spider (the official documentation provides an example). The idea is that you create instances for your spider (you can, for example, use tmux to open 10 different terminal windows and run one instance at each). When you feed Apache Kafka with a URL to be crawled, it's sent to a queue at Redis, to where your instances will periodically look for a new page to crawl.
If your spider generates more URLs from the one you passed initially, they return to Redis to be possibly crawled by other instances. And that's where you can see the power of this distribution.
Once a page is crawled, the result is sent to a Kafka topic.
The official documentation is huge and you can find more details on the installation and setup.

Resque Workers from other hosts registered and active on my system

The Rails application I'm currently working on is hosted at Amazon EC2 servers. It's using Resque for running background jobs, and there are 2 such instances (would-be production and a stage). Also I've mounted Resque monitoring web app to the /resque route (on stage only).
Here is my question:
Why there are workers from multiple hosts registered within my stage system and how can I avoid this?
Some additional details:
I see workers from apparently 3 different machines, but only 2 of them I managed to identify - the stage(obviously) and the production. The third has another address format(starts with domU) and haven't any clue what it could be.
It looks like you're sharing a single Redis server across multiple resque server environments.
The best way to do this safely is to use separate Redis servers or separate Redis databases or namespaces. The Redis-namespace gem can be used with Resque to isolate each environments Resque queues and worker data.
I can't really help you with what the unknown one is, but I had something similar happen when moving hosts and having dns names change. The only way I found to clear out the old ones was to stop all workers on the machine, fire up IRB, require 'resque' and look at Resque.workers. This will list all the workers resque knows about, which in your case will include about 20 bogus ones. You can then do:
Resque.workers.each do {|worker| worker.unregister_worker}
This should prune all the not-really-there workers and get you back to a proper display of the real workers.

mass-restarting httpd on lots of EC2 instances

I am running a variable number of EC2 instances (CentOS 64) that contain an apache web server that caches a bunch of code in production mode.
Now every time I make some changes to the code (generally on a weekly basis) I have to log into each one of them instances and do a "su" then "service httpd restart"
Is there a way to automate this so that I can run a single command on one of the instances it would connect to all others and restart it? Getting really time consuming especially when the application has spawned some 20-30 instances on its own (happens on some days when we get high traffic)
Thanks!
Dancer's shell, dsh, is provided specifically to do this. No 'scripting' required. As #tix3 suggests, you should probably also convince sudo on those machines (configure /etc/sudoers using visudo) to configure them to accept your restart command.

Apache and the c10k

How is Apache in respect to handling the c10k problem under normal conditions ?
Say while running very small scripts with little data, or do I need to scale out if I use Apache?
In the background heavy lifting is done by a few servers running specialized software that processes the requests but I'd like to use Apache as a front. Is this a viable plan?
I consider Apache to be more of an origin server - running something like mod_php or mod_perl to generate the content and being smart about routing to the appropriate system.
If you are getting thousands of concurrent hits to the front of your site, with a mix of types of data (static and dynamic) being returned, you may find it useful to put a more optimised system in front of it though.
The classic post-optimisation problem with Apache isn't generating the dynamic content (or at least, that can be optimised for early in the process), but simply waiting for a slow client to be able to receive the bytes that are being sent. It can therefore be a significant advantage to put a reverse proxy, in the form of Squid or Nginx, in front of the servers to take over the 'spoon-feeding' of the slow network clients, while allowing the content production to happen at full speed, and at local network speeds - 100Mb/sec or even gigabit speeds - if it even has to traverse a network at all.
I'm assuming you've probably seen this data, but if not, it might give you some idea.
Guys, imagine that you are running web server with 10K connections (simultaneous). How could it be?
You've got many many connections per second
Dynamic content
Are you sure that your CPU can handle that many PHP sessions for example? I guess no, so why are you thinking about C10K problem? :D
Static content - small files
And still soo many connections? On single server? Probably you've got problems with networking/throughput too or you are future competitor of Google. Use lighttpd which addresses C10K problem and is stable - fly light. Using Apache for only static files for large sites is obvious.
Your clients are downloading large files for a large time - static content
ISO images, archives etc
If you are doing it via web server - FTP may be more appropriate.
Video streaming
Use lighttpd or specialized software. And still... What about other resources?
I am using Linux Virtual Server as load balancer in front of apache servers (with specific patches for LVS-NAT) and I am happy :) This string is an answer you want to hear.

Can my requirements be met with JMX?

I am completely new to JMX. I have a specific requirement and wanted to know if it is possible to accomplish within the scope of JMX.
Requirements:
I have a set of resources which include many weblogic instances, jBoss instances and Tomcat instances running across many servers. Now I need a one stop solution, UI to monitor these resources, check their current status and if they are down, I need to start and stop them from that webpage.
Is this possible using JMX?
You could use nagios combined with check_jmx to monitor (create statistics)
and may trigger a restart of a resource. (I'm not sure if can trigger a restart direct via JMX)
Check out Jopr, http://www.jboss.org/jopr/
jmx4perl comes with a full featured Nagios Plugin check_jmx4perl for access JMX information. It comes with a set of preconfigured check for various resources, currently for JBoss, Tomcat and Jetty (more are in the pipeline).