How to distribute spiders across the cluster using Scrapyd and ScrapydWeb? - scrapy

I am working in a crawling project, using Scrapy, and I need to distribute my spiders across different nodes in a cluster to make the process faster. I am using ScrapydWeb to manage it and I have already configured two machines, one of them with ScrapydWeb up and both with Scrapyd up. The Web App recognizes both and I can run my spider properly. The problem is that the crawling is running just in parallel (the content is being fetched by both machines), and my purpose was to do it in a distributed way to minimize the crawling time.
Could anybody help me? Thank you in advance.

I don't think Scrapyd & ScrapydWeb offer the possibility of running a spiders across different servers other than just fully running the same spider. If you want to distribute the crawling you can either:
Run 1 spider only on 1 server
If you need actual distributed crawling (where the same spider runs across different machines without multiple machines parsing the same url), you can look into Scrapy-Cluster
You can write custom code where you have 1 process generating the urls to scrape on one side, put the found urls in a queue (using Redis f.e.), and have multiple servers popping urls from this queue to fetch & parse the page

I used Scrapy Cluster to solve the problem and I'm sharing my experience:
Docker installation was hard for me to control and debug, so I tried the Cluster Quick-start and it worked better.
I have five machines available in my cluster and I used one to host the Apache Kafka, as well as the Zookeeper. I also had one for Redis DB. It's important to make sure those machines are available for external access from the ones you are going to use for spidering.
Once these three components were properly installed and running, I installed Scrapy Cluster's requirements in a python3.6 environment. Then, I configured a local settings file with the IP address for the hosts and made sure all online and offline tests passed.
Everything set up, I was able to run the first spider (the official documentation provides an example). The idea is that you create instances for your spider (you can, for example, use tmux to open 10 different terminal windows and run one instance at each). When you feed Apache Kafka with a URL to be crawled, it's sent to a queue at Redis, to where your instances will periodically look for a new page to crawl.
If your spider generates more URLs from the one you passed initially, they return to Redis to be possibly crawled by other instances. And that's where you can see the power of this distribution.
Once a page is crawled, the result is sent to a Kafka topic.
The official documentation is huge and you can find more details on the installation and setup.

Related

Two Logstash instances on same Docker container

Am wondering if there is a way two logstash processes with separate configurations can be run on a single Docker container.
My setup has a Logstash process using file as input sending events to Redis and from there to second Logstash process and over to custom http process. So, Logstash --> Redis --> Logstash --> Http. Was hoping to keep the two Logstash instances and Redis on the same Docker container. Am still new to Docker & Would highly appreciate any inputs / feedback on the same.
This would be more complicated than it needs to be. It is much simpler in the Docker world to run three containers to do three things than to run one container that does them all. It is possible though-
You need to run an init process in your container to control multiple processes, and launch that as your container's entry point. The init will have to know how to launch the processes you are interested in, both logstash and the redis. Basimage/phusion provides an image with a good init system, but the launch scripts are based on runit and can be hard to pick up.
If you wanted to only run a single process, you can use a docker-compose file to launch all three processes and link them together.

Apache HTTP load balancing based on URL pattern

I have a Apache web server in front of 2 tomcats which are connected to the same MySQL backend database.
I need to load balance the incoming requests between two tomcats based on a URL parameter named "projectid". For example all even project ids may be served with tomcat 1 and odd requests with tomcat 2.
This is required because the user may start jobs in a project of tomcat 1 which tomcat 2 won't be aware of and these jobs are currently not stored in the database.
Is there a way to achieve this using mod-proxy-load-balancing?
I'm not aware of such a load algorithm being already present. However, keep in mind that the most common loadbalancing outcome (especially when you have server-side state as you obviously have) is a sticky session: You're only balancing the initial request. After that, all requests are typically directed to the same server.
I typically recommend against distributing the session data as it adds some commonly unnecessary performance hit onto each request, negating the improved performance that you can get with clustering. This is subject to be changed in actual installations though and just a first rule of thumb.
You might be able to create your own loadbalancing algorithm with mod-proxy-load-balancing (you'll need to configure the algorithm in the config file), but I believe your time is better spent fixing your implementation, or implement business specific logic to check all cluster machines for running jobs.

Resque Workers from other hosts registered and active on my system

The Rails application I'm currently working on is hosted at Amazon EC2 servers. It's using Resque for running background jobs, and there are 2 such instances (would-be production and a stage). Also I've mounted Resque monitoring web app to the /resque route (on stage only).
Here is my question:
Why there are workers from multiple hosts registered within my stage system and how can I avoid this?
Some additional details:
I see workers from apparently 3 different machines, but only 2 of them I managed to identify - the stage(obviously) and the production. The third has another address format(starts with domU) and haven't any clue what it could be.
It looks like you're sharing a single Redis server across multiple resque server environments.
The best way to do this safely is to use separate Redis servers or separate Redis databases or namespaces. The Redis-namespace gem can be used with Resque to isolate each environments Resque queues and worker data.
I can't really help you with what the unknown one is, but I had something similar happen when moving hosts and having dns names change. The only way I found to clear out the old ones was to stop all workers on the machine, fire up IRB, require 'resque' and look at Resque.workers. This will list all the workers resque knows about, which in your case will include about 20 bogus ones. You can then do:
Resque.workers.each do {|worker| worker.unregister_worker}
This should prune all the not-really-there workers and get you back to a proper display of the real workers.

What are the most effective tools to manage multiple apache httpd instances?

We have many Apache instances all over our intranet. Some instances run on the same machine. Some instances run on different machines.
I need a tool that can manage these instances from one central location.
Get CPU stats
Get Connection stats
Stop/start Apache instances
Get access to error log
I looked at webmin, but the documentation isn't too clear how it works. Without installing it I'd have trouble getting it to go.
Any recommendations?
I've never used it myself, but I've seen people with monitoring requirements be very happy with Cacti. Besides general health monitoring like CPU stats it has an extremely simple Apache stats plugin that might do what you need:
Script to get the requests per second and the requests currently being processed from
an Apache webserver.
maybe you can put something together with that.

Can my requirements be met with JMX?

I am completely new to JMX. I have a specific requirement and wanted to know if it is possible to accomplish within the scope of JMX.
Requirements:
I have a set of resources which include many weblogic instances, jBoss instances and Tomcat instances running across many servers. Now I need a one stop solution, UI to monitor these resources, check their current status and if they are down, I need to start and stop them from that webpage.
Is this possible using JMX?
You could use nagios combined with check_jmx to monitor (create statistics)
and may trigger a restart of a resource. (I'm not sure if can trigger a restart direct via JMX)
Check out Jopr, http://www.jboss.org/jopr/
jmx4perl comes with a full featured Nagios Plugin check_jmx4perl for access JMX information. It comes with a set of preconfigured check for various resources, currently for JBoss, Tomcat and Jetty (more are in the pipeline).