I had written a simple spider with scrapy-redis to make distributed spiders. I found that when I start two spiders and then kill them all. The redis queue left only the ‘dupfilter’ queue. when I restart the two spiders, they did not work at all. So how to restart the spiders if they had accidentally killed or crashed?
If you set the setting SCHEDULER_PERSIST to False, the dupefilter will be cleared when the spider finish.
However, that won't be the case if the spider is killed (i.e.: hit twice Ctrl+C).
You can add a flag to your spider to clear the dupefilter (or even the queue), for example:
if self.clear_all:
self.crawler.engine.slot.scheduler.df.clear()
self.crawler.engine.slot.scheduler.queue.clear()
Related
CONCURRENT_REQUESTS = 50
CONCURRENT_REQUESTS_PER_DOMAIN = 50
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_DELAY= 0
after checking How to increase Scrapy crawling speed?, my scraper is still slow and takes about 25 hours to scrape 12000 pages (Google,Amazon), I use Crawlera, Is there more I can do to increase speed and when CONCURRENT_REQUESTS =50 does this mean I have 50 thread like request?
#How to run several instances of a spider
Your spider can take some arguments in the terminal as the following: scrapy crawl spider -a arg=value.
Let's imagine you want to start 10 instances because I guess you start with 10 urls (quote: input is usually 10 urls). The commands could be like this:
scrapy crawl spider -a arg=url1 &
scrapy crawl spider -a arg=url2 &
...
scrapy crawl spider -a arg=url3
Where & indicates you launch a command after the previous one without waiting the end of this previous one. As far as I know it's the same syntax in Windows or Ubuntu for this particular need.
##Spider source code
To be able to launch as I showed you, the spider can look like this
class spiderExample(scrapy.Spiper):
def __init__(arg): #all args in here are able to be entered in terminal with -a
self.arg = arg #or self.start_urls = [arg] , because it can answer your problematic
... #any instructions you want, to initialize variables you need in the proccess
#by calling them with self.correspondingVariable in any method of the spider.
def parse(self,response):#will start with start_urls
... #any instructions you want to in the current parsing method
#To avoid to be banned
As far as I read you use Crawlera. Personally I never used this. I never needed to use payable services for it.
##One IP for each spider
The goal here is clear. As I told you in comment I use Tor and Polipo. Tor needs an HTTP proxy like Polipo or Privoxy to run in a scrapy spider correctly. Tor will be tunneled with the HTTP proxy, and at the end the proxy will work with a Tor IP. Where Crawlera can be interesting is Tor's IPs are well known by some websites with a lot of traffic (so a lot of robots going through it too...). These websites can ban Tor's IP because they detected robots behaviors corresponding with the same IP.
Well, I don't know how work Crawlera, so I don't know how you can open several ports and use several IP's with Crawlera. Look at it by yourself. In my case with Polipo I can run several instances tunneled (polipo is listening the tor's corresponding socks port) on several Tor circuits launched by my own. Each Polipo instances has its own listen port. Then for each spider I can run the following
scrapy crawl spider -a arg=url1 -s HTTP_PROXY:127.0.0.1:30001 &
scrapy crawl spider -a arg=url2 -s HTTP_PROXY:127.0.0.1:30002 &
...
scrapy crawl spider -a arg=url10 -s HTTP_PROXY:127.0.0.1:30010 &
Here, each port will listen with different IP's, so for the website these are different users. Then your spider can be more polite (look at settings options) and your whole project be faster. So no need to go through the roof by setting 300 to CONCURRENT_REQUESTS or to CONCURRENT_REQUESTS_PER_DOMAIN, it will make the website turn the wheel and generates unnecessary event like DEBUG: Retrying <GET https://www.website.com/page3000> (failed 5 times): 500 Internal Server Error.
In my personal preferences I like to set different log file for each spider. It avoids to explodes the amount of lines in the terminal, and allows to let me read the events of processes in a more comfortable text file. Easy to write in the command -s LOG_FILE=thingy1.log. It will show you easily if some urls where not scraped as you wanted.
##Random user agent.
When I read Crawlera is a smart solution because it uses the right user agent to avoid to be banned... I was surprised because actually you can do it by your own-self like here. The most important aspect when you do it by yourself is to choose popular user-agent to be overlooked among the large number of users of this same agent. You have available lists on some websites. Besides, be careful and take computer user-agent and not other device like mobile because the rendered page, I mean source code, is not necessarily the same and you can lose the information you want to scrape.
The main cons of my solution is it consumes your computer resources. So your choice of number of instances will depend on your computer capacities (RAM,CPU...), and router capacities too. Personally I am still using ADSL and as I told you 6000 requests were done in 20-30 minutes... But my solution does not consume more bandwidth than setting a crazy amount on CONCURRENT_REQUESTS.
There are many things that can affect your crawler speed. But CONCURRENT_REQUEST and CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN settings. Try setting them to some absurdly hard number like 300 and go from there.
see: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Other than that ensure that you have AUTOTHROTTLE_ENABLED set to False and DOWNLOAD_DELAY set to 0.
Even if scrapy is limited by it's internal behaviour you can always fire up multple instances of your crawler and scale your speed like that. Common atomic approach is to put urls/ids to a redis or rabbitmq queue and pop->crawl them in mutliple scrapy instances.
I have several projects running in scrapyd and all uses the same pipeline, so How can I add this pipeline to every scheduled spider as default with out adding anything to the curl request, only having a flag in default_scrapyd.conf file?
I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks
To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.
Problem Statement:
I use Ansible for spawning slave instances, and SSHing into them, do some tasks and terminate them.
Suppose the playbook is spawning 3 instances. While SSHing into the slave instances, if one's SSH fails, then do Ansible go ahead with the ones which had a successful SSH, or does it fail the task altogether?
If not, then is there any way I can do it?
PS: I did explore the ssh_connection's retries option. But here, by failed SSH, I mean to imply an SSH which failed after retries.
By default Ansible will run your playbook for all specified hosts. If any of them fails, it will still continue running the playbook for the rest of the hosts, and in the end will create a playbook.retry file with the names of the failed hosts, which you can then re-run using:
ansible-playbook playbook.yml --limit #playbook.retry
(assuming your playbook's name is playbook.yml) Note that the re-run will re-run the whole playbook from start, even if some of your tasks have been succeeded in it, hence you should always try to make playbooks resilient to re-runs. Also note that even if you have multiple plays in your playbook, all referring to the same host, the first time the host fails, ansible will not try that host for subsequent plays at all.
There are some ways to change the default behaviour however:
You can for example abort the play for some tasks using any_errors_fatal: true meaning a failure there will mean ansible will stop execution on all hosts (this assumes you are using the default, linear strategy. Using the free strategy means that other hosts might be in a different stage, meaning they might abort earlier / later than you'd expect)
Also, since ansible 2.2 you can re-set unreachable hosts between plays, meaning that even if your host failed in one of the plays, in a subsequent ones ansible will still re-try to run the new plays on it (previous plays will still be marked as failed). You have to add meta: clear_host_errors to the play where you want to re-try all of the previously unreachable hosts.
Is there a way to shutdown a scrapy crawl from the pipeline? My pipline processes urls and adds them to a list. When the list reaches a specified amount I want to shutdown the crawler. I know of raise CloseSpider() but it only seems to work when called from the spider.
Yes, it is possible. You can call Crawler.engine.close_spider(spider_name, reason).
There is an example in the CloseSpider extension