Is there a way to shutdown a scrapy crawl from the pipeline? My pipline processes urls and adds them to a list. When the list reaches a specified amount I want to shutdown the crawler. I know of raise CloseSpider() but it only seems to work when called from the spider.
Yes, it is possible. You can call Crawler.engine.close_spider(spider_name, reason).
There is an example in the CloseSpider extension
Related
CONCURRENT_REQUESTS = 50
CONCURRENT_REQUESTS_PER_DOMAIN = 50
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_DELAY= 0
after checking How to increase Scrapy crawling speed?, my scraper is still slow and takes about 25 hours to scrape 12000 pages (Google,Amazon), I use Crawlera, Is there more I can do to increase speed and when CONCURRENT_REQUESTS =50 does this mean I have 50 thread like request?
#How to run several instances of a spider
Your spider can take some arguments in the terminal as the following: scrapy crawl spider -a arg=value.
Let's imagine you want to start 10 instances because I guess you start with 10 urls (quote: input is usually 10 urls). The commands could be like this:
scrapy crawl spider -a arg=url1 &
scrapy crawl spider -a arg=url2 &
...
scrapy crawl spider -a arg=url3
Where & indicates you launch a command after the previous one without waiting the end of this previous one. As far as I know it's the same syntax in Windows or Ubuntu for this particular need.
##Spider source code
To be able to launch as I showed you, the spider can look like this
class spiderExample(scrapy.Spiper):
def __init__(arg): #all args in here are able to be entered in terminal with -a
self.arg = arg #or self.start_urls = [arg] , because it can answer your problematic
... #any instructions you want, to initialize variables you need in the proccess
#by calling them with self.correspondingVariable in any method of the spider.
def parse(self,response):#will start with start_urls
... #any instructions you want to in the current parsing method
#To avoid to be banned
As far as I read you use Crawlera. Personally I never used this. I never needed to use payable services for it.
##One IP for each spider
The goal here is clear. As I told you in comment I use Tor and Polipo. Tor needs an HTTP proxy like Polipo or Privoxy to run in a scrapy spider correctly. Tor will be tunneled with the HTTP proxy, and at the end the proxy will work with a Tor IP. Where Crawlera can be interesting is Tor's IPs are well known by some websites with a lot of traffic (so a lot of robots going through it too...). These websites can ban Tor's IP because they detected robots behaviors corresponding with the same IP.
Well, I don't know how work Crawlera, so I don't know how you can open several ports and use several IP's with Crawlera. Look at it by yourself. In my case with Polipo I can run several instances tunneled (polipo is listening the tor's corresponding socks port) on several Tor circuits launched by my own. Each Polipo instances has its own listen port. Then for each spider I can run the following
scrapy crawl spider -a arg=url1 -s HTTP_PROXY:127.0.0.1:30001 &
scrapy crawl spider -a arg=url2 -s HTTP_PROXY:127.0.0.1:30002 &
...
scrapy crawl spider -a arg=url10 -s HTTP_PROXY:127.0.0.1:30010 &
Here, each port will listen with different IP's, so for the website these are different users. Then your spider can be more polite (look at settings options) and your whole project be faster. So no need to go through the roof by setting 300 to CONCURRENT_REQUESTS or to CONCURRENT_REQUESTS_PER_DOMAIN, it will make the website turn the wheel and generates unnecessary event like DEBUG: Retrying <GET https://www.website.com/page3000> (failed 5 times): 500 Internal Server Error.
In my personal preferences I like to set different log file for each spider. It avoids to explodes the amount of lines in the terminal, and allows to let me read the events of processes in a more comfortable text file. Easy to write in the command -s LOG_FILE=thingy1.log. It will show you easily if some urls where not scraped as you wanted.
##Random user agent.
When I read Crawlera is a smart solution because it uses the right user agent to avoid to be banned... I was surprised because actually you can do it by your own-self like here. The most important aspect when you do it by yourself is to choose popular user-agent to be overlooked among the large number of users of this same agent. You have available lists on some websites. Besides, be careful and take computer user-agent and not other device like mobile because the rendered page, I mean source code, is not necessarily the same and you can lose the information you want to scrape.
The main cons of my solution is it consumes your computer resources. So your choice of number of instances will depend on your computer capacities (RAM,CPU...), and router capacities too. Personally I am still using ADSL and as I told you 6000 requests were done in 20-30 minutes... But my solution does not consume more bandwidth than setting a crazy amount on CONCURRENT_REQUESTS.
There are many things that can affect your crawler speed. But CONCURRENT_REQUEST and CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN settings. Try setting them to some absurdly hard number like 300 and go from there.
see: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Other than that ensure that you have AUTOTHROTTLE_ENABLED set to False and DOWNLOAD_DELAY set to 0.
Even if scrapy is limited by it's internal behaviour you can always fire up multple instances of your crawler and scale your speed like that. Common atomic approach is to put urls/ids to a redis or rabbitmq queue and pop->crawl them in mutliple scrapy instances.
Is there any way to inject new URL to crawl without stoping the topology from command line and editing
proper files ? I want to do that with Elasticsearch as indexer
It depends on what you use as a backend for storing the status of the URLs. If the URLs are stored in Elasticsearch in the status index, you won't need to restart the crawl topology. You can use the injector topology separately in local mode to inject the new URLs into the status index.
This is also the case with the SOLR or SQL modules but not with MemorySpout + MemoryStatusUpdater as it lives within the JVM and nowhere else.
Which spout do you use?
I have several projects running in scrapyd and all uses the same pipeline, so How can I add this pipeline to every scheduled spider as default with out adding anything to the curl request, only having a flag in default_scrapyd.conf file?
I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks
To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.
I had written a simple spider with scrapy-redis to make distributed spiders. I found that when I start two spiders and then kill them all. The redis queue left only the ‘dupfilter’ queue. when I restart the two spiders, they did not work at all. So how to restart the spiders if they had accidentally killed or crashed?
If you set the setting SCHEDULER_PERSIST to False, the dupefilter will be cleared when the spider finish.
However, that won't be the case if the spider is killed (i.e.: hit twice Ctrl+C).
You can add a flag to your spider to clear the dupefilter (or even the queue), for example:
if self.clear_all:
self.crawler.engine.slot.scheduler.df.clear()
self.crawler.engine.slot.scheduler.queue.clear()