Scrapy: How to stop CrawlSpider after 100 requests - scrapy

I would like to limit the amount of pages CrawlSpider is visiting on website.
How can I stop the Scrapy CrawlSpider after 100 requests?

I believe you can use closespider extension for that with the CLOSESPIDER_PAGECOUNT setting. According to the docs:
... specifies the maximum number of responses to crawl. If the spider
crawls more than that, the spider will be closed with the reason
closespider_pagecount
All you would need to do is set in your settings.py:
CLOSESPIDER_PAGECOUNT = 100
If this doesn't suit your need, another approach could be writing your own extension using Scrapy's stats module to keep track of number of requests.

Related

How to scrape the entire Web with Scrapy (all domains are allowed) on a domain per domain basis?

I am new to Scrapy framework and I would like to create a spider (or multiple spiders if this is preferable) that would scrape the entire Web on a domain per domain basis. In other words, I would like the spider to process all the pages under a specific domain looking for specific keywords in each page, then output word counts for each keyword for that domain, and then move to the next domain.
According to the documentation on the official website, Scrapy crawls in depth-first order:
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases.
Is it easy to implement a domain per domain crawl with the Scrapy framework? What is the best way to achieve this?

How to crawl websites without getting blocked?

I crawl websites very often at the rate of hundreds of requests in an hour.
How to make crawlers behavior more like a human?
How to not get on radar by detection bots?
Currently crawling site with selenium, chrome.
Kindly suggest.
Well, you will have to pause the script between loops.
import time
time.sleep(1)
time.sleep(N)
So, it could hypothetically work like this.
import json,urllib.request
import requests
import pandas as pd
from string import ascii_lowercase
import time
alldata = []
for c in ascii_lowercase:
response = requests.get('https://reservia.viarail.ca/GetStations.aspx?q=' + c)
json_data = response.text.encode('utf-8', 'ignore')
df = pd.DataFrame(json.loads(json_data), columns=['sc', 'sn', 'pv']) # etc.,
time.sleep(3)
alldata.append(df)
Or, look for an API to grab data from the URL you are targeting. You didn't post an actual URL, so it's impossible to say for sure if an API is exposed or not.
There are a lot of ways that sites can detect you are trying to crawl them. The easiest is probably IP. If you are making requests too fast from the same IP you might get blocked. You can introduce (random) delays into your script to try and appear slower.
To continue going fast as possible, you will have to use different IP addresses. There are many proxy and VPN services that you can use to accomplish this.

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

scrapy CrawlSpider: crawl policy / queue questions

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun.
This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pages.
I've realized that my initial notion that Scrapy isn't meant for this type of crawl is revealing itself.
I've begun to focus my sights on Nutch and Methabot in hopes of better performance. The only data that I need to store during the crawl is the full content of the web page and preferably all the links on the page (but even that can be done in post-processing).
I'm looking for a crawler that is fast and employs many parallel requests.
This my be fault of server not Scrapy. Server may be not so fast as you want or may be it (or webmaster) detects crawling and limit speed for this connection/cookie.
Do you use proxy? This may slow down crawling too.
This may be Scrapy wisdom, if you will crawl too intensive you may get ban on this server. For my C++ handwritten crawler I artificially set 1 request per second limit. But this speed is enough for 1 thread ( 1 req * 60 secs * 60 minutes * 24 hours = 86400 req / day ). If you interested you may write email to whalebot.helmsman {AT} gmail.com .
Scrapy allows you to determine the number of concurrent requests and the delay between the requests in its settings.
Do you know where the bottleneck is?. As whalebot.helmsman pointed out, the limit may not be on Scrapy itself, but on the server you're crawling.
You should start by finding out whether the bottleneck is the network or CPU.