i'm using to crawl a website. However, the current code redirects me and does not crawl from the URL I want.
URL:
http://www.example.com/book/diff/
Where diff can be anything except /.
To add on, I only want to crawl url that match the url.
Here is my current code:
name = "testing"
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/book/',
]
rules = (Rule(LinkExtractor(allow=(r'^http://www.example.com/book/[^/]*/$')),
callback='parse_page',follow=True),)
rules = (Rule(LinkExtractor(allow=(r'^http://www.example.com/book/')), callback='parse_page',follow=True),)
This should be enough.
Related
I am writing a scraper with Scrapy within a larger project, and I'm trying to keep it as minimal as possible (without create a whole scrapy project). This code downloads a single URL correctly:
import scrapy
from scrapy.crawler import CrawlerProcess
class WebsiteSpider(scrapy.Spider):
"""
https://docs.scrapy.org/en/latest/
"""
custom_settings = {'DOWNLOAD_DELAY': 1, 'DEPTH_LIMIT': 3}
name = 'my_website_scraper'
def parse(self,response):
html = response.body
url = response.url
# process page here
process = CrawlerProcess()
process.crawl(WebsiteSpider, start_urls=['https://www.bbc.co.uk/'])
process.start()
How can I enrich this code to keep scraping the links found in the start URLs (with a maximum depth, for example of 3)?
Try this.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class WebsiteSpider(Spider):
name = 'bbc.co.uk'
allowed_domains = ['.bbc.co.uk']
start_urls = ['https://www.bbc.co.uk/']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"]) # Get link data for subsequent crawling
data = [{"title": doc.title.text}] # Get target data
return {"Urls": lstA, "Data": data} # Return data to framework
SimplifiedMain.startThread(WebsiteSpider()) # Start crawling
I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!
Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json
I have the following code in a scrapy spider:
class ContactSpider(Spider):
name = "contact"
# allowed_domains = ["http://www.domain.com/"]
start_urls = [
"http://web.domain.com/DECORATION"
]
BASE_URL = "http://web.domain.com"
def parse(self, response):
links = response.selector.xpath('//*[contains(#class,"MAIN")]/a/#href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield Request(absolute_url, headers= headers, callback=self.second)
I'm surprised there is not a simpler way in scrapy to follow links rather than build each absolute_url. Is there a a better way to do this?
For absolute urls you can use urlparse.urljoin, Response already has a shortcut for that via response.urljoin(link). So your code could easily be replaced by:
def parse(self, response):
links = response.selector.xpath('//*[contains(#class,"MAIN")]/a/#href').extract()
for link in links:
yield Request(response.urljoin(link), headers=headers, callback=self.second)
You can also use scrapy LinkExtractors which extract links according to some rules and manages all of the joining automatically.
from scrapy.linkextractors import LinkExtractor
def parse(self, response):
le = LinkExtractor(restrict_xpaths='//*[contains(#class,"MAIN")]/a/#href')
links = le.extract_links(response)
for link in links:
yield Request(link.url, headers= headers, callback=self.second)
Regarding more automated crawling experience - scrapy has CrawlSpider which uses set of rules to extract and follow links on each page. You can read about it more here: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
The docs have some examples of it as well.
I need to download a web page with intensive ajax. Currently, I am using Scrapy with Ajaxenabled. After I write out this response, and open it in browser. There are still some requests initiated. I am not sure if I was right that the rendered response only includes the first level requests. So, how could we let scrapy include all sub-requests into one response?
Now in this case, there are 72 requests sent as opening online, where 23 requests as opening offline.
Really appreciate it!
Here are the screenshots for the requests sent before and after download
requests sent before download
requests sent after download
Here is the code:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/caplinked/bridge',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
The code is as follows:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/startmart/pre.seed',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
I've got a scrappy spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ExampleSpider(CrawlSpider):
name = "spidermaster"
allowed_domains = ["www.test.com"]
start_urls = ["http://www.test.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item'),
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
What im trying is to crawl the whole webpage except what is under an specific path.
For example, i want to crawl all the test web site except www.test.com/too_much_links.
Thanks in advance
I usually do it in this way:
ignore = ['too_much_links', 'many_links']
rules = [Rule(SgmlLinkExtractor(allow=(), deny=ignore), follow=True),
Rule(SgmlLinkExtractor(allow=(), deny=ignore), callback='parse_item'),
]