This is my first spider code. When I executed this code in my cmd. log shows that the urls are not even getting crawled and there were not DEBUG message in them.
Can't be able to find any solution to this problem anywhere.. I am not able to understand what is wrong. can somebody help me with this.
My code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_request(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Log:
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: my_scrapy)
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect
1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5
2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021),
cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-06-19 23:19:01 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor
2021-06-19 23:19:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_scrapy',
'NEWSPIDER_MODULE': 'my_scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_scrapy.spiders']}
2021-06-19 23:19:01 [scrapy.extensions.telnet] INFO: Telnet Password: 1a9440bbf933d074
2021-06-19 23:19:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider opened
2021-06-19 23:19:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2021-06-19 23:19:02 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-06-19 23:19:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.008228,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 99933),
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 91705)}
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider closed (finished)
Note: As I do not have 50 reputation to comment that's why I am answering here.
The problem is in function naming, your function should be def start_requests(self) instead of def start_request(self).
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs. But, in your case it never gets into that function due to which the requests are never made for those URLs.
Your code after small change
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_requests(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Related
I am trying to learn scrapy and splash to scrape efficiently from the web. I have installed scrapy, scrapy-splash and have splash running in a docker container. My code is below
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
from scrapy_splash import SplashRequest
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "propertySearch"
def start_requests(self):
url = "http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield SplashRequest(url=url, callback=self.parse, args={'wait': 3})
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]')
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
# with open(Path(work_path, "extract.txt"), 'wb') as g:
# g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
When I run it the console outputs the below if it is of any use.
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:06:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-07 00:06:32 [scrapy.crawler] INFO: Overridden settings:
{}
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet Password: 162ccfed8b528ac9
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-06-07 00:06:32 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:06:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:06:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> from <GET http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=>
2020-06-07 00:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> (referer: None)
2020-06-07 00:06:33 [propertySearch] DEBUG: Saved file test.txt
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 77255,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 0.429054,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 23, 6, 33, 185514),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 6, 6, 23, 6, 32, 756460)}
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Spider closed (finished)
The text file that it outputs only contains the bare HTML with none of the data that JavaScript would bring in.
I read the documentation closely and found that first I needed to follow the scrapy project set-up process from here closely and then found very useful info here about how to run my script from my IDE. It's working now.
When I execute the following code I am able to extract only one link instead of all the links in that specific page of website.
from scrapy import Spider
from scrapy.http import Request
class BooksSpider(Spider):
name = 'books'
allowed_domains = ['books.toscrape.com/']
start_urls= ["http://books.toscrape.com"]
def parse(self, response):
books = response.xpath("//h3/a/#href").extract()
for book in books:
absolute_url = response.urljoin(book)
yield Request(absolute_url), callback=self.parse_page)
def parse_page(self, response):
pass
This is the output which extracts only the 1st link of the website "books.toscrape.com". Can anyone help me to understand what is mistake here or is this due to some system error. This is too much frustrating now as all loops and everything else is fine. There is some issue with YIELD I guess and how can I handle this error.:
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrap_book)
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-05-26 12:09:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-26 12:09:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrap_book',
'NEWSPIDER_MODULE': 'scrap_book.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrap_book.spiders']}
2020-05-26 12:09:23 [scrapy.extensions.telnet] INFO: Telnet Password: 7b1edefe67af4658
2020-05-26 12:09:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-26 12:09:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-26 12:09:26 [scrapy.core.engine] INFO: Spider opened
2020-05-26 12:09:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-26 12:09:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com> (referer: None)
2020-05-26 12:09:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'books.toscrape.com': <GET http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/ind 'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 5, 26, 6, 24, 26, 117907)}
2020-05-26 12:09:28 [scrapy.core.engine] INFO: Spider closed (finished)
As Scrapy says links are filtered because of your allowed_domains:
DEBUG: Filtered offsite request to 'books.toscrape.com'
Change your code to allowed_domains = ['books.toscrape.com'] and it should work fine.
In addition, in the code you posted there is an error in the yield because there is one bracket too much right behind absolute_url. It should be: yield Request(absolute_url, callback=self.parse_page)
I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)
Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/
I have written a code that passes through links within a web page to extract data and move to the next page. It is the about link from each author in quotes.toscrape.com.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com',]
def parse(self, response):
linkto = response.css('div.quote > span > a::attr(href)').extract()
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = scrapy.parse_about)
nextp = response.css('li.next > a::attr(href)').extract()
if nextp:
nextp = response.urljoin(nextp)
yield scrapy.Request(url=nextp, callback=self.parse)
def parse_about(self, response):
yield {
'date_of_birth': response.css('span.author-born-date::text').extract(),
'author': response.css('h3.author-title::text').extract(),
}
I executed in the command prompt:
scrapy crawl test -o test.csv
but the results I got:
019-03-20 16:36:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotestoscrape)
2019-03-20 16:36:03 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.5.0, Python 2.7.15 |Anaconda, Inc.| (default, Nov 13 2018, 17:33:26) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.5, Platform Windows-10-10.0.17134
2019-03-20 16:36:03 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotestoscrape.spiders', 'SPIDER_MODULES': ['quotestoscrape.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'quotestoscrape'}
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-20 16:36:03 [scrapy.core.engine] INFO: Spider opened
2019-03-20 16:36:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 16:36:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-03-20 16:36:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com> (referer: None)
Traceback (most recent call last):
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\quotestoscrape\quotestoscrape\spiders\QuoteTestSpider.py", line 13, in parse
yield scrapy.Request(url=links, callback = scrapy.parse_about)
AttributeError: 'module' object has no attribute 'parse_about'
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-20 16:36:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 20, 21, 36, 4, 41000),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2019, 3, 20, 21, 36, 3, 468000)}
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Spider closed (finished)
And my csv file I moved it to is empty:
enter image description here
Please let me know what I am doing wrong
According to your log method parse_about is not called because you are trying to call scrapy.parse_about instead of spider's self.parse_about:
....
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = self.parse_about)
As your application doesn't scrape any data -> It creates empty csv file as result.
Trying to figure out how scrapy works and using it to find information on forums.
items.py
import scrapy
class BodybuildingItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
pass
spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
responseSelector = Selector(response)
for sel in responseSelector.css('li.past.line.event-item'):
item = BodybuildingItem()
item['title'] = sel.css('a.data-previewUrl::text').extract()
yield item
The forum I'm trying to get the post titles from in this example is this: https://forum.bodybuilding.nl/fora/supplementen.22/
However I keep getting no results:
class BodyBuildingSpider(BaseSpider): 2017-10-07 00:42:28
[scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: bodybuilding)
2017-10-07 00:42:28 [scrapy.utils.log] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'bodybuilding.spiders', 'SPIDER_MODULES':
['bodybuilding.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'bodybuilding'} 2017-10-07 00:42:28 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats'] 2017-10-07 00:42:28
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-07
00:42:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-07 00:42:28
[scrapy.middleware] INFO: Enabled item pipelines: [] 2017-10-07
00:42:28 [scrapy.core.engine] INFO: Spider opened 2017-10-07 00:42:28
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2017-10-07 00:42:28
[scrapy.core.engine] DEBUG: Crawled (404) https://forum.bodybuilding.nl/robots.txt> (referer: None) 2017-10-07
00:42:29 [scrapy.core.engine] DEBUG: Crawled (200) https://forum.bodybuilding.nl/fora/supplementen.22/> (referer: None)
2017-10-07 00:42:29 [scrapy.core.engine] INFO: Closing spider
(finished) 2017-10-07 00:42:29 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 469,
'downloader/request_count': 2, 'downloader/request_method_count/GET':
2, 'downloader/response_bytes': 22878, 'downloader/response_count':
2, 'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2017, 10, 6, 22, 42, 29,
223305), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'memusage/max':
31735808, 'memusage/startup': 31735808, 'response_received_count':
2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 10, 6, 22, 42, 28, 816043)}
2017-10-07 00:42:29 [scrapy.core.engine] INFO: Spider closed
(finished)
I have been following the guide here: http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html
Update 1:
As someone told me I needed to update my code to the new standards, which I did but it didnt change the outcome:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
for sel in response.css('li.past.line.event-item'):
item = BodybuildingItem()
yield {'title': title.css('a.data-previewUrl::text').extract_first()}
yield item
Last update with fix
After some good help I finally got it working with this spider:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'bodybuilding'
start_urls = ['https://forum.bodybuilding.nl/fora/supplementen.22/']
def parse(self, response):
for title in response.css('h3.title'):
yield {'title': title.css('a::text').extract_first()}
next_page_url = response.xpath("//a[text()='Volgende >']/#href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
You should use response.css('li.past.line.event-item') and there is no need for responseSelector = Selector(response).
Also the CSS you are using li.past.line.event-item, is no more valid, so you need update those first based on the latest web page
To get the next page URL you can use
>>> response.css("a.text::attr(href)").extract_first()
'fora/supplementen.22/page-2'
And then use response.follow to follow this relative url
Edit-2: Next Page processing correction
The previous edit didn't work because on the next page it matches the previous page url, so you need to use below
next_page_url = response.xpath("//a[text()='Volgende >']/#href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
Edit-1: Next Page processing
next_page_url = response.css("a.text::attr(href)").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)