When I execute the following code I am able to extract only one link instead of all the links in that specific page of website.
from scrapy import Spider
from scrapy.http import Request
class BooksSpider(Spider):
name = 'books'
allowed_domains = ['books.toscrape.com/']
start_urls= ["http://books.toscrape.com"]
def parse(self, response):
books = response.xpath("//h3/a/#href").extract()
for book in books:
absolute_url = response.urljoin(book)
yield Request(absolute_url), callback=self.parse_page)
def parse_page(self, response):
pass
This is the output which extracts only the 1st link of the website "books.toscrape.com". Can anyone help me to understand what is mistake here or is this due to some system error. This is too much frustrating now as all loops and everything else is fine. There is some issue with YIELD I guess and how can I handle this error.:
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrap_book)
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-05-26 12:09:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-26 12:09:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrap_book',
'NEWSPIDER_MODULE': 'scrap_book.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrap_book.spiders']}
2020-05-26 12:09:23 [scrapy.extensions.telnet] INFO: Telnet Password: 7b1edefe67af4658
2020-05-26 12:09:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-26 12:09:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-26 12:09:26 [scrapy.core.engine] INFO: Spider opened
2020-05-26 12:09:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-26 12:09:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com> (referer: None)
2020-05-26 12:09:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'books.toscrape.com': <GET http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/ind 'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 5, 26, 6, 24, 26, 117907)}
2020-05-26 12:09:28 [scrapy.core.engine] INFO: Spider closed (finished)
As Scrapy says links are filtered because of your allowed_domains:
DEBUG: Filtered offsite request to 'books.toscrape.com'
Change your code to allowed_domains = ['books.toscrape.com'] and it should work fine.
In addition, in the code you posted there is an error in the yield because there is one bracket too much right behind absolute_url. It should be: yield Request(absolute_url, callback=self.parse_page)
Related
So i tried to scrape this website using scrapy and scrapy-selenium for exercise.Iam trying to get names,prices etc. My xpath expression seems okey at dev tool on chrome.But it isnt working at my script.I dont know what i am doing wrong? Can u please explain that why my xpath expression not working?
import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
class ComputerdealsSpider(scrapy.Spider):
name = 'computerdeals'
def start_requests(self):
yield SeleniumRequest(
url = 'https://slickdeals.net/computer-deals',
wait_time=3,
callback = self.parse
)
def parse(self, response):
products = response.xpath("//ul[#class ='bp-p-filterGrid_items']/li")
for product in products:
yield{
'price' : product.xpath(".//div/span[#class='bp-c-card_subtitle']/text()").get(),
}
OUTPUT
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Scrapy 2.7.0 started (bot: silkdeals)
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Windows-10-10.0.19044-SP0
2022-11-20 13:59:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'silkdeals',
'NEWSPIDER_MODULE': 'silkdeals.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['silkdeals.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-20 13:59:59 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet Password: d3adcd8a4caad669
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-20 13:59:59 [scrapy.middleware] WARNING: Disabled SeleniumMiddleware: SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE_PATH must be set
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-20 13:59:59 [scrapy.core.engine] INFO: Spider opened
2022-11-20 13:59:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-20 14:00:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/robots.txt> (referer: None)
2022-11-20 14:00:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://slickdeals.net/computer-deals/> from <GET https://slickdeals.net/computer-deals>
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to acquire lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 acquired on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to release lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 released on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/computer-deals/> (referer: None)
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 14:00:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96185,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 2.098319,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 11, 0, 1, 689826),
'httpcompression/response_bytes': 617590,
'httpcompression/response_count': 2,
'log_count/DEBUG': 10,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2022, 11, 20, 10, 59, 59, 591507)}
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Spider closed (finished)
This is my first spider code. When I executed this code in my cmd. log shows that the urls are not even getting crawled and there were not DEBUG message in them.
Can't be able to find any solution to this problem anywhere.. I am not able to understand what is wrong. can somebody help me with this.
My code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_request(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Log:
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: my_scrapy)
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect
1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5
2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021),
cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-06-19 23:19:01 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor
2021-06-19 23:19:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_scrapy',
'NEWSPIDER_MODULE': 'my_scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_scrapy.spiders']}
2021-06-19 23:19:01 [scrapy.extensions.telnet] INFO: Telnet Password: 1a9440bbf933d074
2021-06-19 23:19:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider opened
2021-06-19 23:19:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2021-06-19 23:19:02 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-06-19 23:19:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.008228,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 99933),
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 91705)}
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider closed (finished)
Note: As I do not have 50 reputation to comment that's why I am answering here.
The problem is in function naming, your function should be def start_requests(self) instead of def start_request(self).
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs. But, in your case it never gets into that function due to which the requests are never made for those URLs.
Your code after small change
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_requests(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I am trying to learn scrapy and splash to scrape efficiently from the web. I have installed scrapy, scrapy-splash and have splash running in a docker container. My code is below
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
from scrapy_splash import SplashRequest
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "propertySearch"
def start_requests(self):
url = "http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield SplashRequest(url=url, callback=self.parse, args={'wait': 3})
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]')
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
# with open(Path(work_path, "extract.txt"), 'wb') as g:
# g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
When I run it the console outputs the below if it is of any use.
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:06:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-07 00:06:32 [scrapy.crawler] INFO: Overridden settings:
{}
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet Password: 162ccfed8b528ac9
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-06-07 00:06:32 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:06:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:06:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> from <GET http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=>
2020-06-07 00:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> (referer: None)
2020-06-07 00:06:33 [propertySearch] DEBUG: Saved file test.txt
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 77255,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 0.429054,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 23, 6, 33, 185514),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 6, 6, 23, 6, 32, 756460)}
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Spider closed (finished)
The text file that it outputs only contains the bare HTML with none of the data that JavaScript would bring in.
I read the documentation closely and found that first I needed to follow the scrapy project set-up process from here closely and then found very useful info here about how to run my script from my IDE. It's working now.
I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)
Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/
I have written a code that passes through links within a web page to extract data and move to the next page. It is the about link from each author in quotes.toscrape.com.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com',]
def parse(self, response):
linkto = response.css('div.quote > span > a::attr(href)').extract()
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = scrapy.parse_about)
nextp = response.css('li.next > a::attr(href)').extract()
if nextp:
nextp = response.urljoin(nextp)
yield scrapy.Request(url=nextp, callback=self.parse)
def parse_about(self, response):
yield {
'date_of_birth': response.css('span.author-born-date::text').extract(),
'author': response.css('h3.author-title::text').extract(),
}
I executed in the command prompt:
scrapy crawl test -o test.csv
but the results I got:
019-03-20 16:36:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotestoscrape)
2019-03-20 16:36:03 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.5.0, Python 2.7.15 |Anaconda, Inc.| (default, Nov 13 2018, 17:33:26) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.5, Platform Windows-10-10.0.17134
2019-03-20 16:36:03 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotestoscrape.spiders', 'SPIDER_MODULES': ['quotestoscrape.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'quotestoscrape'}
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-20 16:36:03 [scrapy.core.engine] INFO: Spider opened
2019-03-20 16:36:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 16:36:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-03-20 16:36:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com> (referer: None)
Traceback (most recent call last):
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\quotestoscrape\quotestoscrape\spiders\QuoteTestSpider.py", line 13, in parse
yield scrapy.Request(url=links, callback = scrapy.parse_about)
AttributeError: 'module' object has no attribute 'parse_about'
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-20 16:36:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 20, 21, 36, 4, 41000),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2019, 3, 20, 21, 36, 3, 468000)}
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Spider closed (finished)
And my csv file I moved it to is empty:
enter image description here
Please let me know what I am doing wrong
According to your log method parse_about is not called because you are trying to call scrapy.parse_about instead of spider's self.parse_about:
....
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = self.parse_about)
As your application doesn't scrape any data -> It creates empty csv file as result.