Scrapy Pagination - Works for 2 pages but not after that - scrapy

I'm crawling cdw.com website. For a given URL, there are around 17 pages. The script that I have written is able t fetch data from Page 1 and Page 2. Spider closes on its own after giving result of first 2 pages. Please let me know, how can I fetch data for remaining 15 pages.
TIA.
import scrapy
from cdwfinal.items import CdwfinalItem
from scrapy.selector import Selector
import datetime
import pandas as pd
import time
class CdwSpider(scrapy.Spider):
name = 'cdw'
allowed_domains = ['www.cdw.com']
start_urls = ['http://www.cdw.com/']
base_url = 'http://www.cdw.com'
def start_requests(self):
yield scrapy.Request(url = 'https://www.cdw.com/search/?key=axiom' , callback=self.parse )
def parse(self, response):
item=[]
hxs = Selector(response)
item = CdwfinalItem()
abc = hxs.xpath('//*[#id="main"]//*[#class="grid-row"]')
for i in range(len(abc)):
try:
item['mpn'] = hxs.xpath("//div[contains(#class,'search-results')]/div[contains(#class,'search-result')]["+ str(i+1) +"]//*[#class='mfg-code']/text()").extract()
except:
item['mpn'] = 'NA'
try:
item['part_no'] = hxs.xpath("//div[contains(#class,'search-results')]/div[contains(#class,'search-result')]["+ str(i+1) +"]//*[#class='cdw-code']/text()").extract()
except:
item['part_no'] = 'NA'
yield item
next_page = hxs.xpath('//*[#id="main"]//*[#class="no-hover" and #aria-label="Next Page"]').extract()
if next_page:
new_page_href = hxs.xpath('//*[#id="main"]//*[#class="no-hover" and #aria-label="Next Page"]/#href').extract_first()
new_page_url = response.urljoin(new_page_href)
yield scrapy.Request(new_page_url, callback=self.parse, meta={"searchword": '123'})
LOG:
2023-02-11 15:39:55 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36
2023-02-11 15:39:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdw.com/search/?key=axiom&pcurrent=3> (referer: https://www.cdw.com/search/?key=axiom&pcurrent=2) ['cached']
2023-02-11 15:39:55 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-11 15:39:55 [scrapy.extensions.feedexport] INFO: Stored csv feed (48 items) in: Test5.csv
2023-02-11 15:39:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2178,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 68059,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'elapsed_time_seconds': 1.30903,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 11, 10, 9, 55, 327740),
'httpcache/hit': 3,
'httpcompression/response_bytes': 384267,
'httpcompression/response_count': 3,
'item_scraped_count': 48,
'log_count/DEBUG': 62,
'log_count/INFO': 11,
'log_count/WARNING': 45,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2023, 2, 11, 10, 9, 54, 18710)}

Your next_page selector is failing to extract the information for the next page. In general your selectors are more complicated then they need to be, for example you should be using relative xpath expressions in your for loop .
Here is an example that replicates the same behaviour as your spider except using much simpler selectors, and successfully extracts the results from all of the pages.
import scrapy
class CdwSpider(scrapy.Spider):
name = 'cdw'
allowed_domains = ['www.cdw.com']
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
def start_requests(self):
yield scrapy.Request(url='https://www.cdw.com/search/?key=axiom' , callback=self.parse )
def parse(self, response):
for row in response.xpath('//div[#class="grid-row"]'):
mpn = row.xpath(".//span[#class='mfg-code']/text()").get()
cdw = row.xpath('.//span[#class="cdw-code"]/text()').get()
yield {"mpn": mpn, "part_no": cdw}
current = response.css("div.search-pagination-active")
next_page = current.xpath('./following-sibling::a/#href').get()
if next_page:
new_page_url = response.urljoin(next_page)
yield scrapy.Request(new_page_url, callback=self.parse)
EDIT
The only non-default setting I am using is setting a the user agent.
I have made adjustments in the example above to reflect that.
Partial output:
2023-02-11 22:10:58 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-11 22:10:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 106555,
'downloader/request_count': 41,
'downloader/request_method_count/GET': 41,
'downloader/response_bytes': 1099256,
'downloader/response_count': 41,
'downloader/response_status_count/200': 41,
'elapsed_time_seconds': 22.968986,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 12, 6, 10, 58, 700080),
'httpcompression/response_bytes': 7962149,
'httpcompression/response_count': 41,
'item_scraped_count': 984,
'log_count/DEBUG': 1028,
'log_count/INFO': 10,
'request_depth_max': 40,
'response_received_count': 41,
'scheduler/dequeued': 41,
'scheduler/dequeued/memory': 41,
'scheduler/enqueued': 41,
'scheduler/enqueued/memory': 41,
'start_time': datetime.datetime(2023, 2, 12, 6, 10, 35, 731094)}
2023-02-11 22:10:58 [scrapy.core.engine] INFO: Spider closed (finished)

Related

Scrapy not returned all items from pagination

I want to scrape all monitor item from the site https://www.startech.com.bd. But
The problem arise when I run my spider it returns only 60 result.
Here is my code, which doesn't work right:
import scrapy
import time
class StartechSpider(scrapy.Spider):
name = 'startech'
allowed_domains = ['startech.com.bd']
start_urls = ['https://www.startech.com.bd/monitor/']
def parse(self, response):
monitors = response.xpath("//div[#class='p-item']")
for monitor in monitors:
item = monitor.xpath(".//h4[#class = 'p-item-name']/a/text()").get()
price = monitor.xpath(".//div[#class = 'p-item-price']/span/text()").get()
yield{
'item' : item,
'price' : price
}
next_page = response.xpath("//ul[#class = 'pagination']/li/a/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Any help is much appreciated!
//ul[#class = 'pagination']/li/a/#href selects 10 items/pages at once but you have to select unique meaning only the next page.The following xpath expression grab the right pagination.
Code:
next_page = response.xpath("//a[contains(text(), 'NEXT')]/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Output:
2022-11-26 01:45:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startech.com.bd/monitor?page=19> (referer: https://www.startech.com.bd/monitor?page=18)
2022-11-26 01:45:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.startech.com.bd/monitor?page=19>
{'item': 'HP E27q G4 27 Inch 2K QHD IPS Monitor', 'price': '41,000৳'}
None
2022-11-26 01:45:06 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-26 01:45:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6702,
'downloader/request_count': 19,
'downloader/request_method_count/GET': 19,
'downloader/response_bytes': 546195,
'downloader/response_count': 19,
'downloader/response_status_count/200': 19,
'elapsed_time_seconds': 9.939978,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 25, 19, 45, 6, 915772),
'httpcompression/response_bytes': 6200506,
'httpcompression/response_count': 19,
'item_scraped_count': 361,

Scrapy Pausing and resuming crawls, results directory

I have finished a scraping project using resume mode. but I don't know where the results are.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
I look at https://docs.scrapy.org/en/latest/topics/jobs.html, but it does not indicate anything about it
¿Where is the file with the results?
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-10 23:31:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'bans/error/twisted.internet.error.ConnectionRefusedError': 2,
'bans/error/twisted.internet.error.TimeoutError': 6891,
'bans/error/twisted.web._newclient.ResponseNeverReceived': 8424,
'bans/status/500': 9598,
'bans/status/503': 56,
'downloader/exception_count': 15339,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6891,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8424,
'downloader/request_bytes': 9530,
'downloader/request_count': 172,
'downloader/request_method_count/GET': 172,
'downloader/response_bytes': 1848,
'downloader/response_count': 170,
'downloader/response_status_count/200': 169,
'downloader/response_status_count/500': 9,
'downloader/response_status_count/503': 56,
'elapsed_time_seconds': 1717,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 11, 2, 31, 31, 32),
'httperror/response_ignored_count': 67,
'httperror/response_ignored_status_count/500': 67,
'item_scraped_count': 120,
'log_count/DEBUG': 357,
'log_count/ERROR': 119,
'log_count/INFO': 1764,
'log_count/WARNING': 240,
'proxies/dead': 1,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 169,
'retry/count': 1019,
'retry/max_reached': 93,
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2015, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
(Face python 3.8) D:\Selenium\Face python 3.8\TORBUSCADORDELINKS\TORBUSCADORDELINKS\spiders>
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722673,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2020, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
Your command,
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
does not indicate an output file path.
Because of that, your results are nowhere.
Use the -o command-line switch to specify an output path.
See also the Scrapy tutorial, which covers this. Or run scrapy crawl --help.

Scrapy: How to extract an attribute value from a <span>

Looking at Twitter: www.twitter.com/twitter
You will see that the amount of followers are shown as 57.9M but if you hover over that value you will see the exact amount of followers.
This appears in the source as:
<span class="ProfileNav-value" data-count="57939946" data-is-compact="true">57.9M</span>
When I inspect this span on Chrome I use:
(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]
I am trying to extract just the attribute "data-count" using the above:
def parseTwitter(self, response):
company_name=response.meta['company_name']
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[1]/text()")
l.add_xpath('twitter_following', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[2]/text()")
l.add_xpath('twitter_followers', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]/text()")
...but I'm not getting anything back:
2018-10-18 10:22:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-18 10:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/ADP> (referer: None)
2018-10-18 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/Workday> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/OracleHCM> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-18 10:22:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 199199,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 18, 10, 22, 16, 833691),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 52334592,
'memusage/startup': 52334592,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 10, 18, 10, 22, 7, 269320)}
SOLUTION:
As per pwinz suggestion below, I was trying to do a text value extract ".text()" from the attribute where simply #-ing the attribute should give you the value. My final - working - solution is:
def parseTwitter(self, response):
company_name=response.meta['company_name']
print('### ### ### Inside PARSE TWITTER ### ### ###')
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[1]")
l.add_xpath('twitter_following', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[2]")
l.add_xpath('twitter_followers', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]")
yield l.load_item()
Its because data is manipulated with Javascript but Scrapy only downloads HTML but does not executes any JS/AJAX code.
When scraping with Scrapy, always disable Javascript in browser and then find what you want to scrape, and if its available, just use your selector/xpath, otherwise, inspect JS/AJAX calls on webspage to understand how it is loading data
So, to scrape number of follower
You can use following CSS Selector
.ProfileNav-item.ProfileNav-item--followers a
Scrapy code
item = {}
item["followers"] = response.css(".ProfileNav-item.ProfileNav-item--followers a").extract_first()
yield item
With respect to other answers, dynamic content is not the issue here. You are trying to get the text() from the data-count attribute. You should just be able to get the data from the #data-count.
Try this pattern:
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav
-value']/#data-count)[1]")
It worked for me.

Scrapy Crawlspider is only crawling the first 5 pages in a category

Currently, my crawlspider will only crawl roughly 20,000 products of the over 6.5M available. It seems that each category is being scraped but only the first 5 pages of each category are being scraped. I believe that is is something with my linkextractor but I am not sure.
CrawlSpider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
class DigikeyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en']
rules = (
Rule(LinkExtractor(allow=('products', )),callback='parse_item'),
)
def parse_item(self, response):
for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
item['description'] = row.css('.tr-description::text').extract_first()
item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
item['price'] = row.css('.tr-unitPrice::text').extract_first()
item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
yield item
Setting:
BOT_NAME = 'digikey'
SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
Output:
2017-11-01 10:53:11 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-01 10:53:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
'downloader/request_bytes': 1198612,
'downloader/request_count': 988,
'downloader/request_method_count/GET': 988,
'downloader/response_bytes': 23932614,
'downloader/response_count': 982,
'downloader/response_status_count/200': 982,
'dupefilter/filtered': 46,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 1, 17, 53, 11, 421641),
'item_scraped_count': 21783,
'log_count/DEBUG': 22773,
'log_count/ERROR': 2,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 982,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
'scheduler/dequeued': 988,
'scheduler/dequeued/memory': 988,
'scheduler/enqueued': 988,
'scheduler/enqueued/memory': 988,
'start_time': datetime.datetime(2017, 11, 1, 17, 49, 38, 427669)}
2017-11-01 10:53:11 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\dalla_000\digikey>
With this particular site, it might make sense to do a two-stage crawl:
get a list of all the category URLs
iterate over all the URLs
One approach could be to use two spiders and a message queue. The first spider might look like this:
import scrapy
from bs4 import BeautifulSoup
import re
import math
import urllib
from kafka import KafkaClient, SimpleProducer
ITEMS_PER_PAGE = 500
class CreateXxxxxxxUrlListSpider(scrapy.Spider):
kafka = KafkaClient('10.0.1.12:9092')
producer = SimpleProducer(kafka)
name = "create_xxxxxxx_url_list"
allowed_domains = ["xxxxxxx.com"]
start_urls = [
"http://www.xxxxxxx.com/product-search/en?stock=1"
]
def parse(self, response):
soup = BeautifulSoup(response.body)
catfilterlinks = soup.find_all('a', {'class':'catfilterlink'})
for catfilterlink in catfilterlinks:
location = catfilterlink['href'].split("?")[0]
items = re.match(".*\(([0-9]+) items\).*", catfilterlink.next_sibling).group(1)
for page in range(int(math.ceil(float(items) / ITEMS_PER_PAGE))):
if page == 0:
url = "http://www.xxxxxxx.com" + location + "?" + urllib.urlencode({"stock":1})
self.producer.send_messages("xxxxxxx_search_page_urls", url)
else:
url = "http://www.xxxxxxx.com" + location + "/page/" + str(page + 1) + "?" + urllib.urlencode({"stock":1})
self.producer.send_messages("xxxxxxx_search_page_urls", url)
The first spider gets a list of all the pages to crawl and then writes them to a message queue (e.g. Kafka).
The second spider would consume URLs from from the Kafka topic and crawl them. It might look something like this:
from scrapy_kafka.spiders import ListeningKafkaSpider
from .. items import PageHtml
from calendar import timegm
import time
class CrawlXxxxxxxUrlsSpider(ListeningKafkaSpider):
name = 'crawl_xxxxxxx_urls_spider'
allowed_domains = ["xxxxxxx.com"]
topic = "xxxxxxx_search_page_urls"
def parse(self, response):
item = PageHtml()
item['url'] = response.url
item['html'] = response.body_as_unicode()
item['ts'] = timegm(time.gmtime())
return item
# .... or whatever

scrapy:Why no use of parse_item function

Here is my Spider:
import scrapy
import urlparse
from scrapy.http import Request
class BasicSpider(scrapy.Spider):
name = "basic2"
allowed_domains = ["cnblogs"]
start_urls = (
'http://www.cnblogs.com/kylinlin/',
)
def parse(self, response):
next_site = response.xpath(".//*[#id='nav_next_page']/a/#href")
for url in next_site.extract():
yield Request(urlparse.urljoin(response.url,url))
item_selector = response.xpath(".//*[#class='postTitle']/a/#href")
for url in item_selector.extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_item)
def parse_item(self, response):
print "+=====================>>test"
Here is the output:
2016-08-12 14:46:20 [scrapy] INFO: Spider opened
2016-08-12 14:46:20 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-12 14:46:20 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/robots.txt> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/kylinlin/> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
2016-08-12 14:46:20 [scrapy] INFO: Closing spider (finished)
2016-08-12 14:46:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 445,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5113,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 420000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 11,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 131000)}
2016-08-12 14:46:20 [scrapy] INFO: Spider closed (finished)
Why crawled pages are 0?
I cannot understand why there are no output like "+=====================>>test".
Could someone help me out?
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
and your's is set to:
allowed_domains = ["cnblogs"]
which is not even a domain. It should be:
allowed_domains = ["cnblogs.com"]