Request callback issue with infinite crawler - scrapy

I am writing a Scrapy spider whose purpose is to make requests to a remote server in order to hydrate the cache. It's an infinite crawler because I need to make requests after regular intervals. I created initial spider which generates request and hit the server, it worked fine but now when I am making it running infinitely, I am not getting responses. I even tried to debug in the process_response middleware but couldn't get my spider till there. Here is a sketch of code which I am implementing
def generate_requests(self, payloads):
for payload in payloads:
if payload:
print(f'making request with payload {payload}')
yield Request(url=Config.HOTEL_CACHE_AVAILABILITIES_URL, method='POST', headers=Config.HEADERS,
callback=self.parse, body=json.dumps(payload), dont_filter=True, priority=1)
def start_requests(self):
crawler_config = CrawlerConfig()
while True:
if not self.city_scheduler:
for location in crawler_config.locations:
city_name = location.city_name
ttl = crawler_config.get_city_ttl(city_name)
payloads = crawler_config.payloads.generate_payloads(location)
self.city_scheduler[location.city_name] = (datetime.now() + timedelta(minutes=ttl)).strftime("%Y-%m-%dT%H:%M:%S")
yield from self.generate_requests(payloads)

Seems like scrapy has some odd behavior with while loop in start_requests. you can check similar enhancement on scrapy repo here.
Moving while loop logic in your parse method will solve this issue.

Related

How to get around being blocked with "Scrapy"

Background:
I am planning on buying a car, and want to monitor the prices.
I'd like to use Scrapy to do this for me. However the site, blocks my code from doing this.
MWE/Code:
#!/usr/bin/python3
# from bs4 import BeautifulSoup
import scrapy # adding scrapy to our file
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
name = "headphones"
def start_requests(self):
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
# urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse) # we will explain the callback soon
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
def main():
scraper()
Output:
...some stuff above it
2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
..some more stuff underneath
Question:
I just don't know how I can circumnavigate this not allowed to parse the prices, Km's etc. It would make my life so much easier. How can I get past this block? FWIW I also tried it with BeautifulSoup which didn't work.
There are multiple ways to avoid being blocked by the sites while scraping that:
Set ROBOTSTXT_OBEY = False
Increase DOWNLOAD_DELAY between your requests like 3 to 4 seconds depending upon the site behavior
Set CONCURRENT_REQUESTS to 1
Use proxies or pool of proxies by customizing proxy_middleware and serve the cause
Carry site cookies in requests so the site does not identify bot behavior
You can try these solutions sequentially

Scrapy spider_idle signal - need to add requests with parse item callback

In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code):
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True)
# attempt to crawl orphaned items
db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'],
port=self.settings['AWS_RDS_PORT'],
user=self.settings['AWS_RDS_USER'],
passwd=self.settings['AWS_RDS_PASSWD'],
db=self.settings['AWS_RDS_DB'],
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
charset="utf8",)
c=db.cursor()
c.execute("""SELECT p.url FROM products p LEFT JOIN product_data pd ON p.id = pd.product_id AND pd.scrape_date = CURDATE() WHERE p.website_id = %s AND pd.id IS NULL""", (self.website_id,))
while True:
url = c.fetchone()
if url is None:
break
# record orphaned product
self.crawler.stats.inc_value('orphaned_count')
yield Request(url['url'], callback=self.parse_item)
db.close()
Unfortunately, it appears as though the crawler queues up these orphaned items during the rest of the crawl - so, in effect, too many are regarded as orphaned (because the crawler has not yet retrieved these items in the normal crawl, when the database query is executed).
I need this orphaned process to happen at the end of the crawl - so I believe I need to use the spider_idle signal.
However, my understanding is I can't just simply yield requests in my spider idle method - instead I can use self.crawler.engine.crawl?
I need requests to be processed by my spider's parse_item() method (and for my configured middleware, extensions and pipelines to be obeyed). How can I achieve this?
the idle method that was connected to the idle signal (let's say the idle method is called idle_method) should receive the spider as argument, so you could do something like:
def idle_method(self, spider):
self.crawler.engine.crawl(Request(url=myurl, callback=spider.parse_item), spider)

How to make Scrapy execute callbacks before the start_requests method finishes?

I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
with open(self._file) as infile:
for line in infile:
inlist = line.replace("\n","").split(",")
item = MyItem(data = inlist[0])
request = scrapy.Request(
url = "http://foo.org/{0}".format(item["data"]),
callback = self.parse_some_page
)
request.meta["item"]
yield request
def parse_some_page(self,response):
...
request = scrapy.Request(
url = "http://foo.org/bar",
callback = self.parse_some_page2
)
yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
name = 'stackoverflow'
def start_requests(self):
urls = [
'http://www.gob.cl/en/',
'http://www.thaigov.go.th/en.html',
'https://www.yahoo.com',
'https://www.stackoverflow.com',
'https://swapi.co/',
]
for index, url in enumerate(urls):
# create callback chain after a response is returned
deferred = defer.Deferred()
deferred.addCallback(self.parse_some_page)
deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
# add callbacks and errorbacks as needed
yield scrapy.Request(
url=url,
callback=deferred.callback) # this func will start the callback chain AFTER a response is returned
def parse_some_page(self, response):
print('[1] Parsing %s' % (response.url))
return response.body # this will be passed to the next callback
def write_to_disk(self, content, url, filenumber):
print('[2] Writing %s content to disk' % (url))
filename = '%d.html' % filenumber
with open(filename, 'wb') as f:
f.write(content)
# return what you want to pass to the next callback function
# or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
url=url,
callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
References
Twisted Deferred Reference
Explanation of parse(): Briefly summarizes how yeild/return affects parsing.
Non-Blocking Recipes (Klien): I blog post I wrote a while back on async callbacks in Klien/Twisted. Might be helpful to newbies.
PS
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.

Scrapy: Changing media pipeline download priorities: How to delay media files downloads at the very end of the crawl?

http://doc.scrapy.org/en/latest/topics/media-pipeline.html
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).
I want to do the exact opposite: Scrape all HTML urls first, then, download all media files at once. How can I do that?
Not an answer but if you're curious to know how this behavior is implemented, check MediaPipeline pipeline source code, especially the process_item method:
def process_item(self, item, spider):
info = self.spiderinfo
requests = arg_to_iter(self.get_media_requests(item, info))
dlist = [self._process_request(r, info) for r in requests]
dfd = DeferredList(dlist, consumeErrors=1)
return dfd.addCallback(self.item_completed, item, info)
You see that a bunch of requests are queued, to be processed (request sent + response downloaded) BEFORE item_completed is eventually called, returning the original item + the downloaded media info.
In the nominal case, requests generated by the MediaPipeline subclasses will be sent for download immediately by using crawler.engine.download directly:
(...)
else:
request.meta['handle_httpstatus_all'] = True
dfd = self.crawler.engine.download(request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
return dfd

Persist items using a POST request within a Pipeline

I want to persist items within a Pipeline posting them to a url.
I am using this code within the Pipeline
class XPipeline(object):
def process_item(self, item, spider):
log.msg('in SpotifylistPipeline', level=log.DEBUG)
yield FormRequest(url="http://www.example.com/additem, formdata={'title': item['title'], 'link': item['link'], 'description': item['description']})
but it seems it's not making the http request.
Is it possible to make http request from pipelines? If not, do I have to do it in the Spider?
Do I need to specify a callback function? If so, which one?
If I can make the http call, can I check the response (JSON) and return the item if everything went ok, or discard the item if it didn't get saved?
As I final thing, is there a diagram that explains the flow that Scrapy follows from beginning to end? I am getting slightly lost which what gets called when. For instance, if Pipelines returned items to Spiders, what do Spiders do with those items? What's after a Pipeline call?
Many thanks in advance
Migsy
You can inherit your pipeline from scrapy.contrib.pipeline.media.MediaPipeline and yield Requests in 'get_media_requests'. Responses are passed into 'media_downloaded' callback.
Quote:
This method is called for every item pipeline component and must
either return a Item (or any descendant class) object or raise a
DropItem exception. Dropped items are no longer processed by further
pipeline components.
So, only spider can yield a request with a callback.
Pipelines are used for processing items.
You better describe what do you want to achieve.
is there a diagram that explains the flow that Scrapy follows from beginning to end
Architecture overview
For instance, if Pipelines returned items to Spiders
Pipelines do not return items to spiders. The items returned are passed to the next pipeline.
This could be done easily by using the requests library. If you don't want to use another library then look into urllib2.
import requests
class XPipeline(object):
def process_item(self, item, spider):
r = requests.post("http://www.example.com/additem", data={'title': item['title'], 'link': item['link'], 'description': item['description']})
if r.status_code == 200:
return item
else:
raise DropItem("Failed to post item with title %s." % item['title'])