Can i use scrapy Post request without callback? - scrapy

I need to update location on site that uses redio button. This can be done with simple Post request. The problem is that output of this request is
window.location='http://store.intcomex.com/en-XCL/Products/Categories?r=True';
Since it is not a valid url Scrapy redirects it to PageNotFound and closes spider.
2017-09-17 09:57:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https:
//store.intcomex.com/en-XCL/ServiceClient/PageNotFound> from <POST https://store.intcomex.com/en-XC
L//User/SetNewLocation>
Here is my code:
def after_login(self, response):
# inspect_response(response, self)
url = "https://store.intcomex.com/en-XCL//User/SetNewLocation"
data={"id":"xclf1"
}
yield scrapy.FormRequest(url, formdata=data, callback = self.location)
# inspect_response(response, self)
def location(self, response):
yield scrapy.Request(url = 'http://store.intcomex.com/en-XCL/Products/Categories?r=True', callback = self.categories)
The question is how can I redirect scrapy to valid url after executing Post request that changes location? Is there some argument that indicates target url or i can execute it without callback and yiel correct url on the next line?
Thanks.

Related

scrapy.Request not going through

The crawling process seems to ignore and/or not execute the line yield scrapy.Request(property_file, callback=self.parse_property). The first scrapy.Request in def start_requests goes through and executed properly, but not one in def parse_navpage as seen here.
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scrape_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(#data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[#data-testid='listing-details-link']/#href").getall() # List of property urls
break
print(listing_url) #Works
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
print("BEFORE YIELD")
yield scrapy.Request(property_file, callback=self.parse_property) #Not going through
print("AFTER YIELD")
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
Running scrapy crawl scrape_zoopla in the command returns:
2022-09-10 20:38:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html> (referer: None)
BEFORE YIELD
AFTER YIELD
2022-09-10 20:38:24 [scrapy.core.engine] INFO: Closing spider (finished)
Both scrapy.Requests requested local files and only the first one worked. The files exist and properly display the pages and in case one of them does not the crawler would return error "No such file or directory" and likely be interrupted. It seems here the crawler just passed right through the request, not even gone through it, and returned no error. What is the error here?
This is a total shot in the dark but you could try sending both requests from your start_requests method. I honestly don't see why this would work but It might be worth a shot.
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scraoe_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
yield scrapy.Request(property_file, callback=self.parse_property)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(#data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[#data-testid='listing-details-link']/#href").getall() # List of property urls
break
print(listing_url) #Works
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
Update
It just dawned on me why this is happening. It is because you have the allowed_domains attribute set but the request you are making is on your local file system which naturally is not going to match the allowed domain.
Scrapy assumes that all of the initial urls sent from start_requests are permitted and therefore doesn't do any verification for those, but all subsequent parse methods check against the allowed_domains attribute.
Just remove that line from the top of your spider class and your original structure should work fine.

How to callback on 301 redirect without crawling in scrapy?

I am scraping a search result page where in some cases a 301 redirect will be triggered. In that case I do not want to crawl that page, but I need to call a different callback function, passing the redirect URL string to it.
I belive it should be possible to do it along the rules, but could not figure out how to:
class GetbidSpider(CrawlSpider):
handle_httpstatus_list = [301]
rules = (
Rule(
LinkExtractor(
allow=['^https://www\.testrule*$'],
),
follow=False,
callback= 'parse_item'
),
)
def parse_item(self, response):
self.logger.info('Parsing %s', response.url)
print(response.status)
print(response.headers[b'Location'])
The logfile only shows:
DEBUG: Crawled (301) <GET https:...
But the parsind info never gets printed, indicating never entering the function.
How can I
I really can't understand why my suggestions don't work for you. This is a tested code:
import scrapy
class RedirectSpider(scrapy.Spider):
name = 'redirect_spider'
def start_requests(self):
yield scrapy.Request(
url='https://www.moneycontrol.com/india/stockpricequote/pesticidesagrochemicals/piindustries/PII',
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
print(response.status)
print(response.headers[b'Location'])
pass

Scrapy passed one page from parsing

Here is my spider, which I run from a script to parse content of my local dokuwiki:
DEBUG = True
if DEBUG:
f_debug = open('debug.log','w')
md5s = []
class DokuWikiMd5Spider(scrapy.Spider):
name = 'dokuwikispider'
start_urls = ['https://dokuwiki.mjcc.lasil.ru/doku.php']
visited = []
custom_settings = {
'CONCURRENT_REQUESTS': 1,
}
#staticmethod
def get_page_name(url):
url = url.replace("https://dokuwiki.mjcc.lasil.ru/doku.php?", '')
if 'id=start&do=search' in url:
# because credentials are in URL, here we cut only page name
# https://dokuwiki.mjcc.lasil.ru/doku.php?id=start&do=search&id=%D0%BF%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D1%89%D0%B8%D0%BA%D0%B8_%D0%B8_%D0%BA%D0%BE%D0%BD%D1%82%D0%B0%D0%BA%D1%82%D1%8B&q=&p=PASSWORD&u=admin
m = re.findall('id=([^&]+)', url)
return m[1]
else:
m = re.search('id=([^&]+)', url)
return m.group(1)
def parse(self, response):
password = keyring.get_password('dokuwiki', 'admin')
return scrapy.FormRequest.from_response(
response,
formdata = {'u': 'admin', 'p': password},
callback = self.after_login
)
def after_login(self, response):
# check login succeed before going on
if b"authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
if DEBUG:
f_debug.write("parsing: {}\n".format(response.url))
text = response.text
# cut everything except page content, not to depend on wiki settings when comparing
m = re.findall('.*(<!-- wikipage start -->.*<!-- wikipage stop -->).*', text, re.DOTALL)
text = m[0][0]
# with open(r'F:\TEMP\test.html','w') as f:
# f.write(text)
md5 = hashlib.md5()
md5.update(text.encode('utf-8'))
md5s.append({'url': self.get_page_name(response.url), 'md5': md5.hexdigest()})
yield {'url': self.get_page_name(response.url), 'md5': md5.hexdigest()}
for next_page in response.xpath('//a/#href'):
next_url = next_page.extract()
if DEBUG:
f_debug.write("\t?next page: {}\n".format(next_url))
if 'doku.php?id=' in next_url:
# to process every page name only one time
next_page_name = self.get_page_name(next_url)
if next_page_name not in self.visited:
if DEBUG:
f_debug.write("\t\t!\n")
self.visited.append(next_page_name)
yield response.follow("https://dokuwiki.mjcc.lasil.ru/{}&u=admin&p={}".format(next_url, keyring.get_password('dokuwiki', 'admin')), self.after_login)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(DokuWikiMd5Spider)
process.start() # the script will block here until the crawling is finished
So in debug messages I see, that spider crowled page 'wiki_backup':
2019-01-28 19:49:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dokuwiki.mjcc.lasil.ru//doku.php?id=wiki_backup&u=admin&p=PASSWORD> (referer: https://dokuwiki.mjcc.lasil.ru//doku.php?id=%D1%81%D0%BE%D0%B7%D0%B4%D0%B0%D0%BD%D0%B8%D0%B5_%D0%B8_%D0%BF%D1%80%D0%BE%D0%B2%D0%B5%D1%80%D0%BA%D0%B0_%D0%B1%D1%8D%D0%BA%D0%B0%D0%BF%D0%BE%D0%B2&u=admin&p=PASSWORD)
And I can see its content in the crawled method, as you can see in screenshot
But that page wasn't parsed even one time, as you can see in ''debug.log'':
root#F91_Moin20:/home/ishayahu # cat debug.log | grep wiki_backup
?next page: /doku.php?id=wiki_backup
The problem was in a way, how spider checks if authentification was failed. It (as in the tutorial) search for words "authentification failed", but because I had the same words in page content, spider thought that here was an authentification error and stop processing the page.
There should be another way to check if authentification was really failed.

Scrapy how to remove a url from httpcache or prevent adding to cache

I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Middleware responsible for requests/response caching is HttpCacheMiddleware. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.

How to pass response to a spider without fetching a web page?

The scrapy documentation specifically mentions that I should use downloader middleware if I want to pass a response to a spider without actually fetching the web page. However, I can't find any documentation or examples on how to achieve this functionality.
I am interested in passing only the url to the request callback, populate an item's file_urls field with the url (and certain permutations thereof), and use the FilesPipeline to handle the actual download.
How can a write a downloader middleware class that passes the url to the spider while avoiding downloading the web page?
You can return Response object in downloader middleware's process_request() method. This method is called for every request your spider yields.
Something like:
class NoDownloadMiddleware(object):
def process_request(self, request, spider):
# only process marked requests
if not request.meta.get('only_download'):
return
# now make Response object however you wish
response = Response(request.url)
return response
and in your spider:
def parse(self, response):
yield Request(some_url, meta={'only_download':True})
and in your settings.py activate the middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.NoDownloadMiddleware': 543,
}