How to get the rediected url in the process_request() of scrapy RedirectMiddleware? - scrapy

For example:
the url is http://www.wandoujia.com/search?key=saber
It will be redirected to the new url http://www.wandoujia.com/search/3161097853842468421.
I want to get the new url in the process_request() of scrapy RedirectMiddleware?
Following is my code:
class RedirectMiddleware(object):
def process_request(self, request, spider):
new_url = request.url
logging.debug('new_url = ' + new_url)
logging.debug('****************************')
patterns = spider.request_pattern
logging.debug(patterns)
for pattern in patterns:
obj = re.match(pattern, new_url)
if obj:
return Request(new_url)
ps:the request.url is the old url. I want to get the new url correctly.

Try replacing the default middleware with something like this (the method you're looking for is the process_response because the response "contains the redirection")
class CustomRedirectMiddleware(RedirectMiddleware):
def process_response(self, request, response, spider):
redirected = super(CustomRedirectMiddleware, self).process_response(
request, response, spider)
if isinstance(redirected, request.__class__):
print("Original url: <{}>".format(request.url))
print("Redirected url: <{}>".format(redirected.url))
return redirected
return response

Related

Can i use scrapy Post request without callback?

I need to update location on site that uses redio button. This can be done with simple Post request. The problem is that output of this request is
window.location='http://store.intcomex.com/en-XCL/Products/Categories?r=True';
Since it is not a valid url Scrapy redirects it to PageNotFound and closes spider.
2017-09-17 09:57:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https:
//store.intcomex.com/en-XCL/ServiceClient/PageNotFound> from <POST https://store.intcomex.com/en-XC
L//User/SetNewLocation>
Here is my code:
def after_login(self, response):
# inspect_response(response, self)
url = "https://store.intcomex.com/en-XCL//User/SetNewLocation"
data={"id":"xclf1"
}
yield scrapy.FormRequest(url, formdata=data, callback = self.location)
# inspect_response(response, self)
def location(self, response):
yield scrapy.Request(url = 'http://store.intcomex.com/en-XCL/Products/Categories?r=True', callback = self.categories)
The question is how can I redirect scrapy to valid url after executing Post request that changes location? Is there some argument that indicates target url or i can execute it without callback and yiel correct url on the next line?
Thanks.

scrapy downloadmiddleware failed to schedule request from process_response

I want to push the request which response code is 423 back to scheduler, so I create a downloadmiddleware:
class MyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 423:
return request
else:
return response
but it does not work, even the request is not in scheduler again.
thank you for your help!
Your new request is probably getting filtered out by scrapy's dupefilter.
You can try addding dont_filter=True parameter:
def process_response(self, request, response, spider):
if response.status == 423:
request = request.replace(dont_filter=True)
return request
else:
return response
You can set these two settings to your scrapy_settings:
RETRY_HTTP_CODES=[423]
RETRY_TIMES=10
and scrapy will manage it for you.

Scrapy how to remove a url from httpcache or prevent adding to cache

I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Middleware responsible for requests/response caching is HttpCacheMiddleware. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.

How to pass response to a spider without fetching a web page?

The scrapy documentation specifically mentions that I should use downloader middleware if I want to pass a response to a spider without actually fetching the web page. However, I can't find any documentation or examples on how to achieve this functionality.
I am interested in passing only the url to the request callback, populate an item's file_urls field with the url (and certain permutations thereof), and use the FilesPipeline to handle the actual download.
How can a write a downloader middleware class that passes the url to the spider while avoiding downloading the web page?
You can return Response object in downloader middleware's process_request() method. This method is called for every request your spider yields.
Something like:
class NoDownloadMiddleware(object):
def process_request(self, request, spider):
# only process marked requests
if not request.meta.get('only_download'):
return
# now make Response object however you wish
response = Response(request.url)
return response
and in your spider:
def parse(self, response):
yield Request(some_url, meta={'only_download':True})
and in your settings.py activate the middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.NoDownloadMiddleware': 543,
}

how to use get method in groovyx.net.http.RESTClient properly

I'm trying to get JSON file via get method in RESTClient.
Right now I'm trying
def url = 'http://urlurlurl'
def username = 'username'
def password = 'password'
def restClient = new RESTClient(url)
restClient.auth.basic(username, password)
render restClient
When I see what I get from restClient, is just prints
'groovyx.net.http.RESTClient#65333e2e'
Which is hard to understand.
Given that the url is a endpoint of a API get method, and contains JSON file, how can I retrieve JSON file so I can parse it and use that JSON file?
Also I'm trying this too
def url = 'http://urlurlurl'
def username = 'username'
def password = 'password'
def restClient = new RESTClient(url)
restClient.auth.basic(username, password)
//Adding get method
def jsonData = restClient.get(/* what value should I put in here?? */)
This gives me a forbbiden error that says:
Error 500: Internal Server Error
URI: JsonRender
Class: groovyx.net.http.HttpResponseException
Message: Forbidden
Any suggestions? Examples that uses get method in RESTClient will be nice.
The url should be the base url for your api. For example if we want to search some data from an api which complete url is http://localhost:9200/user/app/_search. So, we have base url as http://localhost:9200/ and the path to api is user/app/_search. Now the request looks like this
def client = new RESTClient( 'http://localhost:9200/' )
def resp = client.get( path : 'user/app/_search')
log.debug (resp.getContentAsString())
Hope this will work out.
Thanks,