Fetch pages with scrapy behind Google Authentication - scrapy

I'm trying to log into a website that uses Google credentials. This fails in my scrapy spider:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': self.var.user, 'password': self.var.password},
callback=self.after_login)
Any tips?

After further inspection I managed to solve this, seems to be, a simple issue:
The fields are Email and Passwd, in that order.
Break the log in into two request, the first for email, second for password.
The code that works, as follows:
def parse(self, response):
"""
Insert the email. Next, go to the password page.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Email': self.var.user},
callback=self.log_password)
def log_password(self, response):
"""
Enter the password to complete the log in.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Passwd': self.var.password},
callback=self.after_login)

Related

Response works in Scrapy Shell, but doesn't work in code

I'm new in Scrapy. I wrote my first spider for this site https://book24.ru/knigi-bestsellery/?section_id=1592 and it works fine
import scrapy
class BookSpider(scrapy.Spider):
name = 'book24'
start_urls = ['https://book24.ru/knigi-bestsellery/']
def parse(self, response):
for link in response.css('div.product-card__image-holder a::attr(href)'):
yield response.follow(link, callback=self.parse_book)
for i in range (1, 5):
next_page = f'https://book24.ru/knigi-bestsellery/page-{i}/'
yield response.follow(next_page, callback=self.parse)
print(i)
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
Now I try to write a spider only for one page
import scrapy
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book24.ru/product/transhumanism-inc-6015821/']
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
And it doesn't work, I get an empty file after this command in terminal.
scrapy crawl book -O book.csv
I don't know why.
Will be grateful for the help!
You were getting raise
NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: BookSpider.parse callback is not defined
according the document
parse(): a method that will be called to handle the response
downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further
helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped
data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
just rename your def parse_book(self, response): to def parse(self, response):
Its work fine.

Scrapy won't let me login to asp.net page (ASPX)

Hi I'm having trouble getting my scrapy spider script to login to a aspx (asp.net) website
The script is supposed to crawl a website for product information (it's a suppliers website so we are allowed to do this) but for whatever reason the script is not able to login to the webpage using the script below, there is a username and password field along with a image button but when the script runs it simply doesn't work and we are redirected to the main page... I believe it has something to do with the page being asp.net and apparently i need to pass more information but i've honestly tried everything and im at a loss of what to do next!
What am I doing wrong?
import scrapy
class LeedaB2BSpider(scrapy.Spider):
name = 'leedab2b'
start_urls = [
'https://www.leedab2b.co.uk/customerlogin.aspx'
]
def parse(self, response):
return scrapy.FormRequest.from_response(response=response,
formdata={'ctl00$ContentPlaceHolder1$tbUsername': 'emailaddress#gmail.com', 'ctl00$ContentPlaceHolder1$tbPassword': 'yourpassword'},
clickdata={'id': 'ctl00_ContentPlaceHolder1_lbcustomerloginbutton'},
callback=self.after_login)
def after_login(self, response):
self.logger.info("you are at %s" % response.url)
FormRequest.from_response doesn't seem to send __EVENTTARGET, __EVENTARGUMENT in formdata, try to add them manually:
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$lbcustomerloginbutton',
'__EVENTARGUMENT': '',
'ctl00$ContentPlaceHolder1$tbUsername': 'emailaddress#gmail.com',
'ctl00$ContentPlaceHolder1$tbPassword': 'yourpassword'
}

Web Crawler not printing pages correctly

Good morning !
I've developed a very simple spider with Scrapy just to get used with FormRequest. I'm trying to send a request to this page: https://www.caramigo.eu/ which should lead me to a page like this one: https://www.caramigo.eu/be/fr/recherche?address=Belgique%2C+Li%C3%A8ge&date_debut=16-03-2019&date_fin=17-03-2019. The issue is that my spider does not prompt the page correctly (the cars images and info do not appear at all) and therefore I can't collect any data from it. Here is my spider:
import scrapy
class CarSpider(scrapy.Spider):
name = "caramigo"
def start_requests(self):
urls = [
'https://www.caramigo.eu/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.search_line)
def search_line(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'address': 'Belgique, Liège', 'date_debut': '16-03-2019', 'date_fin': '17-03-2019'},
callback=self.parse
)
def parse(self, response):
filename = 'caramigo.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Sorry if the syntax is not correct, I'm pretty new to coding.
Thank you in advance !

open link authentication using scrapy

hello I have a trouble using scrapy
I want to scrap some data from clinicalkey.com
I have a id, password for my hospital and my hospital has authority of clinicalkey.com
so if I log in to my hospital's library page, I also can use clincalkey.com without authentication
But My scrapy script didn't work. I can't findout why this is not working
My script here
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.FormRequest(loginsite, formdata={'id': 'Myid', 'password': 'MyPassword'}, callback=self.after_login)
def after_login(self, response):
yield scrapy.Request(clinicalkeysite, callback=self.parse_datail)
def parse_detail(self, response):
blahblah
When I see the final response, It has message about "You need login"..
This site use json body form for authenification.
Try something like this:
body = '{"username":"{}","password":"{}","remember_me":true,"product":"CK_US"}'.format(yourname, yourpassword)
yield scrapy.FormRequest(loginsite, body=body, callback=self.after_login)

Scrapy how to remove a url from httpcache or prevent adding to cache

I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Middleware responsible for requests/response caching is HttpCacheMiddleware. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.