open link authentication using scrapy - scrapy

hello I have a trouble using scrapy
I want to scrap some data from clinicalkey.com
I have a id, password for my hospital and my hospital has authority of clinicalkey.com
so if I log in to my hospital's library page, I also can use clincalkey.com without authentication
But My scrapy script didn't work. I can't findout why this is not working
My script here
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.FormRequest(loginsite, formdata={'id': 'Myid', 'password': 'MyPassword'}, callback=self.after_login)
def after_login(self, response):
yield scrapy.Request(clinicalkeysite, callback=self.parse_datail)
def parse_detail(self, response):
blahblah
When I see the final response, It has message about "You need login"..

This site use json body form for authenification.
Try something like this:
body = '{"username":"{}","password":"{}","remember_me":true,"product":"CK_US"}'.format(yourname, yourpassword)
yield scrapy.FormRequest(loginsite, body=body, callback=self.after_login)

Related

Response works in Scrapy Shell, but doesn't work in code

I'm new in Scrapy. I wrote my first spider for this site https://book24.ru/knigi-bestsellery/?section_id=1592 and it works fine
import scrapy
class BookSpider(scrapy.Spider):
name = 'book24'
start_urls = ['https://book24.ru/knigi-bestsellery/']
def parse(self, response):
for link in response.css('div.product-card__image-holder a::attr(href)'):
yield response.follow(link, callback=self.parse_book)
for i in range (1, 5):
next_page = f'https://book24.ru/knigi-bestsellery/page-{i}/'
yield response.follow(next_page, callback=self.parse)
print(i)
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
Now I try to write a spider only for one page
import scrapy
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book24.ru/product/transhumanism-inc-6015821/']
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
And it doesn't work, I get an empty file after this command in terminal.
scrapy crawl book -O book.csv
I don't know why.
Will be grateful for the help!
You were getting raise
NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: BookSpider.parse callback is not defined
according the document
parse(): a method that will be called to handle the response
downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further
helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped
data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
just rename your def parse_book(self, response): to def parse(self, response):
Its work fine.

Looping through pages of Web Page's Request URL with Scrapy

I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!
Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json

How to use python requests with scrapy?

I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem:
def start_requests(self):
yield self.parse(requests.get(url))
def parse(self, response):
#pass
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'
You first need to download the page's resopnse and then convert that string to HtmlResponse object
from scrapy.http import HtmlResponse
resp = requests.get(url)
response = HtmlResponse(url="", body=resp.text, encoding='utf-8')
what you need to do is
get the page with python requests and save it to variable different then Scrapy response.
r = requests.get(url)
replace scrapy response body with your python requests text.
response = response.replace(body = r.text)
thats it. Now you have Scrapy response object with all data available from python requests.
yields return a generator so it iterates over it before the request get's the data you can remove the yield and it should work. I have tested it with sample URL
def start_requests(self):
self.parse(requests.get(url))
def parse(self, response):
#pass

Web Crawler not printing pages correctly

Good morning !
I've developed a very simple spider with Scrapy just to get used with FormRequest. I'm trying to send a request to this page: https://www.caramigo.eu/ which should lead me to a page like this one: https://www.caramigo.eu/be/fr/recherche?address=Belgique%2C+Li%C3%A8ge&date_debut=16-03-2019&date_fin=17-03-2019. The issue is that my spider does not prompt the page correctly (the cars images and info do not appear at all) and therefore I can't collect any data from it. Here is my spider:
import scrapy
class CarSpider(scrapy.Spider):
name = "caramigo"
def start_requests(self):
urls = [
'https://www.caramigo.eu/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.search_line)
def search_line(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'address': 'Belgique, Liège', 'date_debut': '16-03-2019', 'date_fin': '17-03-2019'},
callback=self.parse
)
def parse(self, response):
filename = 'caramigo.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Sorry if the syntax is not correct, I'm pretty new to coding.
Thank you in advance !

Fetch pages with scrapy behind Google Authentication

I'm trying to log into a website that uses Google credentials. This fails in my scrapy spider:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': self.var.user, 'password': self.var.password},
callback=self.after_login)
Any tips?
After further inspection I managed to solve this, seems to be, a simple issue:
The fields are Email and Passwd, in that order.
Break the log in into two request, the first for email, second for password.
The code that works, as follows:
def parse(self, response):
"""
Insert the email. Next, go to the password page.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Email': self.var.user},
callback=self.log_password)
def log_password(self, response):
"""
Enter the password to complete the log in.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Passwd': self.var.password},
callback=self.after_login)