Scrapy won't let me login to asp.net page (ASPX) - scrapy

Hi I'm having trouble getting my scrapy spider script to login to a aspx (asp.net) website
The script is supposed to crawl a website for product information (it's a suppliers website so we are allowed to do this) but for whatever reason the script is not able to login to the webpage using the script below, there is a username and password field along with a image button but when the script runs it simply doesn't work and we are redirected to the main page... I believe it has something to do with the page being asp.net and apparently i need to pass more information but i've honestly tried everything and im at a loss of what to do next!
What am I doing wrong?
import scrapy
class LeedaB2BSpider(scrapy.Spider):
name = 'leedab2b'
start_urls = [
'https://www.leedab2b.co.uk/customerlogin.aspx'
]
def parse(self, response):
return scrapy.FormRequest.from_response(response=response,
formdata={'ctl00$ContentPlaceHolder1$tbUsername': 'emailaddress#gmail.com', 'ctl00$ContentPlaceHolder1$tbPassword': 'yourpassword'},
clickdata={'id': 'ctl00_ContentPlaceHolder1_lbcustomerloginbutton'},
callback=self.after_login)
def after_login(self, response):
self.logger.info("you are at %s" % response.url)

FormRequest.from_response doesn't seem to send __EVENTTARGET, __EVENTARGUMENT in formdata, try to add them manually:
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$lbcustomerloginbutton',
'__EVENTARGUMENT': '',
'ctl00$ContentPlaceHolder1$tbUsername': 'emailaddress#gmail.com',
'ctl00$ContentPlaceHolder1$tbPassword': 'yourpassword'
}

Related

Looping through pages of Web Page's Request URL with Scrapy

I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!
Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json

Code Error with Scrapy Tutorial

I am trying to learn Scrapy and going through the basic tutorial.I am using Anaconda Navigator. I am working in an environment with scrapy installed. I have inputted the code, but keep getting an error.
Here is the code:
import scrapy
class FirstSpider(scrapy.Spider):
name = "FirstSpider"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Requests(url=url, callback = self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = "quotes-%.html" % page
with open(filename, "wb") as f:
f.write(response.body)
self.log("saved file %s")% filename
The code runs for a bit. Says it crawled 0 pages. Then DEBUGS: Telnet Console, and then puts out this error,"[scrapy.core.engine] ERROR: Error while obtaining start requests."
The code then runs some more, and puts out another error after, "yield scrapy.Requests(utl=url, callback = self.parse)" that says "AttributeError: Module 'scrapy' has no attribute 'Requests'.
I have re-written the code, and looked for answers. Please help. Thanks!
You have a typo here:
yield scrapy.Requests(url=url, callback = self.parse)
It's Request and not Requests.

open link authentication using scrapy

hello I have a trouble using scrapy
I want to scrap some data from clinicalkey.com
I have a id, password for my hospital and my hospital has authority of clinicalkey.com
so if I log in to my hospital's library page, I also can use clincalkey.com without authentication
But My scrapy script didn't work. I can't findout why this is not working
My script here
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.FormRequest(loginsite, formdata={'id': 'Myid', 'password': 'MyPassword'}, callback=self.after_login)
def after_login(self, response):
yield scrapy.Request(clinicalkeysite, callback=self.parse_datail)
def parse_detail(self, response):
blahblah
When I see the final response, It has message about "You need login"..
This site use json body form for authenification.
Try something like this:
body = '{"username":"{}","password":"{}","remember_me":true,"product":"CK_US"}'.format(yourname, yourpassword)
yield scrapy.FormRequest(loginsite, body=body, callback=self.after_login)

Flask app-setting up page dependency for LDAP login

I have 2 webpages /login and /index . I could set up dependency for entering /index page once login is authenticated from LDAP using the following view function and it works perfectly with this:
#app.route('/index',methods=['GET','POST'])
def index():
return render_template('index.html')
#app.route('/')
#app.route('/login',methods=['GET', 'POST'])
def login():
error=None
if request.method =='POST':
s = Server('appauth.corp.domain.com:636', use_ssl=True, get_info=ALL)
c=Connection(s,user=request.form['username'],password=request.form['password'],check_names=True, lazy=False,raise_exceptions=False)
c.open()
c.bind()
if request.form['username'] not in users and (c.bind() != True) is True:
error='Invalid credentials. Please try again'
else:
return redirect(url_for('index'))
return render_template('login.html',error=error)
Excuse the indentation for the first line and c=Connection(s,user=request.form['username'],password=request.form['password'],check_names=True, lazy=False,raise_exceptions=False)
`
I am able to access /index page by bypassing the login page. I found that there is a way to set up a decorator #login_required using flask_login. But that would involve setting up a local database using SQLAlchemy . Is there a easier way of doing that as I would need to modify my login log otherwise.
In the login page i am calling index.html, I am using this as my base.hml
`{% extends 'index.html' %}
Implemented it successfully using sessions. This is exactly what is needed for this
http://flask.pocoo.org/docs/0.11/quickstart/#sessions

Fetch pages with scrapy behind Google Authentication

I'm trying to log into a website that uses Google credentials. This fails in my scrapy spider:
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': self.var.user, 'password': self.var.password},
callback=self.after_login)
Any tips?
After further inspection I managed to solve this, seems to be, a simple issue:
The fields are Email and Passwd, in that order.
Break the log in into two request, the first for email, second for password.
The code that works, as follows:
def parse(self, response):
"""
Insert the email. Next, go to the password page.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Email': self.var.user},
callback=self.log_password)
def log_password(self, response):
"""
Enter the password to complete the log in.
"""
return scrapy.FormRequest.from_response(
response,
formdata={'Passwd': self.var.password},
callback=self.after_login)