I would like to login on splash based on the header information obtained by login authentication with scrapy.
I have two questions regarding this.
question 1.
Is it possible to pass header information obtained by logging in with scrpay directly to splash?
question 2.
Does it make sense with splash to recognize that it is not necessary to perform login authentication?
This is my base code.
import scrapy
from scrapy_splash import SplashRequest
def test(self, response):
self.logger.info("headers:-------"+ str(response.headers))
return SplashRequest(
"url",
self.test2,
endpoint='execute',
args={
'lua_source': """Lua script""",
'wait': 5,})
thank you for reading my question.
Related
I am trying to implement an API which returns a token upon a post request with valid email and password. I am using the rest framework for token authentication. To do this, I customized the ObtainAuthToken view from rest_framework to use my own serializer which works with an email instead of a username. The endpoint works flawlessly but I don't get the browsable API when I visit the endpoint in my browser. I just get a blank page with the following line:
{"detail":"Method \"GET\" not allowed."}
views.py
from rest_framework.authtoken.views import ObtainAuthToken
from .serializers import UserTokenSerializer
class CreateUserToken(ObtainAuthToken):
serializer_class = UserTokenSerializer
What is wrong with my view? I think I'm missing something.
I had to set renderer_classes like so:
from .serializers import UserTokenSerializer
from rest_framework.authtoken.views import ObtainAuthToken
from rest_framework.settings import api_settings
class CreateUserToken(ObtainAuthToken):
serializer_class = UserTokenSerializer
# To get the browsable API;
renderer_classes = api_settings.DEFAULT_RENDERER_CLASSES
I'm new to Scrapy and I'm trying to get a log in working, starting in the shell. This is the site I'm trying to log into:
https://www.acdd.com/customer/account/login/
First I did
from scrapy.http import FormRequest
and then I did
token = response.xpath('//*[#name="form_key"]/#value').extract_first() to get the token and the output looks correct. I then did
FormRequest.from_response(response,formdata={'form_key': token,'login[customerid]': '12345','login[username]': 'myaddress#email.com','login[password]': 'mysecret'})
It outputs
<GET https://www.acdd.com/catalogsearch/result/?q=&login%5Bcustomerid%5D=12345&login%5Busername%5D=myaddress%40email.com&login%5Bpassword%5D=mysecret&form_key=abcdef12345>
If I do view(response) it just shows the login page and not the user page like it should. I've been following tutorials and examples but I think maybe there is just something different about this site than the simple examples I've used. I logged in with Firefox and looked in the developer tools to see what form data it POST and I have all the elements. It also looks like while the form is on https://www.acdd.com/customer/account/login/, it actually posts to https://www.acdd.com/customer/account/login/Post. I've tried to just post to that page in the shell but there are no form elements. This is outside the basic examples I've worked with. Any help is appreciated.
You didn't select target form and Scrapy uses the first one on the page (search form):
FormRequest.from_response(
response=response,
formid="login-form",
formdata={
'login[customerid]': '12345',
'login[username]': 'myaddress#email.com',
'login[password]': 'mysecret',
'send': "",
}
)
Also you don't need form_key here because Scrapy will get it from a form for you.
UPDATE Try to add send key.
I am trying to scrape flyers from flipp.com/weekly_ads using Scrapy. Before I can scrape the flyers, I need to input my area code, and search for local flyers (on the site, this is done by clicking a button).
I am trying to input a value, and simualate "clicking a button" using Scrapy.
Initially, I thought that I would be able to use a FormRequest.from_response to search for the form, and input my area code as a value. However, the button is written in javascript, meaning that the form cannot be found.
So, I tried to find the HTTP call via Inspect Element > Developer Tools > Network > XHR to see if any of the calls would load the equivalent flipp page with the new, inputted area code (my area code).
Now, I am very new to Scrapy, and HTTP requests/responses, so I am unsure if the link I found is the correct one (as in, the response with the new area code), or not.
This is the request I found:
https://gateflipp.flippback.com/bf/flipp/data?locale=en-us&postal_code=90210&sid=10775773055673477
I used an arbitrary postal code for the request (90210).
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
If this is incorrect:
How do I input a value for a javascript button, and get the result using Scrapy?
import scrapy
import requests
import json
class flippSpider(scrapy.Spider):
name = "flippSpider"
postal_code = "M1T2R8"
start_urls = ["https://flipp.com/weekly_ads"]
def parse(self, response): #Input value and simulate button click
return Request() #Find http call to simulate button click with correct field/value parameters
def parse_formrequest(self, response):
yield scrapy.Request("https://flipp.com/weekly_ads/groceries", callback= self.parse_groceries)
def parse_groceries(self, response):
flyers = []
flyer_names = response.css("class.flyer-name").extract()
for flyer_name in flyer_names:
flyer = FlippspiderItem()
flyer["name"] = flyer_name
flyers.append(flyer)
self.log(flyer["name"])
print(flyer_name)
return flyers
I expected to find the actual javascript button request within the XHR links but the one I found seems to be incorrect.
Edit: I do not want to use Selenium, it's slow, and I do not want a browser to pop up during execution of the spider.
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
That is the correct URL to get the data powering that website; the things you see on screen when you go to flipp.com/weekly_ads/groceries is just packaging that data in HTML
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
I am pretty sure you are asking the wrong question. You don't need to -- and in fact navigating to flipp.com/weekly_ads/groceries will 100% not do what you want anyway. You can observe that when you click on "Groceries", the content changes but the browser does not navigate to any new page, nor does it make a new XHR request. Thus, everything that you need is in that JSON. What is happening is they are using the flyers.*.categories that contains "Groceries" to narrow down the 129 flyers that are returned to just those related to Groceries.
As for "maintaining the new area code," it's a similar "wrong question" because every piece of data that is returned by that XHR is scoped to the postal code in question. Thus, you don't need to re-submit anything, and nor would I expect any data that comes back from your postal_code=90210 request to contain 30309 (or whatever) data.
Believe it or not, you're actually in a great place: you don't need to deal with complicated CSS or XPath queries to liberate the data from its HTML prison: they are kind enough to provide you with an API to their data. You just need to deal with unpacking the content from their structure into your own.
I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.
To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}
I am trying to build a scraper using scrapy and I plan to use deltafetch to enable incremental refresh but I need to parse javascript based pages which is why I need to use splash as well.
In the settings.py file, we need to add
SPIDER_MIDDLEWARES = {'scrapylib.deltafetch.DeltaFetch': 100,}
for enabling deltafetch whereas, we need to add
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,} for splash
I wanted to know how would both of them work together if both of them use some kind of spider middleware.
Is there some way in which I could use both of them?
For other answers see here and here. Essentially you can use the request meta parameter to manually set the deltafetch_key for the requests you are making. In this way you can request the same page with Splash even after you've successfully scraped items from that page with Scrapy and vice versa. Hope that helps!
from scrapy_splash import SplashRequest
from scrapy.utils.request import request_fingerprint
(your spider code here)
yield scrapy.Request(url, meta={'deltafetch_key': request_fingerprint(response.request)})