Scrapy Link Extractor Rules

Scrapy Link Extractor Rules - scrapy

I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.

To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}

Related

Trying to log into site with scrapy and response shows login page

I'm new to Scrapy and I'm trying to get a log in working, starting in the shell. This is the site I'm trying to log into:
https://www.acdd.com/customer/account/login/
First I did
from scrapy.http import FormRequest
and then I did
token = response.xpath('//*[#name="form_key"]/#value').extract_first() to get the token and the output looks correct. I then did
FormRequest.from_response(response,formdata={'form_key': token,'login[customerid]': '12345','login[username]': 'myaddress#email.com','login[password]': 'mysecret'})
It outputs
<GET https://www.acdd.com/catalogsearch/result/?q=&login%5Bcustomerid%5D=12345&login%5Busername%5D=myaddress%40email.com&login%5Bpassword%5D=mysecret&form_key=abcdef12345>
If I do view(response) it just shows the login page and not the user page like it should. I've been following tutorials and examples but I think maybe there is just something different about this site than the simple examples I've used. I logged in with Firefox and looked in the developer tools to see what form data it POST and I have all the elements. It also looks like while the form is on https://www.acdd.com/customer/account/login/, it actually posts to https://www.acdd.com/customer/account/login/Post. I've tried to just post to that page in the shell but there are no form elements. This is outside the basic examples I've worked with. Any help is appreciated.

You didn't select target form and Scrapy uses the first one on the page (search form):
FormRequest.from_response(
response=response,
formid="login-form",
formdata={
'login[customerid]': '12345',
'login[username]': 'myaddress#email.com',
'login[password]': 'mysecret',
'send': "",
}
)
Also you don't need form_key here because Scrapy will get it from a form for you.
UPDATE Try to add send key.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.

https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.

It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Scrapy: find HTTP call from button click

I am trying to scrape flyers from flipp.com/weekly_ads using Scrapy. Before I can scrape the flyers, I need to input my area code, and search for local flyers (on the site, this is done by clicking a button).
I am trying to input a value, and simualate "clicking a button" using Scrapy.
Initially, I thought that I would be able to use a FormRequest.from_response to search for the form, and input my area code as a value. However, the button is written in javascript, meaning that the form cannot be found.
So, I tried to find the HTTP call via Inspect Element > Developer Tools > Network > XHR to see if any of the calls would load the equivalent flipp page with the new, inputted area code (my area code).
Now, I am very new to Scrapy, and HTTP requests/responses, so I am unsure if the link I found is the correct one (as in, the response with the new area code), or not.
This is the request I found:
https://gateflipp.flippback.com/bf/flipp/data?locale=en-us&postal_code=90210&sid=10775773055673477
I used an arbitrary postal code for the request (90210).
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
If this is incorrect:
How do I input a value for a javascript button, and get the result using Scrapy?
import scrapy
import requests
import json
class flippSpider(scrapy.Spider):
name = "flippSpider"
postal_code = "M1T2R8"
start_urls = ["https://flipp.com/weekly_ads"]
def parse(self, response): #Input value and simulate button click
return Request() #Find http call to simulate button click with correct field/value parameters
def parse_formrequest(self, response):
yield scrapy.Request("https://flipp.com/weekly_ads/groceries", callback= self.parse_groceries)
def parse_groceries(self, response):
flyers = []
flyer_names = response.css("class.flyer-name").extract()
for flyer_name in flyer_names:
flyer = FlippspiderItem()
flyer["name"] = flyer_name
flyers.append(flyer)
self.log(flyer["name"])
print(flyer_name)
return flyers
I expected to find the actual javascript button request within the XHR links but the one I found seems to be incorrect.
Edit: I do not want to use Selenium, it's slow, and I do not want a browser to pop up during execution of the spider.

I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
That is the correct URL to get the data powering that website; the things you see on screen when you go to flipp.com/weekly_ads/groceries is just packaging that data in HTML
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
I am pretty sure you are asking the wrong question. You don't need to -- and in fact navigating to flipp.com/weekly_ads/groceries will 100% not do what you want anyway. You can observe that when you click on "Groceries", the content changes but the browser does not navigate to any new page, nor does it make a new XHR request. Thus, everything that you need is in that JSON. What is happening is they are using the flyers.*.categories that contains "Groceries" to narrow down the 129 flyers that are returned to just those related to Groceries.
As for "maintaining the new area code," it's a similar "wrong question" because every piece of data that is returned by that XHR is scoped to the postal code in question. Thus, you don't need to re-submit anything, and nor would I expect any data that comes back from your postal_code=90210 request to contain 30309 (or whatever) data.
Believe it or not, you're actually in a great place: you don't need to deal with complicated CSS or XPath queries to liberate the data from its HTML prison: they are kind enough to provide you with an API to their data. You just need to deal with unpacking the content from their structure into your own.

Getting links from a infinite scroll page

I'm trying to get links from a page with LinkExtractor on a page with infinite scroll. Doing this with
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True),
)
works. However, this gets called without JavaScript, thus the images are not loading within the page (and their url, which I need). When changing the LinkExtractor to;
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True, process_links='process_links'),
)
with;
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
return links
It only goes to the urls it loads when loading up the page (but it needs to get ALL the links which you can get with scrolling). For some reason it also loads some weird localhost URLs like so;
http://localhost:8050/render.html?url=http%3A%2F%2Flocalhost%3A8050%2Fnl%2Fagenda%2xxxxxx
Which I have no clue why it does that.
Is there a way to execute JavaScript when using the LinkExtractor and Splash, so I can scroll and get all the links before the LinkExtractor gets the links? Only executing JavaScript when following up links from the LinkExtractor would also be enough, but I wouldn't know where to begin to do that.

Link extractor works on the current content not the content that render dynamically. And yes, as you say, for that, you are using splash but splash is used to render JavaScript code while virtual scrolling is never handled in splash, virtual scrolling is more like a network call to fetch new data and append it to the existing HTML. so when you scroll, find a call and then hit that call to get the desired data.

How can deltafetch & splash be used together in Scrapy (python)

I am trying to build a scraper using scrapy and I plan to use deltafetch to enable incremental refresh but I need to parse javascript based pages which is why I need to use splash as well.
In the settings.py file, we need to add
SPIDER_MIDDLEWARES = {'scrapylib.deltafetch.DeltaFetch': 100,}
for enabling deltafetch whereas, we need to add
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,} for splash
I wanted to know how would both of them work together if both of them use some kind of spider middleware.
Is there some way in which I could use both of them?

For other answers see here and here. Essentially you can use the request meta parameter to manually set the deltafetch_key for the requests you are making. In this way you can request the same page with Splash even after you've successfully scraped items from that page with Scrapy and vice versa. Hope that helps!
from scrapy_splash import SplashRequest
from scrapy.utils.request import request_fingerprint
(your spider code here)
yield scrapy.Request(url, meta={'deltafetch_key': request_fingerprint(response.request)})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas