I'm trying to get links from a page with LinkExtractor on a page with infinite scroll. Doing this with
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True),
)
works. However, this gets called without JavaScript, thus the images are not loading within the page (and their url, which I need). When changing the LinkExtractor to;
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True, process_links='process_links'),
)
with;
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
return links
It only goes to the urls it loads when loading up the page (but it needs to get ALL the links which you can get with scrolling). For some reason it also loads some weird localhost URLs like so;
http://localhost:8050/render.html?url=http%3A%2F%2Flocalhost%3A8050%2Fnl%2Fagenda%2xxxxxx
Which I have no clue why it does that.
Is there a way to execute JavaScript when using the LinkExtractor and Splash, so I can scroll and get all the links before the LinkExtractor gets the links? Only executing JavaScript when following up links from the LinkExtractor would also be enough, but I wouldn't know where to begin to do that.
Link extractor works on the current content not the content that render dynamically. And yes, as you say, for that, you are using splash but splash is used to render JavaScript code while virtual scrolling is never handled in splash, virtual scrolling is more like a network call to fetch new data and append it to the existing HTML. so when you scroll, find a call and then hit that call to get the desired data.
Related
So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()
I am trying to scrape flyers from flipp.com/weekly_ads using Scrapy. Before I can scrape the flyers, I need to input my area code, and search for local flyers (on the site, this is done by clicking a button).
I am trying to input a value, and simualate "clicking a button" using Scrapy.
Initially, I thought that I would be able to use a FormRequest.from_response to search for the form, and input my area code as a value. However, the button is written in javascript, meaning that the form cannot be found.
So, I tried to find the HTTP call via Inspect Element > Developer Tools > Network > XHR to see if any of the calls would load the equivalent flipp page with the new, inputted area code (my area code).
Now, I am very new to Scrapy, and HTTP requests/responses, so I am unsure if the link I found is the correct one (as in, the response with the new area code), or not.
This is the request I found:
https://gateflipp.flippback.com/bf/flipp/data?locale=en-us&postal_code=90210&sid=10775773055673477
I used an arbitrary postal code for the request (90210).
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
If this is incorrect:
How do I input a value for a javascript button, and get the result using Scrapy?
import scrapy
import requests
import json
class flippSpider(scrapy.Spider):
name = "flippSpider"
postal_code = "M1T2R8"
start_urls = ["https://flipp.com/weekly_ads"]
def parse(self, response): #Input value and simulate button click
return Request() #Find http call to simulate button click with correct field/value parameters
def parse_formrequest(self, response):
yield scrapy.Request("https://flipp.com/weekly_ads/groceries", callback= self.parse_groceries)
def parse_groceries(self, response):
flyers = []
flyer_names = response.css("class.flyer-name").extract()
for flyer_name in flyer_names:
flyer = FlippspiderItem()
flyer["name"] = flyer_name
flyers.append(flyer)
self.log(flyer["name"])
print(flyer_name)
return flyers
I expected to find the actual javascript button request within the XHR links but the one I found seems to be incorrect.
Edit: I do not want to use Selenium, it's slow, and I do not want a browser to pop up during execution of the spider.
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
That is the correct URL to get the data powering that website; the things you see on screen when you go to flipp.com/weekly_ads/groceries is just packaging that data in HTML
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
I am pretty sure you are asking the wrong question. You don't need to -- and in fact navigating to flipp.com/weekly_ads/groceries will 100% not do what you want anyway. You can observe that when you click on "Groceries", the content changes but the browser does not navigate to any new page, nor does it make a new XHR request. Thus, everything that you need is in that JSON. What is happening is they are using the flyers.*.categories that contains "Groceries" to narrow down the 129 flyers that are returned to just those related to Groceries.
As for "maintaining the new area code," it's a similar "wrong question" because every piece of data that is returned by that XHR is scoped to the postal code in question. Thus, you don't need to re-submit anything, and nor would I expect any data that comes back from your postal_code=90210 request to contain 30309 (or whatever) data.
Believe it or not, you're actually in a great place: you don't need to deal with complicated CSS or XPath queries to liberate the data from its HTML prison: they are kind enough to provide you with an API to their data. You just need to deal with unpacking the content from their structure into your own.
I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.
To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}
I am traversing this URL. In this javascript:ctrl.set_pageReload(1) function making an AJAX call which then loads page data. How can I write my Rule(LinkExtractor(). to make a traverse or is there some other way?
What is AJAX? Its just simply a request to a link with GET or POST method.
You can check it in Inspect-Element view.
Click on the button you are talking about, then see where is the AJAX going to?
Also instead of scraping URLs via Rule(LinkExtractor(), remove start_urls and def parse() method and do this,
def start_requests(self):
yield Request(url= URLE_HERE, callback=self.parse_detail_page)
i have done 'everything' that manual says.
included all files, adding it properly to html structure, loaded images,..
please, you can view live problem here
You are using ajaxpage to load the div with id=slider for certain categories for page.php. However, Nivo slider is looking for a div after the primary page has loaded (the $(window).load function).
Somehow, you need to attach the .load function call to the page being loaded by ajaxpage.
You might try adding the $(window).load call to the bottom of each page whenever you are setting up a Nivo slideshow.
If you were using jQuery's ajax library/module calls, you might be able to attach the load action to fire when the ajax has loaded.
I looked at the source of one of your page.php pages with Nivo and I think that you set up the HTML correctly, but the Nivo module just has not been started, partly because of the way the page is pulling in the content using ajax.
This is a clever way of doing things, but just needs a different trick to starting Nivo.