Scrapy: find HTTP call from button click - scrapy

I am trying to scrape flyers from flipp.com/weekly_ads using Scrapy. Before I can scrape the flyers, I need to input my area code, and search for local flyers (on the site, this is done by clicking a button).
I am trying to input a value, and simualate "clicking a button" using Scrapy.
Initially, I thought that I would be able to use a FormRequest.from_response to search for the form, and input my area code as a value. However, the button is written in javascript, meaning that the form cannot be found.
So, I tried to find the HTTP call via Inspect Element > Developer Tools > Network > XHR to see if any of the calls would load the equivalent flipp page with the new, inputted area code (my area code).
Now, I am very new to Scrapy, and HTTP requests/responses, so I am unsure if the link I found is the correct one (as in, the response with the new area code), or not.
This is the request I found:
https://gateflipp.flippback.com/bf/flipp/data?locale=en-us&postal_code=90210&sid=10775773055673477
I used an arbitrary postal code for the request (90210).
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
If this is incorrect:
How do I input a value for a javascript button, and get the result using Scrapy?
import scrapy
import requests
import json
class flippSpider(scrapy.Spider):
name = "flippSpider"
postal_code = "M1T2R8"
start_urls = ["https://flipp.com/weekly_ads"]
def parse(self, response): #Input value and simulate button click
return Request() #Find http call to simulate button click with correct field/value parameters
def parse_formrequest(self, response):
yield scrapy.Request("https://flipp.com/weekly_ads/groceries", callback= self.parse_groceries)
def parse_groceries(self, response):
flyers = []
flyer_names = response.css("class.flyer-name").extract()
for flyer_name in flyer_names:
flyer = FlippspiderItem()
flyer["name"] = flyer_name
flyers.append(flyer)
self.log(flyer["name"])
print(flyer_name)
return flyers
I expected to find the actual javascript button request within the XHR links but the one I found seems to be incorrect.
Edit: I do not want to use Selenium, it's slow, and I do not want a browser to pop up during execution of the spider.

I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
That is the correct URL to get the data powering that website; the things you see on screen when you go to flipp.com/weekly_ads/groceries is just packaging that data in HTML
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
I am pretty sure you are asking the wrong question. You don't need to -- and in fact navigating to flipp.com/weekly_ads/groceries will 100% not do what you want anyway. You can observe that when you click on "Groceries", the content changes but the browser does not navigate to any new page, nor does it make a new XHR request. Thus, everything that you need is in that JSON. What is happening is they are using the flyers.*.categories that contains "Groceries" to narrow down the 129 flyers that are returned to just those related to Groceries.
As for "maintaining the new area code," it's a similar "wrong question" because every piece of data that is returned by that XHR is scoped to the postal code in question. Thus, you don't need to re-submit anything, and nor would I expect any data that comes back from your postal_code=90210 request to contain 30309 (or whatever) data.
Believe it or not, you're actually in a great place: you don't need to deal with complicated CSS or XPath queries to liberate the data from its HTML prison: they are kind enough to provide you with an API to their data. You just need to deal with unpacking the content from their structure into your own.

Related

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.
First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

Scraping links with selenium

I was working to scrape links to articles on a website. But normally when site was loaded it list only 5 articles then it requires to click load more button to display more articles list.
Html source has only links to first five articles.
I used selenium python to automate clicking load more button to completely load webpage with all article listings.
Question is now how can i extract links to all those articles.
After loading site completely with selenium i tried to get html source with driver.page_source and printed it but still it has only link to first 5 articles.
I want to get links to all those articles that were loaded in webpage after clicking load more button.
Please someone help to provide solution.
Maybe the links take some time to show up and your code is doing driver.source_code before the source code is updated. You can select the links with Selenium after an explicit wait so that you can make sure that the links that are dinamically added to the web page are fully loaded. It is difficult to boil down exactly what you need without a link to your source, but (in Python) it should be something similar to:
from selenium.webdriver.support.ui import WebDriverWait
def condition(driver):
"""If the selector defined in the function retrieves 10 or more results, return the results.
Else, return None.
"""
selector = 'a.my_class' # Selects all <a> tags with the class "my_class"
els = driver.find_elements_by_css_selector(selector)
if len(els) >= 10:
return els
# Making an assignment only when the condition returns a truthy value when called (waiting until 2 min):
links_elements = WebDriverWait(driver, timeout=120).until(condition)
# Getting the href attribute of the links
links_href = [link.get_attribute('href') for link in links_elements]
In this code, you are:
Constantly looking for the elements you want until there are 10 or more of them. You can do this by CSS Selector (as in the example), XPath or other method. This gives you a list of Selenium objects as soon as the wait condition returns an object with a True value, until a certain timeout. See more on explicit waits in the documentation. You should make the appropriate condition for your case - maybe expecting a certain number of links is not good if you are not sure of how many links there will be in the end.
Extracting what you want from the Selenium object. For that, use the appropriate method over the elements in the list you got from the step above.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Scrapy keeps return empty value

I'm scraping http://www.germandeli.com/Meats/Sausages
I would like to extract the link for every product(or item) from the page. I use scrapy shell to test but it keeps return the empty value [ ].
Here is the code I use:
response.xpath('*//h2[#class="item-cell-name"]/a/#href')
Any helps would be greatly appreciated.
Well unfortunately the item content is rendered through JS. But luckily the URL sends a AJAX request to fetch a JSON of the items. This makes it much easier for us to parse it. You can check the XHR tab in the google chrome console to imitate the request with the required headers.
This URL returns the list of products. The limit and the offset parameters in the URL can be played around with to fetch the next set of data. Also to parse the JSON content you can use json.loads from the standard library.

PrestaShop - Reload CMS page with additional parameters

Situation: I needed to add form with POST method to CMS page. I created custom hook and a module displaying the form successfully. Then I need to react to user input errors eg. when user doesn't enter email address I need to detect it, display the whole page again together with the form and with "errors" in user input clearly stated.
Problem: The problem is to display the WHOLE page again with connected information (eg. about errors etc.). In the module PHP file when I add this kind of code,
return $this->display(__FILE__, 'modulename.tpl');
it (naturally) displays ONLY the form, not the whole CMS page with the form.
In case of this code,
Tools::redirectLink('cms.php?id_cms=7');
I can't get to transfer any information by GET neither POST method.
$_POST['test'] = 1;
Tools::redirectLink('cms.php?id_cms=7&test');
I tried to assign to smarty variables too
$smarty->assign('test', '1');
(I need to use it in .tpl file where the form itself is created) but no way to get it work.
{if isset($test)}...,
{if isset($smarty.post.test)}...,
{if isset($_POST['test'])}... {* neither of these conditionals end up as true *}
Even assigning a GET parameter to url has no impact, because there is link rewriting to some kind of friendly url I guess, no matter I included other argument or not. ([SHOPNAME]/cms.php?id_cms=7&test -> [SHOPNAME]/content/7-cmspage-name)
My question is: is there a way to "redirect" or "reload" current page (or possibly any page generally) in prestashop together with my own data included?
I kind of explained the whole case so I'm open to hear a better overall solution than my (maybe I'm thinking about the case in a wrong way at all). This would be other possible answer.
The simplest method would be to use javascript to validate the form - you can use jQuery to highlight the fields that are in error; providing visual feedback on how the submission failed. In effect you don't allow the user to submit the form (and thus leave the page) until you're happy that the action will succeed. I assume that you will then redirect to another page once a successful submission has been received.
There's lots of articles and how-tos available for using javascript, and indeed jQuery for form validation. If you want to keep the site lean and mean, then you can provide an override for the CMS controller and only enqueue the script for the specific page(s) you want to use form validation on.
If the validation is complex, then you might be best using AJAX and just reloading the form section of your page via a call to your module. Hooks aren't great for this kind of thing, so you might want to consider using an alternative mnethod to inject your code onto the cms page. I've written a few articles on this alternative approach which can be found on my prestashop blog