PhantomJS: view dependents such as .js, not just final html - phantomjs

In PhantomJS, is there a way to view the dependents? For example, a page causes a JS script to load. Instead of just viewing the final browser html result, I'd like to see the js script.
What I want to see has Content-Type: application/json

It is possible.
In phantomjs you have the whole list of page callbacks.
Hooking to onResourceRequested and onResourceReceived will give you the desired information every time page tries to load anything.

Related

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.
First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Is there any way to archive and recover entire page (with entire html, css, img, js, ...) using selenium chromedriver on ubuntu?

I'm looking for a way to archive the entire state of webpage, for the purpose of archiving the webpage.
Actually, what I want to somehow save all rendered results of the page (not as the form of screenshot, but the form of rendered result of DOM element) that we can see on browser, and recover them in local environment without network.
I really don't need to save all the functionalities of the page that interact with other computer. Only the view of the page is needed to be archived.
What I tried to archive youtube.com's home page were,
Using beautiful soup to get immediate html sources
Using python selenium and chromedriver to get dynamically loaded html sources
2 + downloading all referenced .css, .js, and images from links in html codes to local directory.
Pressing ctrl+s on chrome, which downloads html sources and several files. (.js, .css, .jpg, ...)
But all of them did not work correctly.
At first, 4th method seems working, but soon I found out that it downloads initial html source, not a dynamically loaded one.
Is there any known ways to do this kind of stuffs? (archiving currently rendered state of the page)
Thanks in advance.

phantomjs: how to generate a perfect html screenshot?

When I use control + s to save a page completely in my browser, and I open it, it more or less resembles the original website.
however, when I render a site on phantomjs and look at the generated html, it looks very different from the screenshot it produces.
how does phantomjs produce a good accurate image screenshot but not a good html screenshot? for example, when I take a screenshot of futureshop.ca and look at the html generated by phantomjs it looks like a completely different website. how do we resolve this?
for example take this cgi proxy:
https://ultraproxy.us/perl/nph-proxy.cgi/en/00/http/www.bestbuy.ca/ (uses perl cgi)
vs.
http://prerender.herokuapp.com/http://www.bestbuy.ca (using phantomjs)
how does a perl cgi script produce a more accurate looking page? is there a way to get phantomjs to do the same? phantomjs has the advantage of handling ajax loaded content while the perl cgi wouldn't.

scrape the reponse which would be loaded from ajax event

I am using scrapy tool to scrape content from website, i need help from you guys how to scrape the reponse which is dynamically loaded from ajax.
when content loading from ajax at that mean time url not changing it keep remains same but content would be changed so on that event i need to crawl.
thank you,
G.kavirajan
yield FormRequest('http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php',
formdata={'type':'new','ajax':'1'},
callback=self.your_callback_method)
bellow are the urls that you can easily catch using fiddler or firebug
this is for featured tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=random
this is for new tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=new
you can request on these url directly to get results you required, although website is using POST request to get data for these url, but i tried with parameter GET request is also working properly