Phantomas not render page directly which the page url contains hashmark - phantomjs

I'm testing a webpage by using phantomas,but I found the problem when I use the url contains hashmark such as 'http://bookstore2.shuqireader.com/route.php?sq_pg_param=bsbc&ver=151011#!/bid/3379630/'.
The screenshot of this page in Phantomas is all about blank,but it work perfectly by using PhantomJs alone.
I installed Phantomas by 'npm install'
phantoms http://bookstore2.shuqireader.com/route.php?sq_pg_param=bsbc&ver=151011#!/bid/3379630/ --screenshot=saveimg.png
saveimg.png is all blank
var webPage = require('webpage');
var page = webPage.create();
page.customHeaders = {
"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
};
page.open('http://bookstore2.shuqireader.com/route.php? sq_pg_param=bsbc&ver=151011#!/bid/3379630/', function (status) {
if(status=="success"){
page.render('saveimg.png');
}
phantom.exit();
});
in phantomJs way, saveimg.png is normal
Is it a bug?

Not the answer you want, but generally all these NPM wrappers of PhantomJs suck for various reasons, the authors generally only handle specifically their use case and the packages fail others who have slightly different needs.
Usually they fail for performance reasons (no problem if you are OK maxing out a CPU every request) but as you see, sometimes you'll be caught by situations the authors didn't code for.
You are much better off just writing your launching phantomjs.exe as a child process. Another alternative is to use the api at http://api.PhantomJsCloud.com (disclosure: I made it)

Related

Scrapy - Javascript rendering

I would like to get some data from here:
https://www.drivy.com/location-voiture/liege/mitsubishi-colt-359699?address=Gare+de+Li%C3%A8ge-Guillemins&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-27&end_time=06%3A00&latitude=50.6251&longitude=5.5659&start_date=2019-05-26&start_time=06%3A00
I'm searching for the ID of the owner of the car. This ID is within the aattribute of class car_owner_section. For the page above it is the numbers in the hrefattribute like this "/users/1228276". The issue is that this link is apparently rendered by javascript and I absolutely want to avoid scrapy-splash. Does anyone has an idea on how to find this ID ? It should be somewhere on a JSON I guess but I've searched for days now and found nothing.
I tested it on scrapy shell, and the response returns the link you are looking for, without using splash. You might want to check your settings.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

PhantomJS process keeps running in background after calling program.kill()

I'm using phantomjs and webdriverio to fetch and render a webpage that's loaded by javascript, then save it to be parsed later by cheerio.
Here's the code for that:
import phantomjs from 'phantomjs-prebuilt'
const webdriverio = require('webdriverio')
const wdOpts = {
desiredCapabilities: {
browserName: 'phantomjs'
}
}
async parse (parseUrl) {
return phantomjs.run('--webdriver=4444').then(program => {
return webdriverio.remote(wdOpts)
.init()
.url(parseUrl)
.waitForExist('.main-ios', 100000)
.pause(5000)
.getHTML('html', true)
.then((html) => {
program.kill()
return html
})
})
}
Even though I call program.kill() I notice that the phantomjs in the list of processes, and it does use up quite a bit of RAM and CPU.
I'm wondering why the process doesn't terminate.
.close() just closes the window. There is a known bug, if it is the last window it stays open.
.quit() should do it, but there are issues associated with that as well.
PhantomJS bug report: https://github.com/detro/ghostdriver/issues/162
someone has a decent workaround posted at the bottom of that thread:
https://github.com/SeleniumHQ/selenium/issues/767#issuecomment-140367536
this fix shoots a SIGTERM to end it: (In python, but might be usefull)
# assume browser = webdriver.PhantomJS()
browser.service.process.send_signal(signal.SIGTERM)
browser.quit()
I like to just open a Docker container with my automation, and run it in there. Docker closes it up for me, however that is prolly out of scope for what you want to do.. i would recommend the above SIGTERM+quit method.
PhantomJS is a 2 component product. There is the Javascript which runs on the client (Whether web or other Script) side as part of your code. Then there is the part that runs as a server-side application (The command line call)
It has been my experience with PhantomJS that when an error is encountered, the PHantomJS server side "hangs" but is unresponsive. If you can update your call to this script to provide output logging, you may b able to see what the error is that PhantomJS application is encountering.
phantomjs /path/to/script/ > /path/to/log/file 2>&1
Hope this Helps! If you'd like me to clarify anything, or elaborate I'm happy to update my answer, just let me know in a comment, Thanks!

Phantomjs no render

I work on a small project with PhantomJS to make screenshot
I use the standard script (http://phantomjs.org/screen-capture.html) everything works perfectly
but if I change the url (this url :https://fr.wikipedia.org/w/index.php?title=Pomme&direction=next&oldid=46779461) screenshot does not work
I do not understand why it does not work...
var page = require('webpage').create();
page.open('https://fr.wikipedia.org/w/index.php?title=Pomme&direction=next&oldid=46779461', function() {
page.render('github2.png');
phantom.exit();
});
I reproduced the issue and it seems like only that single page doesn't work.
Looks like that lemon-loving troll added so many "OOOOO"s that it broke the layout to the point that phantomjs is suiciding and refusing to cooperate.
That's my conclusion. My advice for today is to point your scripts to literally anything else.

Mobile Site not giving correct Data - Beautiful Soup

I'm trying to get product details from the following website.
Baby Shampoo
Specifically the TCIN:# and product details.
But this information is not showing up in the page when I parse it.
A simple line like:
spans = soup.find_all("span", {"class" : "list-value"})
is turning up no results, and when do I go even more basic to:
print(soup.prettify)
I see the page print out but none of the details are in the page. I am not seeing any iframes on the page, and can't figure out why the data is not showing.
I even attempted to adjust my headers in the request:
headers = { 'User-Agent': 'Mozilla/5.0 (Linux; <Android Version>; <Build Tag etc.>) AppleWebKit/<WebKit Rev> (KHTML, like Gecko) Chrome/<Chrome Rev> Mobile Safari/<WebKit Rev>'}
and also:
headers = { 'User-Agent': 'Mozilla/5.0'}
but neither of these are changing the results. Any ideas what could be happening, and where this data could be located?
Thanks,
Mike
If you see all the Network Requests through Chrome Developer Options or Firefox Firebug, you can see all the http get and post requests made and then you have to find out which one contains the needed information. Make sure that you have Network toolbar enabled and Preserve Log checked before making the request in browser. In your case, the information is fetched by the GET request - http://tws.target.com/productservice/services/item_service/v1/by_itemid?id=13197674&callback=browseCallback

Reliably getting favicons in Chrome extensions, chrome://favicon?

I'm using the chrome://favicon/ in my Google Chrome extension to get the favicon for RSS feeds. What I do is get the base path of linked page, and append it to chrome://favicon/http://<domainpath>.
It's working really unreliably. A lot of the time it's reporting the standard "no-favicon"-icon, even when the page really has a favicon. There is almost 0 documentation regarding the chrome://favicon mechanism, so it's difficult to understand how it actually works. Is it just a cache of links that have been visited? Is it possible to detect if there was an icon or not?
From some simple testing it's just a cache of favicons for pages you have visited. So if I subscribe to dribbble.com's RSS feed, it won't show a favicon in my extension. Then if I visit chrome://favicon/http://dribbble.com/ it won't return right icon. Then I open dribbble.com in another tab, it shows its icon in the tab, then when I reload the chrome://favicon/http://dribbble.com/-tab, it will return the correct favicon. Then I open my extensions popup and it still shows the standard icon. But if I then restart Chrome it will get the correct icon everywhere.
Now that's just from some basic research, and doesn't get me any closer to a solution. So my question is: Is the chrome://favicon/ a correct use-case for what I'm doing. Is there any documentation for it? And what is this its intended behavior?
I've seen this problem as well and it's really obnoxious.
From what I can tell, Chrome populates the chrome://favicon/ cache after you visit a URL (omitting the #hash part of the URL if any). It appears to usually populate this cache sometime after a page is completely loaded. If you try to access chrome://favicon/http://yoururl.com before the associated page is completely loaded you will often get back the default 'globe icon'. Subsequently refreshing the page you're displaying the icon(s) on will then fix them.
So, if you can, possibly just refreshing the page you're displaying the icons on just prior to displaying it to the user may serve as a fix.
In my use case, I am actually opening tabs which I want to obtain the favicons from. So far the most reliable approach I have found to obtain them looks roughly like this:
chrome.webNavigation.onCompleted.addListener(onCompleted);
function onCompleted(details)
{
if (details.frameId > 0)
{
// we don't care about activity occurring within a subframe of a tab
return;
}
chrome.tabs.get(details.tabId, function(tab) {
var url = tab.url ? tab.url.replace(/#.*$/, '') : ''; // drop #hash
var favicon;
var delay;
if (tab.favIconUrl && tab.favIconUrl != ''
&& tab.favIconUrl.indexOf('chrome://favicon/') == -1) {
// favicon appears to be a normal url
favicon = tab.favIconUrl;
delay = 0;
}
else {
// couldn't obtain favicon as a normal url, try chrome://favicon/url
favicon = 'chrome://favicon/' + url;
delay = 100; // larger values will probably be more reliable
}
setTimeout(function() {
/// set favicon wherever it needs to be set here
console.log('delay', delay, 'tabId', tab.id, 'favicon', favicon);
}, delay);
});
}
This approach returns the correct favicon about 95% of the time for new URLs, using delay=100. Increasing the delay if you can accept it will increase the reliability (I'm using 1500ms for my use case and it misses <1% of the time on new URLs; this reliability worsens when many tabs are being opened simultaneously). Obviously this is a pretty imprecise way of making it work but it is the best method I've figured out so far.
Another possible approach is to instead pull favicons from http://www.google.com/s2/favicons?domain=somedomain.com. I don't like this approach very much as it requires accessing the external network, relies on a service that has no guarantee of being up, and is itself somewhat unreliable; I have seen it inconsistently return the "globe" icon for a www.domain.com URL yet return the proper icon for just domain.com.
Hope this helps in some way.
As of Oct 2020, it appears chrome extensions using manifest version 3 are no longer able to access chrome://favicon/* urls. I haven't found the 'dedicated API' the message refers to.
Manifest v3 and higher extensions will not have access to the
chrome://favicon host; instead, we'll provide a dedicated API
permission and different URL. This results in being able to
tighten our permissions around the chrome:-scheme.
In order to use chrome://favicon/some-site in extension. manifest.json need to be updated:
"permissions": ["chrome://favicon/"],
"content_security_policy": "img-src chrome://favicon;"
Test on Version 63.0.3239.132 (Official Build) (64-bit)
chrome://favicon url is deprecated in favor of new favicon API with manifest v3.
// manifest.json
{
"permissions": ["favicon"]
}
// utils.js
function getFaviconUrl(url) {
return `chrome-extension://${chrome.runtime.id}/_favicon/?pageUrl=${encodeURIComponent(url)}&size=32`;
}
Source: https://groups.google.com/a/chromium.org/g/chromium-extensions/c/qS1rVpQVl8o/m/qmg1M13wBAAJ
I inspected the website-icon in Chrome history page and found this simpler method.
You can get the favicon url by --
favIconURL = "chrome://favicon/size/16#1x/" + tab.url;
Don't forget to add "permissions" and "content_security_policy" to Chrome. (https://stackoverflow.com/a/48304708/9586876)
In the latest version of Chrome, Version 78.0.3904.87 (Official Build) (64-bit)) when tested, adding just img-src chrome://favicon; as content_security_policy will still show 2 warnings:
'content_security_policy': CSP directive 'script-src' must be specified (either explicitly, or implicitly via 'default-src') and must whitelist only secure resources.
And:
'content_security_policy': CSP directive 'object-src' must be specified (either explicitly, or implicitly via 'default-src') and must whitelist only secure resources.
To get rid of them use:
"permissions": ["chrome://favicon/"],
"content_security_policy": "script-src 'self'; object-src 'self'; img-src chrome://favicon;"
Now you can use chrome://favicon/http://example.com without getting any errors or warnings.