I'm trying to use fetchText() to print out the URL of a google search result to the terminal. Here's the image of what exactly I'm trying to print.
It's only prints out blank though! I don't see anything I'm doing wrong?
Code:
phantom.casperPath = "/usr/local/Cellar/casperjs/1.0.3/libexec/";
phantom.injectJs(phantom.casperPath + '/bootstrap.js');
var utils = require('utils');
var casper = require('casper').create();
casper.start('https://www.google.com/search?q=amazon+shoes');
casper.wait(3000, function () {
this.echo(this.fetchText('#rso > div:nth-child(1) > li:nth-child(1) > div > div > div > div.f.kv._TD > cite'));
}).run();
Google will change the page depending on the useragent string. So you need to set a string during creation (with example string)
var casper = require("casper").create({
pageSettings: {
userAgent: "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
}
});
or with specific function
casper.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36");
Sometimes it is also necessary to set the viewport to something desktop-like, because PhantomJS' default viewport is 400x300 and Google might render a different site based on the viewport.
Related
Using a WebView2 control, I am trying to load into a webpage, but after I log into it, it seems it has some sort of block for generic browser that is not well configured because it keeps loading instead of proceed after the login, so I would like to add a CookieContainer and specify to use Cookies, add headers that specify that decompression is supported and what decompression methods are handled and User agent on WebView2 control same way this answer
works for HttpRequest.
Looking online I only found some code that I've tried to put together, but that's c# and I'm trying to convert it for vb.net but no online tool succeded to convert it yet
Private Sub webView2_NavigationStarting(sender As Object, e As Microsoft.Web.WebView2.Core.CoreWebView2NavigationStartingEventArgs) Handles webView2.NavigationStarting
webView2.AddScriptToExecuteOnDocumentCreated("
window.WebView2.addEventListener('beforenavigate', function(event) {
event.preventDefault();
var xhr = new XMLHttpRequest();
xhr.open(event.detail.verb, event.detail.uri, true);
xhr.setRequestHeader('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36');
xhr.setRequestHeader('Cache-Control', 'no-cache');
xhr.setRequestHeader('Accept-Encoding', 'gzip, deflate');
xhr.onreadystatechange = function() {
if (xhr.readyState === XMLHttpRequest.DONE) {
window.WebView2.injectWebResource(event.detail.id, xhr.responseText);
}
};
xhr.send();
});
")
End Sub
Am I using the rights methods?
edit1:
I've managed to add the UserAgent
Private Sub WebView21_NavigationStarting(sender As Object, args As Microsoft.Web.WebView2.Core.CoreWebView2NavigationStartingEventArgs) Handles WebView21.NavigationStarting
Dim userAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
Dim script As String = $"window.navigator.userAgent = '{userAgent}';"
WebView21.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(script)
End Sub
but still it doesn't proceed after the login.
I try to scrape the odds comparison site from www.raingpost.com
Example from racingpost -> these sites are only working until the race is over, so if you can not see it anymore, pick a race that is still to come :)
So I scraped this site for some info using different spiders, but it seems the odds from the bookmakers are not rendered by splash - at least I can not see the odds in my local splash or the html returned.
I tried:
Increasing the wait time up to 20sec
deactivating the private mode
using scroll down
But it is still not rendering.
How do I scrape these odds?
I tried some solutions from answers here on stackoverflow, the last code I tried was this one:
class DailyoddSpider(scrapy.Spider):
name = 'dailyodd'
allowed_domains = ['www.racingpost.com']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url="https://www.racingpost.com/racecards/394/southwell-aw/2022-03-05/804308/odds-comparison", callback=self.parse, endpoint='execute', args={
'lua_source': self.script
})
Unfortunately, the events do not work for some WebKit browsers on mobile devices.
There seems to be a problem with the touch-event.
OpenLayers versions: 6.4.3 and 6.5.0
Browsers it does not work with:
Miui Browser 71:
Mozilla/5.0 (Linux; U; Android 10; de-de; Redmi Note 8 Pro Build/QP1A.190711.020) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/71.0.3578.141 Mobile Safari/537.36 XiaoMi/MiuiBrowser/12.8.3-gn
Miui Browser 79:
Mozilla/5.0 (Linux; U; Android 10; de-de; Redmi Note 8 Pro Build/QP1A.190711.020) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.147 Mobile Safari/537.36 XiaoMi/MiuiBrowser/12.10.8-gn
Safari 12:
Mozilla/5.0 (iPad; CPU OS 12_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1
You can try it on this website: https://wp-osm-plugin.hyumika.com/map-with-one-html-popup-marker-in-wordpress/ (you cannot click on the marker, drag or pinch)
With Chrome and Firefox it works, but with the WebKit browsers, I get no console output:
a_MapObj.on('singleclick', function(e) {
console.log('singleclick');
});
a_MapObj.on('click', function(e) {
console.log('click');
});
a_MapObj.on('dblclick', function(e) {
console.log('dblclick');
});
a_MapObj.on('error', function(e) {
console.log('error :' + e);
});
a_MapObj.on('moveend', function(e) {
console.log('moveend');
});
a_MapObj.on('movestart', function(e) {
console.log('movestart');
});
a_MapObj.on('pointermove', function(e) {
console.log('pointermove');
});
a_MapObj.on('pointerdrag', function(e) {
console.log('pointerdrag');
});
Could you please help me to fix this?
Thanks a lot & regards,
Mark
If I go to the following web page in Chrome, it loads fine: https://www.cruisemapper.com/?poi=39
However, when I run the following PhantomJS script, which simply goes to the same URL and outputs the entire DOM string to the console, I get a 403 Forbidden message:
var page = require('webpage').create(),
url = 'https://www.cruisemapper.com/?poi=39';
page.open(url, function (status) {
if (status === 'success') {
console.log(page.evaluate(function () {
return document.documentElement.outerHTML;
}));
phantom.exit();
}
});
Here's the exact output to the console:
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /
on this server.<br>
</p>
</body></html>
I thought that if I added some sort of user agent string, it might work. As such, I added the following above the console.log line:
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
But that didn't work. So then I tried the following instead:
page.customHeaders = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
};
But that didn't work either. Does anyone have any advice on how I can possibly hit up the URL above and not get a 403 Forbidden message? Thank you.
Your code works for me fine (I's suggest viewport size emulation though, see code). If you still get a 403, try changing your IP, it's possible that the site is on to you now (you probably visited that page lots of times).
var page = require('webpage').create(),
url = 'https://www.cruisemapper.com/?poi=39';
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
page.viewportSize = { width: 1440, height: 900 }; // <-- otherwise it's 400x300 by default
// It's good to watch for errors on the page
page.onError = function (msg, trace)
{
console.log(msg);
trace.forEach(function(item) {
console.log(' ', item.file, ':', item.line);
})
}
page.open(url, function (status) {
console.log(status);
page.render("page.png"); // Also useful to check if you get what you expect
if (status === 'success') {
console.log(page.evaluate(function () {
return document.documentElement.outerHTML;
}));
phantom.exit();
}
});
I am using phantomjs with mink:
default:
extensions:
Behat\MinkExtension\Extension:
goutte: ~
selenium2:
browser: phantomjs
wd_host: http://localhost:8643/wd/hub
capabilities:
webStorageEnabled: true
But I need to masquerade as the latest chrome. I have tried this:
/**
* #BeforeStep
*/
public function masqueradeAsLatestChrome(StepEvent $event)
{
$this->getSession()->setRequestHeader('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36');
}
But I get the exception:
[Behat\Mink\Exception\UnsupportedDriverActionException]
Exception has been thrown in "beforeStep" hook, defined in FeatureContext::masqueradeAsLatestChrome()
Request header is not supported by Behat\Mink\Driver\Selenium2Driver
The version of chrome isn't critical but the web application must think its talking to a very recent version of chrome.
Selenium does not provide this capability, as it is not something a
user can do. It's recommended you use a proxy to inject additional
headers to the requests generated by the browser.
https://code.google.com/p/selenium/issues/detail?id=2047#c1
Sadly… However, the PhantomJS does provide an interface for setting the headers. Your best shot would be to send a direct command to it using it's REST API. There's also a cool PHP wrap library that would make it 200 times easier.
You should use the new Behat/Mink driver made by Juan Francisco Calderón Zumba
https://github.com/jcalderonzumba
Here is a direct link to the driver https://github.com/jcalderonzumba/MinkPhantomJSDriver
This driver allows you to specify the request headers that you need
(It works with Behat 3.0 but I think it requires at least PHP 5.4)
you can pass additional settings in selenium2driver via
extra_capabilities:
so in your case:
default:
extensions:
Behat\MinkExtension\Extension:
goutte: ~
selenium2:
browser: phantomjs
wd_host: http://localhost:8643/wd/hub
capabilities:
webStorageEnabled: true
extra_capabilities:
phantomjs.page.settings.userAgent: "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"
For https://github.com/facebook/php-webdriver you can use this case:
$capabilities = [
WebDriverCapabilityType::BROWSER_NAME => 'phantomjs',
WebDriverCapabilityType::PLATFORM => 'ANY',
WebDriverCapabilityType::ACCEPT_SSL_CERTS => false,
WebDriverCapabilityType::JAVASCRIPT_ENABLED => true,
'phantomjs.page.settings.userAgent' => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
];