I'm a Japanese chess player and I would like to plot popularity of a strategy in function of time. To do that, I have a website database with a link for the first strategy called Yagura:
https://shogidb2.com/strategy/%E7%9F%A2%E5%80%89/page/1
What I would like to do is to store the years that appears at the beginning of each game (like this I can store it, then count). In this page "2017". But, it is impossible to get the text information. I also tried to find the web links to get data from the game page... But the links don't appear...
Here is my code, if you have any tip, you are welcome, I start to be crasy ^^
import requests
from bs4 import BeautifulSoup
def downloadString(url, params = {}, cookies = {}):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
data = requests.get(url, params = params, headers = headers, cookies = cookies)
return data.text
url = "https://shogidb2.com/strategy/%E7%9F%A2%E5%80%89"
html_doc = downloadString(url, params = {}, cookies = {})
soup = BeautifulSoup(html_doc)
links = []
for link in soup.find_all("a"):
print(link.get("href"))
The problem is the website is built with ReactJS, it creates VirtualDom to populate the data. BeautifulSoup on the other hand looks for DOM elements. Since DOM is not created for the elements, it will get null values. The best solution is to use casperjs (http://casperjs.org/)
The only reason I'm suggesting something like casperjs is much simpler to use than python supported scraping modules like selenium. If you are very serious about your pythonic way, Selenium should work for you. But its hard to configure first time.
Do install phantomjs and casperjs with npm install -g phantomjs casperjs.
PS: Phantomjs is used by casperjs, its just a dependency of casperjs.
// scrape.js
var casper = require('casper').create();
var links;
function getLinks() {
// Scrape the links from top-right nav of the website
var links = document.querySelectorAll('a');
return Array.prototype.map.call(links, function (e) {
return e.getAttribute('href')
});
}
// Opens casperjs homepage
casper.start('https://shogidb2.com/strategy/%E7%9F%A2%E5%80%89');
casper.then(function () {
links = this.evaluate(getLinks);
});
casper.run(function () {
for(var i in links) {
console.log(links[i]);
}
casper.done();
});
To run the script: casperjs scrape.js
Related
On a high level, does anyone know how to enter the Immersive Reader mode on Microsoft Edge when it is available for a given webpage through Selenium?
My aim is to load up a page, enter Immersive Reader, and save the page's source code to disk. I'm firing up Edge through Docker and I'm pragmatically connecting to it via a Node.js script.
I've tried driver.actions().sendKeys(KEY.F9), but that doesn't work since I'm targeting the browser and not a DOM element.
Many thanks for all your help.
New
Just run
driver.get('read://' + url)
and the site is loaded in immersive reader mode if available.
Old
To interact with the UI you have to use pyautogui (pip install pyautogui) and then run this code while the browser window is on focus/active:
import pyautogui
pyautogui.press('F9')
It is also useful for example to save a pdf by interacting with the popup window appearing when pressing CTRL+S.
Here's a bit of code for anyone else who might stumble across this:
Credits to #sound wave for helping me get there!
const { Builder } = require('selenium-webdriver');
const fs = require('fs');
(async () => {
const driver = await new Builder().forBrowser('MicrosoftEdge').usingServer('http://localhost:4444').build();
await driver.get('read://https://www.bbc.co.uk/news/entertainment-arts-64302120'); // this URL needs to be Immersive Reader supported
await driver.switchTo().frame(0);
const pagesource = await driver.getPageSource();
fs.writeFile('test.html', pagesource, err => {
if (err) {
console.log(err);
}
});
const title = (await driver.getTitle()).trim();
console.log(title);
await driver.quit();
})().catch((e) => console.error(e));
I've searched related topics. The most significant I found is this Setting user-agent in browsers with testcafe
But it doesn't provide any real aswers.
My goal is to run the test spoofing a different OS: Since I'm in Linux and the app I'm testing isn't supported for that, it shows a couple of warnings that I would want to get rid when tests are running.
We tried cypress, in which you just add the UserAgent string on a config file and that's it. But I haven't found a straightforward way of doing it on testcafe without a CLI parameter.
Is there a way to spoof an OS or userAgent in testcafe?
You can modify user-agent using the RequestHooks mechanism.
I prepared an example to demonstrate this approach:
import { RequestHook } from 'testcafe';
class UserAgentRequestHook extends RequestHook {
onRequest (e) {
e.requestOptions.headers['user-agent'] = 'Mozilla/5.0 (Android 4.4; Tablet; rv:41.0) Gecko/41.0 Firefox/41.0';
}
onResponse (e) {
}
}
const hook = new UserAgentRequestHook();
fixture `f`
.page `https://www.whatismybrowser.com/detect/what-is-my-user-agent/`;
test.requestHooks(hook)(`test`, async t => {
await t.debug();
});
Please note that TestCafe is using UserAgent internally, so incorrect UA value can lead to unpredictable results.
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
This one works
await page.goto('https://example.com');
This doesn't work (without the protocol i.e http/https)
await page.goto("www.example.com');
It throws error
Protocol error (Page.navigate): Cannot navigate to invalid URL
Why doesn't it append the protocol like it does when we open in Google Chrome?
The Google Chrome Omnibox (Address Bar) has built in functionality to handle multiple complexities, such as: appending protocols, autocomplete, etc.
Puppeteer provides an API to control Chrome or Chromium over the DevTools Protocol, so much of this functionality is currently out of the scope of Puppeteer.
The Puppeteer documentation for the function page.goto() explicitly states:
The url should include scheme, e.g. https://.
This is because page.goto() utilizes Page.navigate from the Chrome DevTools Protocol.
The Chromium source code shows that navigation via Page.navigate is explicitly checked for validity, and if the URL is not valid, it will return the error, "Cannot navigate to invalid URL."
You can easily create a function in Node.js that will append protocols to URLs, and that could be a workaround for your issue.
I got the same error when I was sending the url as an array of arrays
const urls = [["https://www.example1.com"], ["https://www.example2.com]]
destructuring it solved it for me
urls = [].concat(...urls)
for(let url of urls) {
await page.goto(url)
}
This has been an annoying problem for days now. As I start to try to write acceptance tests for my Ember app, when I use the visit() function, the URL is changed in the browser's address bar, so when I change a bit of code and the liveReload happens, it navigates off my test page to whatever page I had told it to visit in the tests.
To troubleshoot, I ember new'd a new app, created a /home route and template, and created an acceptance test for it, and it passed fine, without changing the URL in the address bar. I've compared the code in tests/helpers and it's the same, as is tests/index.html.
I've searched all over without coming across an answer. It's been hard enough for me to grok testing, but problems like this are just tangential, but very irritating. If anyone has a tip as to why this is happening, I'd be extremely grateful for a fix.
As an example, here's my one acceptance test. It passes, but the URL actually changes:
import Ember from 'ember';
import { module, test } from 'qunit';
import startApp from 'star/tests/helpers/start-app';
var application;
module('Acceptance: AddMilestone', {
beforeEach: function() {
application = startApp();
},
afterEach: function() {
Ember.run(application, 'destroy');
}
});
test('Adding milestones', function(assert)
visit('/projects/1234567/details');
andThen(function() {
assert.equal(currentPath(), 'project.details');
});
});
Look in config/environment.js for a block similar to this:
if (environment === 'test') {
// Testem prefers this...
ENV.baseURL = '/';
ENV.locationType = 'none';
// keep test console output quieter
ENV.APP.LOG_ACTIVE_GENERATION = false;
ENV.APP.LOG_VIEW_LOOKUPS = false;
ENV.APP.rootElement = '#ember-testing';
}
Is ENV.locationType set to none for your test environment?
If not, are you changing the locationType elsewhere in your app? Setting it to none leaves the address bar alone.
I have a simple service that sets cookies in angular, but there's no obvious way to test that they've been set in an end-to-end test.
The code to test is as simple as
var splashApp = angular.module('splashApp', ['ngCookies']);
splashApp.controller('FooterController', function ($location, $cookies) {
$cookies.some_cookie = $location.absUrl();
});
But I can't find any docs on how to test. Here's what I have found:
How to access cookies in AngularJS?
http://docs.angularjs.org/api/ngCookies.$cookies
http://docs.angularjs.org/api/ngCookies.$cookieStore
I've also tried
angular.scenario.dsl('cookies', function() {
var chain = {};
chain.get = function(name) {
return this.addFutureAction('get cookies', function($window, $document, done) {
var injector = $window.angular.element($window.document.body).inheritedData('$injector');
var cookies = injector.get('$cookies');
done(null, cookies);
});
};
return function() {
return chain;
}
});
But this returns only the cookies for the parent browser, not the page I want to test.
Any examples on how to do this?
It seems like you need to use PhantomJS.
PhantomJS is a headless WebKit scriptable with a JavaScript API. It
has fast and native support for various web standards: DOM handling,
CSS selector, JSON, Canvas, and SVG. --PhantomJS website
It has support for custom cookies in its API.
In terms of testing, this is probably your best choice.
You might also want to look at CasperJS to help test page navigation.