Python Scrapy Splash doesn't render website, stuck at loading screen - scrapy

I would like to render the following website with Scrapy Splash.
https://m.mobilebet.com/en/sports/football/england-premier-league/
Unfortunately, Splash always gets stuck at the loading screen:
I have already tried using a long waiting time (up to 60 seconds) with no results. My Splash version is 3.3.1 and obey robots.txt has been set to false.
Thanks!

There's not quite enough info to answer, but I've got a good guess.
You see, the major difference between Splash and your browser is the user agent string. You have one that looks like a person. Splash generally doesn't.
This kind of infinite loading is a method used by sites to mitigate repetitive load. Often when you're developing locally without a proxy you'll trip these issues. They are quite maddening to develop against because they're inconsistent.
Your requests are just getting dropped, you'll probably see a 403 after 5-10 minutes.
I think it's likely you can solve this issue with the method mentioned in this answer: Scrapy+Splash return 403 for any site.

I don't think it'll be possible - this website needs JS to be rendered. So you'll need to use something like Selenium to scrape information from it.
Also, perhaps what you are looking for is an API for that information - since scraping it from a website can be very inefficient. Try googling "sports REST API" - look for one with Python SDK.
Ok, so Splash is supposed to render the JS for you it seems. But I wouldn't rely on it too much - those websites constantly change and they are developed against latest browsers, your best bet is to use Selenium with Chromium driver (though using API is much more preferable).

Related

Why is this website so slow on mobile [shopify]

I'm trying to help a client with their slow-ish website https://www.dp-tools.de. If I use Google page speed for mobile I can see that it takes 7 seconds to be interactive but nothing in the found problems really tells me why it is actually that slow. I also tried chrome lighthouse and couldn't really see all that much.
Is there another way of checking or maybe anyone here sees why it is so slow?
Open up the client's Shopify Admin, and examine the theme. Does it have any strange slow code in it? Examine the Apps installed that directly affect theme. Are any of them crap, old, broken junk?
The best way to debug a slow theme is to just start hacking out any junk the client may have added to their theme. A lot of themes are so bad, for example, they load jQuery 3 times. Likely you have one bad apple in there, a call that is blocking and takes way to long to timeout or respond. Developer tools can point these out to in your console. Mobile versions have to come with a developer console you can inspect too right?

Handling SEO on Isomorphic React

i'm using React & Node JS for building universal apps (). I'm also using react-helmet as library to handle page title, meta, description, etc.
But i have some trouble when loading content dynamically using ajax, google crawler cannot fetch my site correctly because content will be loaded dynamically. Any suggestion to tackle this problem?
Thank you!
I had similar situation but with backend as django, but I think which backend you use doesnt matter.
First of let me get to the basics, the google bots dont actually wait for your ajax calls to get completed. If you want to test it out register your page at google webmaster tools and try fetch as google, you will see how your page is seen by bots(mine was just empty page with loading icon), so since calls dont complete, not data and page is empty, which is bad for SEO ,as bots read text.
So what you need to do, is try server side rendering. This you can do in 2 ways either you prerender.io or create templates on backend which are loaded when the page is called for the first time, after which your single page app kicks in.
If you use prerender its paid but pre-render internally uses phantom.js which you are you can directly use. But it did not work out really well for me so I went with creating templates on the backend. So the bots or the user when come to page for first time(or first entry) the page is served from backend else front end.
Feel free to ask in case in any questions :)

Reliably detecting PhantomJS-based spam bots

Is there any way to consistently detect PhantomJS/CasperJS? I've been dealing with a spat of malicious spambots built with it and have been able to mostly block them based on certain behaviours, but I'm curious if there's a rock-solid way to know if CasperJS is in use, as dealing with constant adaptations gets slightly annoying.
I don't believe in using Captchas. They are a negative user experience and ReCaptcha has never worked to block spam on my MediaWiki installations. As our site has no user registrations (anonymous discussion board), we'd need to have a Captcha entry for every post. We get several thousand legitimate posts a day and a Captcha would see that number divebomb.
I very much share your take on CAPTCHA. I'll list what I have been able to detect so far, for my own detection script, with similar goals. It's only partial, as they are many more headless browsers.
Fairly safe to use exposed window properties to detect/assume those particular headless browser:
window._phantom (or window.callPhantom) //phantomjs
window.__phantomas //PhantomJS-based web perf metrics + monitoring tool
window.Buffer //nodejs
window.emit //couchjs
window.spawn //rhino
The above is gathered from jslint doc and testing with phantom js.
Browser automation drivers (used by BrowserStack or other web capture services for snapshot):
window.webdriver //selenium
window.domAutomation (or window.domAutomationController) //chromium based automation driver
The properties are not always exposed and I am looking into other more robust ways to detect such bots, which I'll probably release as full blown script when done. But that mainly answers your question.
Here is another fairly sound method to detect JS capable headless browsers more broadly:
if (window.outerWidth === 0 && window.outerHeight === 0){ //headless browser }
This should work well because the properties are 0 by default even if a virtual viewport size is set by headless browsers, and by default it can't report a size of a browser window that doesn't exist. In particular, Phantom JS doesn't support outerWith or outerHeight.
ADDENDUM: There is however a Chrome/Blink bug with outer/innerDimensions. Chromium does not report those dimensions when a page loads in a hidden tab, such as when restored from previous session. Safari doesn't seem to have that issue..
Update: Turns out iOS Safari 8+ has a bug with outerWidth & outerHeight at 0, and a Sailfish webview can too. So while it's a signal, it can't be used alone without being mindful of these bugs. Hence, warning: Please don't use this raw snippet unless you really know what you are doing.
PS: If you know of other headless browser properties not listed here, please share in comments.
There is no rock-solid way: PhantomJS, and Selenium, are just software being used to control browser software, instead of a user controlling it.
With PhantomJS 1.x, in particular, I believe there is some JavaScript you can use to crash the browser that exploits a bug in the version of WebKit being used (it is equivalent to Chrome 13, so very few genuine users should be affected). (I remember this being mentioned on the Phantom mailing list a few months back, but I don't know if the exact JS to use was described.) More generally you could use a combination of user-agent matching up with feature detection. E.g. if a browser claims to be "Chrome 23" but does not have a feature that Chrome 23 has (and that Chrome 13 did not have), then get suspicious.
As a user, I hate CAPTCHAs too. But they are quite effective in that they increase the cost for the spammer: he has to write more software or hire humans to read them. (That is why I think easy CAPTCHAs are good enough: the ones that annoy users are those where you have no idea what it says and have to keep pressing reload to get something you recognize.)
One approach (which I believe Google uses) is to show the CAPTCHA conditionally. E.g. users who are logged-in never get shown it. Users who have already done one post this session are not shown it again. Users from IP addresses in a whitelist (which could be built from previous legitimate posts) are not shown them. Or conversely just show them to users from a blacklist of IP ranges.
I know none of those approaches are perfect, sorry.
You could detect phantom on the client-side by checking window.callPhantom property. The minimal script is on the client side is:
var isPhantom = !!window.callPhantom;
Here is a gist with proof of concept that this works.
A spammer could try to delete this property with page.evaluate and then it depends on who is faster. After you tried the detection you do a reload with the post form and a CAPTCHA or not depending on your detection result.
The problem is that you incur a redirect that might annoy your users. This will be necessary with every detection technique on the client. Which can be subverted and changed with onResourceRequested.
Generally, I don't think that this is possible, because you can only detect on the client and send the result to the server. Adding the CAPTCHA combined with the detection step with only one page load does not really add anything as it could be removed just as easily with phantomjs/casperjs. Defense based on user agent also doesn't make sense since it can be easily changed in phantomjs/casperjs.

Taking screenshots of a page while its loading using Selenium WebDrivers

I have started using Selenium WebDrivers to automate some performance testing. I found out that we could take screenshots of a page after the page has completed loading using WebDrivers: http://seleniumhq.org/docs/04_webdriver_advanced.html#taking-a-screenshot. However, I want to be able to take screenshots while the page is loading to analyze its loading time and pattern, much like what webpagetest does (http://www.webpagetest.org/). Is there an API that I could use to accomplish this task using WebDrivers?
I am using the FirefoxWebDriver and the Java client for the same. I appreciate help or tips.
Thanks!
Since, I found out that the RemoteWebDriver's get calls are blocking and even the getScreenshot calls are blocking, I decided to run java.awt.Robot in a separate thread and capture screenshots while the WebDriver loads the page.
The only caveat is that the browser instance opened up by the WebDriver has to be in the front of the screen to take snapshots correctly. I am exploring if Robot can take snapshots on an Xvfb display, which would be just awesome and would work for my purposes.

Injected scripts not firing for Google results pages

As a newbie to Safari Extensions I have what I am sure is a terribly trivial question ...
Here goes: I'm building an extension to work with some search engines. To cut a long story short I have boiled my issue down to its simplest form. I have an injected script (an end script). This fires as planned on the Google homepage. But when I enter a query to Google the script does not fire when the subsequent results page loads.
For example, to keep things really basic for testing I created a simple script that just writes to the console; I've set the access level to All so that it fires for all pages. I can see the console message when I open the Google homepage but I dont see it when the subsequent results pages load.
For all intents and purposes it seems as if the transition from Google homepage to results page is not a normal one (that is, not a conventional page load) and does not cause injected scripts to fire. I've only seen this problem on Google so I assume it is something to do with their page loading mechanism. I've tried it with Google Instant on and off and both produce the same behaviour.
e
It's one of those problems that seems so basic as to be stupefying! Please help.