Is there a stealthy headless browser automation tool similar to puppetteer for Python? - selenium

I am aware of the Pyppeteer library and Pyppeteer Stealth, but the problem with them is that the website that I am trying to scrape information from detects Pyppeteer Stealth (Python transplant of Puppetteer) and blocks it. The original Puppetteer Stealth used on node JS does work fine on that website, however, I would much rather create this scraper on Python since I am much more familiar with it.
Which other stealthy and up to date headless browser automation tools are available?
All I will need it for is grabbing the HTML content and parsing it through Beautiful Soup. Unfortunately, the requests and requests-html library also do not work on this website.

If you don't care about the automation part of the software that much I would just recommend looking into Scrapy (and Scrapy Splash if you need js to be rendered which is why I assume you want to use Pyppeteer in the first place) combined with the use of some basic tactics to not get caught as a bot such as user-agent rotation and proxy rotation.
This is the tactic I am using too to make a scraper for similarweb.com at the moment.

Related

Difference between running a browser in headless vs browser [duplicate]

The main difference is, execution on GUI bases and non GUI bases(Headless).
I am looking for difference between all Headless browsers with each other, But unfortunately I didn't find any. I go through one by one, Which makes more confusion. It would be great if someone can share short information with differences, which makes things clear.
Browser
A Browser is an application program that provides a way to look at and interact with all the information on the World Wide Web. Technically a Browser, alternatively referred as a Web Browser or Internet Browser, is a client program that uses HTTP (Hypertext Transfer Protocol) to make requests of Web servers throughout the Internet on behalf of the Browser User.
Headless Browser
A Headless Browser is also a Web Browser but without a graphical user interface (GUI) but can be controlled programmatically which can be extensively used for automation, testing, and other purposes.
Why to use Headless Browsers?
There are a lot of advantages and disadvantages in using the Headless Browsers. Using a headless browser might not be very helpful for browsing the Web, but for Automating tasks and tests it’s awesome.
Advantages of Headless Browsers
There is a lot of advantages in using Headless Browsers. Some of tham are as follows:
A definite advantage of using Headless Browsers is that they are typically faster than real browsers. The reason for being faster is because we are not starting up a Browser GUI and can bypass all the time a real browser takes to load CSS, JavaScript and open and render HTML DOM.
Performancewise you can typically see a 2x to 15x faster performance when using a headless browser.
While Scraping Websites you don’t necessarily want to have to manually start up a website. So you can access the website headlessly and just scrape the HTML. You don’t need to render a Full Browser to do that.
Lot of developers use a Headless Browser for unit testing code changes for their websites and mobile apps. Being able to do all this from a command line without having to manually refresh or start a browser saves them lots and effort.
When You Might NOT Want to Use a Headless Browser
There can be number of reasons why you may opt to use a Real Browser instead of a Headless Browser. A couple of instances:
You need to mimic real users.
You need to visually see the test run.
If you need to do lots of debugging, headless debugging can be difficult.
Which headless browsers are better?
As you rightly pointed that ...the main difference is in the execution on GUI bases and non GUI bases(Headless)..., so from Testing Perspective a lot will depend on the Browser Engine implemented under the hood by any particular browser. For example, here are some of the Browser Engines which fully render web pages or run JavaScript in a virtual DOM.
Chromium Embedded Framework: CEF is a open source project based on the Google Chromium project with JavaScript support and BSD license.
Erik: Erik is a Headless Browser on top of Kanna and WebKit with Swift support and MIT license.
jBrowserDriver: jBrowserDriver is a Selenium-compatible Headless Browser which is WebKit-based and works with Selenium Server through Java binding support and Apache License v2.0 license.
PhantomJS: PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG with JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP and R(via Selenium) support and BSD 3-Clause license.
Splash: Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT with almost all the laungage binding arts and BSD 3-Clause license.
You can find a related discussion in Which drivers support “no-browser”/“headless” testing?

Is it possible to use Selenium from within a web app?

I am building a web site in Django that would scrape data from some site, so people could enter the site, set custom data filters and view scraped data in friendly format.
The problem is that requests and beautiful soup modules will not be enough for the scraping purposes, since I will also need some automation to be done (loading javascript or clicking buttons).
Since Selenium requiers a webdriver to be downloaded and put into a path, is it possible to use it from within web app? Like hosting the webdriver somewhere?
I am also open to solutions other than Selenium, if there are any.
I think what you would want is a selenium grid server.
https://www.seleniumhq.org/docs/07_selenium_grid.jsp
Basically you host it on some remote server and then you can connect to it and spin up web drivers remotely and use them in code as needed. It also comes with a handy interface for checking on current browser instances and even taking screenshots or executing scripts from the web ui.

Automatically click website buttons like selenium

I have a project that requires automating a process on a website (login, click buttons, make decisions, etc.).
Ordinarily I would use something like curl to do the automation and not worry about the UI at all, however this site uses ASPX and redirects and is just a mess, so I need to write something like a selenium test to do it.
A selenium seems like a bit of a hack though, so I was wondering if there is any alternative or tool that may be better than selenium at walking the dom and "clicking" elements?
Guidance or examples appreciated.
A non programtic way would be to use Selenium IDE. Basically you record the events via a Firefox extension and can replay easily. I understand this is a non automating way as it requires a manual playback.
However one element that I really like is that I can use this extension to record my events and then generate scripts to automate playback via Selenium Remote Control drivers.
Selenium IDE is an integrated development environment for Selenium
scripts. It is implemented as a Firefox extension, and allows you to
record, edit, and debug tests. Selenium IDE includes the entire
Selenium Core, allowing you to easily and quickly record and play back
tests in the actual environment that they will run in.
Yes I know you think selenium is a hack but it is rather pretty good!

Does anyone know browser emulators?

Over 2 years I tested web application with help Selenium framework. I know the best design is testing on VM.
The only one downside of this - it's very slow testing. Why?
browser only gets so much memory, if you will run several instances.
site coud be very slow.
connections can be very slow.
Would be great if there was a framework that emulated the browser (engine/core) correctly and can provide some results (api) for surf on the page.
I don't mean to simulate just on the one browser with different version (like IE). I mean to simulate for all browsers with very popular and newest version.
Does anyone know a framework/tool that can do it?
Thank you.
You can try PhantomJS for example.
From their page:
PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS selector,
JSON, Canvas, and SVG.
You can use it in combination with Jasmine (as well as several other frameworks) for testing.
However the selection of available engines is limited to WebKit. I doubt that Selenium will be easy to replace. By the way it looks like Selenium will probably become a W3C standard over the next years.
You can also run Selenium with Xvfb - I use it to execute test on remote server and it is going very well.

Automated browsing of complicated web pages

I have a project that will involve heavy automation of complicated web pages.
I realize there are Mechanize and Beautiful Soup, but don't these break when dealing with large amounts of DOM scripting and other weird stuff you find on complicated web pages?
I think I want essentially a barebones running instance of WebKit that allows me to either do "GUI scripting" or access the DOM. Ideas?
Try Sahi with PhantomJS. Sahi is a browser automation tool, and PhantomJS is a headless Webkit browser. You can find set-up instructions here: http://sahi.co.in/w/sahi-headless-execution-with-phantomjs
Disclaimer: We created the Sahi product.
What platform are you working on? And what language do you intend to use?
Adobe Air let's you embed a webkit inside an Air application and interact with the page JavaScript (there is two-way communication between the page JS and the AIR runtime).
Otherwise, if you are not bound to webkit you could take Mozilla Chromeless for a spin.
My apologies if none of this does what you need to do, I can't quite figure what exactly you are trying to do (page scraping? submitting forms?).
For testing/scraping i would try:
Selenium
EnvJS
Windmill
Watir
Sahi
WebTest