How to bypass Captcha while Web Scraping - selenium

I am trying to scrape the car details from this site using Selenium: https://www.autoscout24.ch/de/autos/alle-marken?vehtyp=10
Approximately every 30 pages I have to verify that I am not a robot,
even though I have included in my code:
driver.implicitly_wait(20)
Is there any way to overcome this?

CAPTCHA is meant for those reasons. There is no co-relation with it being removed due to use of waits in Selenium script. The use of CAPTCHA is to detect that bots/automated systems are not crawling the web page.
Unless you disable it, I don't think that it is the right approach to automate it. Although you may find some tutorials on web to overcome it, but they are very patchy and do not cover all the use cases.

2 options come to mind on how to solve your issue, which one you'll choose depends on what you need.
Option 1 will be cheaper and probably easier, but you can just make your script wait when the Captcha is detected, and play a sound when it's shown so you can manually do the captcha yourself, after the captcha has been dealt with you can let the script continue doing it's thing.
The second option would be to use a captcha solving service, you would need to pay a little but would not need to manually do anything.

I'm not a robot
The "I'm not a robot" checkbox, commonly known as reCAPTCHA v2 is one of the security measure in practice for implementing challenge-response authentication. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mainly helps to protect the applications and the systems from spam and password decryption by asking to complete a simple test that proves it's a human and not a computer trying to access into a password protected account. In short CAPTCHA is implemented to help prevent unauthorized account entry.
So neither of the wait mechanism Implicit wait or Explicit wait would be of any help to avoid CAPTCHA
Solution
An ideal approach would be to disable the CAPTCHA for the AUT (Application Under Test) within Testing / Stagging environment and enable it only in production environment.
References
You can find a couple of relevant detailed discussions in:
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
How can I bypass the Google CAPTCHA with Selenium and Python?

Related

Selenium - Avoid getting CAPTCHAs

I'm trying to scrape a login-only, bot-sensitive website. After logging in, when I perform a simple selenium function like driver.find_element_by_id('button').click(), the website displays a message along the lines of We think you are a bot. Please complete the CAPTCHA below to continue.
Is there any way for me to make selenium more human-like so I don't trigger CAPTCHAs?
Hopefully not.
You are scraping, i.e. you are developing a bot, and if you try to avoid being identified as a bot, it will just be a question of time until the captcha gets improved to detect your strategy.
DonĀ“t do it. The captcha is there for a reason, which is: to detect and lockout bots!
Better check if the page you want to scrape supports an API that allows computer-to-computer communication. If there is one, use it. If there is none, suggest one, but depending on whether the web page owner wants to support your goals, or not, he might say "no".

How do I whitelist my Google account to prevent it ever triggering captcha?

We have a number of Google Sheets accounts specifically dedicated for use by test automation. Our tests use Selenium to automate the Google auth flow and then the rest of the test flow.
Starting Friday (6 Oct 2017) we are seeing Google sometimes inserts captchas into the auth flow. We don't see any consistency in which tests, or which test machines, get captchas and which don't. In some runs we see almost every test encounter captchas, in other runs we see only a few get captchas. We never see captchas when manually executing the test scenarios, and manually solving the captchas as the tests run does not prevent future captchas.
We've seen this sporadically in the past, and it has always gone away on its own. This time it appears to be sticking around.
Given the whole point of these test accounts are to be used by bots, and the whole point of captcha is to prove the user is not a bot, we looked through the settings for the Google accounts for something along the lines of "Never captcha this account" and didn't see any likely candidates. Our searches of StackOverflow and the web for variants of "[google-oauth] [recaptcha] whitelisting" and the like haven't turned up anything beyond "The whole point of captcha is to not be automatable, duh", which we already knew and doesn't help us get our tests running.
Is there a way to whitelist these accounts to never trigger captcha?
Here's Google's official answer:
"There isn't. If you're using a gsuite domain for your test accounts, however, you can run your own identity provider to handle the auth. Not entirely sure if that avoids it 100% of the time though."

Reliably detecting PhantomJS-based spam bots

Is there any way to consistently detect PhantomJS/CasperJS? I've been dealing with a spat of malicious spambots built with it and have been able to mostly block them based on certain behaviours, but I'm curious if there's a rock-solid way to know if CasperJS is in use, as dealing with constant adaptations gets slightly annoying.
I don't believe in using Captchas. They are a negative user experience and ReCaptcha has never worked to block spam on my MediaWiki installations. As our site has no user registrations (anonymous discussion board), we'd need to have a Captcha entry for every post. We get several thousand legitimate posts a day and a Captcha would see that number divebomb.
I very much share your take on CAPTCHA. I'll list what I have been able to detect so far, for my own detection script, with similar goals. It's only partial, as they are many more headless browsers.
Fairly safe to use exposed window properties to detect/assume those particular headless browser:
window._phantom (or window.callPhantom) //phantomjs
window.__phantomas //PhantomJS-based web perf metrics + monitoring tool
window.Buffer //nodejs
window.emit //couchjs
window.spawn //rhino
The above is gathered from jslint doc and testing with phantom js.
Browser automation drivers (used by BrowserStack or other web capture services for snapshot):
window.webdriver //selenium
window.domAutomation (or window.domAutomationController) //chromium based automation driver
The properties are not always exposed and I am looking into other more robust ways to detect such bots, which I'll probably release as full blown script when done. But that mainly answers your question.
Here is another fairly sound method to detect JS capable headless browsers more broadly:
if (window.outerWidth === 0 && window.outerHeight === 0){ //headless browser }
This should work well because the properties are 0 by default even if a virtual viewport size is set by headless browsers, and by default it can't report a size of a browser window that doesn't exist. In particular, Phantom JS doesn't support outerWith or outerHeight.
ADDENDUM: There is however a Chrome/Blink bug with outer/innerDimensions. Chromium does not report those dimensions when a page loads in a hidden tab, such as when restored from previous session. Safari doesn't seem to have that issue..
Update: Turns out iOS Safari 8+ has a bug with outerWidth & outerHeight at 0, and a Sailfish webview can too. So while it's a signal, it can't be used alone without being mindful of these bugs. Hence, warning: Please don't use this raw snippet unless you really know what you are doing.
PS: If you know of other headless browser properties not listed here, please share in comments.
There is no rock-solid way: PhantomJS, and Selenium, are just software being used to control browser software, instead of a user controlling it.
With PhantomJS 1.x, in particular, I believe there is some JavaScript you can use to crash the browser that exploits a bug in the version of WebKit being used (it is equivalent to Chrome 13, so very few genuine users should be affected). (I remember this being mentioned on the Phantom mailing list a few months back, but I don't know if the exact JS to use was described.) More generally you could use a combination of user-agent matching up with feature detection. E.g. if a browser claims to be "Chrome 23" but does not have a feature that Chrome 23 has (and that Chrome 13 did not have), then get suspicious.
As a user, I hate CAPTCHAs too. But they are quite effective in that they increase the cost for the spammer: he has to write more software or hire humans to read them. (That is why I think easy CAPTCHAs are good enough: the ones that annoy users are those where you have no idea what it says and have to keep pressing reload to get something you recognize.)
One approach (which I believe Google uses) is to show the CAPTCHA conditionally. E.g. users who are logged-in never get shown it. Users who have already done one post this session are not shown it again. Users from IP addresses in a whitelist (which could be built from previous legitimate posts) are not shown them. Or conversely just show them to users from a blacklist of IP ranges.
I know none of those approaches are perfect, sorry.
You could detect phantom on the client-side by checking window.callPhantom property. The minimal script is on the client side is:
var isPhantom = !!window.callPhantom;
Here is a gist with proof of concept that this works.
A spammer could try to delete this property with page.evaluate and then it depends on who is faster. After you tried the detection you do a reload with the post form and a CAPTCHA or not depending on your detection result.
The problem is that you incur a redirect that might annoy your users. This will be necessary with every detection technique on the client. Which can be subverted and changed with onResourceRequested.
Generally, I don't think that this is possible, because you can only detect on the client and send the result to the server. Adding the CAPTCHA combined with the detection step with only one page load does not really add anything as it could be removed just as easily with phantomjs/casperjs. Defense based on user agent also doesn't make sense since it can be easily changed in phantomjs/casperjs.

Smart card, PIN, Secure HTTP, Login and Downloading and manipulating the source html - need a suitable coding langugage

I am now motivated to explore a coding language so that I can make the best solution possible.
But I am not sure of the capabilities of all coding langugages, so I am asking for advice.
I want to automate some of the daily processes I do at the office. There is an external database on the internet that we use. We access it with a smart card and secured http.
In short, these are the actions that I do each time I restart the browser or a session ends:
Open a Secured HTTP. /....jsp
After being promted I choose an installed certificate
A smart card is called and I enter a PIN. /charismatics smart security interface/
The page asks me to log in with a username and password.
I open the desired link.
I extract the data from the opened webpage manually.
Is it possible to have all these action automated by code?
THANK YOU FOR ANY SUPPORT
If you get a PIN screen from the charismatics smart card security interface instead of from the operating system then it it may be very hard to automate this. Your program is unlikely to get access to the PIN popup Window.
If you get the PIN prompt from a CSP (as you mentioned in the comments) then it may be possible to automate the PIN login. The PIN is normally used to set up the SSL/TLS connection, so having it open in the browser won't help you much, unless you program the browser itself.
If you are bound to CSP's it may be best to keep to C#/.NET. There are of course bindings for other runtimes, but it is better to have as much control as possible.
You may want to take a look at topics such as parsing HTML, because that's something you certainly need to do. Life becomes a lot harder if the web-pages are filled in using JavaScript, so you may check for that first.
Now if you want to manually choose a link you may want to render the page in your own application and handle the download yourself.
This is certainly not a task I would recommend when starting off on an unknown programming language. I would find this a tricky task - there are a lot of ifs left with this description.

How would you go about making an application that automatically retrieves your bank account balance twice a day?

I'm building a utility that will hopefully keep my wife in tune with how much money we have available.
I need a simple secure way of logging into my bank account and retrieving the balance.
Something like mechanize is the only method I can think of. I'm not even sure if that would work given the properly authenticated https that banks use.
Any ideas?
Write a perl script using LWP::UserAgent. It supports HTTPS connections. The only issue might be if the site requires javascript.
Web Client Programming with Perl has a few examples to get you started if you're not too familiar with perl.
If you really want to go there, get these extensions for Firefox: Live HTTP Headers, Firebug, FireCookie, and HttpFox. Also download cURL and a scripting language that can run cURL command-line tasks (or a scripting language like PHP or Perl that has access to cURL libraries directly).
I've started down this road for some idempotent GET tasks like getting PDFs of the S&P reports (of the stocks I track) from my online brokerage, and downloading the check images for my bank account. Both tasks are repetitive and slow ways of downloading data to my computer that the financial institutions don't provide any way of making it easier.
Here's why you shouldn't: (as a shortcut I'm going to call the archetypal large bank, brokerage, or other financial institution "BloatBank")
BloatBank is not likely to make public their API for accessing this kind of information. So it can change any time and all your hard work will be for naught. Whenever they change their mechanism, you'll have to adapt.
If BloatBank finds out you've been using automatic scripting to try to access your account information, they may ban you because you've violated their terms of service.
You might screw up, and the interaction between the hodgepodge of scripts on BloatBank's server, and your scripts that access your account, might cause a Bad Thing like closing your account. Testing this kind of script is tremendously difficult because you don't have any documentation about how their online service works, and you don't have a test account you can mess with.
(a variant of the above) You think you're safe because you're issuing GET requests. But BloatBank is just a crazy bank that doesn't know anything about REST, so there are some GET requests that can mess up your account.
If someone else does use your script to maliciously sniff your online password or mess with your account, any liability coverage from BloatBank may disappear because you've opened a security hole.
Why don't you teach your wife how to login to the bank herself? Or use Quicken (or Mint, etc) and teach her how to use the auto-download feature?
Have you checked out Watir? It is fantastic for automating web-browser actions. And since it's written in Ruby, you can take the results and store them in a DB (or email them to yourself) if needed.
If you are open to AIR, I'd say build an AIR app. I have worked with mechanize and I think it's cool. AIR gives you similar features with a richer GUI (see HTMLLoader and DOM manipulation of webpage).
If I were you, I'd simply pull the page and manipulate the DOM to suit my visual needs.
Please, if you find this easy to do for your bank please post your bank's name. If I have the same one I'll be closing my account.
More to your question. The process of loading a web page inside of your code rather than in a browser can be a black art, especially if their is any javascript involved. Your best bet would probably be embedding the IE Web Browser control in your app and then simulating key strokes and mouse clicks to arrive at your balance page. Then scrape the HTML for the balance.
I could try paying for Quicken and letting it do the balance downloading. Then I'd just need to find a way to get the number out of the software automatically.
This way I'm not violating any terms of service and I'm also reducing security risk since all "hacking" goes on locally.