Selenium - Avoid getting CAPTCHAs - selenium

I'm trying to scrape a login-only, bot-sensitive website. After logging in, when I perform a simple selenium function like driver.find_element_by_id('button').click(), the website displays a message along the lines of We think you are a bot. Please complete the CAPTCHA below to continue.
Is there any way for me to make selenium more human-like so I don't trigger CAPTCHAs?

Hopefully not.
You are scraping, i.e. you are developing a bot, and if you try to avoid being identified as a bot, it will just be a question of time until the captcha gets improved to detect your strategy.
DonĀ“t do it. The captcha is there for a reason, which is: to detect and lockout bots!
Better check if the page you want to scrape supports an API that allows computer-to-computer communication. If there is one, use it. If there is none, suggest one, but depending on whether the web page owner wants to support your goals, or not, he might say "no".

Related

How to bypass Captcha while Web Scraping

I am trying to scrape the car details from this site using Selenium: https://www.autoscout24.ch/de/autos/alle-marken?vehtyp=10
Approximately every 30 pages I have to verify that I am not a robot,
even though I have included in my code:
driver.implicitly_wait(20)
Is there any way to overcome this?
CAPTCHA is meant for those reasons. There is no co-relation with it being removed due to use of waits in Selenium script. The use of CAPTCHA is to detect that bots/automated systems are not crawling the web page.
Unless you disable it, I don't think that it is the right approach to automate it. Although you may find some tutorials on web to overcome it, but they are very patchy and do not cover all the use cases.
2 options come to mind on how to solve your issue, which one you'll choose depends on what you need.
Option 1 will be cheaper and probably easier, but you can just make your script wait when the Captcha is detected, and play a sound when it's shown so you can manually do the captcha yourself, after the captcha has been dealt with you can let the script continue doing it's thing.
The second option would be to use a captcha solving service, you would need to pay a little but would not need to manually do anything.
I'm not a robot
The "I'm not a robot" checkbox, commonly known as reCAPTCHA v2 is one of the security measure in practice for implementing challenge-response authentication. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mainly helps to protect the applications and the systems from spam and password decryption by asking to complete a simple test that proves it's a human and not a computer trying to access into a password protected account. In short CAPTCHA is implemented to help prevent unauthorized account entry.
So neither of the wait mechanism Implicit wait or Explicit wait would be of any help to avoid CAPTCHA
Solution
An ideal approach would be to disable the CAPTCHA for the AUT (Application Under Test) within Testing / Stagging environment and enable it only in production environment.
References
You can find a couple of relevant detailed discussions in:
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
How can I bypass the Google CAPTCHA with Selenium and Python?

Scrapy. How to navigate, select and submit form

I am trying to make a bot to simulate some human behaviors, and I got some instructions about scrapy to login at a page like nike.com.br, but once I need to select some buttons and submit some forms I was not able to find how.
Can anyone help me on it?
for example, after the login, I need to chose the size of the product and click at add to the cart, that is some way to do it using scrapy?
It's hard to answer you question because it's too generic, and this probably will have different solutions for different pages.
Generally speaking you need to check what the page is doing when you click to submit the form. Most likely a POST request, so you will need to mimic that POST request with scrapy (check FormRequest).
Same logic applies to add item to the card.
I think the best way to approach that is to use the browser's network tool. In scrapy docs there are a few tips about using it for similar purpose (here).

Logic for parking payment

I want to create an app for faster payment of parking.
This question is more about logic of my app, and what tools I need to use about creating it.
At this point, I use a parking place every day and I pay for it through the web page.
I do it like this.
Login to page.
click on the menu and it redirects me to www.parkingexample.page/payments
there is a search menu and I enter my car plate number if my car is found it returns me how much I need to pay, and "Pay" Button appears.
I click "Pay" buttons and then it's all done.
So my goal is to create an app that when I start it will automatically connect to the page and will search for my plate and if found and payment is needed there would be just one button "Pay"
So I think I should do it like this, but as I haven't created any web app(I'm 100% back-end developer) I ask you is my thought process is correct.
And also I don't want to use WebView as I think it's not necessary for me.
When I start my app it sends "POST" request to page to login.
Then I send 'GET' request to www.parkingexample.page/payments with params = 'mycarspaltenumber'
Somehow I need to click on PAY button on page when it appears so I think it's probably again 'POST' request, but at this point, I'm not sure.
So a QUESTION is, is my logic valid? or it can be done in some other way?
UPDATE. ADDED SCREENSHOTS
First Screen shoot this is the menu after I logged in with the search bar where I need to enter my card plate.
Second screen is where I found my car(Entered plate number and clicked search)
and now the page is updated with sum I have to pay and there is a button "PAID" in the bottom right corner I need to click.
And that's all i need.
To validate whether your suggested sequence is correct I would start by capturing your typical browser session between yourself and your parking provider with something like Fiddler. Then I would use HTTP client library of choice (for C# it would be something like HttpClient) and emulate the same flow with correct headers, query parameters and such like.
Looknig at your screenshots it seems the application is ASP.NET Web Forms, which can get a bit painful to emulate due to way its state management works: you will likely need to decode View state object (to ensure you're passing it back correctly) and locate all dynamic field ids that it uses for postbacks. This however is very doable.
If you discover that the above is too hard to emulate (or there's javascript involved) it might be easier to explore Remote Selenium WebDriver coupled with a headless browser like PhantomJS. You'd then have your PhantomJS interact with the page on your server, and you'll drive it with your mobile app. Basically you'll reduce the complexity of your parking provider page to a well documented API.
Hopefully that gives you a starting point
In your application, all that you will need is services call and the security part of logging a new user everytime to check for payment.
So It will be a simple spring-boot application, where you can use the security part for logging, and you can exactly use the simple way , for example you don't need to have a database, just to redirect your page, and if you are not familiar to front-end framework, you can use a basic html-css pages for client side.
Another important point, you should start by designing your application, before coding, because it's very important to know all the ideas behind your application.
Enjoy your doing time!

How Can I login Twitch via Selenium

I am trying twitch bot like a stay hydrated bot but Twitch login system have an google captcha and my bot can't login twitch I tried manually login but even i choose pictures correctly google captcha say "it was wrong try again"
Well, i think you're going to have trouble making a robot do something you can't, i'm guaranteeing the fact you cannot get past a reCaptcha is most likely user error.
Twitch Developer Documentation
You're not going to get past Google Captcha with an automation script as they're designed to stop exactly that. Also it's a really crummy way of creating a view bot, im assuming you actually are trying to do that, because if you were actually trying to make a chatbox, you'd know that twitch have things built in to support things like this, so they don't have to try and get around a reCaptcha.

Reliably detecting PhantomJS-based spam bots

Is there any way to consistently detect PhantomJS/CasperJS? I've been dealing with a spat of malicious spambots built with it and have been able to mostly block them based on certain behaviours, but I'm curious if there's a rock-solid way to know if CasperJS is in use, as dealing with constant adaptations gets slightly annoying.
I don't believe in using Captchas. They are a negative user experience and ReCaptcha has never worked to block spam on my MediaWiki installations. As our site has no user registrations (anonymous discussion board), we'd need to have a Captcha entry for every post. We get several thousand legitimate posts a day and a Captcha would see that number divebomb.
I very much share your take on CAPTCHA. I'll list what I have been able to detect so far, for my own detection script, with similar goals. It's only partial, as they are many more headless browsers.
Fairly safe to use exposed window properties to detect/assume those particular headless browser:
window._phantom (or window.callPhantom) //phantomjs
window.__phantomas //PhantomJS-based web perf metrics + monitoring tool
window.Buffer //nodejs
window.emit //couchjs
window.spawn //rhino
The above is gathered from jslint doc and testing with phantom js.
Browser automation drivers (used by BrowserStack or other web capture services for snapshot):
window.webdriver //selenium
window.domAutomation (or window.domAutomationController) //chromium based automation driver
The properties are not always exposed and I am looking into other more robust ways to detect such bots, which I'll probably release as full blown script when done. But that mainly answers your question.
Here is another fairly sound method to detect JS capable headless browsers more broadly:
if (window.outerWidth === 0 && window.outerHeight === 0){ //headless browser }
This should work well because the properties are 0 by default even if a virtual viewport size is set by headless browsers, and by default it can't report a size of a browser window that doesn't exist. In particular, Phantom JS doesn't support outerWith or outerHeight.
ADDENDUM: There is however a Chrome/Blink bug with outer/innerDimensions. Chromium does not report those dimensions when a page loads in a hidden tab, such as when restored from previous session. Safari doesn't seem to have that issue..
Update: Turns out iOS Safari 8+ has a bug with outerWidth & outerHeight at 0, and a Sailfish webview can too. So while it's a signal, it can't be used alone without being mindful of these bugs. Hence, warning: Please don't use this raw snippet unless you really know what you are doing.
PS: If you know of other headless browser properties not listed here, please share in comments.
There is no rock-solid way: PhantomJS, and Selenium, are just software being used to control browser software, instead of a user controlling it.
With PhantomJS 1.x, in particular, I believe there is some JavaScript you can use to crash the browser that exploits a bug in the version of WebKit being used (it is equivalent to Chrome 13, so very few genuine users should be affected). (I remember this being mentioned on the Phantom mailing list a few months back, but I don't know if the exact JS to use was described.) More generally you could use a combination of user-agent matching up with feature detection. E.g. if a browser claims to be "Chrome 23" but does not have a feature that Chrome 23 has (and that Chrome 13 did not have), then get suspicious.
As a user, I hate CAPTCHAs too. But they are quite effective in that they increase the cost for the spammer: he has to write more software or hire humans to read them. (That is why I think easy CAPTCHAs are good enough: the ones that annoy users are those where you have no idea what it says and have to keep pressing reload to get something you recognize.)
One approach (which I believe Google uses) is to show the CAPTCHA conditionally. E.g. users who are logged-in never get shown it. Users who have already done one post this session are not shown it again. Users from IP addresses in a whitelist (which could be built from previous legitimate posts) are not shown them. Or conversely just show them to users from a blacklist of IP ranges.
I know none of those approaches are perfect, sorry.
You could detect phantom on the client-side by checking window.callPhantom property. The minimal script is on the client side is:
var isPhantom = !!window.callPhantom;
Here is a gist with proof of concept that this works.
A spammer could try to delete this property with page.evaluate and then it depends on who is faster. After you tried the detection you do a reload with the post form and a CAPTCHA or not depending on your detection result.
The problem is that you incur a redirect that might annoy your users. This will be necessary with every detection technique on the client. Which can be subverted and changed with onResourceRequested.
Generally, I don't think that this is possible, because you can only detect on the client and send the result to the server. Adding the CAPTCHA combined with the detection step with only one page load does not really add anything as it could be removed just as easily with phantomjs/casperjs. Defense based on user agent also doesn't make sense since it can be easily changed in phantomjs/casperjs.