After several request my scraping code blocked by target site with re-captcha. I use https://github.com/gocolly/twocaptcha to bypass captcha with selenium chrome driver. It works while bypass with selenium chrome driver but when I run my scraping code again and it still blocked.
my question :
Why my code still blocked when re-captcha already bypassed with
selenium chrome driver?
How to bypass this re-captcha block?
CAPTCHA, short for Completely Automated Public Turing test to tell Computers and Humans Apart, is explicitly designed to prevent automation, so do not try! There are two primary strategies to get around CAPTCHA checks:
Disable CAPTCHAs in your test environment
Add a hook to allow tests to bypass the CAPTCHA
Related
With Robot Framework and its Selenium Library, I need to open specifically Firefox, get to a repository on GitLab and download a certain file. Please don't question the tool choice, I was asked to do this with Robot on Firefox and I have to do it with Robot on Firefox. Nothing crazy on the surface actually, but I found out that GitLab runs a "check" on the browser and apparently Selenium gets stuck.
I've searched for solutions, but they all apply to either Selenium on Java, Python, etc., and most of them are about Chrome. Only a handful of unclear ones talk about Firefox and none about Robot with Selenium. I've tried to adapt some of them, such as the following:
SeleniumLibrary.Open Browser browser=firefox
... ff_profile_dir=set_preference("dom.webdriver.enabled","false");set_preference("useAutomationExtension","false")
SeleniumLibrary.Go To ${UrlGitLab}
But it doesn't work. It's still stuck on the page
Checking your browser before accessing gitlab.com.
This process is automatic. Your browser will redirect to your
requested content shortly.
Please allow up to 5 seconds…
It looks like it tries to reload (?) a couple of times, but it won't go further.
Is there a solution, or a workaround that doesn't completely ditch Robot + Selenium with Firefox?
Do you know any webapps/online tests/online firewalls that are trying to detect if user is using selenium/puppeteer/phantomJS or any other headless browser?
I've created my puppeteer online crawler. I've changed many different stuff like window.navigator object (user-agent, ~.webdriver etc.).
Now I want to make sure that it is undetectable.
There is a headless browser detection test which tests for the following:
Does the User-Agent contain the string "HeadlessChrome"?
Is navigator.webdriver set?
Is window.chrome unset?
Does the browser skip asking for permissions (like notifications)?
Are browser plugins unavailable?
Is navigator.languages unset?
If your browser answers any of these questions with yes, then you fail the test. For more information on the test, check out this post, which is a reply to a post called "Detecting Chrome headless, new techniques".
The author of the latter post also published another test test (code), which claims to be able to detect bots and crawlers. It performs various tests on browser attributes and generates a fingerprint of your browser.
Other "soft" tests done by websites, might include the mouse movement, scrolling behavior, IP address, etc. I doubt you will find many tests regarding these information as this is basically a cat-and-mouse game.
I can handle it with FF web driver (selenium python) but when I change phantomjs the driver .it didnt handle it. is it possible to handle it?
the program must work on server so what must I do?
In short, no, this is not currently possible. The PhantomJS ghostdriver does not implement the primitives for handling alerts/prompts. See https://github.com/detro/ghostdriver/issues/20
This is also unlikely to change since development of PhantomJS has been discontinued.
the program must work on server so what must I do?
Your best option is to use the headless versions of Chrome or Firefox, which can run in a headless environment like PhantomJS just fine. Both Chrome (chromedriver) and Firefox (geckodriver) implement the necessary primitives for handling alerts/prompts.
Another option is that you can use a virtual screen program (e.g. xvfb) to enable you to use a headed browser in a headless environment.
I'd like to crawl a set of random websites received from a URL generator, using Selenium's ChromeDriver with Crawljax to do static code analysis on the captured DOM states.
Is this potentially unsafe for the machine doing the crawling?
My concern is that one of the randomly generated sites is malicious and that execution of JavaScript from ChromeDriver (which is used to capture the new DOM states) infects the machine running the test somehow. Should I be running this in some kind of sandboxed environment?
--edit--
If it matters, the crawler is implemented entirely in Java.
Simple answer, no. Only if your afraid of cookies, and even if you are, your machine isn't.
It's hard to say it's very secure,you should aware of that there is no absolute secure in network.Recently,a chrome RCE has been put out,details:
SSD Advisory – Chrome Turbofan Remote Code Execution – SecuriTeam Blogs
Maybe this can effect on Selenium's ChromeDriver
But you can do some enforce on your system,such as change your firewall mode to white list,only allow your python script and selenium to access internet on port 80,443.
Even if your system pwned by RCE,the malicious code still can't access internet,unless it inject to you python process(I think it's very hard to do with js script in Browser RCE).
Another option:Install HIPS,if your python script want to do anything else but crawl web page(such as start an other process) or read/write some other files,you will know it and decide what to do.
In my oppion,do your crawl thing in a VM and do some enforce on firewall(Windows firewall or Linux iptables),shutdown useless services in windows.That's enough.
In a word,it's diffcult to find the balance between security and convenience and do not believe your system is unbreakable
What kind of pros and con involve with headless selenium test execution. I would like to know the recommendations to run tests on real browser vs headless browsers
With a real browser you can; see whats actually going on, inspect element, test javascript on the go.
With a headless browser you can let it run in the background.
But they are both very similar. One you can see... the other you cant.
I traditionally develop using selenium with a browser to see whats going on, and if your code implements a webdriver interface you can just switch browser whenever you want.... even to go headless.
In c# you have RemoteWebDriver, which is what you want to use if you want to be able to use different browsers.