Selenium HtmlUnitDriver Web Scrape Got Captcha Page From EC2 Server - selenium

I wrote a simple web scraper to scrape expedia.com. Using Java Selenium HtmlUnitDriver, i was able to successfully scrape data from the site if i run it locally.
However, when i deploy this on to an EC2 Server, it always returns me the page where expedia detected it as a bot, thus, it displays this captcha to prove a human is accessing it.
I think it might have something to do with ip address of ec2 servers which got blacklisted by expedia.com somehow?
I've tried scraping different websites where they don't care / don't do human test.
Any idea how to go around this?
Things I tried but still detected as bot:
Changing user agent to something i use on my local browser
Setting a proxy
Update:
Actually setting a proxy server gives me a different error:
Current URL is https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1
The htmlString:
<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
<head>
<title>
500 Internal Server Error
</title>
</head>
<body>
<h1> Internal Server Error </h1>
<p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p>
<p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p>
<p> More information about this error may be available in the server error log. </p>
<hr>
<address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>
</body>
</html>

Are you covering these topics:
-Which agent are you using? Make sure you are using the same agent which you would use in a human navigation, more details in this link.
-Are you inserting waits in your navigation? If as soon as a page load you try to click or navigate, this isn't simulating a regular navigation. More details.
-Which driver are you using, there is a trick with chromedriver to rename a internal variable "cdc_" to other name like "aaa_" then if there is a javascript code in the server trying to detect this variable (cdc_), it will fail. More details.
-There are more things to be studied if you really need to not be detected by the server:
-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.
Interesting articles:
https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html
https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html
https://intoli.com/blog/making-chrome-headless-undetectable/

Related

JBoss Data Virt Access Using SSL

I have Data Virt running via the standalone.sh script, and can log in with my username and password. My next task is configuring it so that it automatically runs whenever the instance is up and running (without having to execute standalone.sh), and uses SSL (port 443) rather than my username and password to log me in. I added the vault.keystore, dv_keystore.jks, and dv_truststore.jks files, and modified both standalone.sh and standalone.xml, according to the JBoss and other online documentation, to account for using these files. I start the standalone.sh script, which runs without any errors. When I browse to:
http://<IP>:8443/dashboard
after starting standalone.sh, I get the following error:
This page can't be displayed
Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in Advanced settings and try connecting to https://:8443 again. If this error persists, it is possible that this site uses an unsupported protocol or cipher suite such as RC4, which is not considered secure. Please contact your site administrator.
The settings Use TLS-1.0-ON, Use TLS-1.1-ON, and Use TLS-1.2-ON are all checked in the Browser properties.
By contrast, when I browse to
http://<IP>:8443/dashboard
when standalone.sh is not running, I get the following:
This page can't be displayed
- Make sure the web address https://:8443 is correct.
- Look for the page with your search engine.
- Refresh the page in a few minutes.
It appears the browser is sensing something going on when standalone.sh is running, but something is not allowing the browser to access the dashboard.
What am I missing here?
Have you validated any other ssl access? Is it just an issue with the dashboard application?

How to locally run my cloudflare worker serverless function, during development?

I managed to deploy my first cloudflare worker using serverless framework according to
https://serverless.com/framework/docs/providers/cloudflare/guide/
and it is working when I hit the cloud.
During development, would like to be able to test on http://localhost:8080/*
What is the simplest way to bring up a local http server and handle my requests using function specified in serverless.yml?
I looked into https://github.com/serverless/examples/tree/master/google-node-simple-http-endpoint
but there is no "start" script.
There seem to be no examples for cloudflare on https://github.com/serverless/
At present, there is no way to run the real Cloudflare Workers runtime locally. The Workers team knows that developers need this, but it will take some work to separate the core Workers runtime from the rest of Cloudflare's software stack, which is otherwise too complex to run locally.
In the meantime, there are a couple options you can try instead:
Third-party emulator
Cloudworker is an emulator for Cloudflare Workers that runs locally on top of node.js. It was built by engineers at Dollar Shave Club, a company that uses Workers, not by Cloudflare. Since it's an entire independent implementation of the Workers environment, there are likely to be small differences between how it behaves vs. the "real thing". However, it's good enough to get some work done.
Preview Service API
The preview seen on cloudflareworkers.com can be accessed via API. With some curl commands, you can upload your code to cloudflareworkers.com and run tests on it. This isn't really "local", but if you're always connected to the internet anyway, it's almost the same. You don't need any special credentials to use this API, so you can write some scripts that use it to run unit tests, etc.
Upload a script called worker.js by POSTing it to https://cloudflareworkers.com/script:
SCRIPT_ID=$(curl -sX POST https://cloudflareworkers.com/script \
-H "Content-Type: text/javascript" --data-binary #worker.js | \
jq -r .id)
Now $SCRIPT_ID will be a 32-digit hex number identifying your script. Note that the ID is based on a hash, so if you upload the exact same script twice, you get the same ID.
Next, generate a random session ID (32 hex digits):
SESSION_ID=$(head -c 16 /dev/urandom | xxd -p)
It's important that this session ID be cryptographically random, because anyone with the ID will be able to connect devtools to your preview and debug it.
Let's also define two pieces of configuration:
PREVIEW_HOST=example.com
HTTPS=1
These specify that when your worker runs, the preview should act like it is running on https://example.com. The URL and Host header of incoming requests will be rewritten to this protocol and hostname. Set HTTPS=1 if the URLs should be HTTPS, or HTTPS=0 if not.
Now you can send a request to your worker like:
curl https://00000000000000000000000000000000.cloudflareworkers.com \
-H "Cookie: __ew_fiddle_preview=$SCRIPT_ID$SESSION_ID$HTTPS$PREVIEW_HOST"
(The 32 zeros can be any hex digits. When using the preview in the browser, these are randomly-generated to prevent cookies and cached content from interfering across sessions. When using curl, though, this doesn't matter, so all-zero is fine.)
You can change this curl line to include a path in the URL, use a different method (like -X POST), add headers, etc. As long as the hostname and cookie are as shown, it will go to your preview worker.
Finally, you can connect the devtools console for debugging in Chrome (currently only works in Chrome unfortunately):
google-chrome https://cloudflareworkers.com/devtools/inspector.html?wss=cloudflareworkers.com/inspect/$SESSION_ID&v8only=true
Note that the above API is not officially documented at present and could change in the future, but changes should be relatively easy to figure out by opening cloudflareworkers.com in a browser and looking at the requests it makes.
You may also be able to test locally by loading the Cloudflare worker as a service worker.
Note:
Use a local web server with https:. Workers won't load using file: or http: protocols.
Your browser will need to support workers, so you can't use IE.
Mock any Cloudflare-specific features, such as KV.
<!doctype html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<!-- Service worker registration -->
<script>
if ('serviceWorker' in navigator) {
// Register the ServiceWorker
navigator.serviceWorker.register('/service-worker.js')
.then(
function(reg) {
// Registration succeeded
console.log('[registerServiceWorker] Registration succeeded. Scope is ' + reg.scope)
window.location.reload(true)
})
.catch(
function(error) {
// Registration failed
console.log('[registerServiceWorker] Registration failed with ' + error)
})
} else {
console.log('[registerServiceWorker] Service workers aren\'t supported')
}
</script>
</body>
</html>
Dollar Share Club created Cloudworker. It is not actively maintained, but it is a way to run Cloudflare Workers locally.
You can read about it on the Cloudflare blog in guest post by the original maintainer of Cloudworker.

The web address incorrect error

I am new to windows 2012 server as well as iis8. I try to host a sample page with domain name in IIS8. I got following error
Make sure the web address http://www.mytest.com:1024 is correct
How can I rectify this? First created a new website called MyTest and assigned the ip address and port number to that site and set the domain as www.mytest.com.
Sorry for my bad English and thanks in advance.
My code in html page is
<html>
<body>
Welocme
</body>
</html>
try executing your page using url like
http://localhost:1024

Weblogic 10, session replication

I am working on session replication with two server instances in a cluster.
Session id is not getting replicated to the second server and hence it always creates a new one, and my open application gets errored out and gets closed. How to handde this failover of server instance so that the user will not be aware if the server instance is down. Here are the settings i am using in weblogic.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<weblogic-web-app xmlns="http://www.bea.com/ns/weblogic/weblogic-web-app">
<session-descriptor>
<session-param>
<param-name>URLRewritingEnabled</param-name>
<param-value>true</param-value>
</session-param>
<session-param>
<param-name>PersistentStoreType</param-name>
<param-value>replicated</param-value>
</session-param>
</session-descriptor>
<context-root>#CONTEXT_ROOT#</context-root>
</weblogic-web-app>
Now that you know that going to the app server directly does not alleviate appear to alleviate your session id issue, you need to do deeper debugging:
Install Firebug in Firefox (https://getfirebug.com/)
Go to your website in Firefox
Turn on Firebug in Firefox (and make sure that the Firefox's Net tab, which might be grayed-out, is enabled)
Log in to your website
Look at the Net tab in firebug and expand the plus sign for the request.
Look at the Request Headers section -- Do you see anything in the cookie field that looks like the JSESSIONID? If so, does the JSESSIONID stay the same or does it change when you navigate to other pages on your site?
I'm attaching a screenshot of using Firebug to look at the cookie that gets set and re-sent on every request when you have logged in to the weblogic admin console for comparison (rather than ADMINCONSOLESESSION, you'd see JSESSIONID as the cookie key)

A server or scripting issue

I am recieving an error in https
HTTP Status 404 - /cas/login
type Status report
message /cas/login
description The requested resource (/cas/login) is not available.
Apache Tomcat/7.0.23
when i open the same link in http it gives an output (an unknown language)
<html>
<head></head>
<body>
<pre>???�?? </pre>
</body>
</html>
Why is it so...? Is this an server issue or script issue? I feel it as a Server issue?? If so please share a remedy for it...
"The requested resource (/cas/login) is not available." generally indicates that cas was not correctly deployed or Tomcat couldn't load it properly. Could you plz examine the contents of either the catalina or the cas.log files and report back?