Spider returns different results from local machine and Scrapy Cloud(phantomjs+selenium+crawlera) - selenium

Hello!
Question to one who use scrapinghub, shub-image, selenuim+phantomjs, crawlera.
English skill is not good, sorry
I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud.
I've writtŠµn spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok.
Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of
webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud.
Then I decided to use crawlera acc. I've added it with:
service_args = [
'--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]
for Windows(local)
self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)
for docker instance
self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)
Again at local all is ok. Cloud not ok.
I've checked cralwera info. It's ok. Requests sends from both(local and cloud).
Note again:
Same proxies(crawlera).
response at windows:
200 http, html with right code
response at ScrapyCloud(docker instance):
200 http, html with inscription 403(forbidden)
I dont get what's wrong.
I think it might be differences between phantomjs versions(Windows, Linux).
Any ideas?

Related

TestCafe EC2 Network logs

We are "successfully" running our gherkin-testcafe build on ec2 headless against chromium. The final issue we are dealing with is that at a certain point in the test a CTA button is showing ...loading instead of Add to Bag, presumably because a service call that gets the status of the product, out of stock, in stock, no longer carry, etc. is failing. The tests work locally of course and we have the luxury of debugging locally opening chrome's dev env and inspecting the network calls etc. But all we can do on the ec2 is take a video and see where it fails. Is there a way to view the logs of all the calls being made by testcafe's proxy browser so we can confirm which one is failing and why? We are using. const rlogger = RequestLogger(/.*/, {
logRequestHeaders: true,
logResponseHeaders: true
});
to log our headers but not getting very explicit reasons why calls are not working.
TestCafe uses the debug module to perform internal logging functionality. So, in order to view the TestCafe proxy logs, you can set the DEBUG environment variable in the following manner:
export DEBUG='hammerhead:*'

How to locally run my cloudflare worker serverless function, during development?

I managed to deploy my first cloudflare worker using serverless framework according to
https://serverless.com/framework/docs/providers/cloudflare/guide/
and it is working when I hit the cloud.
During development, would like to be able to test on http://localhost:8080/*
What is the simplest way to bring up a local http server and handle my requests using function specified in serverless.yml?
I looked into https://github.com/serverless/examples/tree/master/google-node-simple-http-endpoint
but there is no "start" script.
There seem to be no examples for cloudflare on https://github.com/serverless/
At present, there is no way to run the real Cloudflare Workers runtime locally. The Workers team knows that developers need this, but it will take some work to separate the core Workers runtime from the rest of Cloudflare's software stack, which is otherwise too complex to run locally.
In the meantime, there are a couple options you can try instead:
Third-party emulator
Cloudworker is an emulator for Cloudflare Workers that runs locally on top of node.js. It was built by engineers at Dollar Shave Club, a company that uses Workers, not by Cloudflare. Since it's an entire independent implementation of the Workers environment, there are likely to be small differences between how it behaves vs. the "real thing". However, it's good enough to get some work done.
Preview Service API
The preview seen on cloudflareworkers.com can be accessed via API. With some curl commands, you can upload your code to cloudflareworkers.com and run tests on it. This isn't really "local", but if you're always connected to the internet anyway, it's almost the same. You don't need any special credentials to use this API, so you can write some scripts that use it to run unit tests, etc.
Upload a script called worker.js by POSTing it to https://cloudflareworkers.com/script:
SCRIPT_ID=$(curl -sX POST https://cloudflareworkers.com/script \
-H "Content-Type: text/javascript" --data-binary #worker.js | \
jq -r .id)
Now $SCRIPT_ID will be a 32-digit hex number identifying your script. Note that the ID is based on a hash, so if you upload the exact same script twice, you get the same ID.
Next, generate a random session ID (32 hex digits):
SESSION_ID=$(head -c 16 /dev/urandom | xxd -p)
It's important that this session ID be cryptographically random, because anyone with the ID will be able to connect devtools to your preview and debug it.
Let's also define two pieces of configuration:
PREVIEW_HOST=example.com
HTTPS=1
These specify that when your worker runs, the preview should act like it is running on https://example.com. The URL and Host header of incoming requests will be rewritten to this protocol and hostname. Set HTTPS=1 if the URLs should be HTTPS, or HTTPS=0 if not.
Now you can send a request to your worker like:
curl https://00000000000000000000000000000000.cloudflareworkers.com \
-H "Cookie: __ew_fiddle_preview=$SCRIPT_ID$SESSION_ID$HTTPS$PREVIEW_HOST"
(The 32 zeros can be any hex digits. When using the preview in the browser, these are randomly-generated to prevent cookies and cached content from interfering across sessions. When using curl, though, this doesn't matter, so all-zero is fine.)
You can change this curl line to include a path in the URL, use a different method (like -X POST), add headers, etc. As long as the hostname and cookie are as shown, it will go to your preview worker.
Finally, you can connect the devtools console for debugging in Chrome (currently only works in Chrome unfortunately):
google-chrome https://cloudflareworkers.com/devtools/inspector.html?wss=cloudflareworkers.com/inspect/$SESSION_ID&v8only=true
Note that the above API is not officially documented at present and could change in the future, but changes should be relatively easy to figure out by opening cloudflareworkers.com in a browser and looking at the requests it makes.
You may also be able to test locally by loading the Cloudflare worker as a service worker.
Note:
Use a local web server with https:. Workers won't load using file: or http: protocols.
Your browser will need to support workers, so you can't use IE.
Mock any Cloudflare-specific features, such as KV.
<!doctype html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<!-- Service worker registration -->
<script>
if ('serviceWorker' in navigator) {
// Register the ServiceWorker
navigator.serviceWorker.register('/service-worker.js')
.then(
function(reg) {
// Registration succeeded
console.log('[registerServiceWorker] Registration succeeded. Scope is ' + reg.scope)
window.location.reload(true)
})
.catch(
function(error) {
// Registration failed
console.log('[registerServiceWorker] Registration failed with ' + error)
})
} else {
console.log('[registerServiceWorker] Service workers aren\'t supported')
}
</script>
</body>
</html>
Dollar Share Club created Cloudworker. It is not actively maintained, but it is a way to run Cloudflare Workers locally.
You can read about it on the Cloudflare blog in guest post by the original maintainer of Cloudworker.

Testing site with IP addr whitelist using BrowserStack automate + cloud hosted CI

I have a test system (various web pages / web applications), that is hosted in an environment accessible only via machines with IP addresses that are white listed. I control the white list.
Our CI system is cloud hosted (Gitlab), so VMs are spun up dynamically as needed to run automated integration tests as a part of the build pipeline.
The tests in question use BrowserStack automation to run Selenium based tests, which means the source IP addresses of the BrowserStack automation driven requests that hit the test environment are dynamic, as BS is cloud hosted. Also the IP addresses of our test runner machines that call / invoke the BrowserStack automation are dynamic as well.
The whole system worked fine before the intro of IP white listing on the test environment. Since white listing was enabled, the BrowserStack tests can no longer access the environment URLs (due to not being able to white list the dynamic IPs).
I have been trying to get the CI driven tests working again using BS "Local Testing" feature, outlined here https://www.browserstack.com/local-testing.
I have set-up a dedicated Linux VM with a static IP address (cloud hosted). I have installed and am running the BrowserStackLocal.exe binary, using our BS key. It starts up fine and says it has connected to BrowserStack via a web socket. My understanding is this should cause all http(s) etc requests that come from my CI / BrowserStack automation driven tests to be routed through that stand-alone machine (via BS cloud), resulting in it's static IP address being the source of the requests seen at the test environment. This IP addr is white listed.
This is the command that is running on the dedicated / static IP machine:
BrowserStackLocal.exe --{access key} --verbose 3
I have also tried the below, but it made no apparent difference:
BrowserStackLocal.exe --{access key} --force-local --verbose 3
However, this does not seem to work? Either through "live" testing if I try and access the test env directly through BrowserStack, or through BS automate. In both cases the http(s) requests all time out and cannot access our test environment URLs. Also even with --verbose 3 logging level enabled on the BrowserStackLocal.exe process, I never see any request being logged on the stand-alone / static IP machine when I try to run the tests in various ways.
So I am wondering if this is the correct way to solve this problem? Am I misunderstanding how to do this? Do I need to run the BrowserStackLocal.exe perhaps on the same CI runner machine that is invoking the BS automation? This would be problematic as these have dynamic IPs as well (currently).
Thanks in advance for any help!
EDIT/UPDATE: I managed to get this to work!! (Sort of) - it's just a bit slow. If I run the following command on my existing dedicated / static IP server:
BrowserStackLocal.exe --key {mykey} --force-local --verbose 3
Then on another machine (like my dev laptop) if I hit the BS web driver server http://hub-cloud.browserstack.com/wd/hub, and access the site http://www.whatsmyip.org/ to see what IP address comes back, and it did (eventually) come back with my static IP machines address! The problem though is it was quite slow - 20-30 secs for that one site hit, so still looking at alternative solutions. Note for this to work your test code must set the "local" browserstack capability flag to 'true' - eg for Node.js:
// Input capabilities
var capabilities = {
'browserstack.local' : 'true'
}
UPDATE 2: Turning down the --verbose logging level on the local binary (or leaving that flag off completely) seemed to improve things - I am getting 5-10 sec response times now for each request. That might have to do. But this does work as described.
SOLUTION: I managed to get this to work - it's just a bit slow. If I run the following command on my existing dedicated / static IP server (note adding verbose logging seems to slow things down more, so no --verbose flag used now):
BrowserStackLocal.exe --key {mykey} --force-local
Then on another machine (like my dev laptop) if I hit the BS web driver server http://hub-cloud.browserstack.com/wd/hub, and access the site http://www.whatsmyip.org/ to see what IP address comes back, and it did come back with my static IP machines address. Note for this to work your test code must set the "local" browserstack capability flag to 'true' - eg for Node.js:
// Input capabilities
var capabilities = {
'browserstack.local' : 'true'
}
So while a little slow, that might have to do. But this does work as described.

How to send multiple identical HTTP POST message instantly

I have people cheating in my game by sending multiple HTTP POST message the game server (php + apache) to gain items. I have fixed that loophole but I want to test if my fix was correct.
I have tried some of the chrome plugin to send POST messages but I cant imitate sending them in the same instant, for example 5 identical POST message all send out less that 100ms to the same IP in between them.
I have a Centos and a windows machine, would appreciate any script or program recommendation.
If you have Python installed (your CentOS machine would), the following will do what you're after using only built-ins. You'll just need to tweak the conn.request line to pass in any body or headers your server requires.
from threading import Thread
import httplib, sys
REQUESTS = 5
def doRequest():
conn = httplib.HTTPConnection("www.google.com")
conn.request("POST", '/some/url')
for i in range(REQUESTS):
t = Thread(target=doRequest)
t.daemon = True
t.start()

$_POST undefined from remote server POST

I am writing a Drupal 7 module which is listening for HTTP POST messages to be sent by a 3rd party remote application. For testing I am sending messages using the Firefox Poster extension.
If I POST the message, the following code fails to place any value in my local vars (I get 'undefined index'):
$transId = urldecode($_POST['c2s_transaction_id']);
However, if I send the message using GET, the vars get populated fine with the following code:
$transId = urldecode($_REQUEST['c2s_transaction_id']);
This is true on both my local WAMP setup and on a shared hosting package.
I have never worked with HTTP POST messages before and have no idea where the problem might be. Could it be Drupal, the web server, or my code? Can anyone suggest how I might resolve this?
Many thanks,
Polly
Drupal removes the $_POST/$_GET in the system, just use $_REQUEST instead.