I am trying to crawl booking from a VM and I don't get the same response like the one from my local machine. The query is the following:
scrapy shell --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"
When I run the query from my VM, I get a response with the same URL than the one in the query while from the VM I get the generic response:
https://www.booking.com/hotel/fr/le-transat-bleu.fr.html
I must mention that before adding the USER_AGENT part I was getting the same answer even on my local machine.
Also, if I use Links, a command-line browser from the VM, I get the correct response. Hence it does not seem to come from the public IP of the VM I use.
I suspect that there is another information that booking.com might be using to prevent the crawling of certain pages on top of the USER_AGENT and the robot.txt file but I don't know which one.
Local Request Headers
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8', b'Accept-Language': b'en', b'User-Agent': b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0', b'Accept-Encoding': b'gzip,deflate'}
VM Request Headers
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'bkng=11UmFuZG9tSVYkc2RlIyh9Yaa29%2F3xUOLbXpFeYC4TUhBTLg%2BWRWQhTWxLpR01uuU40DSTIBsY%2F5OusQaibxVABBhdPCiYlEsnGLdmcDyD%2BtWFGVlewF8Fo59TLNV6vs0R1Ypha9MOkYUl6wASmexLrJie%2F3imTygdbEEsnB0sv0m%2B%2FJ1C6Cm42FEFBT222yQ7']}
VM Request without cookies
scrapy shell --set="COOKIES_ENABLED=False" --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"
VM Request Headers without cookies
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate']}
Related
i know how to use random (fake) user agent in scrapy. but after i run scrapy. i could see only one random user agent on terminal. so i guessed maybe 'settings.py' run only one time when i run scrapy. if scrapy work really like this and send 1000 request to some web page to collect 1000 data, scrapy will just send same user agent. Surely it can be easy to get ban i think.
can you tell me how can i send random user agent when scrapy send request to some website?
i used this lib(?) in my scrapy project.
after i set faker in user-agent in settings.py
https://pypi.org/project/Faker/
from faker import Faker
fake = Faker()
Faker.seed(fake.random_number())
fake_user_agent = fake.chrome()
USER_AGENT = fake_user_agent
in settings.py i wrote like this. can it work well ??
If you are setting USER_AGENT in your settings.py like in your question then you will just get a single (random) user agent for your entire crawl.
You have a few options if you want to set a fake user agent for each request.
Option 1: Explicitly set User-Agent per request
This approach involves setting the user-agent in the headers of your Request directly. In your spider code you can import Faker like you do above but then call e.g. fake.chrome() on every Request. For example
# At the top of your file
from faker import Faker
# This can be a global or class variable
fake = Faker()
...
# When you make a Request
yield Request(url, headers={"User-Agent": fake.chrome()})
Option 2: Write a middleware to do this automatically
I won't go into this because you might as well use one that already exists
Option 3: Use an existing middleware to do this automatically (such as scrapy-fake-useragent)
If you have lots of requests in your code option 1 isn't so nice, so you can use a Middleware to do this for you. Once you've installed scrapy-fake-useragent you can set it up in your settings file as described on the webpage
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]
Using this you'll get a new user-agent per Request and if a Request fails you'll also get a new random user-agent. One of the key parts of setting this up is FAKEUSERAGENT_PROVIDERS. This tells us where to get the User-Agent from. They are tried in the order they are defined, so the second will be tried if the first one fails for some reason (if getting the user-agent fails, not if the Request fails). Note that if you want to use Faker as the primary provider, then you should put that one first in the list
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]
There are other configuration options (such as using a random chrome-like user-agent, listed in the scrapy-fake-useragent docs.
Example spider
Here is an example spider. For convenience I set the settings inside the spider, but you can put these into your settings.py file.
# fake_user_agents.py
from scrapy import Spider
class FakesSpider(Spider):
name = "fakes"
start_urls = ["http://quotes.toscrape.com/"]
custom_settings = dict(
DOWNLOADER_MIDDLEWARES={
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 400,
"scrapy_fake_useragent.middleware.RetryUserAgentMiddleware": 401,
},
FAKEUSERAGENT_PROVIDERS=[
"scrapy_fake_useragent.providers.FakerProvider",
"scrapy_fake_useragent.providers.FakeUserAgentProvider",
"scrapy_fake_useragent.providers.FixedUserAgentProvider",
],
)
def parse(self, response):
# Print out the user-agent of the request to check they are random
print(response.request.headers.get("User-Agent"))
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Then if I run this with scrapy runspider fake_user_agents.py --nolog the output is
b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 11_0) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/59.0.811.0 Safari/533.1'
b'Opera/8.18.(Windows NT 6.2; tt-RU) Presto/2.9.169 Version/11.00'
b'Opera/8.40.(X11; Linux i686; ka-GE) Presto/2.9.176 Version/11.00'
b'Opera/9.42.(X11; Linux x86_64; sw-KE) Presto/2.9.180 Version/12.00'
b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 5_1 rv:6.0; cy-GB) AppleWebKit/533.45.2 (KHTML, like Gecko) Version/5.0.3 Safari/533.45.2'
b'Opera/8.17.(X11; Linux x86_64; crh-UA) Presto/2.9.161 Version/11.00'
b'Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/3.1)'
b'Mozilla/5.0 (Android 3.1; Mobile; rv:55.0) Gecko/55.0 Firefox/55.0'
b'Mozilla/5.0 (compatible; MSIE 9.0; Windows CE; Trident/5.0)'
b'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 11_9; rv:1.9.4.20) Gecko/2019-07-26 10:00:35 Firefox/9.0'
I am trying to test an API hosted using AWS API Gateway and always getting following error:
Error: socket hang up
Request Headers
clientId: system
Authorization: //Correct Auth Token
User-Agent: PostmanRuntime/7.26.8
Accept: */*
Host: //API Host URL
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
However, when I pass in an invalid Auth Token (like TEST), I actually receive a 403 error as expected. I can also see logs in CloudWatch confirming the call reached the authorizer.
CloudWatch Logs
The same API works for other people perfectly fine.
I have tried almost every resolution I found online related to this issue like I turned off 'SSL Certificate Verification' in Postman and kept proxy settings same as my colleagues. Tried to hit the API after disconnecting VPN as well but nothing worked for me.
Could anyone please help me with this.
Thanks in advance.
Description:
I have upgraded docker selenium version to 3.141.59-zinc (from 3.141.59-europium), it started failing the acceptance test due to header info (set through proxy server) not found at server side. If I change image from zinc to europium - all works fine.
Log trace with 3.141.59-europium:
Remote address of request printed at server side: 127.0.0.1
Headers: {accept-language=en-US,en;q=0.9, host=localhost:39868, upgrade-insecure-requests=1, user=123456789, accept-encoding=gzip, deflate, br, user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36,
accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8, via=1.1 browsermobproxy}
Log trace with 3.141.59-zinc :
Remote address of request printed at server side: 0:0:0:0:0:0:0:1
Headers: {sec-fetch-mode=navigate, sec-fetch-site=none, accept-language=en-US,en;q=0.9, host=localhost:42365, upgrade-insecure-requests=1, connection=keep-alive, sec-fetch-user=?1, accept-encoding=gzip, deflate, br, user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36, accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9}
To Reproduce
Create Proxy object with host and port.
Set proxy in webdriver capabilities.
DesiredCapabilities cap = DesiredCapabilities.chrome();
cap.setCapability(CapabilityType.PROXY, proxy);
Set Proxy header
proxyServer.addHeader("user", "123456789");
Access application
driver.get("http://localhost:/welcome")
Check for proxy header "user", it should be 123456789
Expected behaviour
I am setting header with user=123456789, which is not getting passed if using webdriver 3.141.59-zinc. If I manually call url using URLConnection with proxy - Its working (So no issue in proxy server).
And also If I use ip address instead of localhost, its working fine (proxy header available in request at server). So I guess, its ignoring proxy for localhost in the new version of webdriver 3.141.59-zinc. I also tried with setting noProxy with null/"" but it did not work.
Environment
OS: Oracle Linux Server release 7.5
Docker-Selenium image version: 3.141.59-zinc
Docker version: 17.06.2-ol
Note: Using standalone chrome in headless mode
I am working on a project that involves two embedded devices, let's call them A and B. Device A is the controller and B is being controlled. My goal is to make an emulator for device B, i.e., something that acts like B so A thinks it's controlling B but in reality, it is controlling my own emulator. I don't control or can change A.
Control occurs via the controller posting GET commands invoking various cgi scripts so the plan is to install apache on "my" device, setup CGI and replicate the various scripts. I am running apache version 2.4.18 on Ubuntu 16.04.5 and have configured Apache2 so it successfully runs the various scripts depending on the URL. As an example, one of the scripts is called 'man_session' and a typical URL issued by device A looks like this: http://192.168.0.14/cgi-bin/man_session?command=get&page=122
I have build a C/C++ program named 'man_session' and have successfully configured Apache to invoke my script when this URL is submitted. I can see this based on the apache log:
192.168.0.2 - - [24/Jan/2019:14:38:38 +0000] "GET /cgi-bin/man_session?command=get&page=122 HTTP/1.1" 200 206 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
Also, my script writes to stderr and I can see the output in the log file:
[Thu Jan 24 14:46:10.850123 2019] [cgi:error] [pid 23346:tid 4071617584] [client 192.168.0.2:62339] AH01215: Received man_session command 'command=get&page=122': /home/pi/cgi-bin/man_session
So far so good. The problem I am having is that the script does not get invoked when device A makes the request, only when I make the request via a browser (both Chrome and Internet Explorer work) or curl. The browsers run on my Windows PC and curl runs on the embedded device "B" itself.
When I turn on device A, I can see the URL activity on the log but the script does not get invoked. Below is a log entry showing the URL but which that does not invoke the 'man_session' script. It shows a code of 400 which according to the HTTP specification is an error "due to malformed syntax". Other differences are the missing referrer and user-agent information and http 1.0 vs http 1.1, but I don't see why these would matter.
192.168.0.9 - - [24/Jan/2019:14:38:12 +0000] "GET /cgi-bin/man_session?command=get&page=7 HTTP/1.0" 400 0 "-" "-"
Note that device A is 192.168.0.9 and my PC is 192.168.0.2. What am I missing here, why doesn't the above URL invoke the script as when issued by the browser? Is there any place where I can get more information about why the code 400 occurs in this case?
After a lot of back and forth, I finally figured out the issue. Steps taken:
Increased log level to debug (instead of the default 'warn' in apache2.conf
This caused the following error message to show up in the log
[Sat Jan 26 02:47:56.974353 2019] [core:debug] [pid 15603:tid 4109366320] vhost.c(794): [client 192.168.0.9:61001] AH02415: [strict] Invalid host name '192.168.000.014'
After a bit of research, added the following line to the apache2.conf file
HttpProtocolOptions Unsafe
This fixed it and the scripts are now called as expected.
Platforms:
Tested on iPhone iOS 10, macOS Sierra v10.12.6
Safari v10.1.2 (Safari v10.1.1 and below don't seem to have this problem, and neither do Chrome nor Firefox)
Description:
We're having a problem saving a photo through Google Cloud Storage. From the web inspector, we see that we're making a OPTIONS request to http://storage.googleapis.com/..., but we receive an empty response. Whereas, in other browsers or in other versions of Safari, we don't see a OPTIONS request, only the POST request. We've verified that our CORS configuration on the Google Cloud Storage bucket allows our origin.
Our request headers for the OPTIONS request look like this:
Access-Control-Request-Headers:
Referer: <referrer>
Origin: <origin>
Accept: */
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8
Access-Control-Request-Method: POST
From what we can see in the Safari devtools, there are no response headers, and the response is empty. As for the status code, there's no status code from what we can see in Safari devtools, but when we were using Charles, we saw that we received a 200 status code, but the response and response headers were empty as well.
These are the errors in the console:
http://storage.googleapis.com/... Failed to load resource: Origin
<origin> is not allowed by Access-Control-Allow-Origin
XMLHttpRequest cannot load http://storage.googleapis.com/... Origin
<origin> is not allowed by Access-Control-Allow-Origin.
Is there a issue with the latest version of Safari and CORS?