How can I use a random user agent whenever I send a request? - scrapy

i know how to use random (fake) user agent in scrapy. but after i run scrapy. i could see only one random user agent on terminal. so i guessed maybe 'settings.py' run only one time when i run scrapy. if scrapy work really like this and send 1000 request to some web page to collect 1000 data, scrapy will just send same user agent. Surely it can be easy to get ban i think.
can you tell me how can i send random user agent when scrapy send request to some website?
i used this lib(?) in my scrapy project.
after i set faker in user-agent in settings.py
https://pypi.org/project/Faker/
from faker import Faker
fake = Faker()
Faker.seed(fake.random_number())
fake_user_agent = fake.chrome()
USER_AGENT = fake_user_agent
in settings.py i wrote like this. can it work well ??

If you are setting USER_AGENT in your settings.py like in your question then you will just get a single (random) user agent for your entire crawl.
You have a few options if you want to set a fake user agent for each request.
Option 1: Explicitly set User-Agent per request
This approach involves setting the user-agent in the headers of your Request directly. In your spider code you can import Faker like you do above but then call e.g. fake.chrome() on every Request. For example
# At the top of your file
from faker import Faker
# This can be a global or class variable
fake = Faker()
...
# When you make a Request
yield Request(url, headers={"User-Agent": fake.chrome()})
Option 2: Write a middleware to do this automatically
I won't go into this because you might as well use one that already exists
Option 3: Use an existing middleware to do this automatically (such as scrapy-fake-useragent)
If you have lots of requests in your code option 1 isn't so nice, so you can use a Middleware to do this for you. Once you've installed scrapy-fake-useragent you can set it up in your settings file as described on the webpage
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]
Using this you'll get a new user-agent per Request and if a Request fails you'll also get a new random user-agent. One of the key parts of setting this up is FAKEUSERAGENT_PROVIDERS. This tells us where to get the User-Agent from. They are tried in the order they are defined, so the second will be tried if the first one fails for some reason (if getting the user-agent fails, not if the Request fails). Note that if you want to use Faker as the primary provider, then you should put that one first in the list
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]
There are other configuration options (such as using a random chrome-like user-agent, listed in the scrapy-fake-useragent docs.
Example spider
Here is an example spider. For convenience I set the settings inside the spider, but you can put these into your settings.py file.
# fake_user_agents.py
from scrapy import Spider
class FakesSpider(Spider):
name = "fakes"
start_urls = ["http://quotes.toscrape.com/"]
custom_settings = dict(
DOWNLOADER_MIDDLEWARES={
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 400,
"scrapy_fake_useragent.middleware.RetryUserAgentMiddleware": 401,
},
FAKEUSERAGENT_PROVIDERS=[
"scrapy_fake_useragent.providers.FakerProvider",
"scrapy_fake_useragent.providers.FakeUserAgentProvider",
"scrapy_fake_useragent.providers.FixedUserAgentProvider",
],
)
def parse(self, response):
# Print out the user-agent of the request to check they are random
print(response.request.headers.get("User-Agent"))
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Then if I run this with scrapy runspider fake_user_agents.py --nolog the output is
b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 11_0) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/59.0.811.0 Safari/533.1'
b'Opera/8.18.(Windows NT 6.2; tt-RU) Presto/2.9.169 Version/11.00'
b'Opera/8.40.(X11; Linux i686; ka-GE) Presto/2.9.176 Version/11.00'
b'Opera/9.42.(X11; Linux x86_64; sw-KE) Presto/2.9.180 Version/12.00'
b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 5_1 rv:6.0; cy-GB) AppleWebKit/533.45.2 (KHTML, like Gecko) Version/5.0.3 Safari/533.45.2'
b'Opera/8.17.(X11; Linux x86_64; crh-UA) Presto/2.9.161 Version/11.00'
b'Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/3.1)'
b'Mozilla/5.0 (Android 3.1; Mobile; rv:55.0) Gecko/55.0 Firefox/55.0'
b'Mozilla/5.0 (compatible; MSIE 9.0; Windows CE; Trident/5.0)'
b'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 11_9; rv:1.9.4.20) Gecko/2019-07-26 10:00:35 Firefox/9.0'

Related

Scrapy generic response from VM

I am trying to crawl booking from a VM and I don't get the same response like the one from my local machine. The query is the following:
scrapy shell --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"
When I run the query from my VM, I get a response with the same URL than the one in the query while from the VM I get the generic response:
https://www.booking.com/hotel/fr/le-transat-bleu.fr.html
I must mention that before adding the USER_AGENT part I was getting the same answer even on my local machine.
Also, if I use Links, a command-line browser from the VM, I get the correct response. Hence it does not seem to come from the public IP of the VM I use.
I suspect that there is another information that booking.com might be using to prevent the crawling of certain pages on top of the USER_AGENT and the robot.txt file but I don't know which one.
Local Request Headers
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8', b'Accept-Language': b'en', b'User-Agent': b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0', b'Accept-Encoding': b'gzip,deflate'}
VM Request Headers
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'bkng=11UmFuZG9tSVYkc2RlIyh9Yaa29%2F3xUOLbXpFeYC4TUhBTLg%2BWRWQhTWxLpR01uuU40DSTIBsY%2F5OusQaibxVABBhdPCiYlEsnGLdmcDyD%2BtWFGVlewF8Fo59TLNV6vs0R1Ypha9MOkYUl6wASmexLrJie%2F3imTygdbEEsnB0sv0m%2B%2FJ1C6Cm42FEFBT222yQ7']}
VM Request without cookies
scrapy shell --set="COOKIES_ENABLED=False" --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"
VM Request Headers without cookies
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate']}

CORS Fails with Safari 10.1.2 and Google Storage

Platforms:
Tested on iPhone iOS 10, macOS Sierra v10.12.6
Safari v10.1.2 (Safari v10.1.1 and below don't seem to have this problem, and neither do Chrome nor Firefox)
Description:
We're having a problem saving a photo through Google Cloud Storage. From the web inspector, we see that we're making a OPTIONS request to http://storage.googleapis.com/..., but we receive an empty response. Whereas, in other browsers or in other versions of Safari, we don't see a OPTIONS request, only the POST request. We've verified that our CORS configuration on the Google Cloud Storage bucket allows our origin.
Our request headers for the OPTIONS request look like this:
Access-Control-Request-Headers:
Referer: <referrer>
Origin: <origin>
Accept: */
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8
Access-Control-Request-Method: POST
From what we can see in the Safari devtools, there are no response headers, and the response is empty. As for the status code, there's no status code from what we can see in Safari devtools, but when we were using Charles, we saw that we received a 200 status code, but the response and response headers were empty as well.
These are the errors in the console:
http://storage.googleapis.com/... Failed to load resource: Origin
<origin> is not allowed by Access-Control-Allow-Origin
XMLHttpRequest cannot load http://storage.googleapis.com/... Origin
<origin> is not allowed by Access-Control-Allow-Origin.
Is there a issue with the latest version of Safari and CORS?

Scrapy gets blocked even with Selenium; Selenium on its own doesn't?

I am trying to scrape data off a website. Scrapy on its own didn't work (I get HTTP 403), which led me to believe there are some UI-based countermeasures (e.g. checking for resolution).
Then I tried Selenium; a very basic script clicking its way through the website works just fine. Here's the relevant excerpt of what works:
driver.get(start_url)
try:
link_next = driver.wait.until(EC.presence_of_element_located(
(By.XPATH, '//a[contains(.,"Next")]')))
link_next.click()
Now, in order to store the data, I'm still going to need Scrapy. So I wrote a script combining Scrapy and Selenium.
class MyClass(CrawlSpider):
...
start_urls = [
"domainiwanttocrawl.com?page=1",
]
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.wait = WebDriverWait(self.driver, 2)
def parse(self, response):
self.driver.get(response.url)
while True:
try:
link_next = self.driver.wait.until(EC.presence_of_element_located((By.XPATH, '//a[contains(.,"Next")]')))
self.driver.wait = WebDriverWait(self.driver, 2)
link_next.click()
item = MyItem()
item['source_url'] = response.url
item['myitem'] = ...
return item
except:
break
self.driver.close()
But this will also just result in HTTP 403. If I add something like self.driver.get(url) to the __init__ method, that will work, but nothing beyond that.
So in essence: the Selenium get function continues to work, whereas whatever Scrapy does under the hood with what it finds in start_urls gets blocked. But I don't know how to "kickstart" the crawling without the start_urls. It seems that somehow Scrapy and Selenium aren't actually integrated yet.
Any idea why and what I can do?
Scrapy is a pretty awesome scraping framework, you get a ton of stuff for free. And, if it is getting 403s straight out of the gate, then it's basically completely incapacitated.
Selenium doesn't hit the 403 and you get a normal response. That's awesome, but not because Selenium is the answer; Scrapy is still dead-in-the-water and it's the work-horse, here.
The fact that Selenium works means you can most likely get Scrapy working with a few simple measures. Exactly what it will take is not clear (there isn't enough detail in your question), but the link below is a great place to start.
Scrapy docs - Avoid getting banned
Putting some time into figuring out how to get Scrapy past the 403 is the route I recommend. Selenium is great and all, but Scrapy is the juggernaut when it comes to web-scraping. With any luck it won't take much.
Here is a util that might help: agents.py It can be used to get a random user agent from a list of popular user agents (circa 2014).
>>> for _ in range(5):
... print agents.get_agent()
...
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
Below is a basic way to integrate get_agent with Scrapy. (It's not tested, but should point you in the right direction).
import scrapy
from scrapy.http import Request
from agents import get_agent
EXAMPLE_URL = 'http://www.example.com'
def get_request(url):
headers = {
'User-Agent': get_agent(),
'Referer': 'https://www.google.com/'
}
return Request(url, headers=headers)
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield get_request(EXAMPLE_URL)
Edit
Regarding user agents, looks like this might achieve the same thing but a bit more easily: scrapy-fake-useragent

Change the default USER-AGENT and REFERRER value in wget

When using wget on the console, I usually want to download the version, my Firefox would get, for ex.
wget --header="Accept: text/html" --user-agent="Mozilla/5.0 ..." --referrer connect.wso2.com http://dist.wso2.org/products/carbon/4.2.0/wso2carbon-4.2.0.zip
How can I change the default behaviour of wget, so just using wget would use the actual useragent and header my current Firefox is using?
(Also adding the base-URL of the downloaded site as referer would be nice)
Create an alias like so:
alias wget='wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --referer connect.wso2.com'
You can use type to show how your new wget alias will be interpreted when used as a command name.
type wget
wget is aliased to `wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --referer connect.wso2.com'

Why does yandex return 405, when google return 200 Ok?

I have following problem with site http://huti.ru. When trying to add any of its pages in http://webmaster.yandex.ru/addurl.xml (Yandex - russian search engine) wrote "The server returns a status code http 405 (expected code 200)." What can caouse such different behevior for brawusers and yandex crawler? (Google indexes like normal)
Enviroment: tomcat, java 6
Your server does not allow HEAD requests. Seems that the robot first tries a HEAD before the actual GET.
As
http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
states: HEAD should be identical to GET, except that it does never return a message body, but only the response headers for a particular request.
Note: I did a simple
HEAD / HTTP/1.0
request. Same with HTTP/1.1 + Host: huti.ru.
Check your server logs for the actual content of the response to the Yandex request.
HTTP 405 is Method Not Allowed, and is usually returned if the user agent has used an HTTP verb not supported for the particular resource.
For example, using Fiddler, I issued several requests to http://huti.ru, and I got 200 response for the HEAD, GET, and POST, but I got 405 for the TRACE. It's conceivable that Yandex issues either TRACE or OPTIONS, before making a request for the actual page as a form of a ping to determine if the page exists.
Note: #smilingthax mentioned that your server returns 405 on HEAD. However, issuing the following request from Fiddler worked for me:
HEAD http://huti.ru/ HTTP/1.1
Host: huti.ru
Proxy-Connection: keep-alive
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.23 Safari/534.10
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Thus, your problem might be specific to HEAD requests with particular headers.
I think that 405 means that the page has already been indexed.