Amazon detects scrapy instantly. How to prevent captcha? - scrapy

I am trying to scrape one web page from amazon with the help of Scrapy 2.4.1 over shell. Without any prior scraping amazon instantly askes for captcha entries.
I am setting another user agent as only prevention but have never before scraped the page:
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
Get one page:
>>> fetch('https://www.amazon.de/Eastpak-Provider-Rucksack-Noir-Black/dp/B0815FZ3C6/')
>>> view(response)
Results in a captcha question.
I also tried it with headers:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
>>> req = Request("https://www.amazon.de/Eastpak-Provider-Rucksack-Noir-Black/dp/B0815FZ3C6/", headers=headers)
>>> fetch(req)
This also results in a captcha question, while the main page can be scraped in this case.
How does amazon detect that this is a bot and how to prevent that?

Related

How to save whatsweb session using headless chromedriver?

whatsweb headless using chromedriver only works correctly when used user agent:
chrome_options.add_argument("user-agent=User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
then it works, but its asking the qr code even when I already used
options.add_argument(r"user-data-dir
when I go without headless then chromedriver recognize the user data dir, but in the headless its not working, what's the solution ?

Selenium Select Frame takes very long time to complete

Environment:
Chrome Driver 92.0.4515
I found that selenium.select_frame takes nearly 3 minutes to switch between main window to the frame
The issue only occurs with headless chrome, the normal chrome's still working fine.
Any solutions will be highly appreciated! Thank you!
My webdriver's arguments:
options.add_argument(f"--window-size={width},{height}")
options.add_argument("--headless")
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36 Edg/84.0.522.59")
options.add_argument("--no-sandbox")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-extensions")
options.add_argument("--start-maximized")

Scrapy gets blocked even with Selenium; Selenium on its own doesn't?

I am trying to scrape data off a website. Scrapy on its own didn't work (I get HTTP 403), which led me to believe there are some UI-based countermeasures (e.g. checking for resolution).
Then I tried Selenium; a very basic script clicking its way through the website works just fine. Here's the relevant excerpt of what works:
driver.get(start_url)
try:
link_next = driver.wait.until(EC.presence_of_element_located(
(By.XPATH, '//a[contains(.,"Next")]')))
link_next.click()
Now, in order to store the data, I'm still going to need Scrapy. So I wrote a script combining Scrapy and Selenium.
class MyClass(CrawlSpider):
...
start_urls = [
"domainiwanttocrawl.com?page=1",
]
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.wait = WebDriverWait(self.driver, 2)
def parse(self, response):
self.driver.get(response.url)
while True:
try:
link_next = self.driver.wait.until(EC.presence_of_element_located((By.XPATH, '//a[contains(.,"Next")]')))
self.driver.wait = WebDriverWait(self.driver, 2)
link_next.click()
item = MyItem()
item['source_url'] = response.url
item['myitem'] = ...
return item
except:
break
self.driver.close()
But this will also just result in HTTP 403. If I add something like self.driver.get(url) to the __init__ method, that will work, but nothing beyond that.
So in essence: the Selenium get function continues to work, whereas whatever Scrapy does under the hood with what it finds in start_urls gets blocked. But I don't know how to "kickstart" the crawling without the start_urls. It seems that somehow Scrapy and Selenium aren't actually integrated yet.
Any idea why and what I can do?
Scrapy is a pretty awesome scraping framework, you get a ton of stuff for free. And, if it is getting 403s straight out of the gate, then it's basically completely incapacitated.
Selenium doesn't hit the 403 and you get a normal response. That's awesome, but not because Selenium is the answer; Scrapy is still dead-in-the-water and it's the work-horse, here.
The fact that Selenium works means you can most likely get Scrapy working with a few simple measures. Exactly what it will take is not clear (there isn't enough detail in your question), but the link below is a great place to start.
Scrapy docs - Avoid getting banned
Putting some time into figuring out how to get Scrapy past the 403 is the route I recommend. Selenium is great and all, but Scrapy is the juggernaut when it comes to web-scraping. With any luck it won't take much.
Here is a util that might help: agents.py It can be used to get a random user agent from a list of popular user agents (circa 2014).
>>> for _ in range(5):
... print agents.get_agent()
...
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
Below is a basic way to integrate get_agent with Scrapy. (It's not tested, but should point you in the right direction).
import scrapy
from scrapy.http import Request
from agents import get_agent
EXAMPLE_URL = 'http://www.example.com'
def get_request(url):
headers = {
'User-Agent': get_agent(),
'Referer': 'https://www.google.com/'
}
return Request(url, headers=headers)
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield get_request(EXAMPLE_URL)
Edit
Regarding user agents, looks like this might achieve the same thing but a bit more easily: scrapy-fake-useragent

chromedriver works but "phantomjs unable to locate item using css selector"

I'm designing some end to end testing for my job, and I've got it up and running using nightwatch.js through chromedriver. However, we're looking to have this run on our servers, and so I wanted to be able to run it using phantomjs. Although the test performs without incident using chromedriver, Phantomjs yields the following error "phantomjs unable to locate item using css selector"
Any ideas? I've scoured the internet for a solution, to no avail.
First, check decates' comment here: https://github.com/nightwatchjs/nightwatch/issues/243#issuecomment-94287511
See how depending on the user-agent info passed from your browser to the site, the site returns different XHTML data? So if you want to use phantomjs, but are okay with it spoofing as a different browser via the user agent, you can configure phantomjs' user-agent capabilities, like this (spoofing Mac Chrome):
"desiredCapabilities": {
"browserName": "phantomjs",
"phantomjs.cli.args" : ["--ignore-ssl-errors=true"],
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
}
Then your tests should act the same as your other browser. Using any browser you like, you can check the user-agent string that it sends here: http://www.httpuseragent.org/. Here are some other examples:
// Mac Chrome 46
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
// Windows Chrome 46
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
// Mac Firefox 42.0
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0"
// Windows Firefox 42.0
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Windows NT 6.3; rv:42.0) Gecko/20100101 Firefox/42.0"
// PhantomJS 2.0
"phantomjs.page.settings.userAgent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.0.0 Safari/538.1"
I sometimes have this effect in the difference browsers, not only phantoms. The reason seemed to be that elements are not loaded at the time of evaluating for one browser (and are loaded for another). You can debug it with checking screenshots at the point of failure.
The solution for me was using waitForElementPresent/Visible.

Safari doesn't set content-length when using xmlhttprequest

I have a Javascript object that i am trying to post to the server with XMLHttpRequest() using JSON.stringify(). my code works fine in all major browsers except for Safari (5.1.2). My analysis shows that Safari, is in fact sending the data. i can see the message in the Safari Developer Tools and i see the bytes received in the IIS logs and it seems accurate (48kb) but the WCF function doesn't get the object data. In looking into the wcf logs i see that the content-length is 0 for safari and has a value for chrome. Does anyone have any insight to this issue?
SAFARI:
<httprequest>
<Method>POST</Method>
<QueryString></QueryString>
<WebHeaders>
<Connection>keep-alive</Connection>
<Content-Length>0</Content-Length>
<Content-Type>application/json</Content-Type>
<Accept>*/*</Accept>
<Accept-Encoding>gzip, deflate</Accept-Encoding>
<Accept-Language>en-US</Accept-Language>
<Authorization>Negotiate TlRMTVNTUA</Authorization>
<Cookie>ASP.NET_SessionId=2ynxibj2jovjo345nckpsskm</Cookie>
<Host>localhost</Host>
<Referer>http://localhost/xRMS.Net/PROFILE/Employee.htm?winid=flENz4mLt1TRBTvL&theme=ThemeDevelopment.css</Referer>
<User-Agent>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7</User-Agent>
<Origin>http://localhost</Origin>
</WebHeaders>
</httprequest>
CHROME:
<httprequest>
<Method>POST</Method>
<QueryString></QueryString>
<WebHeaders>
<Connection>keep-alive</Connection>
<Content-Length>48822</Content-Length>
<Content-Type>application/json</Content-Type>
<Accept>*/*</Accept>
<Accept-Charset>ISO-8859-1,utf-8;q=0.7,*;q=0.3</Accept-Charset>
<Accept-Encoding>gzip,deflate,sdch</Accept-Encoding>
<Accept-Language>en-US,en;q=0.8</Accept-Language>
<Cookie>ASP.NET_SessionId=gapksa2mmuh3wcrntz32mipw</Cookie>
<Host>localhost</Host>
<Referer>http://localhost/xRMS.Net/PROFILE/Employee.htm?winid=FTWLXL4b8aTWaaaM&theme=ThemeDevelopment.css</Referer>
<User-Agent>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2</User-Agent>
<Origin>http://localhost</Origin>
</WebHeaders>
</httprequest>