I write a simple test to validate the https proxy in scrapy.but it didn't work
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['baidu.com']
start_urls = ['http://www.baidu.com/']
def parse(self, response):
if response.status == 200:
print(response.text)
and the file of middlewares like this:
class DynamicProxyDownloaderMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'https://183.159.88.182:8010'
also the file of settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'requestTest.middlewares.DynamicProxyDownloaderMiddleware': 100
}
when using the lib of requests.the https proxy works.but changed to scrapy.it confused me.So anybody know this?
the log file:
[the log file][1]
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://www.baidu.com/> (failed 1 times): TCP connection timed out: 10060:
the proxy address is https://183.159.88.182:8010
Related
My middlewares settings:
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "111.11.11.111:1111"
request.headers['Proxy - Authorization'] = basic_auth_header('login', 'password')
My settings:
DOWNLOADER_MIDDLEWARES = {
'my_project.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
After launching, I get an error:
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 217.29.53.106:51725 [{'status': 407, 'reason': b'Proxy Authentication Required'}]
What is the reason, how to fix it? (I use valid https proxies)
Try changing the header name to be Proxy-Authorization
request.headers['Proxy-Authorization'] = basic_auth_header('login', 'password')
proxy = [
'http': 'http://{user}:{password}#{host}:{port}',
'https': 'https://{user}:{password}#{host}:{port}',
]
yield scrapy.request(url=url, proxy=proxy}
doesn't this work?
I am making a scrapy request to get all the data of a site. I am trying to get the response of the full request, however I am not getting any result. I send the code attached. Thanks for your help.
`
class FilminSpider(scrapy.Spider):
name = 'filmin'
allowed_domains = ['filmin.es']
start_urls = ['https://www.filmin.es/wapi/catalog/browse?type=film&page=2&limit=60']
def get_all_movies_data(self):
url = 'https://www.filmin.es/wapi/catalog/browse?type=film&page=2&limit=60'
headers = {"x-requested-with": "XMLHttpRequest"}
request = Request(url = url, method = 'GET',dont_filter = True
,headers = headers)
def parse(self, response):
return response.request
`
pls have a look at this example below you may get something
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
I am getting this from the stats after scrapy crawl test
'dupefilter/filtered': 288, how can I store the filtered requests into a .txt file (or any type) so I view later on?
To achieve this you need to make 2 things:
Set DUPEFILTER_DEBUG setting to True - it will add all
filtered requests into log
Set LOG_FILE to save log in
txt file.
One of possible way to do it - by setting custom_settings spider attribute:
....
class SomeSpider(scrapy.Spider):
....
custom_settings = {
"DUPEFILTER_DEBUG":True,
"LOG_FILE": "log.txt"}
....
def parse(self, response):
....
You will have log lines like this:
2019-12-21 20:34:07 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/page/4/> (referer: http://quotes.toscrape.com/page/3/)
UPDATE
To save only dupefilter logs:
....
from logging import FileHandler
class SomeSpider(scrapy.Spider):
....
custom_settings = {
"DUPEFILTER_DEBUG":True,
# "LOG_FILE": "log.txt"} # - optional
....
def start_requests(self):
# Adding file handler to dupefilter logger:
dupefilter_log_filename = "df_log.txt"
self.crawler.engine.slot.scheduler.df.logger.addHandler(FileHandler(dupefilter_log_filename, delay=False, encoding="utf-8"))
def parse(self, response):
....
Additional info:
Scrapy logging documentation
Python logging module documentation
Here is my spider, which I run from a script to parse content of my local dokuwiki:
DEBUG = True
if DEBUG:
f_debug = open('debug.log','w')
md5s = []
class DokuWikiMd5Spider(scrapy.Spider):
name = 'dokuwikispider'
start_urls = ['https://dokuwiki.mjcc.lasil.ru/doku.php']
visited = []
custom_settings = {
'CONCURRENT_REQUESTS': 1,
}
#staticmethod
def get_page_name(url):
url = url.replace("https://dokuwiki.mjcc.lasil.ru/doku.php?", '')
if 'id=start&do=search' in url:
# because credentials are in URL, here we cut only page name
# https://dokuwiki.mjcc.lasil.ru/doku.php?id=start&do=search&id=%D0%BF%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D1%89%D0%B8%D0%BA%D0%B8_%D0%B8_%D0%BA%D0%BE%D0%BD%D1%82%D0%B0%D0%BA%D1%82%D1%8B&q=&p=PASSWORD&u=admin
m = re.findall('id=([^&]+)', url)
return m[1]
else:
m = re.search('id=([^&]+)', url)
return m.group(1)
def parse(self, response):
password = keyring.get_password('dokuwiki', 'admin')
return scrapy.FormRequest.from_response(
response,
formdata = {'u': 'admin', 'p': password},
callback = self.after_login
)
def after_login(self, response):
# check login succeed before going on
if b"authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
if DEBUG:
f_debug.write("parsing: {}\n".format(response.url))
text = response.text
# cut everything except page content, not to depend on wiki settings when comparing
m = re.findall('.*(<!-- wikipage start -->.*<!-- wikipage stop -->).*', text, re.DOTALL)
text = m[0][0]
# with open(r'F:\TEMP\test.html','w') as f:
# f.write(text)
md5 = hashlib.md5()
md5.update(text.encode('utf-8'))
md5s.append({'url': self.get_page_name(response.url), 'md5': md5.hexdigest()})
yield {'url': self.get_page_name(response.url), 'md5': md5.hexdigest()}
for next_page in response.xpath('//a/#href'):
next_url = next_page.extract()
if DEBUG:
f_debug.write("\t?next page: {}\n".format(next_url))
if 'doku.php?id=' in next_url:
# to process every page name only one time
next_page_name = self.get_page_name(next_url)
if next_page_name not in self.visited:
if DEBUG:
f_debug.write("\t\t!\n")
self.visited.append(next_page_name)
yield response.follow("https://dokuwiki.mjcc.lasil.ru/{}&u=admin&p={}".format(next_url, keyring.get_password('dokuwiki', 'admin')), self.after_login)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(DokuWikiMd5Spider)
process.start() # the script will block here until the crawling is finished
So in debug messages I see, that spider crowled page 'wiki_backup':
2019-01-28 19:49:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dokuwiki.mjcc.lasil.ru//doku.php?id=wiki_backup&u=admin&p=PASSWORD> (referer: https://dokuwiki.mjcc.lasil.ru//doku.php?id=%D1%81%D0%BE%D0%B7%D0%B4%D0%B0%D0%BD%D0%B8%D0%B5_%D0%B8_%D0%BF%D1%80%D0%BE%D0%B2%D0%B5%D1%80%D0%BA%D0%B0_%D0%B1%D1%8D%D0%BA%D0%B0%D0%BF%D0%BE%D0%B2&u=admin&p=PASSWORD)
And I can see its content in the crawled method, as you can see in screenshot
But that page wasn't parsed even one time, as you can see in ''debug.log'':
root#F91_Moin20:/home/ishayahu # cat debug.log | grep wiki_backup
?next page: /doku.php?id=wiki_backup
The problem was in a way, how spider checks if authentification was failed. It (as in the tutorial) search for words "authentification failed", but because I had the same words in page content, spider thought that here was an authentification error and stop processing the page.
There should be another way to check if authentification was really failed.
I need to update location on site that uses redio button. This can be done with simple Post request. The problem is that output of this request is
window.location='http://store.intcomex.com/en-XCL/Products/Categories?r=True';
Since it is not a valid url Scrapy redirects it to PageNotFound and closes spider.
2017-09-17 09:57:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https:
//store.intcomex.com/en-XCL/ServiceClient/PageNotFound> from <POST https://store.intcomex.com/en-XC
L//User/SetNewLocation>
Here is my code:
def after_login(self, response):
# inspect_response(response, self)
url = "https://store.intcomex.com/en-XCL//User/SetNewLocation"
data={"id":"xclf1"
}
yield scrapy.FormRequest(url, formdata=data, callback = self.location)
# inspect_response(response, self)
def location(self, response):
yield scrapy.Request(url = 'http://store.intcomex.com/en-XCL/Products/Categories?r=True', callback = self.categories)
The question is how can I redirect scrapy to valid url after executing Post request that changes location? Is there some argument that indicates target url or i can execute it without callback and yiel correct url on the next line?
Thanks.