how can i use rotated proxy with scrapy? - scrapy

i wrote this in setting.py after pip install scrapy-rotating-proxies
ROTATING_PROXY_LIST = ['http://209.50.52.162:9050']
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
}
then if i run spider like this scrapy crawl test
it show this.
2021-05-03 15:03:32 [rotating_proxies.middlewares] WARNING: No proxies available; marking all proxies as unchecked
2021-05-03 15:03:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-03 15:03:50 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 0, reanimated: 1, mean backoff time: 0s)
2021-05-03 15:03:53 [rotating_proxies.expire] DEBUG: Proxy <http://209.50.52.162:9050> is DEAD
2021-05-03 15:03:53 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.google.com> with another proxy (failed 3 times, max retries: 5)
how can i solve this issue ?

Notice the message in your logs:
DEBUG: Proxy <http://209.50.52.162:9050> is DEAD
You need to add more proxies as shown in the documentation:
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
You can get a list of proxies from many sites.

Related

How to resolve 502 response code in Scrapy request?

I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears
after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things.
I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.
The website sees that you are a "scraper" and not a human user, from the headers of your request.
You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.
For more info, refer to the scrapy documentation.
Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.
2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```

Why does scrapy not use the random proxy downloader midleware?

I am using the following library with scrapy in order to do requests from IPs through rotating proxies.
This persumably stoped working and my IP is used instead. So I am wondering if there is a fallback or if I accidently changed the config.
My settings look like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/Users/user/test_crawl/proxy_list.txt'
PROXY_MODE = 0
The proxy list:
http://147.30.82.195:8080
http://168.183.187.238:8080
The traceback:
[scrapy.proxies] DEBUG: Proxy user pass not found
2018-12-27 14:23:20 [scrapy.proxies] DEBUG: Using proxy
<http://168.183.187.238:8080>, 2 proxies left
2018-12-27 14:23:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/file.htm> (referer: https://www.example.com)
The DEBUG output of user pass not found, should be OK as it indicates that I am not using user/pass to authenticate.
The logfile on the example.com server shows my IP directly instead of the proxy IP.
This used to work, so I am wondering how to get it back working.
scrapy-proxies expects proxies to have a password. If the password is empty, it ignores the proxy.
It should probably fail as it does when there are not proxies, but instead it does nothing, which results in no proxy being configured, and your IP being used instead.
I would say you should report the issue upstream, but the project seems dead. So, unless you are willing to fork the project and fix the issue yourself, you are out of luck.

scrapy shell did not recognize 'sel' object

I'm a python newbie, and trying to using scrapy for project.
Scrapy 0.19 is installed on my centos (linux 2.6.32) and I followed the instruction on scrapy document page,but found that,the scrapy shell could not find 'sel' object,here's my step:
[root#localhost rpm]# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2014-03-02 06:33:23+0800 [scrapy] INFO: Scrapy 0.19.0 started (bot: scrapybot)
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled item pipelines:
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-02 06:33:23+0800 [default] INFO: Spider opened
2014-03-02 06:33:24+0800 [default] DEBUG: Crawled (200) <GET
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html><head><base href="http://example.c'>
[s] item {}
[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x3668ed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> sel.xpath('//title/text()')
Traceback (most recent call last):
File "<console>", line 1, in <module>
NameError: name 'sel' is not defined
>>>
Can anyone tell me how to solve? thx in advance
The sel object was added in 0.20 version. And when you run the shell command it tells you which objects you can use, in your case, hxs, that has similar behaviour:
>>> hxs.select('//title/text()')
You should try to read the documentation first. In the selectors section is explained pretty clear how you can use them based in your current version.

Why scrapy crawler stops?

I have written a crawler using scrapy framework to parse a products site. The crawler stops in between suddenly without completing the full parsing process. I have researched a lot on this and most of the answers indicate that my crawler is being blocked by the website. Is there any mechanism by which I can detect whether my spider is being stopped by website or does it stop on its own?
The below is info level log entry of spider .
2013-09-23 09:59:07+0000 [scrapy] INFO: Scrapy 0.18.0 started (bot: crawler)
2013-09-23 09:59:08+0000 [spider] INFO: Spider opened
2013-09-23 09:59:08+0000 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-09-23 10:00:08+0000 [spider] INFO: Crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min)
2013-09-23 10:01:08+0000 [spider] INFO: Crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min)
2013-09-23 10:02:08+0000 [spider] INFO: Crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min)
2013-09-23 10:03:08+0000 [spider] INFO: Crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min)
2013-09-23 10:04:08+0000 [spider] INFO: Crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min)
2013-09-23 10:05:08+0000 [spider] INFO: Crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)
Below is last part of debug level entry in log file before spider is closed:
2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) <GET http://url.html> (referer: http://site_name)
2013-09-25 11:33:24+0000 [spider] DEBUG: Scraped from <200 http://url.html>
//scrapped data in json form
2013-09-25 11:33:25+0000 [spider] INFO: Closing spider (finished)
2013-09-25 11:33:25+0000 [spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 36754,
'downloader/request_count': 103,
'downloader/request_method_count/GET': 103,
'downloader/response_bytes': 390792,
'downloader/response_count': 103,
'downloader/response_status_count/200': 102,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),
'item_scraped_count': 99,
'log_count/DEBUG': 310,
'log_count/INFO': 14,
'request_depth_max': 1,
'response_received_count': 102,
'scheduler/dequeued': 100,
'scheduler/dequeued/disk': 100,
'scheduler/enqueued': 100,
'scheduler/enqueued/disk': 100,
'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}
2013-09-25 11:33:25+0000 [spider] INFO: Spider closed (finished)
Still there are pages remaining to be parsed, but the spider stops.
So far I know that for a spider:
There are some queue or pool of urls to be scraped/parsed with parsing
methods. You can specify, bind the url to a specific method or let
the default 'parse' do the job.
From parsing methods you must return/yield another request(s), to feed that pool, or item(s)
When the pool runs out of urls or a stop signal is sent the spider stops crawling.
Would be nice if you share your spider code so we can check if those binds are correct. It's easy to miss some bindings by mistake using SgmlLinkExtractor for example.

Scrapy forum crawler starting but not returning any scraped data

Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[#class='value']/text()").extract()
return item
Log output:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
I've updated the rules so that the spider now crawls all of the pages in the thread.
EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[#class='post-author'])[1]//a[#class='internal-link view-post']/text()").extract()
The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.