I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears
after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things.
I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.
The website sees that you are a "scraper" and not a human user, from the headers of your request.
You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.
For more info, refer to the scrapy documentation.
Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.
2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```
Related
I am using the following library with scrapy in order to do requests from IPs through rotating proxies.
This persumably stoped working and my IP is used instead. So I am wondering if there is a fallback or if I accidently changed the config.
My settings look like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/Users/user/test_crawl/proxy_list.txt'
PROXY_MODE = 0
The proxy list:
http://147.30.82.195:8080
http://168.183.187.238:8080
The traceback:
[scrapy.proxies] DEBUG: Proxy user pass not found
2018-12-27 14:23:20 [scrapy.proxies] DEBUG: Using proxy
<http://168.183.187.238:8080>, 2 proxies left
2018-12-27 14:23:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/file.htm> (referer: https://www.example.com)
The DEBUG output of user pass not found, should be OK as it indicates that I am not using user/pass to authenticate.
The logfile on the example.com server shows my IP directly instead of the proxy IP.
This used to work, so I am wondering how to get it back working.
scrapy-proxies expects proxies to have a password. If the password is empty, it ignores the proxy.
It should probably fail as it does when there are not proxies, but instead it does nothing, which results in no proxy being configured, and your IP being used instead.
I would say you should report the issue upstream, but the project seems dead. So, unless you are willing to fork the project and fix the issue yourself, you are out of luck.
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://img14.360buyimg.com/n1/s800x800_jfs/t21448/27/2565333063/465767/c06c0af6/5b5c83e6Nb83e3a19.pn > (failed 1 times): User timeout caused connection failure: Getting http://img14.360buyimg.com/n1/s800x800_jfs/t21448/27/2565333063/465767/c06c0af6/5b5c83e6Nb83e3a19.png took longer than 180.0 seconds..
Like This . I have several pipelines and I want to request other image url when request timeout failure.
I saw scrapy
Retry Middleware
But It seems like for all requests. I want to Specify Only My ImagePipelines.
Look at the documentation for the RetryMiddleware.
The setting RETRY_ENABLED is True by default, you can change it to False to disable the middleware.
I am trying to scrape a website using scrapy, but the network in the office is unstable. If we lose network connection for even a few seconds, scrapy gets stuck and stops downloading. we can see that the last log is:
2018-08-27 11:50:05 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): *.*.org
2018-08-27 11:50:07 [urllib3.connectionpool] DEBUG: https://**.**.org:443 "GET /01313_**0.jpg HTTP/1.1" 200 135790
I had tried to change the timeout settings, but nothing happened.
thank you!
You can try to set RETRY_TIMES setting (in settings.py):
RETRY_TIMES=5
Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[#class='value']/text()").extract()
return item
Log output:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
I've updated the rules so that the spider now crawls all of the pages in the thread.
EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[#class='post-author'])[1]//a[#class='internal-link view-post']/text()").extract()
The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.
I am using Red5 version 1.0.0(final release) for Java 6 on Windows XP sp3.I am using the installer version downloaded from https://code.google.com/p/red5/. I have a project wherein I am performing live webcam chats between the users. I am using RTMPT (HTTP over RTMP)protocol for that.So I have set up my Red5 server behind the Apache web server.The problem is that everything goes on well for 45-50 seconds and suddenly the RTMPT connection gets closed.I am not using a dedicated rtmpt server,i.e. I have not uncommented the rtmpt bean in the conf files.Rather I have added entries of servlet mappings(for idle,fcs,open etc) in the web.xml of my application. RTMPT is listening on 5080 port.I have tested this with previous versions of Red5 also but the problem is the same.The RTMPT connection closes after some time(within a minute).I had gone through logs but there was found nothing regarding this.Also there was no connection closure due to the inactivity period.Has it something to do with Apache? I am not sure whether server is closing the connection (though I cant find any logs about closing connection) or client closes it.Tried it out with 0.9.0 and 0.9.1 too but nothing to avail.I have heard that there were issues using RTMPT with Red5 on Mac but I am on Windows.Any pointers to this problem? Any help is appreciated.Also here are the error logs that I get on my Apache web server -
[error] (OS 10048)Only one usage of each socket address (protocol/network address/port) is normally permitted. : proxy: HTTP: attempt to connect to red5serverip:5080 (*) failed.
The same log is repeated for four times.
Here are some access logs from Apache too -
"POST /send/IDTK7NOG2PXGB/803 HTTP/1.1" 200 1
"POST /send/IDTK7NOG2PXGB/804 HTTP/1.1" 503 323
"POST /send/YXF4WTFMN8TCM/1391 HTTP/1.1" 200 8285
"POST /send/YXF4WTFMN8TCM/1392 HTTP/1.1" 200 1
"POST /send/YXF4WTFMN8TCM/1393 HTTP/1.1" 200 54
"POST /send/YXF4WTFMN8TCM/1394 HTTP/1.1" 200 1
"POST /send/YXF4WTFMN8TCM/1395 HTTP/1.1" 503 323
"POST /close/IDTK7NOG2PXGB/805 HTTP/1.1" 503 323
"POST /close/YXF4WTFMN8TCM/1396 HTTP/1.1" 503 323
Thanx!
Probably you are running out of tcp ports. A tcp connection will remain 4 minutes in the TIME_WAIT state by default, even if it is already closed. When your RTMPT stream uses 5 connections each second, your system will need at least 5*60*4=1200 ports for each connected user.
Often the firewall is limiting the amount of ports available. You can also decrease the keep-alive time of a tcp socket. If you google around with your apache error message you will find enough info to sort this out.
Your red5 server may be crashed. This occurs when your RAM usage is getting over. In that case you need to manually start red5 again. If this solved your problem, you need to upgrade your RAM. I am using about 8GB RAM after facing this problem several times. Because red5 was written using JAVA, it lacks memory. FFMPEG is good to use in a low memory. But I don't know how to provide chat using ffmpeg, exactly.
The 503 means that the service did not respond; if you are forwarding to red5 via apache, then this means there is a problem there. I would suggest not using the stand alone rtmpt bean; instead use only the servlet and remove apache from the mix to debug the issue.