I am facing issues with some urls while running scrappy
ValueError: Missing scheme in request url: mailto:?body=https%3A%2F%2Fiview.abc.net.au%2Fshow%2Finsiders
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://iview.abc.net.au/show/four-corners/series/2020/video/NC2003H028S00> (referer: None)
Here are my settings:
"base_urls" : [
{
# Start crawling from
"url": "https://www.abc.net.au/",
# Overwrite the default crawler and use th RecursiveCrawler instead
"crawler": "RecursiveCrawler",
This works ok with following setting
"base_urls" : [
{
# Start crawling from
"url": "https://www.afr.com/",
# Overwrite the default crawler and use th RecursiveCrawler instead
"crawler": "RecursiveCrawler",
Not sure what I am missing here
You have different behaviors because of the content beign scraped. The problem is that at some point your spider is trying to yield a Request for this URL:
mailto:?body=https%3A%2F%2Fiview.abc.net.au%2Fshow%2Finsiders
The correct URL is probably this:
https://iview.abc.net.au/show/insiders
It's possible that you are scraping the wrong field, or that there was a mistake in the site where this "url" is retrieved.
Related
I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears
after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things.
I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.
The website sees that you are a "scraper" and not a human user, from the headers of your request.
You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.
For more info, refer to the scrapy documentation.
Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.
2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```
I can't figure out what I'm doing wrong here, I'm getting the following error:
[scrapy.mail] ERROR: Unable to send mail: To=['reg2#mydomain.com']
Cc=['reg3#mydomain.com'] Subject="test" Attachs=0- Connection was refused
by other side: 10061: No connection could be made because the target
machine actively refused it..
Here is my very basic spider
import scrapy
from scrapy.mail import MailSender
mailer = MailSender()
class FanaticsSpider(scrapy.Spider):
name = 'fanatics'
start_urls = ['https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-majestic-showtime-logo-cool-base-t-shirt-navy/o-9172+t-70152507+p-1483408147+z-8-1114341320',
]
def parse(self, response):
yield {
'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').re('[$]\d+\.\d+'),
}
mailer.send(to=["reg2#mydomain.com"], subject="test", body="test", cc=["reg3#mydomain.com"])
In my settings.py I have the following:
MAIL_HOST = 'mail.mydomain.com'
MAIL_FROM = 'pricealerts#mydomain.com'
MAIL_PORT = 465
MAIL_USER = 'pricealerts#mydomain.com'
MAIL_PASS = 'passwordxx'
MAIL_SSL = True
It seems like these server details aren't getting pulled properly? I've tried modifying all the options I could, including trying to populate the settings in the spider but that gave me another problem.
mailer=MailSender(smtpuser="pricealerts#mydomain.com", mailfrom="pricealerts#mydomain.com", smtphost="mail.mydomain.com", smtppass="password", smtpport=465)
This didn't give me any errors but the spider seems to hang after [scrapy.core.engine] INFO: Spider closed (finished) and I have to close the anaconda command prompt. Also, no email gets sent.
I also tried this alternate method found here and didn't get an error but no email was sent What did I forget in order to correctly send an email using Scrapy
I have that same error try to use some proxies its works for me
I am using the following library with scrapy in order to do requests from IPs through rotating proxies.
This persumably stoped working and my IP is used instead. So I am wondering if there is a fallback or if I accidently changed the config.
My settings look like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/Users/user/test_crawl/proxy_list.txt'
PROXY_MODE = 0
The proxy list:
http://147.30.82.195:8080
http://168.183.187.238:8080
The traceback:
[scrapy.proxies] DEBUG: Proxy user pass not found
2018-12-27 14:23:20 [scrapy.proxies] DEBUG: Using proxy
<http://168.183.187.238:8080>, 2 proxies left
2018-12-27 14:23:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/file.htm> (referer: https://www.example.com)
The DEBUG output of user pass not found, should be OK as it indicates that I am not using user/pass to authenticate.
The logfile on the example.com server shows my IP directly instead of the proxy IP.
This used to work, so I am wondering how to get it back working.
scrapy-proxies expects proxies to have a password. If the password is empty, it ignores the proxy.
It should probably fail as it does when there are not proxies, but instead it does nothing, which results in no proxy being configured, and your IP being used instead.
I would say you should report the issue upstream, but the project seems dead. So, unless you are willing to fork the project and fix the issue yourself, you are out of luck.
I am attempting to do acceptance testing on a website using Codeception and BrowserStack. The website I am testing requires a query string appended to the url in order to sign-in.
For example: https://examplesite.com/?realm=ab-cd
I have attempted to use this url in the acceptance.suites.yml file:
class_name: AcceptanceTester
modules:
enabled:
- WebDriver:
url: http://examplesite.com/?realm=ab-cd
host: 'hostmaster#examplesite.com:mykey#hub.browserstack.com'
port: 80
browser: firefox
capabilities:
javascriptEnabled: true
I have also attempted to place a sendGET in the actual test:
$I->sendGET('/?realm=ab-cd');
Both attempts result in not being able to sign in. What would the correct way to do this be?
So I found that in the acceptance.suite.yml file, the Url that you provide can not have a query string appended to it. Following Naktibalda's suggestion, I tried a few variations to the:
$I->amOnPage()
I found when appending a query string I had to start it with the ? (leaving off the preceding /). For example:
$I->amOnPage('?realm=bu-pd'); //Works
$I->amOnPage('/?realm=bu-pd'); //Doesn't work
I reached this question, found a solution myself so giving it here:
$I->amOnPage(['/path','query_param1' => 'bu-pd']);
sendGET belongs to REST module, use amOnPage in WebDriver test.
I'm trying to geocode a lot of data. I have a lot of machines across which to spread the load (so that I won't go over the 2,500 requests per IP address per day). I am using a script to make the requests with either wget or cURL. However, both wget and cURL yield the same "request denied" message. That being said, when I make the request from my browser, it works perfectly. An example request is:
wget http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true
And the resulting output is:
[1] 93930
05:00 PM ~: --2011-12-19 17:00:25-- http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA
Resolving maps.googleapis.com... 72.14.204.95
Connecting to maps.googleapis.com|72.14.204.95|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: `json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA'
[ <=> ] 54 --.-K/s in 0s
2011-12-19 17:00:25 (1.32 MB/s) - `json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA' saved [54]
The file it wrote to only contains:
{
"results" : [],
"status" : "REQUEST_DENIED"
}
Any help is much appreciated.
The '&' character that separates the address and sensor parameters isn't getting passed along to the wget command, but instead is telling your shell to run wget in the background. The resulting query is missing the required 'sensor' parameter, which should be set to true or false based on your input.
wget "http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false"