I'm using a CrawlSpider with LinkExtractor object to crawl next pages and other links from a homepage. Iv'e got two Links Extractors; one to crawl next pages and another one to crawl some links events (cf. spider code below).
My second linkExtractor works (events links), but the first one doesn't.
I've got this error in my stack trace when I launched my spider :
[scrapy] WARNING: Remote certificate is not valid for hostname "marathons.ahotu.fr"; u'ssl390453.cloudflaressl.com'!=u'marathons.ahotu.fr'
Actually I'm a novice in Python and Scrapy, so my questions are :
What does it mean ?
How can I fix it ?
Here is my spider code :
import scrapy
import os
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
if os.path.isfile('ListeCAP_Marathons_ahotu.csv'):
reecritureFichier = open('ListeCAP_Marathons_ahotu.csv', 'w')
reecritureFichier.truncate()
reecritureFichier.close()
class MySpider(CrawlSpider):
name = 'ListeCAP_Marathons_ahotu'
start_urls = ['https://marathons.ahotu.fr/calendrier']
rules = (
# LINKEXTRACTOR N°1 = NEXT PAGES
Rule(LinkExtractor(allow=('https://marathons.ahotu.fr/calendrier?page=[0-9]{1,100}#list-top',),),),
# LINKEXTRACTOR N°2 = EVENTS LINKS
Rule(LinkExtractor(allow=('https://marathons.ahotu.fr/evenement/.+',),),follow=True,callback='parse_item'),
)
def parse_item(self, response):
selector = Selector(response)
yield{
'nom_even':selector.xpath('/html/body/div[2]/div[2]/h1/span[#itemprop="name"]/text()').extract(),
}
print('--------------------> NOM DE L\'EVENEMENT :', selector.xpath('//*[#id="jog"]/div[2]/section/article/header/h1/text()').extract())
(I'm using Scrapy 1.4.0 with Twisted-17.9.0)
You can't fix this type of error. The best that you can do is send a message to the administrator of the domain and let he/she know that the certificate has problems (In this case the certificate is for other domain and not marathons.arotu.fr).
Related
I'm more than a day look how to deploy telegram bot with webhook instead of polling..
even in the official doc it's not work for me https://github.com/python-telegram-bot/python-telegram-bot/wiki/Webhooks
someone can explain me oh give me some link to work tutorial how to deploy the most basic bot like this one
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
from telegram.ext import Updater, CommandHandler, MessageHandler, Filters
def start(update, context):
"""Send a message when the command /start is issued."""
update.message.reply_text('Hi!')
def help_command(update, context):
"""Send a message when the command /help is issued."""
update.message.reply_text('Help!')
def echo(update, context):
"""Echo the user message."""
update.message.reply_text(update.message.text)
def main():
TOKEN = "telegramToken"
updater = Updater(TOKEN, use_context=True)
# Get the dispatcher to register handlers
dp = updater.dispatcher
# on different commands - answer in Telegram
dp.add_handler(CommandHandler("start", start))
dp.add_handler(CommandHandler("help", help_command))
dp.add_handler(MessageHandler(Filters.text & ~Filters.command, echo))
PORT = int(os.environ.get('PORT', '5000'))
SERVER = 'myipserver/'
CERT = 'cert.pem'
updater.bot.setWebhook(SERVER + TOKEN, certificate=open(CERT, 'rb'))
# updater.bot.setWebhook(SERVER + TOKEN) also dont working
updater.start_webhook(listen="0.0.0.0", port=PORT, url_path=TOKEN)
# updater.start_polling()
updater.idle()
if __name__ == '__main__':
main()
The problem is that the default port in App Engine is 8080
And the webhook in telegram does not support this port (just ports 443, 80, 88, or 8443)
https://core.telegram.org/bots/webhooks
I run my Telegram BOTs on Heroku and normally start the webhook then is set it:
updater.start_webhook(listen="0.0.0.0",
port=int(os.environ.get("PORT", 5000)),
url_path='token'
updater.bot.setWebhook("https://myapp.com/token")
updater.idle()
When deploying on Heroku you get a URL for your application using SSL (https://myapp.herokuapp.com), which you then configure via the BotFather.
Providing the BotFather with a HTTPS url is a must, I wonder if you don't do that or maybe there is an issue with the SSL certificate you are using (self-signed?)
I am facing issues with some urls while running scrappy
ValueError: Missing scheme in request url: mailto:?body=https%3A%2F%2Fiview.abc.net.au%2Fshow%2Finsiders
[scrapy.core.scraper:168|ERROR] Spider error processing <GET https://iview.abc.net.au/show/four-corners/series/2020/video/NC2003H028S00> (referer: None)
Here are my settings:
"base_urls" : [
{
# Start crawling from
"url": "https://www.abc.net.au/",
# Overwrite the default crawler and use th RecursiveCrawler instead
"crawler": "RecursiveCrawler",
This works ok with following setting
"base_urls" : [
{
# Start crawling from
"url": "https://www.afr.com/",
# Overwrite the default crawler and use th RecursiveCrawler instead
"crawler": "RecursiveCrawler",
Not sure what I am missing here
You have different behaviors because of the content beign scraped. The problem is that at some point your spider is trying to yield a Request for this URL:
mailto:?body=https%3A%2F%2Fiview.abc.net.au%2Fshow%2Finsiders
The correct URL is probably this:
https://iview.abc.net.au/show/insiders
It's possible that you are scraping the wrong field, or that there was a mistake in the site where this "url" is retrieved.
I can't figure out what I'm doing wrong here, I'm getting the following error:
[scrapy.mail] ERROR: Unable to send mail: To=['reg2#mydomain.com']
Cc=['reg3#mydomain.com'] Subject="test" Attachs=0- Connection was refused
by other side: 10061: No connection could be made because the target
machine actively refused it..
Here is my very basic spider
import scrapy
from scrapy.mail import MailSender
mailer = MailSender()
class FanaticsSpider(scrapy.Spider):
name = 'fanatics'
start_urls = ['https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-majestic-showtime-logo-cool-base-t-shirt-navy/o-9172+t-70152507+p-1483408147+z-8-1114341320',
]
def parse(self, response):
yield {
'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').re('[$]\d+\.\d+'),
}
mailer.send(to=["reg2#mydomain.com"], subject="test", body="test", cc=["reg3#mydomain.com"])
In my settings.py I have the following:
MAIL_HOST = 'mail.mydomain.com'
MAIL_FROM = 'pricealerts#mydomain.com'
MAIL_PORT = 465
MAIL_USER = 'pricealerts#mydomain.com'
MAIL_PASS = 'passwordxx'
MAIL_SSL = True
It seems like these server details aren't getting pulled properly? I've tried modifying all the options I could, including trying to populate the settings in the spider but that gave me another problem.
mailer=MailSender(smtpuser="pricealerts#mydomain.com", mailfrom="pricealerts#mydomain.com", smtphost="mail.mydomain.com", smtppass="password", smtpport=465)
This didn't give me any errors but the spider seems to hang after [scrapy.core.engine] INFO: Spider closed (finished) and I have to close the anaconda command prompt. Also, no email gets sent.
I also tried this alternate method found here and didn't get an error but no email was sent What did I forget in order to correctly send an email using Scrapy
I have that same error try to use some proxies its works for me
I'm trying to embed browsermob proxy with my selenium (chrome) framework for UI automated testing in order to intercept responses and other networking.
Description :
Selenium webdriver using browsermob proxy and it works just fine - HTTP and secured HTTPS URL's are OK. When I'm trying to navigate to unsecured HTTPS URL I get this chrome error:
ERR_TUNNEL_CONNECTION_FAILED
Here's my python code:
class Browser(object):
display = None
browser = None
def __init__(self, implicitly_wait_seconds=10, is_visible=True, display_size=None, browser_name='chrome'):
if not is_visible:
self.display = Display(display_size)
self.server = Server('/home/erez/Downloads/browsermob-proxy-2.1.4/bin/browsermob-proxy')
self.server.start()
self.proxy = self.server.create_proxy()
self.capabilities = DesiredCapabilities.CHROME
self.proxy.add_to_capabilities(self.capabilities)
self.proxy.new_har("test", options={'captureHeaders': True, 'captureContent': True})
self.start_browser(display_size, implicitly_wait_seconds, browser_name)
def __enter__(self):
return self
def __exit__(self, _type, value, trace):
self.close()
def start_browser(self, display_size, implicitly_wait_seconds=10, browser_name='chrome'):
if browser_name == 'chrome':
chrome_options = Options()
# chrome_options.add_argument("--disable-extensions")
chrome_options.add_experimental_option("excludeSwitches", ["ignore-certificate-errors"])
chrome_options.add_argument("--ssl-version-max")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--proxy-server=%s' % self.proxy.proxy)
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--allow-insecure-localhost')
chrome_options.add_argument('--ignore-urlfetcher-cert-requests')
self.browser = webdriver.Chrome(os.path.dirname(os.path.realpath(__file__)) + "/chromedriver",
chrome_options=chrome_options, desired_capabilities=self.capabilities)
self.browser.implicitly_wait(implicitly_wait_seconds)
You can also call create_proxy with trustAllServers as argument:
self.proxy = self.server.create_proxy(params={'trustAllServers':'true'})
I faced the same problem for SSL proxying using BrowserMob Proxy. For this you have to install a certificate in your browser that has been defined in this link
Go to the bottom of the document, and you will see a section called "SSL support". Install the ca-certificate-rsa.cer in your browser and there would be no issue in SSL proxying anymore.
If installing Browsermobs/test servers certificates won't do the job, like in my case, not the most elegant way, but gets the job done:
You are able to bypass ERR_TUNNEL_CONNECTION_FAILED error by passing a trustAllServers-parameter to the proxy instance, which will disable the certificate verification for upstream servers. Unfortunately, for as far as I know, this functionality has not been implemented in the Browsermobs Python wrapper.
However, you can start the proxy with the parameter via Browsermobs REST API (see section REST API # https://github.com/lightbody/browsermob-proxy/blob/master/README.md). In case of Python, Requests might be the way to go.
Here's a snippet:
import json
import requests
from browsermobproxy import Server
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Start the proxy via python
server = Server('/path_to/bin/browsermob-proxy')
server.start()
# Start proxy instance with trustAllServers set to true, store returned port number
r = requests.post('http://localhost:8080/proxy', data = {'trustAllServers':'true'})
port = json.loads(r.text)['port']
# Start Chromedriver, pass on the proxy
chrome_options = Options()
chrome_options.add_argument('--proxy-server=localhost:%s' % port)
driver = webdriver.Chrome('/path_to/selenium/chromedriver', chrome_options=chrome_options)
# Get a site with untrusted cert
driver.get('https://site_with_untrusted_cert')
Also, if you need to access HAR-data, you'll need to do that also trough the REST API:
requests.put('http://localhost:8080/proxy/%s/har' % port, data = {'captureContent':'true'})
r = requests.get('http://localhost:8080/proxy/%s/har' % port)
Of course, since you're disabling safety features, this parameter should be used for limited testing purposes only.
Try use this
self.capabilities['acceptSslCerts'] = True
mod_wsgi Exception occurred processing WSGI script '/usr/share/graphite-web/graphite.wsgi'
I copied only apache-graphite.conf to /etc/apache/sites-available, why does it complain about graphite.wsgi?
Content of apache-graphite.conf:
import os, sys
os.environ['DJANGO_SETTINGS_MODULE'] = 'graphite.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
from graphite.logger import log
log.info("graphite.wsgi - pid %d - reloading search index" % os.getpid())
import graphite.metrics.search
graphite.wsgi is the wsgi application callled by your apache webserver to answer incoming requests.
The apache-graphite.conf site defines a wsgi application running django which will process requests using graphite code. I guess it looks more like this : https://github.com/graphite-project/graphite-web/blob/0.9.x/examples/example-graphite-vhost.conf
graphite.wsgi usually looks like : https://github.com/graphite-project/graphite-web/blob/0.9.x/conf/graphite.wsgi.example