https with jython2.7 + trusting all certificates does not work. Result: httplib.BadStatusLine - ssl

UPDATE: Problem related to bug in jython 2.7b1. See bug report: http://bugs.jython.org/issue2021. jython-coders are working on a fix!
After changing to jython2.7beta1 from Jython2.5.3 I am no longer able to read content of webpages using SSL, http and "trusting all certificates". The response from the https-page is always an empty string, resulting in httplib.BadStatusLine exception from httplib.py in Jython.
I need to be able to read from a webpage which requires authentication and do not want to setup any certificate store since I must have portability. Therefore my solution is to use the excellent implementation provided by http://tech.pedersen-live.com/2010/10/trusting-all-certificates-in-jython/
Example code is detailed below. Twitter might not be the best example, since it does not require certificate trusting; but the result is the same with or without the decorator.
#! /usr/bin/python
import sys
from javax.net.ssl import TrustManager, X509TrustManager
from jarray import array
from javax.net.ssl import SSLContext
class TrustAllX509TrustManager(X509TrustManager):
# Define a custom TrustManager which will blindly
# accept all certificates
def checkClientTrusted(self, chain, auth):
pass
def checkServerTrusted(self, chain, auth):
pass
def getAcceptedIssuers(self):
return None
# Create a static reference to an SSLContext which will use
# our custom TrustManager
trust_managers = array([TrustAllX509TrustManager()], TrustManager)
TRUST_ALL_CONTEXT = SSLContext.getInstance("SSL")
TRUST_ALL_CONTEXT.init(None, trust_managers, None)
# Keep a static reference to the JVM's default SSLContext for restoring
# at a later time
DEFAULT_CONTEXT = SSLContext.getDefault()
def trust_all_certificates(f):
# Decorator function that will make it so the context of the decorated
# method will run with our TrustManager that accepts all certificates
def wrapped(*args, **kwargs):
# Only do this if running under Jython
if 'java' in sys.platform:
from javax.net.ssl import SSLContext
SSLContext.setDefault(TRUST_ALL_CONTEXT)
print "SSLContext set to TRUST_ALL"
try:
res = f(*args, **kwargs)
return res
finally:
SSLContext.setDefault(DEFAULT_CONTEXT)
else:
return f(*args, **kwargs)
return wrapped
##trust_all_certificates
def read_page(host):
import httplib
print "Host: " + host
conn = httplib.HTTPSConnection(host)
conn.set_debuglevel(1)
conn.request('GET', '/example')
response = conn.getresponse()
print response.read()
read_page("twitter.com")
This results in:
Host: twitter.com
send: 'GET /example HTTP/1.1\r\nHost: twitter.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: ''
Traceback (most recent call last):
File "jytest.py", line 62, in <module>
read_page("twitter.com")
File "jytest.py", line 59, in read_page
response = conn.getresponse()
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 1030, in getresponse
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 407, in begin
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 371, in _read_status
httplib.BadStatusLine: ''
Changing back to jython2.5.3 gives me parseable output from twitter.
Have any of you seen this before? Can not find any bug-tickets on jython project page about this nor can I understand what changes could result in this behaviour (more than maybe #1309, but I do not understand if it is related to my problem).
Cheers

Related

python configparser key error raised when using Globally

When I'm trying to pass the API endpoint values in the post API file, KeryError has unfortunately been raised. In the baseapi.ini file, I wrote [API] endpoint = value
Post API file:
import requests
from APIs.payLoad import addBookPayload
from Utilities.configration import config
from Utilities.resources import *
url = config()['API']['endpoint']+ApiResources.addBook
header = {"Content-Type": "application/json"}
response = requests.post(url, json=addBookPayload("pl74"), headers=header,)
print(response.json())
response_json = response.json()
book_ID = response_json['ID']
Error:
Traceback (most recent call last):
File "C:\Users\Muhammad Azmul Haq\PycharmProjects\BackEndProject\APIs\PostAPI.py", line 8, in <module>
url = config()['API']['endpoint']+ApiResources.addBook
File "C:\Users\Muhammad Azmul Haq\AppData\Local\Programs\Python\Python39\lib\configparser.py", line 960, in __getitem__
raise KeyError(key)
KeyError: 'API'
Does anyone have an idea what I did wrong Kind regards?
You are not initializing your global variable in config before accessing it. Try assigning value in the current file,
or
Put all configure in the separate configuration file and import that configuration file.

How to read an image sent in body of req in falcon

Sending jpg image in body of POST, using postman to do so:
Reading it with
image_text_similarity.py:
import json
class ImageTextSimilarity():
def on_post(self, req, resp):
image_raw = json.loads(req.stream.read())
which errors out with
Traceback (most recent call last):
File "/home/dario/.local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
self.handle_request(listener, req, client, addr)
File "/home/dario/.local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 175, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "falcon/api.py", line 274, in falcon.api.API.__call__
File "falcon/api.py", line 269, in falcon.api.API.__call__
File "/home/dario/ImageTextSimilarityApp/image_text_similarity.py", line 95, in on_post
image_raw = json.loads(req.stream.read())
File "/usr/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
How do we read the image from the body of the POST request?
Rest of the code is
image_similarity_app.py:
import falcon
from image_text_similarity import ImageTextSimilarity
api = application = falcon.API()
api.req_options.auto_parse_form_urlencoded = True
image_text_similarity_object = ImageTextSimilarity()
api.add_route('/image_text_similarity', image_text_similarity_object)
And starting the service with gunicorn image_similarity_app
I'm not an expert at Postman, but it appears that by choosing binary, you are sending your JPEG image data as the request body: Postman Chrome: What is the difference between form-data, x-www-form-urlencoded and raw
In Falcon, you can simply read the request payload as
jpeg_data = req.stream.read()
(Note that on some app servers such as the stdlib's wsgiref.simple_server, you may need to use the safe Request.bounded_stream wrapper.)
See also Falcon's WSGI and ASGI tutorials for inspiration; they use are very related topic (building an image service) to illustrate the basic concepts of the framework. You'll find examples how to handle RESTful image resources: upload, convert, store, list, serve, cache etc.

"Missing 1 required positional argument: 'resp'" when invoking Falcon resource responder that has a 'self' argument

I am developing a WSGI application on Windows. I use peewee (which is supposedly unrelated) and:
falcon==2.0.0
waitress==1.4.3
I have the following code in my resources.py:
from models import Board
class BoardResource:
def on_get_collection(self, req, resp):
resp.media = Board.select()
def on_get(self, req, resp):
code = req.get_param('code')
resp.media = Board.get_by_id(code)
I have the following code in my app.py:
import falcon
import models
from resources import BoardResource
def init():
models.init()
api = falcon.API()
api.add_route('/boards', BoardResource, suffix='collection')
api.add_route('/board', BoardResource)
return api
api = init()
I start the app with this command: waitress-serve app:api. When I request /boards from the API, I get this error:
ERROR:waitress:Exception while serving /boards
Traceback (most recent call last):
File "c:\users\pepsiman\.virtualenvs\hsech-api\lib\site-packages\waitress\channel.py", line 349, in service
task.service()
File "c:\users\pepsiman\.virtualenvs\hsech-api\lib\site-packages\waitress\task.py", line 169, in service
self.execute()
File "c:\users\pepsiman\.virtualenvs\hsech-api\lib\site-packages\waitress\task.py", line 439, in execute
app_iter = self.channel.server.application(environ, start_response)
File "c:\users\pepsiman\.virtualenvs\hsech-api\lib\site-packages\falcon\api.py", line 269, in __call__
responder(req, resp, **params)
TypeError: on_get_collection() missing 1 required positional argument: 'resp'
I decided to remove the self argument from the definiton of on_get_collection and the error was gone. I know that self must be there and have no idea why it doesn't work like that. Any ideas how to fix?
I have found the problem myself: when calling api.add_route the responder class must indeed be instantiated, thus the following lines:
api.add_route('/boards', BoardResource, suffix='collection')
api.add_route('/board', BoardResource)
need to be modified like this:
api.add_route('/boards', BoardResource(), suffix='collection')
api.add_route('/board', BoardResource())
Of course it works without removing the self argument from the definitions.
I hope this silly mistake of mine will help someone fix theirs.

Why does calling a scrapy spider from pywikibot give a ReactorNotRestartable error?

I am able to call a scrapy spider from another Python script using either CrawlerRunner or CrawlerProcess. But, when I try to call the same spider calling class from a pywikibot robot, I get a ReactorNotRestartable error. Why is this and how can I fix it?
Here is the error:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
Here is the script which calls my scrapy spider. It runs fine if I just call the class from main.
from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
Updated simple Test that instantiates the AEAMetadata class twice.
Here is the calling code in my pywikibot bot which fails:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
My call to AEAMetadata was embedded in a larger script which fooled me into thinking the AEAMetadata class was only instantiated once before failure.
In fact, AEAMetadata was called twice.
And, I also thought that the script would block after the reactor.run() because the comment in all the scrapy examples stated that was the case.
However, the second deferred callback is reactor.stop() which unblocks the reactor.run().
A more basic incorrect assumption was that the reactor was deleted and recreated on each iteration. In fact, the reactor is instantiated and initialized when it is first imported. And, it is a global object which lives as long as the underlying process and was not designed to be restarted. The extremes actually needed to delete and restart a reactor are described here:
http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
So, I guess I've answered my own question.
And, I'm rewriting my script so it doesn't try to use the reactor in a way it was never intended to be used.
And, thanks Gallaecio for getting me thinking in the right direction.

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help.
I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of ghostdriver.log or just disable it. Looking at phantomjs -h and ghostdriver github page I couldn't find the answer, my friend google let me down also.
$ scrapy deploy
Building egg of crawler-1370960743
'build/scripts-2.7' does not exist -- can't clean it
zip_safe flag not set; analyzing archive contents...
tests.fake_responses.__init__: module references __file__
Deploying crawler-1370960743 to http://localhost:6800/addversion.json
Server response (200):
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 18, in render
return JsonResource.render(self, txrequest)
File "/usr/lib/pymodules/python2.7/scrapy/utils/txweb.py", line 10, in render
r = resource.Resource.render(self, txrequest)
File "/usr/lib/python2.7/dist-packages/twisted/web/resource.py", line 216, in render
return m(request)
File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 66, in render_POST
spiders = get_spider_list(project)
File "/usr/lib/pymodules/python2.7/scrapyd/utils.py", line 65, in get_spider_list
raise RuntimeError(msg.splitlines()[-1])
RuntimeError: IOError: [Errno 13] Permission denied: 'ghostdriver.log
When using the PhantomJS driver add the following parameter:
driver = webdriver.PhantomJS(service_log_path='/var/log/phantomjs/ghostdriver.log')
Related code, would be nice to have an option to turn off logging though, seems thats not supported:
selenium/webdriver/phantomjs/service.py
class Service(object):
"""
Object that manages the starting and stopping of PhantomJS / Ghostdriver
"""
def __init__(self, executable_path, port=0, service_args=None, log_path=None):
"""
Creates a new instance of the Service
:Args:
- executable_path : Path to PhantomJS binary
- port : Port the service is running on
- service_args : A List of other command line options to pass to PhantomJS
- log_path: Path for PhantomJS service to log to
"""
self.port = port
self.path = executable_path
self.service_args= service_args
if self.port == 0:
self.port = utils.free_port()
if self.service_args is None:
self.service_args = []
self.service_args.insert(0, self.path)
self.service_args.append("--webdriver=%d" % self.port)
if not log_path:
log_path = "ghostdriver.log"
self._log = open(log_path, 'w')
#Reduce logging level
driver = webdriver.PhantomJS(service_args=["--webdriver-loglevel=SEVERE"])
#Remove logging
import os
driver = webdriver.PhantomJS(service_log_path=os.path.devnull)