Scrapy SplashRequest and broken PNGs - scrapy

I'm trying to use Scrapy-Splash to take a screenshot of a website using the 'render.png' endpoint (in practice I do this in my spider after certain exceptions occur, and I want to view how the site looks for them).
The problem I'm having is that the response appears to not be a valid PNG. A min example in the scrapy shell is:
from scrapy_splash import SplashRequest
url='http://www.waitrose.com'
args={'wait': 2, 'width': 320, 'timeout': 60, 'render_all': 1}
endpoint='render.png'
# I also tried with dont_send_headers=True, dont_process_response=True
sr=SplashRequest(url=url, args=args, endpoint=endpoint)
fetch(sr)
You will need a local splash server running to execute this of course (see here)
The response headers are
{'Content-Type': 'image/png',
'Date': 'Mon, 10 Apr 2017 21:23:48 GMT',
'Server': 'TwistedWeb/16.1.1'}
but the body starts like
In [16]: response.body[:100]
Out[16]: '<html><head></head><body>\xe2\x80\xb0PNG\n\x1a\n\nIHDR\x01#\x04\xc2\xad\x08\x065r\xe2\x80\x9aQ\tpHYs\x0fa\x0fa\x01\xc2\xa8?\xc2\xa7i IDATx\x01\xc3\xac\xc2\xbd\x07\xc5\x93\\\xc3\x97u\xc3\xa6y\xc2\xaa\xc2\xbab\xc3\xa7\xc5\x93\xc3\x91'
and even after trimming the html tags and saving to file, my system says non-valid PNG.
On the other hand if I use the python-requests module like
import requests
base_url = "http://localhost:8050/render.png"
params = {'url': 'http://www.waitrosecellar.com',
'wait': 2,
'width': 320,
'timeout': 60,
'render_all': 1}
response2 = requests.get(base_url, params)
I have no issues. The response content starts like
In [19]: response2.content[:100]
Out[19]: '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01#\x00\x00\x03)\x08\x06\x00\x00\x00u\xf4\xea\x11\x00\x00\x00\tpHYs\x00\x00\x0fa\x00\x00\x0fa\x01\xa8?\xa7i\x00\x00 \x00IDATx\x01\xec\xbd\x07\x9c]\xc7u\xdf\x7f\xb6\x17\xec\xa2\xf7\xba(\x04A\x80`\x17\x8bH\x90\x14\x9bHY\xdd\x92l\xc9\x92\xab\\\x92'
the headers are
In [20]: response2.headers
Out[20]: {'Transfer-Encoding': 'chunked', 'Date': 'Mon, 10 Apr 2017 21:39:17 GMT', 'Content-Type': 'image/png', 'Server': 'TwistedWeb/16.1.1'}
and saving the file produces a valid PNG image, which I can view on my system.
What is going with SplashRequest that is messing up the PNG?
I found exactly the same issue using the screenshot pipline from the scrapy docs too.
EDIT: Interestingly, if I set breakpoints in the middleware process_response, the response.body is at that stage a valid PNG.

Turns out this was some beautifulsoup html parser middleware I had in the chain whose 'process_response' method was messing up the png bytes.

Related

Unable to access pdf document via requests or selenium

I have a huge list of URLs and each one loads a different PDF document. This is one of them:
https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0
It will most likely open the website home page in the first try, but if you paste the link again it will open a pdf document.
I'm trying to write a python script to download those documents locally to extract contnet using tika, but this behavior where it opens the home page the first time is throwing a wrench in anything I try.
1. I tried requests, but expectedly it just returns the HTML content of home page
import requests
from tika import parser
link = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx DocumentFragmentID=74223655&CheckDocumentGroups=0"
resp = requests.get(link)
with open('metadata.pdf', 'wb') as f:
f.write(resp.content)
raw = parser.from_file('metadata.pdf', xmlContent=False)
print(raw['content'])
output:
\n\n\n\n\n\n\n\n\n\n \n \t\t\n\n\t\tSkip to Main Content\xa0\xa0\xa0\xa0Logout\xa0\xa0\xa0\xa0My
Account\xa0\xa0\xa0\xa0\t\t\tHelp\n\n\n\n\n\n\n\t\t\t\nSelect a location\nPinellas County\n\n\xa0\nAll Case
Records Search\nCivil, Family Case Records\nCriminal & Traffic Case Records\nProbate Case Records\nCourt
Calendar\n\nAttorney Login\nRegistered User Login\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\n\t\t\
t\xa0\t\n\t\n\t\tClerk of the Circuit Court|Mortgage Foreclosure Sales|Pinellas County Government|Pinellas
County Sheriff's Office|Public Defender|Sixth Judicial Circuit|State of Florida|State Attorney|Self Help
Center|Court Forms|How-To Videos|Florida Courts eFiling Portal Video|Attorney Account Setup|Reports and
Statistics|Terms of Use|Contact UsCopyright 2003 Tyler Technologies. All rights Reserved.\n\t\n\n\n\n
\n
2. I tried to open the home page using Selenium, and transfer cookies from the webdriver to requests following this answer .
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
resp = s.get(url)
It did not work, and when I checked the CookieJar of the response object it came out empty.
I have to admit I have so little understanding of how cookies work, but it was just a desperate attempt. What am I misunderstanding here? I appreciate any input.
3. My last resort (for obvious reasons) was to open each document via webdriver and download the content, but even this did not work.
#opens a new window and assigns it as the working window
def open_window(driver, link):
driver.execute_script(f"window.open('{link}')")
new_window = driver.window_handles[-1]
driver.switch_to.window(new_window)
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
open_window(driver, url)
#print source of new window
print(driver.page_source)
The output is just this:
<html><head></head><body></body></html>
After a little more tinkering, solution #2 worked. But instead of getting cookies from the driver after accessing the main page only, I had the browser start another query (with little extra steps specific to this website) then I used the cookies. It looks like this
[{'domain': 'ccmspa.pinellascounty.org',
'expiry': 1670679832, #this is the time the cookie expires in epoch time
'httpOnly': True,
'name': '.ASPXFORMSPUBLICACCESS',
'path': '/',
'secure': True,
'value': '1DBB1EADBA199D246E84CCE7243202DCA6BBD7E383FE360ECBFC2E6150102C79F3EC2F6B232B85589C51976AF20EF7EBDF52CF74122A7A6E78B4C6F31434C58AB57E10005C41DE019814B704F12B150A0818585E85F0237EFCF1A11B205414325CA1850605FF932BC43CC5B36395488F40D58DA594899C4D62FF3ECCBE729C6BC001194225B6653CB89C1305C7FBCB26E1BCFCFF75476784D24ADFCA0AFF679A3BAA3131'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': True,
'name': 'ASP.NET_SessionId',
'path': '/',
'secure': True,
'value': '24552pqtb1tomjbw2gkzko55'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': False,
'name': 'EDLFDCVM',
'path': '/',
'sameSite': 'None',
'secure': True,
'value': '02282de498-9595-48s0hGpl59SkUKRZpRrS_b1TKJfXlz_3dGN9xGZ2tcTXrHuDsR5rN90I_Rp192pX48C1k'}]

How to create quick links with branch io api in python?

I tried following this Creating Quick Links using the Branch HTTP API? , however I think there's been an update where there's no more type 2.
How do you create quick links through branch io api? Here's my current code
import requests
import json
def branch (medium,source,campaign,test,link):
url = "https://api2.branch.io/v1/url"
headers = {'Content-Type': 'application/json'}
data = json.dumps({
"branch_key": '<branch key>',
"channel": f"{source}",
"feature": f"{medium}",
"campaign": f"{campaign}",
"data":{
"$og_title":f"{name}",
"$marketing_title":"test",
"~creation_source":1,
"$og_description":f"{test}",
"~feature":f"{medium}",
"+url":f"https://lilnk.link/{test}",
"$ios_deeplink_path":f"{link}",
"$android_deeplink_path":f"{link}",
"~marketing":'true',
"$one_time_use":'false',
"~campaign":"testing",
"~channel":f"{source}"
})
resp = requests.post(url, headers=headers, data=data)
print(resp.status_code)
This actually gets a 200 code, however I did not find it in the quick links.
I've checked through network where the url is supposed to be https://dashboard.branch.io/v1/link/marketing
and tried using the payload. However 403 error occurs.
How can I create a quicklink?

How to read an image sent in body of req in falcon

Sending jpg image in body of POST, using postman to do so:
Reading it with
image_text_similarity.py:
import json
class ImageTextSimilarity():
def on_post(self, req, resp):
image_raw = json.loads(req.stream.read())
which errors out with
Traceback (most recent call last):
File "/home/dario/.local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
self.handle_request(listener, req, client, addr)
File "/home/dario/.local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 175, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "falcon/api.py", line 274, in falcon.api.API.__call__
File "falcon/api.py", line 269, in falcon.api.API.__call__
File "/home/dario/ImageTextSimilarityApp/image_text_similarity.py", line 95, in on_post
image_raw = json.loads(req.stream.read())
File "/usr/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
How do we read the image from the body of the POST request?
Rest of the code is
image_similarity_app.py:
import falcon
from image_text_similarity import ImageTextSimilarity
api = application = falcon.API()
api.req_options.auto_parse_form_urlencoded = True
image_text_similarity_object = ImageTextSimilarity()
api.add_route('/image_text_similarity', image_text_similarity_object)
And starting the service with gunicorn image_similarity_app
I'm not an expert at Postman, but it appears that by choosing binary, you are sending your JPEG image data as the request body: Postman Chrome: What is the difference between form-data, x-www-form-urlencoded and raw
In Falcon, you can simply read the request payload as
jpeg_data = req.stream.read()
(Note that on some app servers such as the stdlib's wsgiref.simple_server, you may need to use the safe Request.bounded_stream wrapper.)
See also Falcon's WSGI and ASGI tutorials for inspiration; they use are very related topic (building an image service) to illustrate the basic concepts of the framework. You'll find examples how to handle RESTful image resources: upload, convert, store, list, serve, cache etc.

Error to render Inline PDF with CGI in Python 3.7

When script is executed with python CGI PDF is feching well. But if i import script in another module unable to fetch pdf.
import cgi
form = cgi.FieldStorage()
import os,io,html,sys
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
data = 'Create a new PDF with Reportlab Swamy® RedteK 4104 DE* ≤ 0.4 ≤ 1.5 *'
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
strdata = data.encode('utf-8','xmlcharrefreplace')
cat = str(html.unescape(strdata.decode()))
can.drawString(10, 500, cat)
can.showPage()
can.save()
packet.seek(0)
print('Content-type: application/pdf')
print('Content-Disposition: inline; filename="out.pdf"')
print('\n\n')
sys.stdout.flush()
sys.__stdout__.buffer.write(packet.getvalue())
PDF fetched when module when script is initiated directly.
Error when import in another module
Error in apache errorlogs: malformed header from script. Bad header=%PDF-1.3:
Thanks you,
A guess from a non-python guy here:
print('Content-type: application/pdf')
print('Content-Disposition: inline; filename="out.pdf"')
print('\n\n')
looks suspicious: A linebreak is required after each HTTP-header:
print('Content-type: application/pdf')
print('\n')
print('Content-Disposition: inline; filename="out.pdf"')
print('\n\n')
As you say you see "malformed header from script" in the logs: Monitor them. I assume that the version without an extra linebreak just has the Content-Type header with a really weird content type:
Content-type: application/pdfContent-Disposition: inline; filename="out.pdf"
while the browser (and server) could (and should) make use of
Content-type: application/pdf
Content-Disposition: inline; filename="out.pdf"

Dynamically created JPG image using AWS Lambda service

I am trying to create a dynamically created graph as a JPG file that I could use in Alexa Skill standard cards as part of response. The following code creates a JPG image when I run it locally on my computer, when using browser with URL "http://localhost:5000/image.jpg".
from flask import send_file
from flask import Flask
from PIL import Image, ImageDraw
from io import BytesIO
app = Flask(__name__)
app.config['DEBUG'] = True
def serve_pil_image(pil_img):
img_io = BytesIO()
pil_img.save(img_io, 'JPEG', quality=70)
img_io.seek(0)
return send_file(img_io, mimetype='image/jpeg')
#app.route('/image.jpg')
def serve_img():
size = (128,128)
background = (128,128,55)
xy = [(0,0),(10,10),(20,20),(30,12),(50,50),(70,9),(90,70)]
img = Image.new('RGB',size,background)
draw = ImageDraw.Draw(img)
draw.line(xy, fill=128, width=5)
return serve_pil_image(img)
if __name__ == '__main__':
app.run(debug=True)
However, when I deploy the same code to AWS Lambda service using Zappa I am getting the following error message (from CloudWatch logs):
An error occurred during JSON serialization of response: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Traceback (most recent call last):
File "/usr/lib64/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Is there some configuration option to fix this problem? I haven't found any so far.
Binary Support is finally here! You should look at it and try again.
If you want to serve binary data (in this case Base64 images) through API Gateway, you need to set the following:
In the Method Response of your method
Set Content-Type as image/jpeg in HTTP 200 Status Response
Header
In the Integration Response of your method
Set Content-Type as 'image/jpeg' in Header Mappings. Mind the quotes!
With the AWS CLI, set contentHandling attribute to CONVERT_TO_BINARYon your Integration Response
Check to entire process in this great step-by step guide: https://stackoverflow.com/a/41434295/720665
(example is for a base64 encoded png image, but the gist of it is the same)