Parse html5 data-* attributes in python? - scrapy

Is it possible to access the data-* portion of an html element from python? I'm using scrapy and the data-* is not available in a selector object, though the raw data is available in a Request object.
If I dump the html using wget -O page http://page.com then I can see the data in the file. It's something like blahlink
I can edit the data-mine portion in an editor, so I know it's there ... it just seems like well-behaved parsers are dropping it.
As you can see, I'm confused.

Yeah, lxml does not expose the attribute names for some reason, and Talvalin is right, html5lib does:
stav#maia:~$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2
>>> import html5lib
>>> html = '''<a href="blah" target="_blank" data-mine="a;slfkjasd;fklajsdfl;ahsdf"
... data-yours="truly">blahlink</a>'''
>>> for x in html5lib.parse(html, treebuilder='lxml').xpath('descendant::*/#*'):
... print '%s = "%s"' % (x.attrname, x)
...
href = "blah"
target = "_blank"
data-mine = "a;slfkjasd;fklajsdfl;ahsdf"
data-yours = "truly"

I did it like this without using a third-party library:
import re
data_email_pattern = re.compile(r'data-email="([^"]+)"')
match = data_email_pattern.search(response.body)
if match:
print(match.group(1))
...

I've not tried it, but there is html5lib (http://code.google.com/p/html5lib/) which can be used in conjunction with Beautiful Soup instead of scrapy's built-in selectors.

Related

FormRequest that renders JS content in scrapy shell

I'm trying to scrape content from this page with the following form data:
I need the County: set to Prince George's and DateOfFilingFrom set to 01-01-2000 so I do the following:
% scrapy shell
In [1]: from scrapy.http import FormRequest
In [2]: request = FormRequest(url='https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx', formdata={'DateOfFilingFrom': '01-01-2000', 'County:': "Prince George's"})
In [3]: response
In [4]:
But it's not working(response is None) plus, the next page looks like the following which is loaded dynamically, I need to know how to be able to access each of the links shown below with the following inspection(as far as I know this might be done using Splash however, I'm not sure how to combine a SplashRequest within a FormRequest and do it all from within scrapy shell for testing purposes. I need to know what I'm doing wrong and how to render the next page(the one that results from the FormRequest shown below)
The request you're sending is missing a couple of fields, which is probably why you don't get a response back. The fields you fill in also don't correspond to the fields they are expecting in the request. A good way to deal with this is using scrapy's from_response (doc), which can populate some fields for you already based on the information in the form.
For this website the following worked for me (using scrapy shell):
>>> url = "https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx"
>>> fetch(url)
>>> from scrapy import FormRequest
>>> req = FormRequest.from_response(
... response,
... formxpath="//form[#id='form1']", # specify the form on the current page
... formdata={
... 'cboCountyId': '16', # the county you select is converted to a number
... 'DateOfFilingFrom': '01-01-2001',
... 'cboPartyType': 'Decedent',
... 'cmdSearch': 'Search'
... },
... clickdata={'type': 'submit'},
... )
>>> fetch(req)

Enable to select element using Scrapy shell

I'm trying to print out all the titles of the products of this website using scrapy shell: 'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas'
Once it is open I start fetching:
fetch('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas')
And I try to print out the title of each product as a result nothing is selected:
>>> response.css('.shelfProductTile-descriptionLink::text')
output: []
Also tried:
>>> response.css('a')
output: []
How can I do ? Thanks
Your code is correct. What happens is that there is no a element in the HTML retrieved by scrapy. When you visit the page with your browser, the product list is populated with javascript, on the browser side. They are not in the HTML code.
In the doc you'll find techniques to pre-render javascript. Maybe you should try that.

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

lxml removes double slash iframe

I'm using lxml to sanitize html data, but in some cases lxml is removing also the valid tags. It removes iframe tags that have a valid host but starts with double slashes (//)
code example:
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="//www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<div></div>'
but for normal urls (without double slashes) it works fine
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
What I have to do , to make lxml to understand that it's valid URL ?
Thanks.
If you look at the docs for Cleaner (http://lxml.de/3.4/api/lxml.html.clean.Cleaner-class.html), it appears that by default these parameters are set to True:
embedded:
Removes any embedded objects (flash, iframes)
frames:
Removes any frame-related tags
So my first instinct would be to try cleaner = Cleaner(host_whitelist=['www.youtube.com'], embedded=False)

Generate PDF from Plone content types

I need to create PDFs from content types (made with dexerity if that matters) so that the user creates a new document and after filling the form a PDF is generated and ready to be downloaded. So basically after creating/modifying the document a PDF should be created and stored in ZODB (actually I'm using blobs) so that I could link the view with a "Download as PDF".
I've seen PDFNode but it doesn't seem to be what I'm looking for. There's also Produce & Publish but it's a webservice(?) and the company I'm going to develop this for doesn't want (for privacy) to send data outside their datacenters.
Any idea?
It seems that you are searching for these:
Reportlab (official site) for a custom solution
collective.sendaspdf for an ootb solution
I actually do this sort of thing a lot on a project of mine. I used Products.SmartPrintNG and fop for it though and didn't do it the standard way that the product uses(I think it uses javascript to initiate the conversion.. weird).
Couple things:
I had to sanitize the output since fop is pretty touchy
used lxml
mine uses archetypes
Anyways, my event handler for creating the PDF ends up looking something like this:
from Products.SmartPrintNG.browser import SmartPrintView
from lxml.cssselect import CSSSelector
from lxml.html import fromstring, tostring
import re
san_re = re.compile('(?P<width>width\=("|\')\d{1,5}(px|%|in|cm|mm|em|ex|pt|pc)?("|\'))')
class Options(object):
def __init__(self, __dict):
self.__dict = __dict
def __getattr__(self, attr):
if self.__dict.has_key(attr):
return self.__dict[attr]
raise AttributeError(attr)
def sanitize_xml(xml):
selector = CSSSelector('table,td,tr')
elems = selector(xml)
for el in elems:
if el.attrib.has_key('width'):
width = el.attrib['width']
style = el.attrib.get('style', '').strip()
if style and not style.endswith(';'):
style += ';'
style += 'width:%s;' % width
del el.attrib['width']
el.attrib['style'] = style
return xml
def save_pdf(obj, event):
smartprint = SmartPrintView(obj, obj.REQUEST)
html = obj.restrictedTraverse('view')()
xml = fromstring(html)
selector = CSSSelector('div#content')
xml = selector(xml)
html = tostring(sanitize_xml(xml[0]))
res = smartprint.convert(
html=html,
format='pdf2',
options=Options({'stylesheet': 'pdf_output_stylesheet', 'template': 'StandardTemplate'})
)
field = obj.getField('generatedPDF')
field.set(obj, res, mimetype='application/pdf', _initializing_=True)
field.setFilename(obj, obj.getId() + '.pdf')
Produce and Publish Lite is self-contained, open-source code and the successor to SmartPrintNG. http://pypi.python.org/pypi/zopyx.smartprintng.lite/
use z3c.rml, works very well to produce pdf from an rml template, instead of converting from html which can be tricky.