lxml removes double slash iframe - lxml

I'm using lxml to sanitize html data, but in some cases lxml is removing also the valid tags. It removes iframe tags that have a valid host but starts with double slashes (//)
code example:
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="//www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<div></div>'
but for normal urls (without double slashes) it works fine
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
What I have to do , to make lxml to understand that it's valid URL ?
Thanks.

If you look at the docs for Cleaner (http://lxml.de/3.4/api/lxml.html.clean.Cleaner-class.html), it appears that by default these parameters are set to True:
embedded:
Removes any embedded objects (flash, iframes)
frames:
Removes any frame-related tags
So my first instinct would be to try cleaner = Cleaner(host_whitelist=['www.youtube.com'], embedded=False)

Related

FormRequest that renders JS content in scrapy shell

I'm trying to scrape content from this page with the following form data:
I need the County: set to Prince George's and DateOfFilingFrom set to 01-01-2000 so I do the following:
% scrapy shell
In [1]: from scrapy.http import FormRequest
In [2]: request = FormRequest(url='https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx', formdata={'DateOfFilingFrom': '01-01-2000', 'County:': "Prince George's"})
In [3]: response
In [4]:
But it's not working(response is None) plus, the next page looks like the following which is loaded dynamically, I need to know how to be able to access each of the links shown below with the following inspection(as far as I know this might be done using Splash however, I'm not sure how to combine a SplashRequest within a FormRequest and do it all from within scrapy shell for testing purposes. I need to know what I'm doing wrong and how to render the next page(the one that results from the FormRequest shown below)
The request you're sending is missing a couple of fields, which is probably why you don't get a response back. The fields you fill in also don't correspond to the fields they are expecting in the request. A good way to deal with this is using scrapy's from_response (doc), which can populate some fields for you already based on the information in the form.
For this website the following worked for me (using scrapy shell):
>>> url = "https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx"
>>> fetch(url)
>>> from scrapy import FormRequest
>>> req = FormRequest.from_response(
... response,
... formxpath="//form[#id='form1']", # specify the form on the current page
... formdata={
... 'cboCountyId': '16', # the county you select is converted to a number
... 'DateOfFilingFrom': '01-01-2001',
... 'cboPartyType': 'Decedent',
... 'cmdSearch': 'Search'
... },
... clickdata={'type': 'submit'},
... )
>>> fetch(req)

Enable to select element using Scrapy shell

I'm trying to print out all the titles of the products of this website using scrapy shell: 'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas'
Once it is open I start fetching:
fetch('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas')
And I try to print out the title of each product as a result nothing is selected:
>>> response.css('.shelfProductTile-descriptionLink::text')
output: []
Also tried:
>>> response.css('a')
output: []
How can I do ? Thanks
Your code is correct. What happens is that there is no a element in the HTML retrieved by scrapy. When you visit the page with your browser, the product list is populated with javascript, on the browser side. They are not in the HTML code.
In the doc you'll find techniques to pre-render javascript. Maybe you should try that.

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

Parse html5 data-* attributes in python?

Is it possible to access the data-* portion of an html element from python? I'm using scrapy and the data-* is not available in a selector object, though the raw data is available in a Request object.
If I dump the html using wget -O page http://page.com then I can see the data in the file. It's something like blahlink
I can edit the data-mine portion in an editor, so I know it's there ... it just seems like well-behaved parsers are dropping it.
As you can see, I'm confused.
Yeah, lxml does not expose the attribute names for some reason, and Talvalin is right, html5lib does:
stav#maia:~$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2
>>> import html5lib
>>> html = '''<a href="blah" target="_blank" data-mine="a;slfkjasd;fklajsdfl;ahsdf"
... data-yours="truly">blahlink</a>'''
>>> for x in html5lib.parse(html, treebuilder='lxml').xpath('descendant::*/#*'):
... print '%s = "%s"' % (x.attrname, x)
...
href = "blah"
target = "_blank"
data-mine = "a;slfkjasd;fklajsdfl;ahsdf"
data-yours = "truly"
I did it like this without using a third-party library:
import re
data_email_pattern = re.compile(r'data-email="([^"]+)"')
match = data_email_pattern.search(response.body)
if match:
print(match.group(1))
...
I've not tried it, but there is html5lib (http://code.google.com/p/html5lib/) which can be used in conjunction with Beautiful Soup instead of scrapy's built-in selectors.