How can i use a function in find_all() in BeautifulSoup - beautifulsoup

I'm using bs4 for my project. Now I get something like:
<tr flag='t'><td flag='f'></td></tr>
I already know i could use a function in find_all(). So i use
def myrule(tag):
return tag['flag']=='f' and tag.parent['flag']=='t';
soup.find_all(myrule)
then i get the error like
KeyError: 'myrule'
can anyone help me with this, why it don't work.
Thanks.

You are searching every possible tag in your soup object for an attribute named flag. If the current tag being passed don't have that attribute it'll throw an error and the program will stop.
You should initially verify if the tag have that attribute before checking the rest. Like this:
from bs4 import BeautifulSoup
example = """<tr flag='t'><td flag='f'></td></tr>"""
soup = BeautifulSoup(example, "lxml")
def myrule(tag):
return "flag" in tag.attrs and tag['flag']=='f' and tag.parent['flag']=='t';
print(soup.find_all(myrule))
Outputs:
[<td flag="f"></td>]

Related

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?
The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

Using Scrapy to scrape data after form submit

I'm trying to scrape content from listing detail page that can only be viewed by clicking the 'view' button which triggers a form submit . I am new to both Python and Scrapy
Example markup
<li><h3>Abc Widgets</h3>
<form action="/viewlisting?id=123" method="post">
<input type="image" src="/images/view.png" value="submit" >
</form>
</li>
My solution in Scrapy is to extract form actions then use Request to return the page with a callback to parse it for for the desired content. However I have hit a few issues
I'm getting the following error "request url must be str or unicode"
secondly when I hardcode a URL to overcome the above issue it seems my parsing function is returning what looks like a list
Here is my code - with reactions of the real URLs
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from wfi2.items import Wfi2Item
class ProfileSpider(Spider):
name = "profiles"
allowed_domains = ["wfi.com.au"]
start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS"
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT"
]
def parse(self, response):
hxs = Selector(response)
forms = hxs.xpath('//*[#id="area-managers"]//*/form')
for form in forms:
action = form.xpath('#action').extract()
print "ACTION: ", action
#request = Request(url=action,callback=self.parse_profile)
request = Request(url=action,callback=self.parse_profile)
yield request
def parse_profile(self, response):
hxs = Selector(response)
profile = hxs.xpath('//*[#class="contentContainer"]/*/text()')
print "PROFILE", profile
I'm getting the following error "request url must be str or unicode"
Please have a look at the scrapy documentation for extract(). It says : "Serialize and return the matched nodes as a list of unicode strings" (bold added by me).
The first element of the list is probably what you want. So you could do something like:
request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)
secondly when I hardcode a URL to overcome the above issue it seems my
parsing function is returning what looks like a list
According to the documentation of xpath it's a SelectorList. Add extract() to the xpath and you'll get a list with the text tokens. Eventually you want to clean up and join the elements that list before further processing.

Why is this BeautifulSoup result []?

I want to get the text in the span. I have checked it, but I don't see the problem
from bs4 import BeautifulSoup
import urllib.request
import socket
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib.request.urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html)
print(soup.findAll('span',attrs={'class': 'b'}))
The result was [], why?
Looking at the site in question, your search result turns up an empty list because there are no spans with a class value of b. BeautifulSoup does not propagate down the CSS like a browser would. In addition, your urllib request looks incorrect. Looking at the site, I think you want to grab all the spans with a class of label, though it's hard when the site isn't in my native language. Here's is how you would go about it:
from bs4 import BeautifulSoup
import urllib2 # Note urllib2
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib2.urlopen(searchurl) # Note no need for request
html = f.read()
soup = BeautifulSoup(html)
for s in soup.findAll('span', attrs={"class":"label"}):
print s.text
This gives for the url listed:
Farbe:
Kraftstoffverbr. komb.:
Kraftstoffverbr. innerorts:
Kraftstoffverbr. außerorts:
CO²-Emissionen komb.:
Zugr.-lgd. Treibstoffart:

Passing results from mechanize to BeautifulSoup

I get an when i try to mix mechanize and BeautifulSoup in the following code:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
import mechanize
br=mechanize.Browser()
br.set_handle_robots(True)
br.open('http://tel.search.ch/')
br.select_form(nr=0)
br.form["was"] = "siemens"
br.submit()
content = br.response
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('title', a['href']):
print "URL:", a['href']
br.close()
The code from the beginning till br.submit() works fine with mechanize and the for loop with BeautifulSoup too. But I don't know how to pass the results from br.submit() into BeautifulSoup. The 2 lines:
content = br.response
soup = BeautifulSoup(content)
are apparently wrong. I get an error for soup = BeautifulSoup(content):
TypeError: expected string or buffer
Can anyone help?
Try changing
content = br.response
to
content = br.response().read()
In this way content now has html that can be passed to BeautifulSoup.

Getting inner tag text whilst using a filter in BeautifulSoup

I have:
... html
<div id="price">$199.00</div>
... html
How do I get the $199.00 text. Using
soup.findAll("div",id="price",text=True)
does not work as I get all the innet text from the whole document.
Find div tag, and use text attribute to get text inside the tag.
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <html>
... <body>
... <div id="price">$199.00</div>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html)
>>> soup.find('div', id='price').text
u'$199.00'
You are SO close to make it work.
(1) How to search and locate the tag that you are interested:
Let's take a look at how to use find_all function:
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):...
name="div":The name attribute will contains the tag name
attrs={"id":"price"}: The attrs is a dictionary who contains the attribute
recursive: a flag whether dive into its children or not.
text: could be used along with regular expressions to search for tags which contains certain text
limit: is a flag to choose how many you want to return limit=1 make find_all the same as find
In your case, here are a list of commands to locate the tags playing with different flags:
>> # in case you have multiple interesting DIVs I am using find_all here
>> html = '''<html><body><div id="price">$199.00</div><div id="price">$205.00</div></body></html>'''
>> soup = BeautifulSoup(html)
>> print soup.find_all(attrs={"id":"price"})
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
>> # This is a bit funky but sometime using text is extremely helpful
>> # because text is actually what human will see so it is more reliable
>> import re
>> tags = [text.parent for text in soup.find_all(text=re.compile('\$'))]
>> print tags
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
There are many different ways to locate your elements and you just need to ask yourself, what will be the most reliable way to locate a element.
More Information about BS4 Find, click here.
(2) How to get the text of a tag:
tag.text will return unicode and you can convert to string type by using tag.text.encode('utf-8')
tag.string will also work.