How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href? - beautifulsoup

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?

The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

Related

Parsing text from certain "html elements" using selenium

What I've seen so far is that the page source of a webpage if filtered by selenium then it is possible to parse text or something necessary from that page source applying bs4 or lxml no matter the page source was javascript enabled or not. However, my question is how can I parse documents from a certain html elements by filtering selenium and then using bs4 or lxml library. if the below pasted element is considered then applying bs4 or lxml the way i move is:
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)
#rest of the code here
Now, how can I filter the above paste html portion using selenium and then apply bs4 library on it? Could not think of driver.page_source as it is only applicable when filtered from a webpage.
To be a little more specific, if I want to use something like below, then how can it be?
from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html) #this "html" is the above pasted one
print(element_html)
driver.page_source would give you the complete HTML source code of the page at one particular moment. You, though, having an element instance, can get to it's outerHTML using .get_attribute() method:
element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")
As far as extracting the span element source from out of the mouseover attribute - I would first parse the tr element with BeautifulSoup, get the onmouseover attribute and then use a regular expression to extract the html value from inside the Tip() function call. And then, re-parse the span html with BeautifulSoup:
import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())
Prints:
License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)

How can i use a function in find_all() in BeautifulSoup

I'm using bs4 for my project. Now I get something like:
<tr flag='t'><td flag='f'></td></tr>
I already know i could use a function in find_all(). So i use
def myrule(tag):
return tag['flag']=='f' and tag.parent['flag']=='t';
soup.find_all(myrule)
then i get the error like
KeyError: 'myrule'
can anyone help me with this, why it don't work.
Thanks.
You are searching every possible tag in your soup object for an attribute named flag. If the current tag being passed don't have that attribute it'll throw an error and the program will stop.
You should initially verify if the tag have that attribute before checking the rest. Like this:
from bs4 import BeautifulSoup
example = """<tr flag='t'><td flag='f'></td></tr>"""
soup = BeautifulSoup(example, "lxml")
def myrule(tag):
return "flag" in tag.attrs and tag['flag']=='f' and tag.parent['flag']=='t';
print(soup.find_all(myrule))
Outputs:
[<td flag="f"></td>]

scraping an iframe with beautifulsoup and python

i am trying to scrape the following page:
https://www.dukascopy.com/swiss/english/marketwatch/sentiment/
more exactly, the numbers in the chart. for example, the number 74,19 % in the green bar next to the aud/usd text. i have inspected the elements and found out that the tag for this number is span. but the following code does not return this or any other number in the chart:
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.dukascopy.com/swiss/english/marketwatch/sentiment/')
soup = BeautifulSoup(r.content, "html.parser")
data = soup('span')
print(data)
So if you incorporate selenium with beautiful soup, you will get all the abilities of selenium to scrape iframes.
try this:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get(bond_iframe)
bond_source = browser.page_source
browser.quit()
soup = BeautifulSoup(bond_source,"html.parser")
for div in soup.findAll('div',attrs={'class':'qs-note-panel'}):
print div
The for loop would be which div tag you are searching for

Why is this BeautifulSoup result []?

I want to get the text in the span. I have checked it, but I don't see the problem
from bs4 import BeautifulSoup
import urllib.request
import socket
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib.request.urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html)
print(soup.findAll('span',attrs={'class': 'b'}))
The result was [], why?
Looking at the site in question, your search result turns up an empty list because there are no spans with a class value of b. BeautifulSoup does not propagate down the CSS like a browser would. In addition, your urllib request looks incorrect. Looking at the site, I think you want to grab all the spans with a class of label, though it's hard when the site isn't in my native language. Here's is how you would go about it:
from bs4 import BeautifulSoup
import urllib2 # Note urllib2
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib2.urlopen(searchurl) # Note no need for request
html = f.read()
soup = BeautifulSoup(html)
for s in soup.findAll('span', attrs={"class":"label"}):
print s.text
This gives for the url listed:
Farbe:
Kraftstoffverbr. komb.:
Kraftstoffverbr. innerorts:
Kraftstoffverbr. außerorts:
CO²-Emissionen komb.:
Zugr.-lgd. Treibstoffart:

Passing results from mechanize to BeautifulSoup

I get an when i try to mix mechanize and BeautifulSoup in the following code:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
import mechanize
br=mechanize.Browser()
br.set_handle_robots(True)
br.open('http://tel.search.ch/')
br.select_form(nr=0)
br.form["was"] = "siemens"
br.submit()
content = br.response
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('title', a['href']):
print "URL:", a['href']
br.close()
The code from the beginning till br.submit() works fine with mechanize and the for loop with BeautifulSoup too. But I don't know how to pass the results from br.submit() into BeautifulSoup. The 2 lines:
content = br.response
soup = BeautifulSoup(content)
are apparently wrong. I get an error for soup = BeautifulSoup(content):
TypeError: expected string or buffer
Can anyone help?
Try changing
content = br.response
to
content = br.response().read()
In this way content now has html that can be passed to BeautifulSoup.