Parsing text from certain "html elements" using selenium - selenium

What I've seen so far is that the page source of a webpage if filtered by selenium then it is possible to parse text or something necessary from that page source applying bs4 or lxml no matter the page source was javascript enabled or not. However, my question is how can I parse documents from a certain html elements by filtering selenium and then using bs4 or lxml library. if the below pasted element is considered then applying bs4 or lxml the way i move is:
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)
#rest of the code here
Now, how can I filter the above paste html portion using selenium and then apply bs4 library on it? Could not think of driver.page_source as it is only applicable when filtered from a webpage.
To be a little more specific, if I want to use something like below, then how can it be?
from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html) #this "html" is the above pasted one
print(element_html)

driver.page_source would give you the complete HTML source code of the page at one particular moment. You, though, having an element instance, can get to it's outerHTML using .get_attribute() method:
element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")
As far as extracting the span element source from out of the mouseover attribute - I would first parse the tr element with BeautifulSoup, get the onmouseover attribute and then use a regular expression to extract the html value from inside the Tip() function call. And then, re-parse the span html with BeautifulSoup:
import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())
Prints:
License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)

Related

BS4 - Replacing text content, preserving tags

I have an HTML document that uses the text-styling style attribute to change case. When I see that style, I'd like to change all text for which that tag applies, retaining the HTML tags.
I have a partial solution that replaces the tag entirely. The approach that seems like it ought to be correct gives me AttributeError: 'NoneType' object has no attribute 'next_element'
Example:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
# works, but replaces all text, removing the HTML tags
for node in soup.find_all(attrs={'style': upper_patt}):
node.replace_with(node.text.upper())
# does not work, throws AttributeError error
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for txt in node.strings:
txt.replace_with(txt.upper())
Seems like you want to change the inner text to uppercase for all the children of an element with text-transform: uppercase.
Instead of altering the result of find_all, loop over the children text with node.findChildren(text=True) of the result, and use replace_with() to change the text:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for child in node.findChildren(recursive=True, text=True):
child.replace_with(child.text.upper())
print(soup)
Prints:
<div style="text-transform: uppercase;">
FOO0
<font>FOO0</font>
<div>FOO1
<div>FOO2</div>
</div>
</div>

Pandas web scraping(Beautiful soup) find in tag with class, another tag with a link. Then following the link inside href

I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page
Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...

How to get element in class covered h3 by using BeautifulSoup

I would like to get element "Orange" by using BeautifulSoup.
I only need an element which covered h2.
please help how to get it?
<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>
from bs4 import BeautifulSoup
html = """<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'name'}):
print(item.text)

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?
The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.