BS4 - Replacing text content, preserving tags - beautifulsoup

I have an HTML document that uses the text-styling style attribute to change case. When I see that style, I'd like to change all text for which that tag applies, retaining the HTML tags.
I have a partial solution that replaces the tag entirely. The approach that seems like it ought to be correct gives me AttributeError: 'NoneType' object has no attribute 'next_element'
Example:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
# works, but replaces all text, removing the HTML tags
for node in soup.find_all(attrs={'style': upper_patt}):
node.replace_with(node.text.upper())
# does not work, throws AttributeError error
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for txt in node.strings:
txt.replace_with(txt.upper())

Seems like you want to change the inner text to uppercase for all the children of an element with text-transform: uppercase.
Instead of altering the result of find_all, loop over the children text with node.findChildren(text=True) of the result, and use replace_with() to change the text:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for child in node.findChildren(recursive=True, text=True):
child.replace_with(child.text.upper())
print(soup)
Prints:
<div style="text-transform: uppercase;">
FOO0
<font>FOO0</font>
<div>FOO1
<div>FOO2</div>
</div>
</div>

Related

Extracting text within div tag itself with BeautifulSoup

I am trying to extract the number eg. "3762" from the div below with Beautifulsoup:
<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>
The div comes from this website (a pharma medical database): Drugs.com.
I can not use "class" since that changes from div to div, more than just pid-box-1 and pid-box-2. I haven't had success using the "data-pid-imprintid" either.
This is what i have tried and i know that i cant write "data-pid-imprintid" the way i have done:
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all('div', 'data-pid-imprintid')
for div in divs:
item = div.find('div')
id = item.get('data-pid-imprintid')
print (id)
This gets the value of data-pid-imprintid in every div with data-pid-imprintid
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all("div", attrs={"data-pid-imprintid": True})
for div in divs:
print(div.get('data-pid-imprintid'))
First at all be aware there is a little typo in your html (class="pid-box-1'), without fixing it, you will only get two ids back.
How to select?
As alternativ approache to find_all() that works well, you can also go with the css selector:
soup.select('div [data-pid-imprintid]')
These will select every <div> with an attribute called data-pid-imprintid. To get the value of data-pid-imprintid you have to iterate the result set for example by list comprehension:
[e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
Example
import requests
from bs4 import BeautifulSoup
html='''<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
ids = [e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
print(ids)
Output
['3762', '5096', '10944']

extract text from html string with Scrapy

Here is the html string in question.
<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>
With BeautifulSoup, this code
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
gets me
a book of grammar rules:
which is exactly what I want.
With scrapy, how do I get the same result?
from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()
this code gets me
['a ', ' of grammar ', ': ']
How should I fix it?
aYou can use this code to get all text inside div and its child:
text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)
your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

How to use BeautifulSoup to get content inside over-line tags

I would like to extract the content("_The_important_content_") from an HTML snippet as follows:
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
My code is just:
for i in soup.findAll('div', class_="a:2 c:gray m:da"):
print(i.text)
But because the "class" field contains new line symbols and is expanded to multiple line so that BeautifulSoup cannot match, the code returns nothing. How can I specify the correct class field and get the content?
There are many tags with the same "class" value and other "class" value but I want to extract the contents from the tags with that specific "class" value.
Try this:
html='''
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
item = soup.select("[class^=]")[0].text
print(item.strip())
Result:
_The_important_content_

Parsing text from certain "html elements" using selenium

What I've seen so far is that the page source of a webpage if filtered by selenium then it is possible to parse text or something necessary from that page source applying bs4 or lxml no matter the page source was javascript enabled or not. However, my question is how can I parse documents from a certain html elements by filtering selenium and then using bs4 or lxml library. if the below pasted element is considered then applying bs4 or lxml the way i move is:
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)
#rest of the code here
Now, how can I filter the above paste html portion using selenium and then apply bs4 library on it? Could not think of driver.page_source as it is only applicable when filtered from a webpage.
To be a little more specific, if I want to use something like below, then how can it be?
from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html) #this "html" is the above pasted one
print(element_html)
driver.page_source would give you the complete HTML source code of the page at one particular moment. You, though, having an element instance, can get to it's outerHTML using .get_attribute() method:
element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")
As far as extracting the span element source from out of the mouseover attribute - I would first parse the tr element with BeautifulSoup, get the onmouseover attribute and then use a regular expression to extract the html value from inside the Tip() function call. And then, re-parse the span html with BeautifulSoup:
import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())
Prints:
License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.