Extract text from p only if preceding header exists using Beautifulsoup - beautifulsoup

I want to extract the text in paragraph element using beautifulsoup.
The html looks something like this:
<span class="span_class>
<h1>heading1</h1>
<p>para1</p>
<h1>heading 2</h1>
<p>para2</p>
</span>
I want to extract text from first p only if h1 exists and so on;
So far i have tried
x=soup.findAll('span',{'class':'span_class'})
y=x.findAll('p')[0].text
But i am not getting it.

You can use CSS sibling selector here:
paragraphs = x.select('h1 + p')
# `paragraphs` now contains two elements: <p>para1</p> and <p>para2</p>
This will select only those P elements that have immediate H1 siblings before them.
If you want to do some more logic based on H1 content, you can do this:
for p x.select('h1:first-child + p'):
# `p` contains the element that has `H1` before it.
# `p.previous_sibling` contains `H1`.
if p.previous_sibling.text == 'heading1':
# We got the `P` that has `H1` with content `"heading1"` before it.
print(p, p.previous_sibling)

html = '''<html>
<body>
<span class='span_class'>
<h1>heading1</h1>
<p>content1</p>
<p>content2</p>
<h1>heading2</h1>
<p>content3</p>
</span>
</body>
</html>'''
soup = bs(html, 'lxml')
x = soup.find_all('span',{'class':'span_class'}) #find span
try:
for y in x:
heading = y.find_all('h1') # find h1
for something in heading: # if h1 exist
if something.text == 'heading1':
print(something.text) # print h1
try:
p = something.find_next('p') #try find next p
print(p)
except: # if no next <p>, do nothing
pass
else:
pass #if is is not 'heading1', do nothing
except Exception as e:
print(e)
Is this what you are looking for? It will try to look for your <span> and try to find <h1> from it. For <h1> is in <span> , it will look for the next <p>.

Related

Find link in text and replace with "a" tag

I have a partially good HTML, I need to create hyperlink, like:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021,
https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/;
Sezione Superotto in
https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto.
Has to become:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021, < a
href="https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/" >https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/< /a >;
Sezione Superotto in < a
href="https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto">https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto< /a >.
Beautifulsoup seems to not find the http well, so I used this regex with the pure python findall, but I cannot substitute or compose the text. Right now I made:
links = re.findall(r"(http|ftp|https:\/\/)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])", str(soup))
link_to_replace = []
for l in links:
link = ''.join(l)
if link in soup.find("body").text:
good_link = ""+link+""
fixed_text = soup.replace(link, good_link)
soup.replace_with(fixed_text)
I tried multiple solutions in the last two lines (this is just one), none worked.
Perhaps as follows, where I first identify the relevant anchor elements and strip out any other attributes besides the href, then later substitute the href link with the href html
import re
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://rivista.clionet.it/vol5/giorgi-zoppi-la-ricerca-indire-tra-uso-didattico-del-patrimonio-storico-culturale-e-promozione-delle-buone-pratiche/')
soup = bs(r.text, 'lxml')
item = soup.select_one('p:has(a[id="ft-note-16"])')
text = item.text
for tag in item.select('a:not([id])'):
href = tag['href']
tag.attrs = {'href': href}
text = re.sub(href, str(tag), text)
text = re.sub(item.a.text, '', text).strip()
print(text)

How can I print only numbers inside a tag

I have a soup object like:
<div class="list-card__select">
<div class="list-card__item-size">
Size:
75 м² </div>
I did
soup = BeautifulSoup(text, 'lxml')
number = item.find(class_='list-card__item-size').text
print(number)
Result: 'Size: 75 м²'
How can I get just: '75'
you can do this:
soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]
Provided that the pattern is always identical, a simple split() can be used.
item.find(class_='list-card__item-size').text.split(' ')[1]
Alternatives can be regex or you inspect other elements, javascript or api that hold this information directly.
If number is always positive then we also can use re package.
import re
string = "Size: 75 м²"
print( re.findall(r'\d+', string)[0] )
Output : 75

modify html tags replace "src" with "data-src"

is there way to use BeautifulSoup to replace element tags for images, I have some HTML files that I want to replace the "src" with "data-src" of that img
<img src="//Pictures/q-90-90.png" data-src="//Pictures/p720_test.jpg">
code so far
soup = BeautifulSoup(open("template/home.html", 'lxml')
images = soup.findAll('img')
for i in images:
#replace src with data-src
I am open to any solution using regex as well, Ideally the output would be
<img src="//Pictures/p720_test.jpg" data-src="//Pictures/q-90-90.png">
Don't use regex on html; try this instead:
soup.find('img')['src']= "data-src"
Edit:
To swap attribute values inside the <img> elements, try this:
old_src = soup.select_one('img')['src']
old_data = soup.select_one('img')['data-src']
target = soup.select_one('img')
target['src']= old_data
target['data-src']= old_src

beautifulsoup: get text (including html tags) between two different tags (</h3> and <h2>)

I am trying to scrape an html file structured as follow using beautifulsoup. Basicaly, each unit is constisted of:
one <h2></h2>
one <h3></h3>
more than one <p></p>
Something like follow:
<h2>January, 2020</h2>
<h3>facility</h3>
<p>text1-1</p>
<p>text1-2</p>
<h2>April, 2020</h2>
<h3>scientists</h3>
<p>text2-1</p>
<p>text2-2</p>
<h2>June, 2020</h2>
<h3>lawyers</h3>
<p>text3-1</p>
<h2>.....
I want to get text including the <p> tags between </h3> and the next <h2>. The result should be:
for row #1:
<p>text1-1</p>
<p>text1-2</p>
for row #2:
<p>text2-1</p>
<p>text2-2</p>
for row #3:
<p>text3-1</p>
Here is what I tried so far:
num_h2 = len(soup.find_all('h2'))
for i in range(0,num_h2):
print('---------')
print(i)
p_string = ''
sibling = soup.find_all('h3')[i].find_next_sibling('p').getText()
if sibling:
p_string += sibling
else:
break
print(p_string)
The problem with this solution is that it only shows the content of the first <p> under each unit. I do not know how to find how many <p> are there to generate a for loop. Also, is there a better way to do this than using find_next_silibing()?
Maybe css selectors can help:
for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)
Output:
<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>

beautifulsoup: find elements after certain element, not necessarily siblings or children

Example html:
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
I want to search for <p>s but only if its position is after span#target.
It should return p4, p5, p6 and p7 in the above example.
I tried to get all <p>s first then filter, but then I don't know how do I judge if an element is after span#target or not, either.
You can do this by using the find_all_next function in beautifulsoup.
from bs4 import BeautifulSoup
doc = # Read the HTML here
# Parse the HTML
soup = BeautifulSoup(doc, 'html.parser')
# Select the first element you want to use as the reference
span = soup.select("span#target")[0]
# Find all elements after the `span` element that have the tag - p
print(span.find_all_next("p"))
The above snippet will result in
[<p>p4</p>, <p>p5</p>, <p>p6</p>, <p>p7</p>]
Edit: As per the request to compare position below by OP-
If you want to compare position of 2 elements, you'll have to rely on sourceline and sourcepos provided by the html.parser and html5lib parsing options.
First off, store the sourceline and/or sourcepos of your reference element in a variable.
span_srcline = span.sourceline
span_srcpos = span.sourcepos
(you don't actually have to store them though, you can just do span.sourcepos directly as long as you have the span stored)
Now iterate through the result of find_all_next and compare the values-
for tag in span.find_all_next("p"):
print(f'line diff: {tag.sourceline - span_srcline}, pos diff: {tag.sourcepos - span_srcpos}, tag: {tag}')
You're most likely interested in line numbers though, as the sourcepos denotes the position on a line.
However, sourceline and sourcepos mean slightly different things for each parser. Check the docs for that info
Try this
html_doc = """
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id="target").findNext('p').contents[0])
Result
p4
try
span = soup.select("span > #target > p")