modify html tags replace "src" with "data-src" - beautifulsoup

is there way to use BeautifulSoup to replace element tags for images, I have some HTML files that I want to replace the "src" with "data-src" of that img
<img src="//Pictures/q-90-90.png" data-src="//Pictures/p720_test.jpg">
code so far
soup = BeautifulSoup(open("template/home.html", 'lxml')
images = soup.findAll('img')
for i in images:
#replace src with data-src
I am open to any solution using regex as well, Ideally the output would be
<img src="//Pictures/p720_test.jpg" data-src="//Pictures/q-90-90.png">

Don't use regex on html; try this instead:
soup.find('img')['src']= "data-src"
Edit:
To swap attribute values inside the <img> elements, try this:
old_src = soup.select_one('img')['src']
old_data = soup.select_one('img')['data-src']
target = soup.select_one('img')
target['src']= old_data
target['data-src']= old_src

Related

Beautifulsoup output with indentation

New to Python webscraping and BeautifulSoup.
I'd like to format the following so when it outputs the tags, it does so indented
H1 text
H2 text
H3 text
H2 text
...
etc.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(website.content, 'html.parser')
tags = soup.find_all(['h1', 'h2'])
for soups in tags:
print(soups.string)
Your help is much appreciated.
You can define a dictionary of indents/prefixes
preString = {'h1': '', 'h2': '\t', 'h3':'\t\t', 'h4':'\t\t\t'}
then you can just loop and print like:
tags = soup.find_all([t for t in preString])
for soups in [t for t in tags if t.string]:
print(preString[soups.name]+soups.string)
I filtered with if t.string in case they have tags inside rather than just text. Using .text gets you the full text regardless of child tags; if you want that, and you want your find_all to be independent, you can instead:
tags = soup.find_all(['h1', 'h2'])
for soups in tags:
preStr = preString[soups.name] if soups.name in preString else ''
print(preStr+soups.string)
(You can add a default indent/prefix after the else when defining preStr)

Find link in text and replace with "a" tag

I have a partially good HTML, I need to create hyperlink, like:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021,
https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/;
Sezione Superotto in
https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto.
Has to become:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021, < a
href="https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/" >https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/< /a >;
Sezione Superotto in < a
href="https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto">https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto< /a >.
Beautifulsoup seems to not find the http well, so I used this regex with the pure python findall, but I cannot substitute or compose the text. Right now I made:
links = re.findall(r"(http|ftp|https:\/\/)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])", str(soup))
link_to_replace = []
for l in links:
link = ''.join(l)
if link in soup.find("body").text:
good_link = ""+link+""
fixed_text = soup.replace(link, good_link)
soup.replace_with(fixed_text)
I tried multiple solutions in the last two lines (this is just one), none worked.
Perhaps as follows, where I first identify the relevant anchor elements and strip out any other attributes besides the href, then later substitute the href link with the href html
import re
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://rivista.clionet.it/vol5/giorgi-zoppi-la-ricerca-indire-tra-uso-didattico-del-patrimonio-storico-culturale-e-promozione-delle-buone-pratiche/')
soup = bs(r.text, 'lxml')
item = soup.select_one('p:has(a[id="ft-note-16"])')
text = item.text
for tag in item.select('a:not([id])'):
href = tag['href']
tag.attrs = {'href': href}
text = re.sub(href, str(tag), text)
text = re.sub(item.a.text, '', text).strip()
print(text)

beautifulsoup: find elements after certain element, not necessarily siblings or children

Example html:
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
I want to search for <p>s but only if its position is after span#target.
It should return p4, p5, p6 and p7 in the above example.
I tried to get all <p>s first then filter, but then I don't know how do I judge if an element is after span#target or not, either.
You can do this by using the find_all_next function in beautifulsoup.
from bs4 import BeautifulSoup
doc = # Read the HTML here
# Parse the HTML
soup = BeautifulSoup(doc, 'html.parser')
# Select the first element you want to use as the reference
span = soup.select("span#target")[0]
# Find all elements after the `span` element that have the tag - p
print(span.find_all_next("p"))
The above snippet will result in
[<p>p4</p>, <p>p5</p>, <p>p6</p>, <p>p7</p>]
Edit: As per the request to compare position below by OP-
If you want to compare position of 2 elements, you'll have to rely on sourceline and sourcepos provided by the html.parser and html5lib parsing options.
First off, store the sourceline and/or sourcepos of your reference element in a variable.
span_srcline = span.sourceline
span_srcpos = span.sourcepos
(you don't actually have to store them though, you can just do span.sourcepos directly as long as you have the span stored)
Now iterate through the result of find_all_next and compare the values-
for tag in span.find_all_next("p"):
print(f'line diff: {tag.sourceline - span_srcline}, pos diff: {tag.sourcepos - span_srcpos}, tag: {tag}')
You're most likely interested in line numbers though, as the sourcepos denotes the position on a line.
However, sourceline and sourcepos mean slightly different things for each parser. Check the docs for that info
Try this
html_doc = """
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id="target").findNext('p').contents[0])
Result
p4
try
span = soup.select("span > #target > p")

Extract text from p only if preceding header exists using Beautifulsoup

I want to extract the text in paragraph element using beautifulsoup.
The html looks something like this:
<span class="span_class>
<h1>heading1</h1>
<p>para1</p>
<h1>heading 2</h1>
<p>para2</p>
</span>
I want to extract text from first p only if h1 exists and so on;
So far i have tried
x=soup.findAll('span',{'class':'span_class'})
y=x.findAll('p')[0].text
But i am not getting it.
You can use CSS sibling selector here:
paragraphs = x.select('h1 + p')
# `paragraphs` now contains two elements: <p>para1</p> and <p>para2</p>
This will select only those P elements that have immediate H1 siblings before them.
If you want to do some more logic based on H1 content, you can do this:
for p x.select('h1:first-child + p'):
# `p` contains the element that has `H1` before it.
# `p.previous_sibling` contains `H1`.
if p.previous_sibling.text == 'heading1':
# We got the `P` that has `H1` with content `"heading1"` before it.
print(p, p.previous_sibling)
html = '''<html>
<body>
<span class='span_class'>
<h1>heading1</h1>
<p>content1</p>
<p>content2</p>
<h1>heading2</h1>
<p>content3</p>
</span>
</body>
</html>'''
soup = bs(html, 'lxml')
x = soup.find_all('span',{'class':'span_class'}) #find span
try:
for y in x:
heading = y.find_all('h1') # find h1
for something in heading: # if h1 exist
if something.text == 'heading1':
print(something.text) # print h1
try:
p = something.find_next('p') #try find next p
print(p)
except: # if no next <p>, do nothing
pass
else:
pass #if is is not 'heading1', do nothing
except Exception as e:
print(e)
Is this what you are looking for? It will try to look for your <span> and try to find <h1> from it. For <h1> is in <span> , it will look for the next <p>.

Replace occurrences on html file

I have to replace some kind of occurrences on thousands of html files and I'm intendind to use linux script for this.
Here are some examples of replaces I have to do
From: <a class="wiki_link" href="/WebSphere+Application+Server">
To: <a class="wiki_link" href="/confluence/display/WIKIHAB1/WebSphere%20Application%20Server">
That means, add /confluence/display/WIKIHAB1 as prefix and replace "+" by "%20".
I'll do the same for other tags, like img, iframe, and so on...
First, which tool should I use to make it? Sed? Awk? Other?
If anybody has any example, I really appreciate.
After some research I found out Beautiful Soup. It's a python library to parse html files, really easy to use and very well docummented.
I had no experience with Python and could wrote the code without problems.
Here is an example of python code to make the replace that I mentioned in the question.
#!/usr/bin/python
import os
from bs4 import BeautifulSoup
#Replaces plus sign(+) by %20 and add /confluence... prefix to each
#href parameter at anchor(a) tag that has wiki_link in class parameter
def fixAnchorTags(soup):
tags = soup.find_all('a')
for tag in tags:
newhref = tag.get("href")
if newhref is not None:
if tag.get("class") is not None and "wiki_link" in tag.get("class"):
newhref = newhref.replace("+", "%20")
newhref = "/confluence/display/WIKIHAB1" + newhref
tag['href'] = newhref
#Creates a folder to save the converted files
def setup():
if not os.path.exists("converted"):
os.makedirs("converted")
#Run all methods for each html file in the current folder
def run():
for file in os.listdir("."):
if file.endswith(".html"):
print "Converting " + file
htmlfile = open(file, "r")
converted = open("converted/"+file, "w")
soup = BeautifulSoup(htmlfile, "html.parser")
fixAnchorTags(soup)
converted.write(soup.prettify("UTF-8"))
converted.close()
htmlfile.close()
setup()
run()