Ignore / skip tags when prettifying with BeautifulSoup - beautifulsoup

Is it possible to ignore / skip certain tags when parsing and prettifying an HTML-document with BeautifulSoup?
I am using BeautifulSoup to prettify HTML-documents with large embedded SVG-images. There is no need to prettify the SVG-images and all of their child-elements. As performance is critical for this application, I thought I might be able to save some runtime by ignoring / skipping the SVG-elements when prettifying the HTML, and just include the SVG-elements as they originally were in the input.
I am aware of SoupStrainer but it seems to do the exact opposite of what I need. I have also read many of the posts here on StackOverflow and elsewhere, and none of them seem to address this issue.
Example
# Messy HTML code.
messy = \
"""
<html> <head>
<title>
Some title</title>
</head> <body>
<svg>Don't parse and prettify this!</svg>
</body> </html>
"""
# Prettify the HTML code.
from bs4 import BeautifulSoup
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
Which produces the result:
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>
Don't parse and prettify this!
</svg>
</body>
</html>
Note that the <svg> element has also been parsed and prettified by BeautifulSoup. Is there a way to avoid this?
Thanks!

As far as I can tell, bs4 doesn't allow for skipping particular tags; but you could write your own parser (like here) and include or allow exceptions, or use regex to replace the tags you don't want to parse.
First, list the tags you want to skip parsing
skipTags = ['svg']
# skipTags = ['svg', 'script', 'style'] ## list all the tag names to skip
If you don't care about preserving the the tags, you could just get rid of them entirely.
# import re
# from bs4 import BeautifulSoup
for n in skipTags: messy = re.sub(f'<{n}\s*.*\s*>\s*.*\s*</{n}>', '', messy)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
If you want to preserver the tags, then replace them with comments and then replace the comment after prettifying. [This can be significantly slower than just getting rid of them.]
# import re
# from bs4 import BeautifulSoup
cReps = []
for n in skipTags:
rcpat = re.compile(f'<{n}\s*.*\s*>\s*.*\s*</{n}>')
cReps += [m.span() for m in rcpat.finditer(messy)]
for cri, (sPos, ePos) in list(enumerate(cReps))[::-1]:
repCmt, orig = f'<!--do_not_parse__placeholder_{cri}-->', messy[sPos:ePos]
messy = messy[:sPos] + repCmt + messy[ePos:]
cReps[cri] = (repCmt, orig)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
for repCmt, orig in cReps:
pretty = pretty.replace(repCmt, orig, 1)
print('<!--messy-subbed-->', messy, '\n<!--pretty-->', pretty, sep='\n')
Printed output of the last statement above, with the sample HTML in your question looks like:
<!--messy-subbed-->
<html> <head>
<title>
Some title</title>
</head> <body>
<!--do_not_parse__placeholder_0-->
</body> </html>
<!--pretty-->
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>Don't parse and prettify this!</svg>
</body>
</html>
Note that I don't know if either method will actually improve performance, especially when you consider how many more times it will be through the HTML string/s. You might want to look into https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

Related

LinkedIn webscraping selenium issue

Element <div class="block mt2"> is not showing up when searching in output of print(soup).
# Scrap the data of 1 LinkedIn profile, write the data to a csv file
wd.get(https://www.linkedin.com/company/pacific-retail-capital-partners/)
soup = BeautifulSoup(wd.page_source, "html.parser")
soup
Output exceeds the size limit. Open the full output data in a text editor
<html class="theme theme--mercado artdeco windows" lang="en"><head>
<script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);</script>
<title>LinkedIn</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta class="mercado-icons-sprite" content="https://static-exp2.licdn.com/sc/h/7438dbnn8galtczp2gk2s4bgb" id="artdeco-icons/static/images/sprite-asset" name="asset-url"/>
<meta content="" name="description"/>
<meta content="notranslate" name="google"/>
<meta content="voyager-web" name="service"/>
HTML inspect
<div class="block mt2">
<div>
<h1 id="ember30" class="ember-view t-24 t-black t-bold
full-width" title="Pacific Retail Capital Partners">
<span dir="ltr">Pacific Retail Capital Partners</span>
</h1>
Since the html document is loaded in our script. We can scrape the name of the company using the div tag.
info_div = soup.find('div', {'class' : 'block mt2'})
print(info_div)
Output is null. I am not getting any information printed.
Can you explain what's happening and needed to be rectified.
It seems find() method doesn't support multiple class name.
Use the following css selector select_one() to get the company details.
info_div = soup.select_one('div.block.mt2')
print(info_div.text)
or to get company name only use this.
company = soup.select_one('div.block.mt2 h1>span')
print(company.text)
If you still want use find() method then try with this.
info_div = soup.find('div', {'class' : 'mt2'})
print(info_div.text)

Is their any way to give the information by python to java script back in given module?

def calculate(a,b):
answer = a + b
your = answer
this is a module i created for a calculator. I want to display the answer on html web page by java script through eel.
So my question is that how can i give the information back to java script by python(eel).
for this code, I think you can create the same method in javascript as well. use the following code
function calculate(x,y){
return x + y;
}
Then, you can call this function as var answer = calculate(5,6). this will return answer to the variable.
Looking at the documentation for eel it looks like you can try this:
On the line before your calculate() function in your Python source add a decorator #eel.expose
#eel.expose
def calculate(a, b):
answer = a + b
return answer
Your Javascript source (lifted directly from the documentation will look roughly like this):
<!DOCTYPE html>
<html>
<head>
<title>Hello, World!</title>
<script type="text/javascript" src="eel.js"></script>
<script type="text/javascript">
// Call a Python function, passing the numbers four and five and then creating an alert with the answer
alert(eel.calculate(4, 5));
</script>
</head>
<body>
Hello, World!
</body>
</html>
It would help us if you revised your question to include more of the source code from your Python and JavaScript programs, some examples of more information you can provide come from this webpage https://stackoverflow.com/help/how-to-ask

Context Dictionary isn't being passed with render in view

Alright, I've been poking around the internet for a solution to that there's something obvious that I'm missing but so far no good.
I'm currently having trouble with passing a context dictionary to a template in Django via my view. So far everything else seems to return, except for the dictionary that I'm passing to the template.
def search_subjects(request):
"""
This is our search view, at present it collects queries relating to:
- Subject ID
- Study Name
- Date Range Start
- Date Range Start
Then validates these entries, after which it redirects to the search
results view.
:param request:
:return: Redirect to search results if search button is pressed and form fields
are valid or renders this view again if this request is not POST
"""
if request.method == 'POST':
form = SearchForm(request.POST)
if form.is_valid():
search_dict = {}
search = form.save(commit=False)
search.subject_search = request.POST['subject_search']
search.study_search = request.POST['subject_search']
if request.POST['date_range_alpha'] and \
dateparse.parse_datetime(request.POST['date_range_alpha']):
search.date_range_alpha = request.POST['date_ranch_alpha']
else:
search.date_range_alpha = EPOCH_TIME
if request.POST['date_range_omega'] and \
dateparse.parse_datetime(request.POST['date_range_omega']):
with_tz = dateparse.parse_datetime(request.POST['date_range_omega'])
search.date_range_omega = with_tz
else:
search.date_range_omega = timezone.now()
search.save()
for k, v in form.data.items():
search_dict[k] = v
print(search_dict)
return render(request, 'dicoms/search_results.html', search_dict)
else:
form = SearchForm()
return render(request, 'dicoms/search.html', {'form': form})
And my template here:
!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div>{{ search_dict }}</div>
</body>
</html>
The page that I'm getting back:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div></div>
</body>
</html>
What on earth am I missing here?
Ok, so I walked away from this for a bit and managed to solve it. I wasn't passing a context dictionary correctly. Fix can be seen below.:
search.save()
context = {'search': search}
return render(request, 'dicoms/search_results.html', context)
Adjusting the template accordingly:
Here's what you searched for:
<div>{{ search.subject_search }}</div>
<div>{{ search.study_search }}</div>
<div>{{ search.date_range_alpha }}</div>
<div>{{ search.date_range_omega }}</div>
Results in:
Here's what you searched for:
<div>herp </div>
<div>herp </div>
<div>Jan. 1, 1970, midnight</div>
<div>Feb. 26, 2019, 11:05 p.m.</div>
Had I trusted in django and simply passed the whole search object in the beginning I wouldn't have ended up here. But you live and learn.

python splinter comparing unicode elementlist with string

I want to get all the anchor tag text from an iframe named "ListFirst". I'm trying to iterate text and comparing each with the string 'AGENT-WIN3E64 ' that I want to click.But the comparison I made here e['text'] == u'AGENT-WIN3E64 ' becomes false event though the strings are same. Please help.
Here is my code:
with iframe12.get_iframe('ListFirst') as iframe1231:
anchorList=iframe1231.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
e.click()
break;
With the setup below I tried to recreate the situation you describe. The .py script below seems to find the anchor just fine though.
index.html,
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<iframe name="mainframe" src="iframe1.html"></iframe>
</body>
</html>
iframe1.html,
<html>
<head></head>
<body>
<iframe name="childframe" src="iframe2.html"></frame>
</body>
</html>
iframe2.html,
<html>
<head></head>
<body>
AGENT-WIN3E64
b
c
d
e
</body>
</html>
test.py
from splinter import Browser
browser = Browser('firefox', wait_time=10)
browser.visit("http://localhost:8000/index.html")
# get mainframe
with browser.get_iframe('mainframe') as mainframe:
# get childframe
with mainframe.get_iframe('childframe') as childframe:
anchorList = childframe.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
print "found anchor"
e.click()
break;
This outputs,
found anchor
But note that you could also find the anchor directly using xpath,
anchor = childframe.find_by_xpath("//a[text() = 'AGENT-WIN3E64 ']")

How to add id using dojo.query to search element

I'm trying to add id to a element using dojo.query. I'm not sure if it's possible though. I trying to use the code below to add the id but it's not working.
dojo.query('div[style=""]').attr("id","main-body");
<div style="">
content
</div>
If this is not possible, is there another way to do it? Using javascript or jquery? Thanks.
Your way of adding an id to an element is correct.
The code runs fine for me in Firefox 17 and Chrome 23 but I have an issue in IE9. I suspect you may have the same issue.
In IE9 the query div[style=""] returns no results. The funny thing is,it works fine in compatibility mode!
t seems that in IE9 in normal mode if an HTML element has an inline empty style attribute, that attribute is not being preserved when the element is added to the DOM.
So a solution would be to use a different query to find the divs you want.
You could try to find the divs with an empty style attributes OR with no style attribute at all.
A query like this should work:
div[style=""], div:not([style])
Take a look at the following example:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test Page</title>
<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/dojo/1.8.2/dojo/dojo.js"></script>
<script type="text/javascript">
dojo.require("dojo.NodeList-manipulate");//just for the innerHTML() function
dojo.addOnLoad(function () {
var nodeListByAttr = dojo.query('div[style=""], div:not([style])');
alert('Search by attribute nodeList length:' + nodeListByAttr.length);
nodeListByAttr.attr("id", "main-body");
var nodeListByID = dojo.query('#main-body');
alert('Search by id nodeList length:' + nodeListByID.length);
nodeListByID.innerHTML('Content set after finding the element by ID');
});
</script>
</head>
<body>
<div style="">
</div>
</body>
</html>
Hope this helps
#Nikanos' answer covers the query issue, I would like to add, that any query returns an array of elements, in case of Dojo it is dojo/NodeList.
The problem is you are about to assign the same id to multiple DOM nodes, especially with query containing div:not([style]). I recommend to use more specific query like first div child of body:
var nodes = dojo.query('body > div:first-child');
nodes.attr("id", "main-body");
To make it more robust, do not manipulate all the nodes, just the first node (even through there should be just one):
dojo.query('body > div:first-child')[0].id = "main-body";
This work also in IE9, see it in action: http://jsfiddle.net/phusick/JN4cz/
The same example written in Modern Dojo: http://jsfiddle.net/phusick/BReda/