python splinter comparing unicode elementlist with string - python-unicode

I want to get all the anchor tag text from an iframe named "ListFirst". I'm trying to iterate text and comparing each with the string 'AGENT-WIN3E64 ' that I want to click.But the comparison I made here e['text'] == u'AGENT-WIN3E64 ' becomes false event though the strings are same. Please help.
Here is my code:
with iframe12.get_iframe('ListFirst') as iframe1231:
anchorList=iframe1231.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
e.click()
break;

With the setup below I tried to recreate the situation you describe. The .py script below seems to find the anchor just fine though.
index.html,
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<iframe name="mainframe" src="iframe1.html"></iframe>
</body>
</html>
iframe1.html,
<html>
<head></head>
<body>
<iframe name="childframe" src="iframe2.html"></frame>
</body>
</html>
iframe2.html,
<html>
<head></head>
<body>
AGENT-WIN3E64
b
c
d
e
</body>
</html>
test.py
from splinter import Browser
browser = Browser('firefox', wait_time=10)
browser.visit("http://localhost:8000/index.html")
# get mainframe
with browser.get_iframe('mainframe') as mainframe:
# get childframe
with mainframe.get_iframe('childframe') as childframe:
anchorList = childframe.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
print "found anchor"
e.click()
break;
This outputs,
found anchor
But note that you could also find the anchor directly using xpath,
anchor = childframe.find_by_xpath("//a[text() = 'AGENT-WIN3E64 ']")

Related

Ignore / skip tags when prettifying with BeautifulSoup

Is it possible to ignore / skip certain tags when parsing and prettifying an HTML-document with BeautifulSoup?
I am using BeautifulSoup to prettify HTML-documents with large embedded SVG-images. There is no need to prettify the SVG-images and all of their child-elements. As performance is critical for this application, I thought I might be able to save some runtime by ignoring / skipping the SVG-elements when prettifying the HTML, and just include the SVG-elements as they originally were in the input.
I am aware of SoupStrainer but it seems to do the exact opposite of what I need. I have also read many of the posts here on StackOverflow and elsewhere, and none of them seem to address this issue.
Example
# Messy HTML code.
messy = \
"""
<html> <head>
<title>
Some title</title>
</head> <body>
<svg>Don't parse and prettify this!</svg>
</body> </html>
"""
# Prettify the HTML code.
from bs4 import BeautifulSoup
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
Which produces the result:
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>
Don't parse and prettify this!
</svg>
</body>
</html>
Note that the <svg> element has also been parsed and prettified by BeautifulSoup. Is there a way to avoid this?
Thanks!
As far as I can tell, bs4 doesn't allow for skipping particular tags; but you could write your own parser (like here) and include or allow exceptions, or use regex to replace the tags you don't want to parse.
First, list the tags you want to skip parsing
skipTags = ['svg']
# skipTags = ['svg', 'script', 'style'] ## list all the tag names to skip
If you don't care about preserving the the tags, you could just get rid of them entirely.
# import re
# from bs4 import BeautifulSoup
for n in skipTags: messy = re.sub(f'<{n}\s*.*\s*>\s*.*\s*</{n}>', '', messy)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
If you want to preserver the tags, then replace them with comments and then replace the comment after prettifying. [This can be significantly slower than just getting rid of them.]
# import re
# from bs4 import BeautifulSoup
cReps = []
for n in skipTags:
rcpat = re.compile(f'<{n}\s*.*\s*>\s*.*\s*</{n}>')
cReps += [m.span() for m in rcpat.finditer(messy)]
for cri, (sPos, ePos) in list(enumerate(cReps))[::-1]:
repCmt, orig = f'<!--do_not_parse__placeholder_{cri}-->', messy[sPos:ePos]
messy = messy[:sPos] + repCmt + messy[ePos:]
cReps[cri] = (repCmt, orig)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
for repCmt, orig in cReps:
pretty = pretty.replace(repCmt, orig, 1)
print('<!--messy-subbed-->', messy, '\n<!--pretty-->', pretty, sep='\n')
Printed output of the last statement above, with the sample HTML in your question looks like:
<!--messy-subbed-->
<html> <head>
<title>
Some title</title>
</head> <body>
<!--do_not_parse__placeholder_0-->
</body> </html>
<!--pretty-->
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>Don't parse and prettify this!</svg>
</body>
</html>
Note that I don't know if either method will actually improve performance, especially when you consider how many more times it will be through the HTML string/s. You might want to look into https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

Context Dictionary isn't being passed with render in view

Alright, I've been poking around the internet for a solution to that there's something obvious that I'm missing but so far no good.
I'm currently having trouble with passing a context dictionary to a template in Django via my view. So far everything else seems to return, except for the dictionary that I'm passing to the template.
def search_subjects(request):
"""
This is our search view, at present it collects queries relating to:
- Subject ID
- Study Name
- Date Range Start
- Date Range Start
Then validates these entries, after which it redirects to the search
results view.
:param request:
:return: Redirect to search results if search button is pressed and form fields
are valid or renders this view again if this request is not POST
"""
if request.method == 'POST':
form = SearchForm(request.POST)
if form.is_valid():
search_dict = {}
search = form.save(commit=False)
search.subject_search = request.POST['subject_search']
search.study_search = request.POST['subject_search']
if request.POST['date_range_alpha'] and \
dateparse.parse_datetime(request.POST['date_range_alpha']):
search.date_range_alpha = request.POST['date_ranch_alpha']
else:
search.date_range_alpha = EPOCH_TIME
if request.POST['date_range_omega'] and \
dateparse.parse_datetime(request.POST['date_range_omega']):
with_tz = dateparse.parse_datetime(request.POST['date_range_omega'])
search.date_range_omega = with_tz
else:
search.date_range_omega = timezone.now()
search.save()
for k, v in form.data.items():
search_dict[k] = v
print(search_dict)
return render(request, 'dicoms/search_results.html', search_dict)
else:
form = SearchForm()
return render(request, 'dicoms/search.html', {'form': form})
And my template here:
!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div>{{ search_dict }}</div>
</body>
</html>
The page that I'm getting back:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div></div>
</body>
</html>
What on earth am I missing here?
Ok, so I walked away from this for a bit and managed to solve it. I wasn't passing a context dictionary correctly. Fix can be seen below.:
search.save()
context = {'search': search}
return render(request, 'dicoms/search_results.html', context)
Adjusting the template accordingly:
Here's what you searched for:
<div>{{ search.subject_search }}</div>
<div>{{ search.study_search }}</div>
<div>{{ search.date_range_alpha }}</div>
<div>{{ search.date_range_omega }}</div>
Results in:
Here's what you searched for:
<div>herp </div>
<div>herp </div>
<div>Jan. 1, 1970, midnight</div>
<div>Feb. 26, 2019, 11:05 p.m.</div>
Had I trusted in django and simply passed the whole search object in the beginning I wouldn't have ended up here. But you live and learn.

How can an nodelist created by getElementsByClassname or getElementsByTagName display its values as string

What I know about getElementsByClassName / getElementsByTagName is that both create a nodelist of the elements in question and that the nodelist elements are treated as objects I have a problem where I want to display the innerHTML of the elements inside of the nodelist but because they are objects this seems to be impossible.
Example:
<!DOCTYPE html>
<head>
<link rel="stylesheet" type="text/css" href="stylesheet.css">
<script src="javascript.js"></script>
</head>
<body>
<p id="pp"></p>
<button onclick="test()">push to test</button>
<p>dog</p>
<p>cat</p>
<p>snake</p>
</body>
//javascript.js file
function test() {
var paragraph = document.getElementsByTagName("p"),
para1 = paragraph[0].innerHTML,
ansBox = document.getElementById("pp");
ansBox.innerHTML = para1;
}
This is condensed version of a longer code. I think that the para1 variable should be a string and then the assignment statement should assign that string to the ansBox.innerHTML but instead I get nothing. I have reworked several versions of this code none work. How can you get the text elements inside of a nodelist to display in the ansBox?
Your script is loaded but your DOM hasn't loaded yet if you load your script inside head like that
<!DOCTYPE html>
<head>
<link rel="stylesheet" type="text/css" href="stylesheet.css">
</head>
<body>
<p id="pp"></p>
<button onclick="test()">push to test</button>
<p>dog</p>
<p>cat</p>
<p>snake</p>
<script src="javascript.js"></script> <!-- load it here -->
</body>
Also paragraph[0] and ansBox refer to the same DOM HTMLParagraphElement just so you know which does not have anything inside (It is empty to begin with)
In the JavaScript code above, you took the HTML inside an empty element and then assign it to itself, and of course you get an empty value.

Difference between innerhtml and outerhtml in cocoa WebView

I am using cocoa webview for rich text editing in my application. Just confused with innerHtml and outerHtml method avaiable in webkit.
Can anyone explain what is the difference between
[(DOMHTMLElement *)[[[webView mainFrame] DOMDocument] documentElement] outerHTML];
AND
[(DOMHTMLElement *)[[[webView mainFrame] DOMDocument] documentElement] outerText];
innerHTML is a property of a DOM element that represents the HTML
inside the element, i.e. between the opening and closing tags. It has
been widely copied, however implementations vary (probably because it
has no published standard[1]) particularly in how they treat element
attributes.
outerHTML is similar to innerHTML, it is an element property that
includes the opening an closing tags as well as the content. It
hasn't been as widely copied as innerHTML so it remains more-or-less
IE only.
<p id="pid">welcome</p>
innerHTML of element "pid" == welcome
outerHTML of element "pid" == <p id="pid">welcome</p>
and whereAs
innerText The textual content of the container.
outerText Same as innerText when accessed for read; replaces the whole element when assigned a new value.
<p id="pid">welcome</p>
innerText of element "pid" == welcome
outerText of element "pid" == welcome
Suppose we have a page loaded to webview with html
<html>
<head><title>Your Title</title></head>
<body>
<h1>Heading</h1>
<p id="para" >hi <b>Your_Name</b></p>
</body>
<html>
NOW.
[(DOMHTMLElement *)[[webView mainFrame] DOMDocument] documentElement]
will returen the DOMHTMLElement "html" and
outerHTML will return the complete html as
<html>
<head><title>Your Title</title></head>
<body>
<h1>Heading</hi>
<p id="para">hi <b>Your_Name</b></p>
</body>
<html>
outerText will return html as
Heading
hi Your_Name
for example if we take example of p tag in this case
outerHTML will return - <p id="para">hi <b>Your_Name</b></p>
outerText will return - hi Your_Name
innerHTML will return - hi <b>Your_Name</b>
innerText will return - hi Your_Name
i have explained it with the help of example where definition for these 4 terms already explained in the answer below.
<!DOCTYPE html>
<html>
<head>
<title>innerHTML and outerHTML | Javascript Usages</title>
</head>
<body>
<div id="replace">REPLACE By inner or outer HTML</div>
<script>
userwant = "inner";
userwant = "outer";
if (userwant = "inner") {
document.querySelector("#replace").innerHTML;
// this will remove just message: 'REPLACE By inner or outer HTML' //
} else if (userwant = "outer") {
document.querySelector("#replace").outerHTML;
// this will remove all element <div> ~ </div> by the message: 'REPLACE By inner or outer HTML' //
};
</script>
</body>
</html>

HTML Parsing Objective C

<html>
<head>
<title>this value i want to grab</title>
</head>
<body> </body>
</html>
I have this NSString html, and I want to grab value in tag title?
Can anybody help me?
Use NSXMLParser, together with NSXMLParserDelegate.
In your case it is really simple. You just watch for #"title" in parser:didStartElement and in parser:foundCharacters: add the content to your own string variable, finalizing it on parser:didEndElement:.