LinkedIn webscraping selenium issue - selenium

Element <div class="block mt2"> is not showing up when searching in output of print(soup).
# Scrap the data of 1 LinkedIn profile, write the data to a csv file
wd.get(https://www.linkedin.com/company/pacific-retail-capital-partners/)
soup = BeautifulSoup(wd.page_source, "html.parser")
soup
Output exceeds the size limit. Open the full output data in a text editor
<html class="theme theme--mercado artdeco windows" lang="en"><head>
<script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);</script>
<title>LinkedIn</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta class="mercado-icons-sprite" content="https://static-exp2.licdn.com/sc/h/7438dbnn8galtczp2gk2s4bgb" id="artdeco-icons/static/images/sprite-asset" name="asset-url"/>
<meta content="" name="description"/>
<meta content="notranslate" name="google"/>
<meta content="voyager-web" name="service"/>
HTML inspect
<div class="block mt2">
<div>
<h1 id="ember30" class="ember-view t-24 t-black t-bold
full-width" title="Pacific Retail Capital Partners">
<span dir="ltr">Pacific Retail Capital Partners</span>
</h1>
Since the html document is loaded in our script. We can scrape the name of the company using the div tag.
info_div = soup.find('div', {'class' : 'block mt2'})
print(info_div)
Output is null. I am not getting any information printed.
Can you explain what's happening and needed to be rectified.

It seems find() method doesn't support multiple class name.
Use the following css selector select_one() to get the company details.
info_div = soup.select_one('div.block.mt2')
print(info_div.text)
or to get company name only use this.
company = soup.select_one('div.block.mt2 h1>span')
print(company.text)
If you still want use find() method then try with this.
info_div = soup.find('div', {'class' : 'mt2'})
print(info_div.text)

Related

Ignore / skip tags when prettifying with BeautifulSoup

Is it possible to ignore / skip certain tags when parsing and prettifying an HTML-document with BeautifulSoup?
I am using BeautifulSoup to prettify HTML-documents with large embedded SVG-images. There is no need to prettify the SVG-images and all of their child-elements. As performance is critical for this application, I thought I might be able to save some runtime by ignoring / skipping the SVG-elements when prettifying the HTML, and just include the SVG-elements as they originally were in the input.
I am aware of SoupStrainer but it seems to do the exact opposite of what I need. I have also read many of the posts here on StackOverflow and elsewhere, and none of them seem to address this issue.
Example
# Messy HTML code.
messy = \
"""
<html> <head>
<title>
Some title</title>
</head> <body>
<svg>Don't parse and prettify this!</svg>
</body> </html>
"""
# Prettify the HTML code.
from bs4 import BeautifulSoup
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
Which produces the result:
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>
Don't parse and prettify this!
</svg>
</body>
</html>
Note that the <svg> element has also been parsed and prettified by BeautifulSoup. Is there a way to avoid this?
Thanks!
As far as I can tell, bs4 doesn't allow for skipping particular tags; but you could write your own parser (like here) and include or allow exceptions, or use regex to replace the tags you don't want to parse.
First, list the tags you want to skip parsing
skipTags = ['svg']
# skipTags = ['svg', 'script', 'style'] ## list all the tag names to skip
If you don't care about preserving the the tags, you could just get rid of them entirely.
# import re
# from bs4 import BeautifulSoup
for n in skipTags: messy = re.sub(f'<{n}\s*.*\s*>\s*.*\s*</{n}>', '', messy)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
If you want to preserver the tags, then replace them with comments and then replace the comment after prettifying. [This can be significantly slower than just getting rid of them.]
# import re
# from bs4 import BeautifulSoup
cReps = []
for n in skipTags:
rcpat = re.compile(f'<{n}\s*.*\s*>\s*.*\s*</{n}>')
cReps += [m.span() for m in rcpat.finditer(messy)]
for cri, (sPos, ePos) in list(enumerate(cReps))[::-1]:
repCmt, orig = f'<!--do_not_parse__placeholder_{cri}-->', messy[sPos:ePos]
messy = messy[:sPos] + repCmt + messy[ePos:]
cReps[cri] = (repCmt, orig)
pretty = BeautifulSoup(markup=messy, features='html.parser').prettify()
for repCmt, orig in cReps:
pretty = pretty.replace(repCmt, orig, 1)
print('<!--messy-subbed-->', messy, '\n<!--pretty-->', pretty, sep='\n')
Printed output of the last statement above, with the sample HTML in your question looks like:
<!--messy-subbed-->
<html> <head>
<title>
Some title</title>
</head> <body>
<!--do_not_parse__placeholder_0-->
</body> </html>
<!--pretty-->
<html>
<head>
<title>
Some title
</title>
</head>
<body>
<svg>Don't parse and prettify this!</svg>
</body>
</html>
Note that I don't know if either method will actually improve performance, especially when you consider how many more times it will be through the HTML string/s. You might want to look into https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

Context Dictionary isn't being passed with render in view

Alright, I've been poking around the internet for a solution to that there's something obvious that I'm missing but so far no good.
I'm currently having trouble with passing a context dictionary to a template in Django via my view. So far everything else seems to return, except for the dictionary that I'm passing to the template.
def search_subjects(request):
"""
This is our search view, at present it collects queries relating to:
- Subject ID
- Study Name
- Date Range Start
- Date Range Start
Then validates these entries, after which it redirects to the search
results view.
:param request:
:return: Redirect to search results if search button is pressed and form fields
are valid or renders this view again if this request is not POST
"""
if request.method == 'POST':
form = SearchForm(request.POST)
if form.is_valid():
search_dict = {}
search = form.save(commit=False)
search.subject_search = request.POST['subject_search']
search.study_search = request.POST['subject_search']
if request.POST['date_range_alpha'] and \
dateparse.parse_datetime(request.POST['date_range_alpha']):
search.date_range_alpha = request.POST['date_ranch_alpha']
else:
search.date_range_alpha = EPOCH_TIME
if request.POST['date_range_omega'] and \
dateparse.parse_datetime(request.POST['date_range_omega']):
with_tz = dateparse.parse_datetime(request.POST['date_range_omega'])
search.date_range_omega = with_tz
else:
search.date_range_omega = timezone.now()
search.save()
for k, v in form.data.items():
search_dict[k] = v
print(search_dict)
return render(request, 'dicoms/search_results.html', search_dict)
else:
form = SearchForm()
return render(request, 'dicoms/search.html', {'form': form})
And my template here:
!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div>{{ search_dict }}</div>
</body>
</html>
The page that I'm getting back:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Search Results</title>
</head>
<body>
Here's what you searched for:
<div></div>
</body>
</html>
What on earth am I missing here?
Ok, so I walked away from this for a bit and managed to solve it. I wasn't passing a context dictionary correctly. Fix can be seen below.:
search.save()
context = {'search': search}
return render(request, 'dicoms/search_results.html', context)
Adjusting the template accordingly:
Here's what you searched for:
<div>{{ search.subject_search }}</div>
<div>{{ search.study_search }}</div>
<div>{{ search.date_range_alpha }}</div>
<div>{{ search.date_range_omega }}</div>
Results in:
Here's what you searched for:
<div>herp </div>
<div>herp </div>
<div>Jan. 1, 1970, midnight</div>
<div>Feb. 26, 2019, 11:05 p.m.</div>
Had I trusted in django and simply passed the whole search object in the beginning I wouldn't have ended up here. But you live and learn.

How can an nodelist created by getElementsByClassname or getElementsByTagName display its values as string

What I know about getElementsByClassName / getElementsByTagName is that both create a nodelist of the elements in question and that the nodelist elements are treated as objects I have a problem where I want to display the innerHTML of the elements inside of the nodelist but because they are objects this seems to be impossible.
Example:
<!DOCTYPE html>
<head>
<link rel="stylesheet" type="text/css" href="stylesheet.css">
<script src="javascript.js"></script>
</head>
<body>
<p id="pp"></p>
<button onclick="test()">push to test</button>
<p>dog</p>
<p>cat</p>
<p>snake</p>
</body>
//javascript.js file
function test() {
var paragraph = document.getElementsByTagName("p"),
para1 = paragraph[0].innerHTML,
ansBox = document.getElementById("pp");
ansBox.innerHTML = para1;
}
This is condensed version of a longer code. I think that the para1 variable should be a string and then the assignment statement should assign that string to the ansBox.innerHTML but instead I get nothing. I have reworked several versions of this code none work. How can you get the text elements inside of a nodelist to display in the ansBox?
Your script is loaded but your DOM hasn't loaded yet if you load your script inside head like that
<!DOCTYPE html>
<head>
<link rel="stylesheet" type="text/css" href="stylesheet.css">
</head>
<body>
<p id="pp"></p>
<button onclick="test()">push to test</button>
<p>dog</p>
<p>cat</p>
<p>snake</p>
<script src="javascript.js"></script> <!-- load it here -->
</body>
Also paragraph[0] and ansBox refer to the same DOM HTMLParagraphElement just so you know which does not have anything inside (It is empty to begin with)
In the JavaScript code above, you took the HTML inside an empty element and then assign it to itself, and of course you get an empty value.

python splinter comparing unicode elementlist with string

I want to get all the anchor tag text from an iframe named "ListFirst". I'm trying to iterate text and comparing each with the string 'AGENT-WIN3E64 ' that I want to click.But the comparison I made here e['text'] == u'AGENT-WIN3E64 ' becomes false event though the strings are same. Please help.
Here is my code:
with iframe12.get_iframe('ListFirst') as iframe1231:
anchorList=iframe1231.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
e.click()
break;
With the setup below I tried to recreate the situation you describe. The .py script below seems to find the anchor just fine though.
index.html,
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<iframe name="mainframe" src="iframe1.html"></iframe>
</body>
</html>
iframe1.html,
<html>
<head></head>
<body>
<iframe name="childframe" src="iframe2.html"></frame>
</body>
</html>
iframe2.html,
<html>
<head></head>
<body>
AGENT-WIN3E64
b
c
d
e
</body>
</html>
test.py
from splinter import Browser
browser = Browser('firefox', wait_time=10)
browser.visit("http://localhost:8000/index.html")
# get mainframe
with browser.get_iframe('mainframe') as mainframe:
# get childframe
with mainframe.get_iframe('childframe') as childframe:
anchorList = childframe.find_by_tag('a')
for e in anchorList:
if e['text'] == u'AGENT-WIN3E64 ': #unicode string comparison
print "found anchor"
e.click()
break;
This outputs,
found anchor
But note that you could also find the anchor directly using xpath,
anchor = childframe.find_by_xpath("//a[text() = 'AGENT-WIN3E64 ']")

How to do standard layouts with StringTemplate?

With StringTemplate, what is the proper way to have a standard layout template such as:
<head>
..
</head>
<html>
$body()$
</html>
Where I can set the body template from my application, so that every template I use uses this fundamental layout?
Thanks.
I found it hiding in the documentation:
http://www.antlr.org/wiki/display/ST/StringTemplate+2.2+Documentation
"Include template whose name is
computed via expr. The argument-list
is a list of attribute assignments
where each assignment is of the form
attribute=expr. Example
$(whichFormat)()$ looks up
whichFormat's value and uses that as
template name. Can also apply an
indirect template to an attribute."
So my main layout template now looks like this:
<head>
<title>Sportello</title>
</head>
<html lang="en-US">
<body>
$partials/header()$
<section>$(body_template)()$</section>
$partials/footer()$
</body>
</html>
...to which I pass the subtemplate's name as an attribute.