Pandas web scraping(Beautiful soup) find in tag with class, another tag with a link. Then following the link inside href - pandas

I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page

Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...

Related

extract text from html string with Scrapy

Here is the html string in question.
<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>
With BeautifulSoup, this code
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
gets me
a book of grammar rules:
which is exactly what I want.
With scrapy, how do I get the same result?
from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()
this code gets me
['a ', ' of grammar ', ': ']
How should I fix it?
aYou can use this code to get all text inside div and its child:
text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)
your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

How to use BeautifulSoup to get content inside over-line tags

I would like to extract the content("_The_important_content_") from an HTML snippet as follows:
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
My code is just:
for i in soup.findAll('div', class_="a:2 c:gray m:da"):
print(i.text)
But because the "class" field contains new line symbols and is expanded to multiple line so that BeautifulSoup cannot match, the code returns nothing. How can I specify the correct class field and get the content?
There are many tags with the same "class" value and other "class" value but I want to extract the contents from the tags with that specific "class" value.
Try this:
html='''
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
item = soup.select("[class^=]")[0].text
print(item.strip())
Result:
_The_important_content_

Parsing text from certain "html elements" using selenium

What I've seen so far is that the page source of a webpage if filtered by selenium then it is possible to parse text or something necessary from that page source applying bs4 or lxml no matter the page source was javascript enabled or not. However, my question is how can I parse documents from a certain html elements by filtering selenium and then using bs4 or lxml library. if the below pasted element is considered then applying bs4 or lxml the way i move is:
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)
#rest of the code here
Now, how can I filter the above paste html portion using selenium and then apply bs4 library on it? Could not think of driver.page_source as it is only applicable when filtered from a webpage.
To be a little more specific, if I want to use something like below, then how can it be?
from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html) #this "html" is the above pasted one
print(element_html)
driver.page_source would give you the complete HTML source code of the page at one particular moment. You, though, having an element instance, can get to it's outerHTML using .get_attribute() method:
element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")
As far as extracting the span element source from out of the mouseover attribute - I would first parse the tr element with BeautifulSoup, get the onmouseover attribute and then use a regular expression to extract the html value from inside the Tip() function call. And then, re-parse the span html with BeautifulSoup:
import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())
Prints:
License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)

Getting a specific part of a website with Beautiful Soup 4

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!
you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']
If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.