I would like to get element "Orange" by using BeautifulSoup.
I only need an element which covered h2.
please help how to get it?
<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>
from bs4 import BeautifulSoup
html = """<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'name'}):
print(item.text)
Related
I am trying to extract the number eg. "3762" from the div below with Beautifulsoup:
<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>
The div comes from this website (a pharma medical database): Drugs.com.
I can not use "class" since that changes from div to div, more than just pid-box-1 and pid-box-2. I haven't had success using the "data-pid-imprintid" either.
This is what i have tried and i know that i cant write "data-pid-imprintid" the way i have done:
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all('div', 'data-pid-imprintid')
for div in divs:
item = div.find('div')
id = item.get('data-pid-imprintid')
print (id)
This gets the value of data-pid-imprintid in every div with data-pid-imprintid
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all("div", attrs={"data-pid-imprintid": True})
for div in divs:
print(div.get('data-pid-imprintid'))
First at all be aware there is a little typo in your html (class="pid-box-1'), without fixing it, you will only get two ids back.
How to select?
As alternativ approache to find_all() that works well, you can also go with the css selector:
soup.select('div [data-pid-imprintid]')
These will select every <div> with an attribute called data-pid-imprintid. To get the value of data-pid-imprintid you have to iterate the result set for example by list comprehension:
[e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
Example
import requests
from bs4 import BeautifulSoup
html='''<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
ids = [e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
print(ids)
Output
['3762', '5096', '10944']
This question already has answers here:
extracting href from <a> beautiful soup
(2 answers)
Closed 2 years ago.
I am trying to pull the link for the latest droplist from https://www.supremecommunity.com/season/spring-summer2020/droplists/
If you right click on latest and click inspect, you see this:
That link will change every week, so I am trying to pull it from that page.
When I do
import requests
from bs4 import BeautifulSoup
url = "https://www.supremecommunity.com/season/spring-summer2020/droplists/"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
my_data = soup.find('div', attrs = {'id': 'box-latest'})
I get:
div class="col-sm-4 col-xs-12 app-lr-pad-2" id="box-latest">
<a class="block" href="/season/spring-summer2020/droplist/2020-03-26/">
<div class="feature feature-7 boxed text-center imagebg boxedred sc-app-boxlistitem" data-overlay="7">
<div class="empty-background-image-holder">
<img alt="background" src=""/>
</div>
<h2 class="pos-vertical-center">Latest</h2>
</div>
</a>
</div>
How can I just pull the "/season/spring-summer2020/droplist/2020-03-26/" part out?
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.supremecommunity.com/season/spring-summer2020/droplists/")
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("div", id="box-latest").contents[1].get("href"))
Output:
/season/spring-summer2020/droplist/2020-03-26/
This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?
The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2
What I've seen so far is that the page source of a webpage if filtered by selenium then it is possible to parse text or something necessary from that page source applying bs4 or lxml no matter the page source was javascript enabled or not. However, my question is how can I parse documents from a certain html elements by filtering selenium and then using bs4 or lxml library. if the below pasted element is considered then applying bs4 or lxml the way i move is:
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)
#rest of the code here
Now, how can I filter the above paste html portion using selenium and then apply bs4 library on it? Could not think of driver.page_source as it is only applicable when filtered from a webpage.
To be a little more specific, if I want to use something like below, then how can it be?
from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html) #this "html" is the above pasted one
print(element_html)
driver.page_source would give you the complete HTML source code of the page at one particular moment. You, though, having an element instance, can get to it's outerHTML using .get_attribute() method:
element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")
As far as extracting the span element source from out of the mouseover attribute - I would first parse the tr element with BeautifulSoup, get the onmouseover attribute and then use a regular expression to extract the html value from inside the Tip() function call. And then, re-parse the span html with BeautifulSoup:
import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
<td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip\('(.*?)'\)", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())
Prints:
License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)
I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.