Extracting text within div tag itself with BeautifulSoup - beautifulsoup

I am trying to extract the number eg. "3762" from the div below with Beautifulsoup:
<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>
The div comes from this website (a pharma medical database): Drugs.com.
I can not use "class" since that changes from div to div, more than just pid-box-1 and pid-box-2. I haven't had success using the "data-pid-imprintid" either.
This is what i have tried and i know that i cant write "data-pid-imprintid" the way i have done:
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all('div', 'data-pid-imprintid')
for div in divs:
item = div.find('div')
id = item.get('data-pid-imprintid')
print (id)

This gets the value of data-pid-imprintid in every div with data-pid-imprintid
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all("div", attrs={"data-pid-imprintid": True})
for div in divs:
print(div.get('data-pid-imprintid'))

First at all be aware there is a little typo in your html (class="pid-box-1'), without fixing it, you will only get two ids back.
How to select?
As alternativ approache to find_all() that works well, you can also go with the css selector:
soup.select('div [data-pid-imprintid]')
These will select every <div> with an attribute called data-pid-imprintid. To get the value of data-pid-imprintid you have to iterate the result set for example by list comprehension:
[e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
Example
import requests
from bs4 import BeautifulSoup
html='''<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
ids = [e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
print(ids)
Output
['3762', '5096', '10944']

Related

BS4 - Replacing text content, preserving tags

I have an HTML document that uses the text-styling style attribute to change case. When I see that style, I'd like to change all text for which that tag applies, retaining the HTML tags.
I have a partial solution that replaces the tag entirely. The approach that seems like it ought to be correct gives me AttributeError: 'NoneType' object has no attribute 'next_element'
Example:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
# works, but replaces all text, removing the HTML tags
for node in soup.find_all(attrs={'style': upper_patt}):
node.replace_with(node.text.upper())
# does not work, throws AttributeError error
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for txt in node.strings:
txt.replace_with(txt.upper())
Seems like you want to change the inner text to uppercase for all the children of an element with text-transform: uppercase.
Instead of altering the result of find_all, loop over the children text with node.findChildren(text=True) of the result, and use replace_with() to change the text:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for child in node.findChildren(recursive=True, text=True):
child.replace_with(child.text.upper())
print(soup)
Prints:
<div style="text-transform: uppercase;">
FOO0
<font>FOO0</font>
<div>FOO1
<div>FOO2</div>
</div>
</div>

How to write xpath for a field and validate the fields

I have a requirement to verify field name and values. My code looks like
<div class="line info">
<div class="unit labelInfo TextMdB">
Reference #:
</div>
<div class="unit lastUnit">
701
</div>
</div>
</div>
<div class="line info">
<div class="unit labelInfo TextMdB">
Registered Date:
</div>
<div class="unit lastUnit">
05/05/2020
</div>
</div>
I gave my xpath as
"//div[#class='unit lastUnit']//preceding-sibling::div[#class='unit labelInfo TextMdB' and contains(text(),'Reference #:')]".
With this xpath I am able to reach "reference#" field . But how to verify reference # field is displaying the value (in this case 701) .
Appreciate your response.
Thanks
You can first reach the Reference # text by using its text in the xpath and then you can use following-sibling to fetch the div tag and then use getText()(java) / text (python) method to get 701.
(Edited answer after OP's comment)
If you want to check if the element is displayed on the page or not then you can fetch its list and check if the size of that list is greater than 0 or not.
You can do it like:
In Java:
List<WebElement> elementList = driver.findElements(By.xpath("//div[#class='line info']//div[contains(text(),'Reference #')]//following-sibling::div"));
if(elementList.size()>0){
// Element is present on the UI
// Finding its text
String text = elementList.get(0).getText();
}
In python:
elementList = driver.find_elements_by_xpath("//div[#class='line info']//div[contains(text(),'Reference #')]//following-sibling::div")
if (elementList.len>0):
# Element is present
# Printing its text
print(elementList[0].text)

Using requests and bs4 to get a link from a webpage [duplicate]

This question already has answers here:
extracting href from <a> beautiful soup
(2 answers)
Closed 2 years ago.
I am trying to pull the link for the latest droplist from https://www.supremecommunity.com/season/spring-summer2020/droplists/
If you right click on latest and click inspect, you see this:
That link will change every week, so I am trying to pull it from that page.
When I do
import requests
from bs4 import BeautifulSoup
url = "https://www.supremecommunity.com/season/spring-summer2020/droplists/"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
my_data = soup.find('div', attrs = {'id': 'box-latest'})
I get:
div class="col-sm-4 col-xs-12 app-lr-pad-2" id="box-latest">
<a class="block" href="/season/spring-summer2020/droplist/2020-03-26/">
<div class="feature feature-7 boxed text-center imagebg boxedred sc-app-boxlistitem" data-overlay="7">
<div class="empty-background-image-holder">
<img alt="background" src=""/>
</div>
<h2 class="pos-vertical-center">Latest</h2>
</div>
</a>
</div>
How can I just pull the "/season/spring-summer2020/droplist/2020-03-26/" part out?
import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.supremecommunity.com/season/spring-summer2020/droplists/")
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("div", id="box-latest").contents[1].get("href"))
Output:
/season/spring-summer2020/droplist/2020-03-26/

How to use BeautifulSoup to get content inside over-line tags

I would like to extract the content("_The_important_content_") from an HTML snippet as follows:
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
My code is just:
for i in soup.findAll('div', class_="a:2 c:gray m:da"):
print(i.text)
But because the "class" field contains new line symbols and is expanded to multiple line so that BeautifulSoup cannot match, the code returns nothing. How can I specify the correct class field and get the content?
There are many tags with the same "class" value and other "class" value but I want to extract the contents from the tags with that specific "class" value.
Try this:
html='''
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
item = soup.select("[class^=]")[0].text
print(item.strip())
Result:
_The_important_content_

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.