Using requests and bs4 to get a link from a webpage [duplicate] - beautifulsoup

This question already has answers here:
extracting href from <a> beautiful soup
(2 answers)
Closed 2 years ago.
I am trying to pull the link for the latest droplist from https://www.supremecommunity.com/season/spring-summer2020/droplists/
If you right click on latest and click inspect, you see this:
That link will change every week, so I am trying to pull it from that page.
When I do
import requests
from bs4 import BeautifulSoup
url = "https://www.supremecommunity.com/season/spring-summer2020/droplists/"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
my_data = soup.find('div', attrs = {'id': 'box-latest'})
I get:
div class="col-sm-4 col-xs-12 app-lr-pad-2" id="box-latest">
<a class="block" href="/season/spring-summer2020/droplist/2020-03-26/">
<div class="feature feature-7 boxed text-center imagebg boxedred sc-app-boxlistitem" data-overlay="7">
<div class="empty-background-image-holder">
<img alt="background" src=""/>
</div>
<h2 class="pos-vertical-center">Latest</h2>
</div>
</a>
</div>
How can I just pull the "/season/spring-summer2020/droplist/2020-03-26/" part out?

import requests
from bs4 import BeautifulSoup
r = requests.get(
"https://www.supremecommunity.com/season/spring-summer2020/droplists/")
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("div", id="box-latest").contents[1].get("href"))
Output:
/season/spring-summer2020/droplist/2020-03-26/

Related

BS4 - Replacing text content, preserving tags

I have an HTML document that uses the text-styling style attribute to change case. When I see that style, I'd like to change all text for which that tag applies, retaining the HTML tags.
I have a partial solution that replaces the tag entirely. The approach that seems like it ought to be correct gives me AttributeError: 'NoneType' object has no attribute 'next_element'
Example:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
# works, but replaces all text, removing the HTML tags
for node in soup.find_all(attrs={'style': upper_patt}):
node.replace_with(node.text.upper())
# does not work, throws AttributeError error
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for txt in node.strings:
txt.replace_with(txt.upper())
Seems like you want to change the inner text to uppercase for all the children of an element with text-transform: uppercase.
Instead of altering the result of find_all, loop over the children text with node.findChildren(text=True) of the result, and use replace_with() to change the text:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for child in node.findChildren(recursive=True, text=True):
child.replace_with(child.text.upper())
print(soup)
Prints:
<div style="text-transform: uppercase;">
FOO0
<font>FOO0</font>
<div>FOO1
<div>FOO2</div>
</div>
</div>

Extracting text within div tag itself with BeautifulSoup

I am trying to extract the number eg. "3762" from the div below with Beautifulsoup:
<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>
The div comes from this website (a pharma medical database): Drugs.com.
I can not use "class" since that changes from div to div, more than just pid-box-1 and pid-box-2. I haven't had success using the "data-pid-imprintid" either.
This is what i have tried and i know that i cant write "data-pid-imprintid" the way i have done:
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all('div', 'data-pid-imprintid')
for div in divs:
item = div.find('div')
id = item.get('data-pid-imprintid')
print (id)
This gets the value of data-pid-imprintid in every div with data-pid-imprintid
soup = BeautifulSoup(html_text, 'lxml')
divs = soup.find_all("div", attrs={"data-pid-imprintid": True})
for div in divs:
print(div.get('data-pid-imprintid'))
First at all be aware there is a little typo in your html (class="pid-box-1'), without fixing it, you will only get two ids back.
How to select?
As alternativ approache to find_all() that works well, you can also go with the css selector:
soup.select('div [data-pid-imprintid]')
These will select every <div> with an attribute called data-pid-imprintid. To get the value of data-pid-imprintid you have to iterate the result set for example by list comprehension:
[e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
Example
import requests
from bs4 import BeautifulSoup
html='''<div class="contentBox">
<div class="pid-box-1" data-pid-imprintid="3762">
</div>
<div class="pid-box-2" data-pid-imprintid="5096">
</div>
<div class="pid-box-1" data-pid-imprintid="10944">
</div>
</div>'''
soup = BeautifulSoup(html, 'lxml')
ids = [e['data-pid-imprintid'] for e in soup.select('div [data-pid-imprintid]')]
print(ids)
Output
['3762', '5096', '10944']

How to get element in class covered h3 by using BeautifulSoup

I would like to get element "Orange" by using BeautifulSoup.
I only need an element which covered h2.
please help how to get it?
<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>
from bs4 import BeautifulSoup
html = """<h2 class="heading">
<a><span class="name">Orange</span></a>
</h2>"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("span", {'class': 'name'}):
print(item.text)

Getting error AttributeError: ResultSet object has no attribute 'find_all'

Hi I am trying to find out all the links under pagination thing and the pagination part code already extracted. but when i was trying to capture all the list items I am getting the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
import requests
from bs4 import BeautifulSoup
url = "https://scrapingclub.com/exercise/list_basic/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
pages = soup.find_all('ul', class_='pagination')
links = pages.find_all('a', class_='page-link')
print(links)
I did not understand by the term AttributeError: ResultSet object has no attribute 'find_all'. can anybody check this what I am missing.
The problem is you cannot call .find_all() or .find() on ResultSet returned by first .find_all() call.
This example will print all links from pagination:
import requests
from bs4 import BeautifulSoup
url = "https://scrapingclub.com/exercise/list_basic/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
pages = soup.find('ul', class_='pagination') # <-- .find() to return only one element
for link in pages.find_all('a', class_='page-link'): # <-- find_all() to return list of elements
print(link)
Prints:
<a class="page-link" href="?page=2">2</a>
<a class="page-link" href="?page=3">3</a>
<a class="page-link" href="?page=4">4</a>
<a class="page-link" href="?page=5">5</a>
<a class="page-link" href="?page=6">6</a>
<a class="page-link" href="?page=7">7</a>
<a class="page-link" href="?page=2">Next</a>

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.