Select html tag with multiple css beautifulsoup - beautifulsoup

I am using BeautifulSoup for extracting tags form html. There are some html tag having multiple css classes example :
html = '''
<a class ='a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'>This is a anchor text</a>
<div class ='s-access-detail-page s-color-twister-title-link a-text-normal'>Div text</div>
'''
soup = BeautifulSoup(html, "lxml")
all_prod_links = soup.find_all('a', {'class': ['a-link-normal','s-access-detail-page','s-color-twister-title-link','a-text-normal']})
when I am using above code it is giving me both the tags. Is there any way by which I can get the element containing all the css.

This will find all tags (a, div, or other) that have 'class' attribute and have all the specified classes:
from bs4 import BeautifulSoup
html = '''
<a class='a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'>This is a anchor text</a>
<div class='s-access-detail-page s-color-twister-title-link a-text-normal'>Div text</div>
'''
soup = BeautifulSoup(html, "lxml")
all_prod_links = soup.find_all(
lambda t: 'class' in t.attrs and 'a-link-normal' in t['class'] and \
's-access-detail-page' in t['class'] and \
's-color-twister-title-link' in t['class'] and \
'a-text-normal' in t['class'])
print(all_prod_links)
Prints:
[<a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal">This is a anchor text</a>]

html = '''
<a class ='a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'>This is a anchor text</a>
<div class ='s-access-detail-page s-color-twister-title-link a-text-normal'>Div text</div>
'''
soup = BeautifulSoup(html, "lxml")
all_prod_links = soup.find_all(attrs={'class':'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})
Result is
[<a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal">This is a anchor text</a>]

Related

How to extract value of all classes in beautiful Soup

I have a HTML file with a structure like this:
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
I need to have a Python dict like this:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
So I need to get the value of the class attribute of all span tags, along with their text. How can I do this with bs4?
Select your elements and iterate the ResultSet while appending the values to your dict. To extract the values of an attribute use .get(). Because class will give you a list pick yours by index or key.
Example
from bs4 import BeautifulSoup
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html)
d = {
'institution':[],
'person':[]
}
for e in soup.select('span[wikidata]'):
d[e.get('class')[0]].append(e.get('wikidata'))
d
Output
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
This is the way I solved my problem thanks to #HedgeHog.
from bs4 import BeautifulSoup
from collections import defaultdict
def capture_info(soup: 'BeautifulSoup') -> defaultdict:
info = defaultdict(list)
for i in soup.select('span[Wikidata]'):
info[i.get('class')[0]].append(i.get('wikidata'))
return info
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
info = capture_info(soup)
The output is:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']})

How can I get a value from an attribute inside a tag

I have a soup object like:
<a class="love-action js-add-to-favorites" data-id="415953" data-price="715.00" href="#">
</a>
I did
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price)
I'd like to get only: 715.00
How to fix?
You can access attributes of a tag by treating it like a dictionary - So simply get the value from the attribute data-price by:
price['data-price']
Example based on your question
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price['data-price'])
Output
715.00

Can I select by class in a tag hierarchy in BeautifulSoup?

<div class="menu-drop-main">
<ul class="menu-drop-list">
<li>男士面部护肤</li>
<li>美妆工具</li>
<li>面部护肤</li>
<li>香水彩妆</li>
</ul>
</div>
If I want to use 'select' instead of 'find', can I get a list of the 4 'li' tags?
tags = soup.select('div ul .menu-drop-main')
You can use soup.select('.menu-drop-main li'). That will select all <li> tags under tag with class="menu-drop-main":
from bs4 import BeautifulSoup
html_doc = """<div class="menu-drop-main">
<ul class="menu-drop-list">
<li>男士面部护肤</li>
<li>美妆工具</li>
<li>面部护肤</li>
<li>香水彩妆</li>
</ul>
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.select(".menu-drop-main li"), sep="\n")
Prints:
<li>男士面部护肤</li>
<li>美妆工具</li>
<li>面部护肤</li>
<li>香水彩妆</li>

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
This is what I ended up doing
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])

BeautifulSoup Nested class selector

I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
Bill
Sean
But I want it to return only ...
John
Sam
Bailey
Jack
Sour
Sweet
Salty
Fruits are good
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache']
allowed_domains = ['example.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks
soup.select('.fruits .apple p')
use CSSselector, it's very easy to express class.
soup.find(class_='fruits').find(class_="apple").find_all('p')
Or, you can use find() to get the p tag step by step
EDIT:
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
use strings generator to get all the string under the div tag, stripped_strings will get rid of \n in the results.
out:
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
"""
soup = BeautifulSoup(source_code, 'lxml')
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]