Find multiple tags with condition - beautifulsoup

Is it possible to find multiple tags with a condition?
<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">
Could I say
Find all "a" and "img" tags containing "/img/"

Yes, just supply function (can be lambda function) to find_all() method:
data = """<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for tag in soup.body.find_all(lambda t: in ('a', 'img') and \
('href' in t.attrs and '/img/' in t['href']) or
('src' in t.attrs and '/img/' in t['src'])):
print(, tag.attrs)
print('*' * 80)
a {'href': '/img/something.jpg'}
img {'src': '/img/somethingelse.png'}


Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
# This returns the <a> element
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
# This returns None
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
Find the tag in soup that matches all provided kwargs, and contains the
If no match is found, return None.
If more than one match is found, raise ValueError.
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
# u'The Dormouse's story'
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.
in one line using lambda
soup.find(lambda"a" and "Edit" in tag.text)
You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return == 'a' and 'Edit' in tag.text
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return == 'a' and 'Edit' in tag.get_text()
With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in'a:-soup-contains("Edit")')]
print(single, '\n', multiple)
Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda"a" and "Edit" in tag.text)
How to get content inside tag in beautiful Shop 4?

how to get all content inside a html tags ?
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
code above will return :
what i want is :
the tag <a> is removed after it's found
i hope the solution will works with find_all() too
If the <b> tag is a sibling of the <a> tag use the following line:
matched_list = soup.select_one('b')
If the <b> tag is a child of the <a> tag use the following line:
matched_list = soup.select_one('a b')
Use select instead of select_one if you need multiple hits.
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
for b in matched_list:

BeautifulSoup find by attribute value regardless of attribute

Say I have something like this:
<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>
I want to search for the keyword 'cake' and get all of them.
Find all by using lambda and search for a given attribute value or if a class contains the value that you want.
from bs4 import BeautifulSoup
example = """<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>"""
soup = BeautifulSoup(example, "html.parser")
print (soup.find_all(lambda tag: [a for a in tag.attrs.values() if a == "cake" or "cake" in tag.get("class")]))
[<div class="cake">1</div>, <h2 id="cake">1</h2>, <sometag someattribute="cake">1</sometag>]
You could use regex and BeautifulSoup together. This is my terrible script:
r = '''<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(r, 'lxml')
for i in range(len(re.findall(r'(\w+)="cake"',str(soup)))-1):
print(soup.find_all(re.compile(r'(\w+)'), {(re.findall(pattern,str(soup)))[i]:'cake'}))
The output:
[<div class="cake">1</div>]
[<h2 id="cake">1 </div>
<sometag someattribute="cake">1</sometag></h2>]

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = ""
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = ""
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = ""
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('' + relative_path)
This is what I ended up doing
url = ''
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('' + player_href['href'])

BeautifulSoup Nested class selector

I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<span>Fruits are good</span>
<div class="mango">
<div class="apple">
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
But I want it to return only ...
Fruits are good
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['']
allowed_domains = ['']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks'.fruits .apple p')
use CSSselector, it's very easy to express class.
Or, you can use find() to get the p tag step by step
[s for div in'.fruits .apple') for s in div.stripped_strings]
use strings generator to get all the string under the div tag, stripped_strings will get rid of \n in the results.
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<span>Fruits are good</span>
<div class="mango">
<div class="apple">
soup = BeautifulSoup(source_code, 'lxml')
[s for div in'.fruits .apple') for s in div.stripped_strings]