Given a specific URL rendered with Python/requests, I need to findAll kind of div, h3, p, etc with class name "Specific".
This works partially :
data = soup.findAll("div", { "class" : "Specific" })
because it only finds div.
I am looking for something like :
data = soup.findAll("*", { "class" : "Specific" })
data = soup.find_all(class_='Specific')

You should specify class_ parameter in the find_all method. name parameter may be omitted as well:
In [12]: html = '''<div class='Specific'><span class='Specific c1'></span><p class='NonSpecific'></p></div>'''
In [13]: soup = bs4.BeautifulSoup(html, 'html.parser')
In [14]: soup.find_all(class_='Specific')
[<div class="Specific"><span class="Specific c1"></span><p class="NonSpecific"></p></div>,
<span class="Specific c1"></span>]


Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
# This returns the <a> element
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
# This returns None
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
Find the tag in soup that matches all provided kwargs, and contains the
If no match is found, return None.
If more than one match is found, raise ValueError.
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
# u'The Dormouse's story'
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.
in one line using lambda
soup.find(lambda"a" and "Edit" in tag.text)
You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return == 'a' and 'Edit' in tag.text
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return == 'a' and 'Edit' in tag.get_text()
With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in'a:-soup-contains("Edit")')]
print(single, '\n', multiple)
Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda"a" and "Edit" in tag.text)
Extracting data from div tag

so im scraping data from a website and it has some data in its div tag
like this :
<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>,
I want to scrape "Donald Duck" part and state and city name after rel="nofollow"
the site contains a lot of data so name and state are different
the code that i have written is
div = soup.find_all('div', {'class':'search-result__title'})
print (div.string)
this gives me a error
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
first, use .text. Second, find_all() will return a list of elements. You need to specify the index value with either: print (div[0].text), or since you will probably have more than 1 element, just iterate through them
from bs4 import BeautifulSoup
html = '''<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find_all('div', {'class':'search-result__title'})
print (div[0].text)
for each in div:
print (each.text)

How to get content inside tag in beautiful Shop 4?

how to get all content inside a html tags ?
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
code above will return :
what i want is :
the tag <a> is removed after it's found
i hope the solution will works with find_all() too
If the <b> tag is a sibling of the <a> tag use the following line:
matched_list = soup.select_one('b')
If the <b> tag is a child of the <a> tag use the following line:
matched_list = soup.select_one('a b')
Use select instead of select_one if you need multiple hits.
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
for b in matched_list:

Scrapy find all links with different(similar) class

I'm trying to scrap links with certain class "post-item post-item-xxxxx". But since the class is different in each, how can I capture all of them?
<li class="post-item post-item-18887"><a
href="" title="Post1"</a></li>
<li class="post-item post-item-18883"><a href="" title="Post2"</a></li>
my code:
scrap all the cafe links from
class DengaSpider(scrapy.Spider):
name = 'cafes'
allowed_domains = ['']
start_urls = [
rules = [
def parse(self, response):
cafelink = response.css('post.item').xpath('//a/#href').extract()
if cafelink is not None:
the .css part is not working, how can I fix it?
Here's a sample run for the above html in scrapy shell:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Test HTML String", body='<li class="post-item post-item-18887"><a href="" title="Post2"</li>', encoding='utf-8')
>>> cafelink = response.css(' a::attr(href)').extract_first()
>>> cafelink
>>> cafelink = response.css(' a::attr(href)').extract()
>>> cafelink
['', '']
Xpath has the contains() method for this, so you might try this:
cafelink = response.xpath("//*[contains(#class, 'post-item-')]//a/#href").extract()
Also be careful when using // in xpath. It makes xpath starts the search in the document root, no matter where it currently is.
If all the items you want also have the "post-item" class then why do you need to capture them by their other class? In case you still need to do that, try the "starts with" CSS selector:
response.css('li[class^="post-item post-item-"]')
Documentation here.

data-lazy beautifulsoup html find

I am having problems calling specific attributes in beautifulsoup
<div class="route_list "
I am trying to extract only the company and departure date and the following code returns a key error.
bsObj = BeautifulSoup(, "html.parser")
departure = div.attrs['data-ubt-departuredate']
You might not be targeting the desired div, narrow down your search:
div = bsObj.find("div", class_="route_list")
Or, checking the presence of the data-ubt-departuredate attribute:
div = bsObj.find("div", {"data-ubt-departuredate": True})