Beautiful Soup (Python) : find all kind of div, p, span, etc, with a given class - beautifulsoup

Given a specific URL rendered with Python/requests, I need to findAll kind of div, h3, p, etc with class name "Specific".
This works partially :
data = soup.findAll("div", { "class" : "Specific" })
because it only finds div.
I am looking for something like :
data = soup.findAll("*", { "class" : "Specific" })
Good answer from soon :
data = soup.find_all(class_='Specific')

You should specify class_ parameter in the find_all method. name parameter may be omitted as well:
In [12]: html = '''<div class='Specific'><span class='Specific c1'></span><p class='NonSpecific'></p></div>'''
In [13]: soup = bs4.BeautifulSoup(html, 'html.parser')
In [14]: soup.find_all(class_='Specific')
Out[14]:
[<div class="Specific"><span class="Specific c1"></span><p class="NonSpecific"></p></div>,
<span class="Specific c1"></span>]

Related

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.
in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)
Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
Good Luck

Extracting data from div tag

so im scraping data from a website and it has some data in its div tag
like this :
<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>,
I want to scrape "Donald Duck" part and state and city name after rel="nofollow"
the site contains a lot of data so name and state are different
the code that i have written is
div = soup.find_all('div', {'class':'search-result__title'})
print (div.string)
this gives me a error
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
first, use .text. Second, find_all() will return a list of elements. You need to specify the index value with either: print (div[0].text), or since you will probably have more than 1 element, just iterate through them
from bs4 import BeautifulSoup
html = '''<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find_all('div', {'class':'search-result__title'})
print (div[0].text)
#OR
for each in div:
print (each.text)

How to get content inside tag in beautiful Shop 4?

how to get all content inside a html tags ?
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
print(matched_list)
code above will return :
<a><b>scgvggvd</b></a>
what i want is :
<b>scgvggvd</b>
the tag <a> is removed after it's found
i hope the solution will works with find_all() too
If the <b> tag is a sibling of the <a> tag use the following line:
matched_list = soup.select_one('b')
If the <b> tag is a child of the <a> tag use the following line:
matched_list = soup.select_one('a b')
Use select instead of select_one if you need multiple hits.
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
for b in matched_list:
print(b)

Scrapy find all links with different(similar) class

I'm trying to scrap links with certain class "post-item post-item-xxxxx". But since the class is different in each, how can I capture all of them?
<li class="post-item post-item-18887"><a
href="http://example.com/archives/18887.html" title="Post1"</a></li>
<li class="post-item post-item-18883"><a href="http://example.com/archives/18883.html" title="Post2"</a></li>
my code:
scrap all the cafe links from example.com
class DengaSpider(scrapy.Spider):
name = 'cafes'
allowed_domains = ['example.com']
start_urls = [
'http://example.com/archives/8136.html',
]
rules = [
Rule(
LinkExtractor(
allow=('^http://example\.com/archives/\d+.html$'),
unique=True
),
follow=True,
callback="parse_items"
)
]
def parse(self, response):
cafelink = response.css('post.item').xpath('//a/#href').extract()
if cafelink is not None:
print(cafelink)
the .css part is not working, how can I fix it?
Here's a sample run for the above html in scrapy shell:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Test HTML String", body='<li class="post-item post-item-18887"><a href="http://example.com/archives/18883.html" title="Post2"</li>', encoding='utf-8')
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract_first()
>>> cafelink
'http://example.com/archives/18887.html'
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract()
>>> cafelink
['http://example.com/archives/18887.html', 'http://example.com/archives/18883.html']
Xpath has the contains() method for this, so you might try this:
cafelink = response.xpath("//*[contains(#class, 'post-item-')]//a/#href").extract()
Also be careful when using // in xpath. It makes xpath starts the search in the document root, no matter where it currently is.
If all the items you want also have the "post-item" class then why do you need to capture them by their other class? In case you still need to do that, try the "starts with" CSS selector:
response.css('li[class^="post-item post-item-"]')
Documentation here.

data-lazy beautifulsoup html find

I am having problems calling specific attributes in beautifulsoup
<div class="route_list "
data-id="11234"
data-lazy="ubt"
data-ubt-company="ABC"
data-ubt-departuredate="2016-11-10"
data-ubt-destcountry="China,"
data-ubt-from="Shanghai"
data-ubt-mark="Bus"
data-ubt-price="2399"
data-ubt-sailingid="11185"
data-ubt-score="4.4"
data-ubt-sourcefrom="Cruise"
data-ubt-voyaid="1184">
I am trying to extract only the company and departure date and the following code returns a key error.
bsObj = BeautifulSoup(html.read(), "html.parser")
div=bsObj.div
departure = div.attrs['data-ubt-departuredate']
You might not be targeting the desired div, narrow down your search:
div = bsObj.find("div", class_="route_list")
Or, checking the presence of the data-ubt-departuredate attribute:
div = bsObj.find("div", {"data-ubt-departuredate": True})