Beautiful Soup - How to find tags after a specific item in HTML? - beautifulsoup

I need to find tags after a specific item on a website. So, is there a way to skip the tag objects until this specific one, then find the matching ones to given criteria? I need all p with class XYZ after the div with class ABC.
response = requests.get(url).text
soup = BeautifulSoup(response)
items = soup.find_all('p', {'class': 'MessageTextSize js-message-text message-text'}) # only return the ones after the div with class of "Text 2"
Edit: You can see a sample code block below which is part response. The aim is finding the last two paragraphs (Text 3 & Text 4) despite the first one (Text 1) also has the same p class with them. So, I need to look for the parameter of find_all function after the Text 2 (class MessageTextSize js-message-text message-text).
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 1</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize MessageTextSize--jumbo js-message-text message-text" data-aria-label-part="0">Text 2</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 3</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 4</p>
</div>
p.s. bs4 version is 4.8.1, which is the latest release.

You can always use a custom function (or a lambda expression) inside find_all. The following is self-explanatory (IMO).
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
Example
from bs4 import BeautifulSoup
html = """
<p class="XYZ">Text 1</p>
<p class="XYZ">Text 2</p>
<div class="ABC"></div>
<p class="XYZ">Text 3</p>
<p class="XYZ">Text 4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
print(result)
Output
[<p class="XYZ">Text 3</p>, <p class="XYZ">Text 4</p>]
EDIT
MessageTextSize js-message-text message-text represents three classes, not one.
x.get('class', '') returns a list of classes -
['MessageTextSize', 'js-message-text', 'message-text']
In your particular case, you have to target a p tag not a div, if I understood correctly.
So, you have to use
result = soup.find_all(
lambda x: x.name == 'p' and
'MessageTextSize js-message-text message-text' in ' '.join(x.get('class', ''))
and x.find_previous('p', class_='MessageTextSize MessageTextSize--jumbo js-message-text message-text')
)
Ref:
find_previous()
Function as filter

If I understand you correctly, this should work:
item = soup.select_one('p[class*="MessageTextSize--jumbo"]')
sibs = item.parent.find_next_siblings()
for sib in sibs:
print(sib.text.strip())
Output:
Text 3
Text 4

Related

How to extract value of all classes in beautiful Soup

I have a HTML file with a structure like this:
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
I need to have a Python dict like this:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
So I need to get the value of the class attribute of all span tags, along with their text. How can I do this with bs4?
Select your elements and iterate the ResultSet while appending the values to your dict. To extract the values of an attribute use .get(). Because class will give you a list pick yours by index or key.
Example
from bs4 import BeautifulSoup
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html)
d = {
'institution':[],
'person':[]
}
for e in soup.select('span[wikidata]'):
d[e.get('class')[0]].append(e.get('wikidata'))
d
Output
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']}
This is the way I solved my problem thanks to #HedgeHog.
from bs4 import BeautifulSoup
from collections import defaultdict
def capture_info(soup: 'BeautifulSoup') -> defaultdict:
info = defaultdict(list)
for i in soup.select('span[Wikidata]'):
info[i.get('class')[0]].append(i.get('wikidata'))
return info
html = '''
<p id="01">... EU legislation and the <em>monetary power</em> of the
<span class="institution" Wikidata="Q8901" name="European Central Bank">ECB</span>.</p>
<p id="02"><span class="person" Wikidata="Q563217">Guido Carli</span>, Governor of the
<span class="institution" Wikidata="Q806176">Bank of Italy</span> ...</p>
'''
soup = BeautifulSoup(html, 'html.parser')
info = capture_info(soup)
The output is:
{'institution': ['Q8901', 'Q806176'], 'person': ['Q563217']})

How can I get a value from an attribute inside a tag

I have a soup object like:
<a class="love-action js-add-to-favorites" data-id="415953" data-price="715.00" href="#">
</a>
I did
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price)
I'd like to get only: 715.00
How to fix?
You can access attributes of a tag by treating it like a dictionary - So simply get the value from the attribute data-price by:
price['data-price']
Example based on your question
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price['data-price'])
Output
715.00

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.
OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())
I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

Scrapy CSV export shows the same data in all rows

I'm trying to scrape the following html code:
<ul class="results-list" id="search-results">
<li>
<h3 class="name">First John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
<li>
<h3 class="name">Second John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
</ul>
When I run my spider, I get 2 rows, containing the same information. I have name,email,phone columns and for example in the name column for both I would get:
First John,Second John.
My Scrapy code is the following:
people= response.xpath('//ul[#class="results-list"]/li')
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'//h3/text()').extract()
item['Email'] = person.xpath(
'//div[#class="details"]/a/#href').extract()
item['Phone'] = person.xpath(
'//div[#class="details"]/span[#class="phone"]/text()').extract()
yield item
However when I run scrapy crawl MySpider -o output.csv I get the same information in all rows.
you are using absolute path on your xpath expressions, change them to:
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'.//h3/text()').extract_first()
item['Email'] = person.xpath(
'.//div[#class="details"]/a/#href').extract_first()
item['Phone'] = person.xpath(
'.//div[#class="details"]/span[#class="phone"]/text()').extract_first()
yield item

BeautifulSoup Nested class selector

I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
Bill
Sean
But I want it to return only ...
John
Sam
Bailey
Jack
Sour
Sweet
Salty
Fruits are good
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache']
allowed_domains = ['example.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks
soup.select('.fruits .apple p')
use CSSselector, it's very easy to express class.
soup.find(class_='fruits').find(class_="apple").find_all('p')
Or, you can use find() to get the p tag step by step
EDIT:
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
use strings generator to get all the string under the div tag, stripped_strings will get rid of \n in the results.
out:
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
"""
soup = BeautifulSoup(source_code, 'lxml')
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]