Beautifoul soup: ho extract <p> content of a parent balise - beautifulsoup

in a text file, each item have the same structure so I would like to parse it with beautiful soup.
An extract:
data = """
<article id="1" title="Titre 1" sourcename="Le monde" about="Fillon|Macron">
<p type="title">Sub title1</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxx</p>
</article>
<article id="2" title="Titre 2" sourcename="La Croix" about="Le Pen|Mélanchon">
<p type="title">Sub title2</p>
<p>yyyyyyyyyyyyyyyyyyyyyyyyy</p>
</article>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for text in soup.find_all('article'):
print(text['id'])
print(list(text.findChildren()))
print(list(text.children))
I want to extract "p" balise content:
For each article, I would like to get a list of list (to convert to Df panda).
For example:
[
[1, "Sub title2", "xxxxxxxxxxxxx"],
2, "Sub title2", "yyyyyyyyyyyyy"],
]
Thanks a lot.
Théo

You're almost there.
result = [] # create a variable to store your results
for article in soup.find_all("article"):
article_id = article["id"]
title = article.select("p[type=title]")[0] # select the title tag
title_text = title.text
p = title.find_next("p").text # get the adjacent p tag
result.append([article_id, title_text, p])

Related

How to extract text of specific tags with multiple occurrences

HTML:
"<span class="font-weight-bold color-primary small text-right text-nowrap">29,95 €</span>
url = https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire?sellerCountry=13&sellerReputation=2&language=1&minCondition=4#articleFilterSellerLocation
I wish to extract the text of 29,95 €.
Currently using BeautifulSoup. However, the page has a table with many other texts like this which I also wish to extract. How do I find all of these tags and extract only the text at the end to a list?
The current code I have tried is:
for price in new_page:
new_page.find("div", class_="table-body")
price = new_page.find_all("span", attrs="font-weight-bold color-primary small text-right text-nowrap")
output_price = [x["font-weight-bold color-primary small text-right text-nowrap"] for x in price]
import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"sellerCountry": "13",
"sellerReputation": "2",
"language": "1",
"minCondition": "4"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('dl.labeled dd:nth-child(6)').text)
main('https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire')
Output:
29,95 €

Find multiple tags with condition

Is it possible to find multiple tags with a condition?
<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">
Could I say
Find all "a" and "img" tags containing "/img/"
Yes, just supply function (can be lambda function) to find_all() method:
data = """<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for tag in soup.body.find_all(lambda t: t.name in ('a', 'img') and \
('href' in t.attrs and '/img/' in t['href']) or
('src' in t.attrs and '/img/' in t['src'])):
print(tag.name, tag.attrs)
print('*' * 80)
Outputs:
a {'href': '/img/something.jpg'}
********************************************************************************
img {'src': '/img/somethingelse.png'}
********************************************************************************

Issue parsing variable from HTML with bs4

Im trying to parse the "value" of variable ( __VIEWSTATEGENERATOR ), here's the HTML code ::
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
Here's the code I am attempting to do that with ::
viewstategenerator = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
I then execute:: print(viewstategenerator), and I get the following string for my variable:
>>> print(viewstategenerator)
[<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>]
I was expecting to grab just the value of "1434571F", not sure why that is... Any help would be highly appreciated!!
It looks like you're close but just a tad confused about the BeautifulSoup API.
soup.findAll returns a list of all of the DOM elements that match the query you gave it. Seeing as only one element on the page can match your query, you should use soup.find instead. To get the value of the value attribute of your input element, use ['value'].
from bs4 import BeautifulSoup as Soup
html = """
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
"""
soup = Soup(html, 'lxml') # Use whatever parser you're already using.
viewstategenerator = soup.find("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
print(viewstategenerator['value'])
# Prints 1434571F

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
This is what I ended up doing
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])

BeautifulSoup Nested class selector

I am using BeautifulSoup for a project. Here is my HTML structure
<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
Now I want to grab text in div class 'apple' which falls under class 'fruits'
This is what I have tried so far ....
for node in soup.find_all("div", class_="apple")
Its returning ...
Bill
Sean
But I want it to return only ...
John
Sam
Bailey
Jack
Sour
Sweet
Salty
Fruits are good
Please note that I DO NOT know the exact structure of elements inside div class="apple" There can be any type of different HTML elements inside that class. So the selector has to be flexible enough.
Here is the full code, where I need to add this BeautifulSoup code ...
class MySpider(CrawlSpider):
name = 'dknnews'
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache']
allowed_domains = ['example.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
#soup = BeautifulSoup(content.decode('utf-8','ignore'))
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dknpagetype"})
ptitle = soup.find_all(attrs={"name":"dknpagetitle"})
pturl = soup.find_all(attrs={"name":"dknpageurl"})
ptdate = soup.find_all(attrs={"name":"dknpagedate"})
ptdesc = soup.find_all(attrs={"name":"dknpagedescription"})
for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE -->
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
yield Request(url, callback=self.parse)
I am not sure how to use nested selectors with BeautifulSoup find_all ?
Any help is very appreciated.
Thanks
soup.select('.fruits .apple p')
use CSSselector, it's very easy to express class.
soup.find(class_='fruits').find(class_="apple").find_all('p')
Or, you can use find() to get the p tag step by step
EDIT:
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]
use strings generator to get all the string under the div tag, stripped_strings will get rid of \n in the results.
out:
['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']
Full code:
from bs4 import BeautifulSoup
source_code = """<div class="container">
<div class="fruits">
<div class="apple">
<p>John</p>
<p>Sam</p>
<p>Bailey</p>
<p>Jack</p>
<ul>
<li>Sour</li>
<li>Sweet</li>
<li>Salty</li>
</ul>
<span>Fruits are good</span>
</div>
<div class="mango">
<p>Randy</p>
<p>James</p>
</div>
</div>
<div class="apple">
<p>Bill</p>
<p>Sean</p>
</div>
</div>
"""
soup = BeautifulSoup(source_code, 'lxml')
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]