How to get content inside tag in beautiful Shop 4? - beautifulsoup

how to get all content inside a html tags ?
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
print(matched_list)
code above will return :
<a><b>scgvggvd</b></a>
what i want is :
<b>scgvggvd</b>
the tag <a> is removed after it's found
i hope the solution will works with find_all() too

If the <b> tag is a sibling of the <a> tag use the following line:
matched_list = soup.select_one('b')
If the <b> tag is a child of the <a> tag use the following line:
matched_list = soup.select_one('a b')
Use select instead of select_one if you need multiple hits.

from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
for b in matched_list:
print(b)

Related

All li tag under first ul tag

I want to scrape all a tags with href attrs in all li tags under first ul tag. The below code scrape all all a tags under all li tags in all ul tags. (I want only under the first ul tag). You can see the website.
https://www.mindtools.com/pages/main/newMN_CDV.htm
my code is:
for ultag in soup.find_all('ul', {'class': 'collection test_further_resource'}):
for litag in ultag.find_all('li'):
print(litag.find("a")["href"])
Please see the website. Please go to "Browse tools by Category" and "Thinking about Career direction". I want to scrape the href for this category which are 13.
Thank you in advance.
.find() will return the first tag it finds that matches, as opposed to .find_all() which will return a list of all the matches.
import requests
from bs4 import BeautifulSoup
url = 'https://www.mindtools.com/pages/main/newMN_CDV.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
ultag = soup.find('ul', {'class': 'collection test_further_resource'})
for litag in ultag.find_all('li'):
print(litag.find("a")["href"])
Output:
/pages/article/managing-career.htm
/pages/article/newCDV_97.htm
/pages/article/career-strategy.htm
/pages/article/personal-ansoff-matrix.htm
/pages/article/career-opportunities.htm
/pages/article/managing-yourself.htm
/pages/article/newCDV_99.htm
/pages/article/newCDV_89.htm
/pages/article/rebooting-your-career.htm
/pages/article/newCDV_98.htm
/pages/article/locus-of-control.htm
/pages/article/newCDV_90.htm
/pages/article/seat-on-the-board.htm

Extracting data from div tag

so im scraping data from a website and it has some data in its div tag
like this :
<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>,
I want to scrape "Donald Duck" part and state and city name after rel="nofollow"
the site contains a lot of data so name and state are different
the code that i have written is
div = soup.find_all('div', {'class':'search-result__title'})
print (div.string)
this gives me a error
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
first, use .text. Second, find_all() will return a list of elements. You need to specify the index value with either: print (div[0].text), or since you will probably have more than 1 element, just iterate through them
from bs4 import BeautifulSoup
html = '''<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find_all('div', {'class':'search-result__title'})
print (div[0].text)
#OR
for each in div:
print (each.text)

Find multiple tags with condition

Is it possible to find multiple tags with a condition?
<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">
Could I say
Find all "a" and "img" tags containing "/img/"
Yes, just supply function (can be lambda function) to find_all() method:
data = """<a href = "/img/something.jpg">
<img src= "/img/somethingelse.png">"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for tag in soup.body.find_all(lambda t: t.name in ('a', 'img') and \
('href' in t.attrs and '/img/' in t['href']) or
('src' in t.attrs and '/img/' in t['src'])):
print(tag.name, tag.attrs)
print('*' * 80)
Outputs:
a {'href': '/img/something.jpg'}
********************************************************************************
img {'src': '/img/somethingelse.png'}
********************************************************************************

BeautifulSoup find by attribute value regardless of attribute

Say I have something like this:
<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>
I want to search for the keyword 'cake' and get all of them.
Find all by using lambda and search for a given attribute value or if a class contains the value that you want.
from bs4 import BeautifulSoup
example = """<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>"""
soup = BeautifulSoup(example, "html.parser")
print (soup.find_all(lambda tag: [a for a in tag.attrs.values() if a == "cake" or "cake" in tag.get("class")]))
Outputs:
[<div class="cake">1</div>, <h2 id="cake">1</h2>, <sometag someattribute="cake">1</sometag>]
You could use regex and BeautifulSoup together. This is my terrible script:
r = '''<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(r, 'lxml')
for i in range(len(re.findall(r'(\w+)="cake"',str(soup)))-1):
print(soup.find_all(re.compile(r'(\w+)'), {(re.findall(pattern,str(soup)))[i]:'cake'}))
The output:
[<div class="cake">1</div>]
[<h2 id="cake">1 </div>
<sometag someattribute="cake">1</sometag></h2>]

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
This is what I ended up doing
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])