Get text of elements but separate with spaces - beautifulsoup

I'm grabbing text data from a webpage and when I use .text it has all the elements combined. However, I want to separate some of them with a space.
For example, I have this text:
data=['<span class="sub-title title-block"><span class="nowrap">1.2</span><span class="nowrap">TEKNA</span></span>',
'<span class="sub-title title-block"><span class="nowrap">Amr</span><span class="nowrap">V12 5.2</span></span>',
'<span class="sub-title title-block"></span>']
When I do the following:
from bs4 import BeautifulSoup
for i in data:
soup = BeautifulSoup(i, 'lxml')
for d in soup:
print(d.text)
I get:
1.2TEKNA
AmrV12 5.2
But I want the expected output:
1.2 TEKNA
Amr V12 5.2
where I get each text separated between each other.

You can use get_text(<sep>) method and define your custom separator as below:
from bs4 import BeautifulSoup
data=['<span class="sub-title title-block"><span class="nowrap">1.2</span><span class="nowrap">TEKNA</span></span>',
'<span class="sub-title title-block"><span class="nowrap">Amr</span><span class="nowrap">V12 5.2</span></span>',
'<span class="sub-title title-block"></span>']
for i in data:
soup = BeautifulSoup(i, 'lxml')
for d in soup:
print(d.get_text(" "))
Output:
1.2 TEKNA
Amr V12 5.2

Related

How to extract text of specific tags with multiple occurrences

HTML:
"<span class="font-weight-bold color-primary small text-right text-nowrap">29,95 €</span>
url = https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire?sellerCountry=13&sellerReputation=2&language=1&minCondition=4#articleFilterSellerLocation
I wish to extract the text of 29,95 €.
Currently using BeautifulSoup. However, the page has a table with many other texts like this which I also wish to extract. How do I find all of these tags and extract only the text at the end to a list?
The current code I have tried is:
for price in new_page:
new_page.find("div", class_="table-body")
price = new_page.find_all("span", attrs="font-weight-bold color-primary small text-right text-nowrap")
output_price = [x["font-weight-bold color-primary small text-right text-nowrap"] for x in price]
import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"sellerCountry": "13",
"sellerReputation": "2",
"language": "1",
"minCondition": "4"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('dl.labeled dd:nth-child(6)').text)
main('https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire')
Output:
29,95 €

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.
in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)
Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
Good Luck

How to get content inside tag in beautiful Shop 4?

how to get all content inside a html tags ?
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
print(matched_list)
code above will return :
<a><b>scgvggvd</b></a>
what i want is :
<b>scgvggvd</b>
the tag <a> is removed after it's found
i hope the solution will works with find_all() too
If the <b> tag is a sibling of the <a> tag use the following line:
matched_list = soup.select_one('b')
If the <b> tag is a child of the <a> tag use the following line:
matched_list = soup.select_one('a b')
Use select instead of select_one if you need multiple hits.
from bs4 import BeautifulSoup
content = "<a><b>scgvggvd</b></a>"
soup = BeautifulSoup(content, 'html.parser')
matched_list = soup.find('a')
for b in matched_list:
print(b)

BeautifulSoup find by attribute value regardless of attribute

Say I have something like this:
<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>
I want to search for the keyword 'cake' and get all of them.
Find all by using lambda and search for a given attribute value or if a class contains the value that you want.
from bs4 import BeautifulSoup
example = """<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>"""
soup = BeautifulSoup(example, "html.parser")
print (soup.find_all(lambda tag: [a for a in tag.attrs.values() if a == "cake" or "cake" in tag.get("class")]))
Outputs:
[<div class="cake">1</div>, <h2 id="cake">1</h2>, <sometag someattribute="cake">1</sometag>]
You could use regex and BeautifulSoup together. This is my terrible script:
r = '''<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(r, 'lxml')
for i in range(len(re.findall(r'(\w+)="cake"',str(soup)))-1):
print(soup.find_all(re.compile(r'(\w+)'), {(re.findall(pattern,str(soup)))[i]:'cake'}))
The output:
[<div class="cake">1</div>]
[<h2 id="cake">1 </div>
<sometag someattribute="cake">1</sometag></h2>]

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
This is what I ended up doing
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])