How to extract text of specific tags with multiple occurrences - beautifulsoup

HTML:
"<span class="font-weight-bold color-primary small text-right text-nowrap">29,95 €</span>
url = https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire?sellerCountry=13&sellerReputation=2&language=1&minCondition=4#articleFilterSellerLocation
I wish to extract the text of 29,95 €.
Currently using BeautifulSoup. However, the page has a table with many other texts like this which I also wish to extract. How do I find all of these tags and extract only the text at the end to a list?
The current code I have tried is:
for price in new_page:
new_page.find("div", class_="table-body")
price = new_page.find_all("span", attrs="font-weight-bold color-primary small text-right text-nowrap")
output_price = [x["font-weight-bold color-primary small text-right text-nowrap"] for x in price]

import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"sellerCountry": "13",
"sellerReputation": "2",
"language": "1",
"minCondition": "4"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('dl.labeled dd:nth-child(6)').text)
main('https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire')
Output:
29,95 €

Related

Get text of elements but separate with spaces

I'm grabbing text data from a webpage and when I use .text it has all the elements combined. However, I want to separate some of them with a space.
For example, I have this text:
data=['<span class="sub-title title-block"><span class="nowrap">1.2</span><span class="nowrap">TEKNA</span></span>',
'<span class="sub-title title-block"><span class="nowrap">Amr</span><span class="nowrap">V12 5.2</span></span>',
'<span class="sub-title title-block"></span>']
When I do the following:
from bs4 import BeautifulSoup
for i in data:
soup = BeautifulSoup(i, 'lxml')
for d in soup:
print(d.text)
I get:
1.2TEKNA
AmrV12 5.2
But I want the expected output:
1.2 TEKNA
Amr V12 5.2
where I get each text separated between each other.
You can use get_text(<sep>) method and define your custom separator as below:
from bs4 import BeautifulSoup
data=['<span class="sub-title title-block"><span class="nowrap">1.2</span><span class="nowrap">TEKNA</span></span>',
'<span class="sub-title title-block"><span class="nowrap">Amr</span><span class="nowrap">V12 5.2</span></span>',
'<span class="sub-title title-block"></span>']
for i in data:
soup = BeautifulSoup(i, 'lxml')
for d in soup:
print(d.get_text(" "))
Output:
1.2 TEKNA
Amr V12 5.2

Beautifoul soup: ho extract <p> content of a parent balise

in a text file, each item have the same structure so I would like to parse it with beautiful soup.
An extract:
data = """
<article id="1" title="Titre 1" sourcename="Le monde" about="Fillon|Macron">
<p type="title">Sub title1</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxx</p>
</article>
<article id="2" title="Titre 2" sourcename="La Croix" about="Le Pen|Mélanchon">
<p type="title">Sub title2</p>
<p>yyyyyyyyyyyyyyyyyyyyyyyyy</p>
</article>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for text in soup.find_all('article'):
print(text['id'])
print(list(text.findChildren()))
print(list(text.children))
I want to extract "p" balise content:
For each article, I would like to get a list of list (to convert to Df panda).
For example:
[
[1, "Sub title2", "xxxxxxxxxxxxx"],
2, "Sub title2", "yyyyyyyyyyyyy"],
]
Thanks a lot.
Théo
You're almost there.
result = [] # create a variable to store your results
for article in soup.find_all("article"):
article_id = article["id"]
title = article.select("p[type=title]")[0] # select the title tag
title_text = title.text
p = title.find_next("p").text # get the adjacent p tag
result.append([article_id, title_text, p])

I dont get values when i request HTML using beatifulsoup (python)

I am currently building my own "stock" tracker.
I have a hard time extracting the right values from websites when scraping.
On the online html-code h2 has a value, but when i request it, h2 doesn't bring along this value.
Here is my code:
import requests
from bs4 import BeautifulSoup
html_text = requests.get("https://npinvestor.dk/kursinfo/vis-aktie/172.1.MAERSK-B:2").text
soup = BeautifulSoup(html_text, "lxml")
stock = soup.find('h2', class_="change-pct text-right change-flash change-color")
print(stock)
stock_2 = soup.find('h2', class_="change-pct text-right change-flash change-color").text
print(stock_2)
my output:
<h2 class="change-pct text-right change-flash change-color" style="width: 120px; float: left;"> </h2>
 
The data you're trying to get is dynamically added by JS, which means bs4 won't see those.
However, you can try the API endpoint.
For example:
import requests
API_endpoint = "https://npinvestor.dk/javascript/ajax/stock_details.php?symbol=172.1.MAERSK-B&provider=2&frequency=60"
data = requests.get(API_endpoint).json()
print(data["symbol"], data["close_price"])
Output:
172.1.MAERSK/B 13660
The entire response looks like this:
{
"company_name": "A.P. M\u00f8ller - M\u00e6rsk B",
"isin": "DK0010244508",
"symbol": "172.1.MAERSK/B",
"tstamp": "2021-01-26 12:02:36",
"last": "13215",
"lowest": "13170",
"highest": "13600",
"close_price": "13660",
"volume": "23053",
"bid": "13205",
"ask": "13210",
"change_pct": "-3.257686676427529",
"change_pt": "-445",
"data_provider": "2",
"provider_name": "OMX",
"delay_type": "1"
}

BS4: issues finding href of 2 tags

I'm having problems getting soup to return all links that are both bold and have a URL. Right now it's only returning the 1st one on the page.
Here is part of the source:
<div class="section_wrapper" id="all_players_">
<div class="section_heading">
<span class="section_anchor" id="players__link" data-label="925 Players"></span>
<h2>925 Players</h2> <div class="section_heading_text">
<ul> <li><strong>Bold</strong> indicates active player and + indicates a Hall of Famer.</li>
</ul>
</div>
</div> <div class="section_content" id="div_players_">
<p>John D'Acquisto (1973-1982)</p>
<p>Jeff D'Amico (1996-2004)</p>
<p>Jeff D'Amico (2000-2000)</p>
<p>Jamie D'Antona (2008-2008)</p>
<p>Jerry D'Arcy (1911-1911)</p>
<p><b>Chase d'Arnaud (2011-2016)</b></p>
<p><b>Travis d'Arnaud (2013-2016)</b></p>
<p>Omar Daal (1993-2003)</p>
<p>Paul Dade (1975-1980)</p>
<p>John Dagenhard (1943-1943)</p>
<p>Pete Daglia (1932-1932)</p>
<p>Angelo Dagres (1955-1955)</p>
<p><b>David Dahl (2016-2016)</b></p>
<p>Jay Dahl (1963-1963)</p>
<p>Bill Dahlen (1891-1911)</p>
<p>Babe Dahlgren (1935-1946)</p>**strong text**
and here is my script:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
for player_url in soup.b.find_all(limit=None):
for player_link in re.findall('/players/', player_url['href']):
print ('http://www.baseball-reference.com' + player_url['href'])
The other part is that there are other div id's that have similar lists that I don't care about. I want to grab the URLs from only this div class, that have a <b> tag. The <b> tag symbolizes that they are active players and that is what I am trying to capture.
Use BeautifulSoup to do the "selection" work and drill down to your data:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
bolds = soup.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
Now, if only want the one div with id=div_players_ you could add an additional filter:
url = "http://www.baseball-reference.com/players/d/"
content = urllib.request.urlopen(url)
soup = bs(content, "html.parser")
div_players = soup.find('div', {'id': 'div_players_'})
bolds = div_players.find_all('b')
for bold in bolds:
player_link = bold.find('a')
if player_link:
relative_path = player_link['href']
print('http://www.baseball-reference.com' + relative_path)
This is what I ended up doing
url = 'http://www.baseball-reference.com/players/d/'
content = urllib.request.urlopen(url)
soup = bs(content, 'html.parser')
for player_div in soup.find_all('div', {'id':'all_players_'}):
for player_bold in player_div('b'):
for player_href in player_bold('a'):
print ('http://www.baseball-reference.com' + player_href['href'])

Extracting href from attribute with BeatifulSoup

I use this method
allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})
to return a list like this:
[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>,
掳脵露脠驴矛脮脮]
How do I extract this href?
http://www.ylyd.com/showurl.asp?id=6182
Thanks. :)
you can use
for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
a['href']
In this example, there's no real need to use regex, it can be simply as calling <a> tag and then ['href'] attribute like so:
get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']
You can always use prettify() method to better see the HTML code.
from bs4 import BeautifulSoup
string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
<font size="3">
掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
</font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu" target="_blank">
掳脵露脠驴矛脮脮
</a>
]
'''
soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')
# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''
Alternatively, you can do the same thing using Baidu Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Essentially, the main difference in this example is that you don't have to figure out how to grab certain elements since it's already done for the end-user with a JSON output.
Code to grab href/cached href from first page results:
from serpapi import BaiduSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "baidu",
"q": "ylyd"
}
search = BaiduSearch(params)
results = search.get_dict()
for result in results['organic_results']:
# try/expect used since sometimes there's no link/cached link
try:
link = result['link']
except:
link = None
try:
cached_link = result['cached_page_link']
except:
cached_link = None
print(f'{link}\n{cached_link}\n')
# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
Disclaimer, I work for SerpApi.