I am currently building my own "stock" tracker.
I have a hard time extracting the right values from websites when scraping.
On the online html-code h2 has a value, but when i request it, h2 doesn't bring along this value.
Here is my code:
import requests
from bs4 import BeautifulSoup
html_text = requests.get("https://npinvestor.dk/kursinfo/vis-aktie/172.1.MAERSK-B:2").text
soup = BeautifulSoup(html_text, "lxml")
stock = soup.find('h2', class_="change-pct text-right change-flash change-color")
print(stock)
stock_2 = soup.find('h2', class_="change-pct text-right change-flash change-color").text
print(stock_2)
my output:
<h2 class="change-pct text-right change-flash change-color" style="width: 120px; float: left;"> </h2>
The data you're trying to get is dynamically added by JS, which means bs4 won't see those.
However, you can try the API endpoint.
For example:
import requests
API_endpoint = "https://npinvestor.dk/javascript/ajax/stock_details.php?symbol=172.1.MAERSK-B&provider=2&frequency=60"
data = requests.get(API_endpoint).json()
print(data["symbol"], data["close_price"])
Output:
172.1.MAERSK/B 13660
The entire response looks like this:
{
"company_name": "A.P. M\u00f8ller - M\u00e6rsk B",
"isin": "DK0010244508",
"symbol": "172.1.MAERSK/B",
"tstamp": "2021-01-26 12:02:36",
"last": "13215",
"lowest": "13170",
"highest": "13600",
"close_price": "13660",
"volume": "23053",
"bid": "13205",
"ask": "13210",
"change_pct": "-3.257686676427529",
"change_pt": "-445",
"data_provider": "2",
"provider_name": "OMX",
"delay_type": "1"
}
Related
HTML:
"<span class="font-weight-bold color-primary small text-right text-nowrap">29,95 €</span>
url = https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire?sellerCountry=13&sellerReputation=2&language=1&minCondition=4#articleFilterSellerLocation
I wish to extract the text of 29,95 €.
Currently using BeautifulSoup. However, the page has a table with many other texts like this which I also wish to extract. How do I find all of these tags and extract only the text at the end to a list?
The current code I have tried is:
for price in new_page:
new_page.find("div", class_="table-body")
price = new_page.find_all("span", attrs="font-weight-bold color-primary small text-right text-nowrap")
output_price = [x["font-weight-bold color-primary small text-right text-nowrap"] for x in price]
import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"sellerCountry": "13",
"sellerReputation": "2",
"language": "1",
"minCondition": "4"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('dl.labeled dd:nth-child(6)').text)
main('https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire')
Output:
29,95 €
Im trying to parse the "value" of variable ( __VIEWSTATEGENERATOR ), here's the HTML code ::
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
Here's the code I am attempting to do that with ::
viewstategenerator = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
I then execute:: print(viewstategenerator), and I get the following string for my variable:
>>> print(viewstategenerator)
[<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>]
I was expecting to grab just the value of "1434571F", not sure why that is... Any help would be highly appreciated!!
It looks like you're close but just a tad confused about the BeautifulSoup API.
soup.findAll returns a list of all of the DOM elements that match the query you gave it. Seeing as only one element on the page can match your query, you should use soup.find instead. To get the value of the value attribute of your input element, use ['value'].
from bs4 import BeautifulSoup as Soup
html = """
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
"""
soup = Soup(html, 'lxml') # Use whatever parser you're already using.
viewstategenerator = soup.find("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
print(viewstategenerator['value'])
# Prints 1434571F
Say I have something like this:
<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>
I want to search for the keyword 'cake' and get all of them.
Find all by using lambda and search for a given attribute value or if a class contains the value that you want.
from bs4 import BeautifulSoup
example = """<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>"""
soup = BeautifulSoup(example, "html.parser")
print (soup.find_all(lambda tag: [a for a in tag.attrs.values() if a == "cake" or "cake" in tag.get("class")]))
Outputs:
[<div class="cake">1</div>, <h2 id="cake">1</h2>, <sometag someattribute="cake">1</sometag>]
You could use regex and BeautifulSoup together. This is my terrible script:
r = '''<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(r, 'lxml')
for i in range(len(re.findall(r'(\w+)="cake"',str(soup)))-1):
print(soup.find_all(re.compile(r'(\w+)'), {(re.findall(pattern,str(soup)))[i]:'cake'}))
The output:
[<div class="cake">1</div>]
[<h2 id="cake">1 </div>
<sometag someattribute="cake">1</sometag></h2>]
I am having problems calling specific attributes in beautifulsoup
<div class="route_list "
data-id="11234"
data-lazy="ubt"
data-ubt-company="ABC"
data-ubt-departuredate="2016-11-10"
data-ubt-destcountry="China,"
data-ubt-from="Shanghai"
data-ubt-mark="Bus"
data-ubt-price="2399"
data-ubt-sailingid="11185"
data-ubt-score="4.4"
data-ubt-sourcefrom="Cruise"
data-ubt-voyaid="1184">
I am trying to extract only the company and departure date and the following code returns a key error.
bsObj = BeautifulSoup(html.read(), "html.parser")
div=bsObj.div
departure = div.attrs['data-ubt-departuredate']
You might not be targeting the desired div, narrow down your search:
div = bsObj.find("div", class_="route_list")
Or, checking the presence of the data-ubt-departuredate attribute:
div = bsObj.find("div", {"data-ubt-departuredate": True})
I use this method
allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})
to return a list like this:
[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>,
掳脵露脠驴矛脮脮]
How do I extract this href?
http://www.ylyd.com/showurl.asp?id=6182
Thanks. :)
you can use
for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
a['href']
In this example, there's no real need to use regex, it can be simply as calling <a> tag and then ['href'] attribute like so:
get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']
You can always use prettify() method to better see the HTML code.
from bs4 import BeautifulSoup
string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
<font size="3">
掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
</font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu" target="_blank">
掳脵露脠驴矛脮脮
</a>
]
'''
soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')
# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''
Alternatively, you can do the same thing using Baidu Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Essentially, the main difference in this example is that you don't have to figure out how to grab certain elements since it's already done for the end-user with a JSON output.
Code to grab href/cached href from first page results:
from serpapi import BaiduSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "baidu",
"q": "ylyd"
}
search = BaiduSearch(params)
results = search.get_dict()
for result in results['organic_results']:
# try/expect used since sometimes there's no link/cached link
try:
link = result['link']
except:
link = None
try:
cached_link = result['cached_page_link']
except:
cached_link = None
print(f'{link}\n{cached_link}\n')
# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
Disclaimer, I work for SerpApi.