Issue parsing variable from HTML with bs4

Issue parsing variable from HTML with bs4 - beautifulsoup

Im trying to parse the "value" of variable ( __VIEWSTATEGENERATOR ), here's the HTML code ::
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
Here's the code I am attempting to do that with ::
viewstategenerator = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
I then execute:: print(viewstategenerator), and I get the following string for my variable:
>>> print(viewstategenerator)
[<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>]
I was expecting to grab just the value of "1434571F", not sure why that is... Any help would be highly appreciated!!

It looks like you're close but just a tad confused about the BeautifulSoup API.
soup.findAll returns a list of all of the DOM elements that match the query you gave it. Seeing as only one element on the page can match your query, you should use soup.find instead. To get the value of the value attribute of your input element, use ['value'].
from bs4 import BeautifulSoup as Soup
html = """
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
"""
soup = Soup(html, 'lxml') # Use whatever parser you're already using.
viewstategenerator = soup.find("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
print(viewstategenerator['value'])
# Prints 1434571F

Related

Beautifoul soup: ho extract <p> content of a parent balise

in a text file, each item have the same structure so I would like to parse it with beautiful soup.
An extract:
data = """
<article id="1" title="Titre 1" sourcename="Le monde" about="Fillon|Macron">
<p type="title">Sub title1</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxx</p>
</article>
<article id="2" title="Titre 2" sourcename="La Croix" about="Le Pen|Mélanchon">
<p type="title">Sub title2</p>
<p>yyyyyyyyyyyyyyyyyyyyyyyyy</p>
</article>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for text in soup.find_all('article'):
print(text['id'])
print(list(text.findChildren()))
print(list(text.children))
I want to extract "p" balise content:
For each article, I would like to get a list of list (to convert to Df panda).
For example:
[
[1, "Sub title2", "xxxxxxxxxxxxx"],
2, "Sub title2", "yyyyyyyyyyyyy"],
]
Thanks a lot.
Théo

You're almost there.
result = [] # create a variable to store your results
for article in soup.find_all("article"):
article_id = article["id"]
title = article.select("p[type=title]")[0] # select the title tag
title_text = title.text
p = title.find_next("p").text # get the adjacent p tag
result.append([article_id, title_text, p])

I dont get values when i request HTML using beatifulsoup (python)

I am currently building my own "stock" tracker.
I have a hard time extracting the right values from websites when scraping.
On the online html-code h2 has a value, but when i request it, h2 doesn't bring along this value.
Here is my code:
import requests
from bs4 import BeautifulSoup
html_text = requests.get("https://npinvestor.dk/kursinfo/vis-aktie/172.1.MAERSK-B:2").text
soup = BeautifulSoup(html_text, "lxml")
stock = soup.find('h2', class_="change-pct text-right change-flash change-color")
print(stock)
stock_2 = soup.find('h2', class_="change-pct text-right change-flash change-color").text
print(stock_2)
my output:
<h2 class="change-pct text-right change-flash change-color" style="width: 120px; float: left;"> </h2>

The data you're trying to get is dynamically added by JS, which means bs4 won't see those.
However, you can try the API endpoint.
For example:
import requests
API_endpoint = "https://npinvestor.dk/javascript/ajax/stock_details.php?symbol=172.1.MAERSK-B&provider=2&frequency=60"
data = requests.get(API_endpoint).json()
print(data["symbol"], data["close_price"])
Output:
172.1.MAERSK/B 13660
The entire response looks like this:
{
"company_name": "A.P. M\u00f8ller - M\u00e6rsk B",
"isin": "DK0010244508",
"symbol": "172.1.MAERSK/B",
"tstamp": "2021-01-26 12:02:36",
"last": "13215",
"lowest": "13170",
"highest": "13600",
"close_price": "13660",
"volume": "23053",
"bid": "13205",
"ask": "13210",
"change_pct": "-3.257686676427529",
"change_pt": "-445",
"data_provider": "2",
"provider_name": "OMX",
"delay_type": "1"
}

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()

With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)

Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
Good Luck

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.

OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())

I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

BeautifulSoup find by attribute value regardless of attribute

Say I have something like this:
<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>
I want to search for the keyword 'cake' and get all of them.

Find all by using lambda and search for a given attribute value or if a class contains the value that you want.
from bs4 import BeautifulSoup
example = """<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>"""
soup = BeautifulSoup(example, "html.parser")
print (soup.find_all(lambda tag: [a for a in tag.attrs.values() if a == "cake" or "cake" in tag.get("class")]))
Outputs:
[<div class="cake">1</div>, <h2 id="cake">1</h2>, <sometag someattribute="cake">1</sometag>]

You could use regex and BeautifulSoup together. This is my terrible script:
r = '''<div class="cake">1</div>
<h2 id="cake">1</div>
<sometag someattribute="cake">1</div>'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(r, 'lxml')
for i in range(len(re.findall(r'(\w+)="cake"',str(soup)))-1):
print(soup.find_all(re.compile(r'(\w+)'), {(re.findall(pattern,str(soup)))[i]:'cake'}))
The output:
[<div class="cake">1</div>]
[<h2 id="cake">1 </div>
<sometag someattribute="cake">1</sometag></h2>]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Issue parsing variable from HTML with bs4 - beautifulsoup

Related

Beautifoul soup: ho extract <p> content of a parent balise

I dont get values when i request HTML using beatifulsoup (python)

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

compare the 'class' of container tag

BeautifulSoup find by attribute value regardless of attribute

Categories

Resources