Getting width/attributes out of a tag in Beautifulsoup instead of text

Getting width/attributes out of a tag in Beautifulsoup instead of text - beautifulsoup

So the beautifulsoup documentation I can find talks about finding a specific tag using id, class etc... But it doesn't talk about how to extract data from within the tag rather than what it surrounds.
My issue:
<img src=yellowbar.png width=63.94 height=10><img src=redbar.png width=36.0632181423 height=10><br />
Power:</b> 1480 / 1480<br />
<img src=yellowbar.png width=100 height=10><img src=redbar.png width=0 height=10><br />
I have this HTML. There are around a total of 20 tags on the page, of which 3 have src=yellowbar.png
my goal is, to select the second one, and get the width back. So I am guessing it would go:
Find tags -> find src=yellowbar.png -> select second one -> print width back.
How would I go about this?
So far I've managed to print a list of all tags.
soup = BeautifulSoup(element, "lxml")
tag = soup.find_all('img')
print(tag)
which returns
[<img height="10" src="yellowbar.png" width="77"/>, <img height="10" src="redbar.png" width="0"/>]

If I could understand your question then this should solve your issue.
from bs4 import BeautifulSoup
content = """
<img src=yellowbar.png width=63.94 height=10><img src=redbar.png width=36.0632181423 height=10><br />
Power:</b> 1480 / 1480<br />
<img src=yellowbar.png width=100 height=10><img src=redbar.png width=0 height=10><br />
"""
soup = BeautifulSoup(content,"lxml")
for tags in soup.find_all("img",{"src":"yellowbar.png"}): #use the attributes as well to specify the item you look for
print(tags['width']) #access the value using attribute
Output:
63.94
100

Related

Capture links in scrapy using regex as selector

<svg version="1.1" id="Calque_1" xmlns="&ns_svg;" xmlns:xlink="&ns_xlink;" width="700" height="700" viewBox="0 0 300 300" overflow="visible" enable-background="new 0 0 300 300" xml:space="preserve">
<a xlink:href="https://www.pros-locations-de-voitures.fr/location-de-voiture-ain-01/" onmouseover="TipFunction('Ain')" onmouseout="TipFunction('')"><path id="Z1" title="Ain" d="M237.125,152.725l-1.7-1l-2.4,3.3l-2.7,1.6l-2,0.1l-0.2-1.4l-1.6-0.8l-2,2.2l-1.5,0.1v-1.5h-1.5l-2.1-3.9 l-2.5-1.6l-2.7,0.6l-2.9-0.8l-2.9,10.5l-0.8,4l1.5,4.6l1.5-0.3l1.8,2.9l3.2-0.3l3,1l1.5-2.5l1.4-0.4l5.6,7.6l2.9-3.3l1.1-6.8 l-0.4-4.7h1.5l1.3-1.4h-0.1l0.3-2.6l2.8-1.7L237.125,152.725z" fill="red" stroke="#EEEEEE" stroke-width="0.9"></path> </a>
<a xlink:href="https://www.pros-locations-de-voitures.fr/location-de-voiture-aisne-02/" onmouseover="TipFunction('Aisne')" onmouseout="TipFunction('')"><path id="Z2" title="Aisne" d="M179.025,42.325l-6.3,0.4l-0.2,1.8l-1.9,4.1l1.1,3.5l0.2,5.1l-0.3,2.2l1.1,0.9l-1.3,0.6l-1.2,2.8l-1.3,0.8 l1.4,2.3l-1.5-0.1l0.4,1.5l1.2-0.8l1.4,0.6l0.3,1.4l-1.1,0.8l1.3,0.4l0.9,1.2l-0.3,1.4l1.9,2.1l4.7,3l3.8-5.1l-1.3-0.6l0.5-1.4 l-0.8-1.2l2.7-1.1l-1.6-4l0.6-1.4l4-2l2.7,1l0.4-1.5l-0.1-7.1l1.4-0.1l2.5-3.6l-0.7-1.6l0.7-1.7l-0.4-2.9h-0.2l-1.8-0.6v-0.1 l-7.8-2.1l-2.6,0.9l-1.2-0.9L179.025,42.325z " fill="#094353" stroke="#EEEEEE" stroke-width="0.9"></path> </a>
While testing the regex pattern its working fine and matches the links but while applying in code it returning empty list.
import scrapy
class scraper(scrapy.Spider):
name = "scraper"
start_urls = ["https://www.pros-locations-de-voitures.fr/"]
def parse(self, response):
yield {
'Links' : response.selector.re('(?<=xlink:href=").*?(?=")')
}

The data you are looking for is loaded via javascript so to gain access to the data you will have to pre-render the page using either scrapy-splash, selenium or scrapy-playwright. You can then use below xpath selector to obtain the urls. No need to use regex in this case
response.xpath("//*/#*[name()='xlink:href']").getall()

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)

You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()

With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)

Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
Good Luck

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.

OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())

I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

Selenium find all <em> and the following <a> tags

I want to get all em and the following a tags, but these are seperated:
<em style="color: #FF2500;">ITEM:</em> LINK <br />
<em style="color: #FF2500;">ITEM2:</em> LINK2 <br />
<em style="color: #FF2500;">ITEM3:</em> LINK3 <br />
I need to save the ITEM and the correspinding LINK, because they must be together, but I only managed to find the text of the links:
elems = driver.find_elements_by_xpath("//em/following-sibling::a[#href]")
printing this gives me only the contents of the link:
elems = driver.find_elements_by_xpath("//em/following-sibling::a[#href]")
for link in elems:
print (link.text) # LINK, LINK2, LINK3
I could of course find all em and links by themself, but I wouldnt know if they fit together. So I need to find:
All <em> where an <a> with a certain text follows. This way I should know for sure that they are together.

Searches em preceding to a with some text
//a[text()=' LINK2 '] | //a[text()=' LINK2 ']/preceding-sibling::em[1]
to get just text concatenated from both elements
concat(//a[text()=' LINK2 ']/preceding-sibling::em[1], //a[text()=' LINK2 '])
result:
"ITEM2: LINK2 "

Issue parsing variable from HTML with bs4

Im trying to parse the "value" of variable ( __VIEWSTATEGENERATOR ), here's the HTML code ::
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
Here's the code I am attempting to do that with ::
viewstategenerator = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
I then execute:: print(viewstategenerator), and I get the following string for my variable:
>>> print(viewstategenerator)
[<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>]
I was expecting to grab just the value of "1434571F", not sure why that is... Any help would be highly appreciated!!

It looks like you're close but just a tad confused about the BeautifulSoup API.
soup.findAll returns a list of all of the DOM elements that match the query you gave it. Seeing as only one element on the page can match your query, you should use soup.find instead. To get the value of the value attribute of your input element, use ['value'].
from bs4 import BeautifulSoup as Soup
html = """
<div>
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="1434571F"/>
</div>
"""
soup = Soup(html, 'lxml') # Use whatever parser you're already using.
viewstategenerator = soup.find("input", {"type": "hidden", "name": "__VIEWSTATEGENERATOR"})
print(viewstategenerator['value'])
# Prints 1434571F

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting width/attributes out of a tag in Beautifulsoup instead of text - beautifulsoup

Related

Capture links in scrapy using regex as selector

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

compare the 'class' of container tag

Selenium find all <em> and the following <a> tags

Issue parsing variable from HTML with bs4

Categories

Resources