Comment an element with BeautifulSoup - beautifulsoup

I am trying to learn how to use BeautifulSoup. I know how to delete a single element (using extract or decompose). I was wonder if there's a way to put the element within a comment, so that the element is printed as
<!-- <p>HI there</p> -->

You could create a Comment object from the element and use the replace_with method to replace the original tag with the comment.
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup('<p>HI there</p>', 'html.parser')
soup.p.replace_with(Comment(str(soup.p)))
print(soup)
<!--<p>HI there</p>-->

Related

Nested Span element URL text value - Selenium

I am trying to get the value #2011 which is a URL text from the HTML below. I tried the below code but didnt work. It says it is unable to locate the class
driver.find_element(By.XPATH, '//span[#class = "data-issue-and-pr-hovercards-enabled"]').get_attribute('a')
Can anyone help to correct the mistake? I am new to selenium.
<span data-issue-and-pr-hovercards-enabled>
<span><span> · Fixed by #2011</span><span></span></span>
</span>
Here is the link to the website - github.com/mlpack/mlpack/issues/2008 I want to get the #2011 which is next to the Fixed by Text (below the title of the issue). Is it possible to do this?
Try the below XPath:
This relative xpath will search for all tag names(*) which contains the text "#2011"
//*[contains(text(),'#2011')]
Or try the below one: Very similar explanation as above but this will search only within <a> tag
//a[contains(text(),'#2011')]
Update:
Try the below XPath:
//span[contains(text(),'Fixed by')]//a
Use .text method to fetch the required value. This will get you the below text value.
In the given HTML data-issue-and-pr-hovercards-enabled is an attribute but not the value of class.
To extract the text #2011 ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span[data-issue-and-pr-hovercards-enabled] a[data-hovercard-type='pull_request'][data-hovercard-url='/mlpack/mlpack/pull/2011/hovercard']"))).text)
Using XPATH and get_attribute("innerHTML"):
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[#data-issue-and-pr-hovercards-enabled]//a[#data-hovercard-type='pull_request' and #data-hovercard-url='/mlpack/mlpack/pull/2011/hovercard']"))).get_attribute("innerHTML"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
References
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

How can I search a Beautiful Soup tree to get the tag path to a text match?

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

Beautiful soup how to remove links *and* the link text from soup

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
You can extract the <a> tags with href. Either do .extract() or .decompose():
Here it is in full:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
And then by removing it:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a', href=True):
a.extract()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
You could also use .decompose():
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
soup.a.decompose()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)

Is there a way to find a string in a HTML file and return its XPath?

I'm trying to do reverse engineering in a scraper to generate a model to extract data.
So, I know the title of a page and I want to look for it inside an HTML code and then return the XPath or CSS Selector to this location.
I'm using Scrapy in my project, but, for this reverse engineering, I thought maybe Beautiful Soup 4 combined with lxml parser could help me too. I just haven't found any docs about it.
Does anyone know if there is a way to do this?
If you're actually using lxml, you could use getpath()...
from lxml import etree
xml = """
<doc>
<one>
<two>
<test>foo</test>
</two>
<two>
<test>bar</test>
</two>
</one>
</doc>
"""
tree = etree.fromstring(xml)
for match in tree.xpath("//*[contains(text(),'bar')]"):
print(etree.ElementTree(tree).getpath(match))
This prints:
/doc/one/two[2]/test

Scrapy, javascript form, not crawling next page

I am having an issue. I am using scrapy to extract data from HTML tables that are displayed after a form search. The problem is that it will not continue to crawl to the next page. I have tried multiple combinations of rules. I understand that it is not recommended to override the default parse logic in CrawlSpider. I have found many answers that fix others issues but, I have not been able to find a solution in which a form POST must occur first. I look at my code and see that it requests the allowed_urls then POST to search.do and the results are returned in HTML formatted results page and thus the parsing begins. Here is my code and I have replaced the real url with nourl.com
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
from EMD.items import EmdItem
class EmdSpider(CrawlSpider):
name = "emd"
start_urls = ["https://nourl.com/methor"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div//div//div//span[#class="pagelinks"]/a[#href]'))),
Rule(SgmlLinkExtractor(allow=('')), callback = 'parse_item')
)
def parse_item(self, response):
url = "https://nourl.com/methor-app/search.do"
payload = {"county": "ANDERSON"}
return (FormRequest(url, formdata = payload, callback = self.parse_data))
def parse_data(self, response):
print response
sel = Selector(response)
items = sel.xpath('//td').extract()
print items
I have left allow = ('') blank because I have tried so many combinations of it. Also in my xpath leads to this:
<div align="center">
<div id="bg">
<!--
Main Container
-->
<div id="header2"></div>
<!--
Content
-->
<div id="content">
<!--
Hidden/Accessible Headers
-->
<h1 class="hide"></h1>
<!--
InstanceBeginEditable name="Content"
-->
<h2></h2>
<p align="left"></p>
<p id="printnow" align="center"></p>
<p align="left"></p>
<span class="pagebanner"></span>
<span class="pagelinks">
[First/Prev]
<strong></strong>
,
<a title="Go to page 2" href="/methor-app/results.jsp?d-49653-p=2"></a>
,
<a title="Go to page 3" href="/methor-app/results.jsp?d-49653-p=3"></a>
[
/
]
</span>
I have checked with multiple tools and my xpath is correctly pointing to the URLs to go to next page. my output in the command prompt is only grabbing data from the first page. I have seen a couple of tutorials where the code contains a yield statement but I am not sure what that does other than "tell the function that it will be used again later without loosing its data" Any ideas would be helpful. Thank you!!!
It may be because you need to select the actual URL in your rule, not just the <a>node. [...] in XPath is used to make a condition, not select something. Try:
//span[#class="pagelinks"]/a/#href
Also a few comments:
How did you find this HTML? Beware of tools to find XPath, as HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (which can be used to generated the page you're looking at, and also some browsers try to sanitize HTML).
It may not be the case here, but the "javascript form" in a scrapy question spooked me. You should always check that the content of response.body is what you expect.
//div//div//div is exactly the same as //div. The two slashes means we don't care anymore about the structure, just select all the nodes named div in the children of the current node. That also why here //span[...] might do the trick.