Beautiful soup how to remove links *and* the link text from soup - beautifulsoup

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

You can extract the <a> tags with href. Either do .extract() or .decompose():
Here it is in full:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
And then by removing it:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a', href=True):
a.extract()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
You could also use .decompose():
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
soup.a.decompose()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)

Related

how to scrape some text inside an href randomly generated - selenium

I was scraping a dynamic page with selenium and I got stuck getting text 1 and text 2 in the following example:
<span class="class number 1"> text 1 text 2 </span>
The same happens if the span is instead a div.
I managed to get text 1 with this python line
var = driver.find_element(By.CLASS_NAME, "class number 1").text"
However, to get text 2, since link 1 is generated, say, randomly, I can't refer to the href in any way!
Any help is really appreciated
Try this, it retrieves 'text 2':
driver.find_element(By.XPATH, ".//span[#class='class number 1']/a").text
Try using CSS Selectors
txt2 = driver.find_element(By.CSS_SELECTOR, "span[class='class number 1'] > a").text
#To extract both text node values at the same time, you can use innerHTML as follows
driver.find_element(By.CSS_SELECTOR, "span[class='class number 1'].get_attribute('innerHTML')

How can I search a Beautiful Soup tree to get the tag path to a text match?

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

Comment an element with BeautifulSoup

I am trying to learn how to use BeautifulSoup. I know how to delete a single element (using extract or decompose). I was wonder if there's a way to put the element within a comment, so that the element is printed as
<!-- <p>HI there</p> -->
You could create a Comment object from the element and use the replace_with method to replace the original tag with the comment.
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup('<p>HI there</p>', 'html.parser')
soup.p.replace_with(Comment(str(soup.p)))
print(soup)
<!--<p>HI there</p>-->

How to verify text across HTML elements in Selenium

Given the following code, how would I verify the text within using Selenium?
<div class='my-text-block>
<p>My first paragraph of text</p>
<p>My second paragraph of text</p>
</div>
I am wanting to, in one verifyText statement to capture all the text:
My first paragraph of text
My second paragraph of text
Is it possible?
Since you've tagged this with selenium-webdriver, I'm assuming you want a code example but because you've not stated what language you're using, I'll give you a python example. It should be easy to translate that to a different language if needed.
ok(driver.find_element("class", "my-text-block").text == "What I expect it to be")
The text attribute on a WebElement object simply contains all visible text within that element and all children elements.
And some lovely docs, of course.

Scrapy, javascript form, not crawling next page

I am having an issue. I am using scrapy to extract data from HTML tables that are displayed after a form search. The problem is that it will not continue to crawl to the next page. I have tried multiple combinations of rules. I understand that it is not recommended to override the default parse logic in CrawlSpider. I have found many answers that fix others issues but, I have not been able to find a solution in which a form POST must occur first. I look at my code and see that it requests the allowed_urls then POST to search.do and the results are returned in HTML formatted results page and thus the parsing begins. Here is my code and I have replaced the real url with nourl.com
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
from EMD.items import EmdItem
class EmdSpider(CrawlSpider):
name = "emd"
start_urls = ["https://nourl.com/methor"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div//div//div//span[#class="pagelinks"]/a[#href]'))),
Rule(SgmlLinkExtractor(allow=('')), callback = 'parse_item')
)
def parse_item(self, response):
url = "https://nourl.com/methor-app/search.do"
payload = {"county": "ANDERSON"}
return (FormRequest(url, formdata = payload, callback = self.parse_data))
def parse_data(self, response):
print response
sel = Selector(response)
items = sel.xpath('//td').extract()
print items
I have left allow = ('') blank because I have tried so many combinations of it. Also in my xpath leads to this:
<div align="center">
<div id="bg">
<!--
Main Container
-->
<div id="header2"></div>
<!--
Content
-->
<div id="content">
<!--
Hidden/Accessible Headers
-->
<h1 class="hide"></h1>
<!--
InstanceBeginEditable name="Content"
-->
<h2></h2>
<p align="left"></p>
<p id="printnow" align="center"></p>
<p align="left"></p>
<span class="pagebanner"></span>
<span class="pagelinks">
[First/Prev]
<strong></strong>
,
<a title="Go to page 2" href="/methor-app/results.jsp?d-49653-p=2"></a>
,
<a title="Go to page 3" href="/methor-app/results.jsp?d-49653-p=3"></a>
[
/
]
</span>
I have checked with multiple tools and my xpath is correctly pointing to the URLs to go to next page. my output in the command prompt is only grabbing data from the first page. I have seen a couple of tutorials where the code contains a yield statement but I am not sure what that does other than "tell the function that it will be used again later without loosing its data" Any ideas would be helpful. Thank you!!!
It may be because you need to select the actual URL in your rule, not just the <a>node. [...] in XPath is used to make a condition, not select something. Try:
//span[#class="pagelinks"]/a/#href
Also a few comments:
How did you find this HTML? Beware of tools to find XPath, as HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (which can be used to generated the page you're looking at, and also some browsers try to sanitize HTML).
It may not be the case here, but the "javascript form" in a scrapy question spooked me. You should always check that the content of response.body is what you expect.
//div//div//div is exactly the same as //div. The two slashes means we don't care anymore about the structure, just select all the nodes named div in the children of the current node. That also why here //span[...] might do the trick.