How can I search a Beautiful Soup tree to get the tag path to a text match? - beautifulsoup

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

Related

Beautiful soup how to remove links *and* the link text from soup

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
You can extract the <a> tags with href. Either do .extract() or .decompose():
Here it is in full:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
And then by removing it:
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a', href=True):
a.extract()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)
Output:
I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here:
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.
You could also use .decompose():
from bs4 import BeautifulSoup
html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p>This text is the problem</p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
soup.a.decompose()
p_tags = soup.find_all('p')
for each in p_tags:
print (each.text)

Scrapy - Cleaning up text[/p] from nested links[/a] etc

I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.
PROBLEM is that when I scrape CONTENT of the article <p> that content is filled with additional tags like - strong, a etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:
<p> According to <a> Japan's newspapers </a> it happened ... </p>
Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:
I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.
Please provide your selector for more detailed help.
Given what you're describing, I'd guess you're selecting p/text() (xml) or p::text (css), which is not going to get the text in the children of <p> elements.
You should try selecting response.xpath('//p/descendant-or-self::*/text()') to get the text in the <p> and all it's children.
You could also just select the <p>, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.
You could use string.replace(,)
new_string = old_string.replace("<a>", "")
You could integrate this into a loop which iterates over a list that contains all of the substrings that you want to discard.

Selenium find all the elements which have two divs

I am trying to collect texts and images from a website to help collect missing people related tweets. Here is the problem:
Some tweets don't have images so the corresponding <div class='c' ....> has only one <div>...</div>.
Some tweets have images, so the corresponding <div class='c' ....> has two <div>...</div>, as shown in the following codes:
<div class='c' id="M_D*****">
<div>...</div>
and
<div class='c' id="M_D*****">
<div>...</div>
<div>...</div>
I intend to check whether a tweet has an image, i.e. find out whether the corresponding <div class='c' ....> has two <div>...</div>.
PS: The following codes are used to collect all the texts and image URLs but not all tweets have images so I want to match them by solving the above problem.
tweets = browser.find_elements_by_xpath("//span[#class='ctt']")
graph_links = browser.find_elements_by_xpath("//img[#alt='img' and #class='ib']")
This is a public welfare program, which aims to help the missing people go back home.
By collecting the text and the images separately, I think that it's going to be impossible to match the text with the related image after the fact. I would suggest a different approach. I would search for the <div class='c'...> that contains both the text and the optional image. Once you have the "container" DIV, you can then get the text and see if an image exists and put them all together. Without all the relevant HTML, you may have to tweak the code below but it should give you an idea on how to approach this.
containers = browser.find_elements_by_css_selector("div.c")
for container in containers:
print container.find_element_by_css_selector("span.ctt").text // the tweet text
images = container.find_elements_by_css_selector("img.ib")
if len(images) > 0 // see if the image exists
print images[0].get_attribute("src") // the URL of the image
print "-------------" // separator between tweets
The html you provided is probably not enough, but basing on it I suggest xpath: //div[#id='M_D*****' and ./div//img] which find div with specified id and containing div with image.
But answering directly to your question:
//div[./div[2] and not(./div[3])] will find all divs with exactly 2 div children

How to verify text across HTML elements in Selenium

Given the following code, how would I verify the text within using Selenium?
<div class='my-text-block>
<p>My first paragraph of text</p>
<p>My second paragraph of text</p>
</div>
I am wanting to, in one verifyText statement to capture all the text:
My first paragraph of text
My second paragraph of text
Is it possible?
Since you've tagged this with selenium-webdriver, I'm assuming you want a code example but because you've not stated what language you're using, I'll give you a python example. It should be easy to translate that to a different language if needed.
ok(driver.find_element("class", "my-text-block").text == "What I expect it to be")
The text attribute on a WebElement object simply contains all visible text within that element and all children elements.
And some lovely docs, of course.

CSS locator for corresponding xpath for selenium

The some part of the html of the webpage which I'm testing looks like this
<div id="twoWideCallouts">
<div class="callout">
<a target="_blank" href="http://facebook.com">Facebook</a>
</div>
<div class="callout last">
<a target="_blank" href="http://youtube.com">Youtube</a>
</div>
I've to check using selenium that when I click on text, the URL opened is the same that is given in href and not error page.
Using Xpath I've written the following command
//i is iterator
selenium.getAttribute("//div[contains(#class, 'callout')]["+i+"]/a/#href")
However, this is very slow and for some of the links doesn't work. By reading many answers and comments on this site I've come to know that CSS loactors are faster and cleaner to maintain so I wrote it again as
css = div:contains(callout)
Firstly, I'm not able to reach to the anchor tag.
Secondly, This page can have any number of div where id = callout. Using xpathcount i can get the count of this, and I'll be iterating on that count and performing the href check. How can something similar be done using CSS locator?
Any help would be appreciated.
EDIT
I can click on the link using the locator css=div.callout a, but when I try to read the href value using String str = "css=div.callout a[href]";
selenium.getAttribute(str);. I get the Error - element not found. Console description is given below.
19:12:33.968 INFO - Command request: getAttribute[css=div.callout a[href], ] on session
19:12:33.993 INFO - Got result: ERROR: Element css=div.callout a[href not found on session
I tried to get the href attribute using xpath like this
"xpath=(//div[contains(#class, 'callout')])["+1+"]/a/#href" and it worked fine.
Please tell me what should be the corresponding CSS locator for this.
It should be -
css = div:contains(callout)
Did you notice ":" instead of "." you used?
For CSSCount this might help -
http://www.eviltester.com/index.php/2010/03/13/a-simple-getcsscount-helper-method-for-use-with-selenium-rc/
#
On a different note, did you see proposal of new selenium site on area 51 - http://area51.stackexchange.com/proposals/4693/selenium.
#
To read the sttribute I used css=div.callout a#href and it worked. The problem was with use of square brackets around attribute name.
For the first part of your question, anchor your identifier on the hyperlink:
css=a[href=http://youtube.com]
For achieving a count of elements in the DOM, based on CSS selectors, here's an excellent article.