Unable to loop through navigable string with BeautifulSoup CSS Selector - selenium

I would like to extract the content inside the p tag below.
<section id="abstractSection" class="row">
<h3 class="h4">Abstract<span id="viewRefPH" class="pull-right hidden"></span>
</h3>
<p> Variation of the (<span class="ScopusTermHighlight">EEG</span>), has functional and. behavioural effects in sensory <span class="ScopusTermHighlight">EEG</span We can interpret our. Individual <span class="ScopusTermHighlight">EEG</span> text to extract <span class="ScopusTermHighlight">EEG</span> power level.</p>
</section>
A one line Selenium as below,
document_abstract = WebDriverWait(self.browser, 20).until(
EC.visibility_of_element_located((By.XPATH, '//*[#id="abstractSection"]/p'))).text
can extract easily the p tag content and provide the following output:
Variation of the EEG, has functional and. behavioural effects in sensoryEEG. We can interpret our. Individual EEG text to extract EEG power level.
Nevertheless, I would like to employ the BeautifulSoup due to speed consideration.
The following bs by referring to the css selector (i.e.,#abstractSection ) was tested
url = r'scopus_offilne_specific_page.html'
with open(url, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
home=page_soup.select_one('#abstractSection').next_sibling
for item in home:
for a in item.find_all("p"):
print(a.get_text())
However, the compiler return the following error:
AttributeError: 'str' object has no attribute 'find_all'
Also, since Scopus require login id, the above problem can be reproduce by using the offline html which is accessible via this link.
May I know where did I do wrong, appreciate for any insight

Thanks to this OP, the problem issued above apparently can be solve simply as below
document_abstract=page_soup.select('#abstractSection > p')[0].text

Related

Selenium Python, extract text from node and ALL child nodes

I have the opposite problem described here. I can't get the text more than one layer deep.
HTML is structured in the following manner:
<span class="data">
<p>This text is extracted just fine.</p>
<p>And so is this.</p>
<p>
And this.
<div>
<p>But this text is not extracted.</p>
</div>
</p>
<div>
<p>And neither is this.</p>
</div>
</span>
My Python code looks something like this:
el.find_element_by_xpath(".//span[contains(#class, 'data')]").text
Try the same with child elements:
print(el.find_element_by_xpath(".//span[contains(#class, 'data')]").text)
print(el.find_element_by_xpath(".//span[contains(#class, 'data')]/div").text)
print(el.find_element_by_xpath(".//span[contains(#class, 'data')]/p").text)
Not sure what's the referred el in your original post. But able to get all the text using the below.
driver.find_element_by_xpath("//span[#class='data']").text
Output:
'This text is extracted just fine.\nAnd so is this.\nAnd this.\nBut this text is not extracted.\nAnd neither is this.'
Instead of relying on WebElement.text property consider querying innerText property
Consider using Explicit Wait as it will make your test more robust and reliable in case if the element you're looking for is loaded by i.e. AJAX call
Assuming all above:
print(WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//span[#class='data']"))).get_attribute("innerText"))
Demo:

Splinter Is it possible to use browser.fill without name as a query

I would like to use an absolute xpath to fill in a search bar. The ids and classes are dynamically generated and there is no name variable or instance. So it feels like I'm stuck without a tool to fill in boxes without the named variable.
Is there a way around this? Can I somehow change the absolute xpath to look like its a name assignment and then query and fill based on the new 'type' I assigned the absolute xpath?
Is there a method for this in Selenium if not available in Splinter?
I've select by CSS and I'm finding this error 'selenium.common.exceptions.InvalidElementStateException: Message: Element is not currently interactable and may not be manipulated'
Edit:
<div class="type-ahead-input-container" role="search">
<div class="type-ahead-input-wrapper">
<div class="type-ahead-input">
<label class="visually-hidden" for="a11y-ember10393">
Search
</label>
<!---->
<input id="a11y-ember10393" class="ember-text-field ember-view" aria-
autocomplete="list" autocomplete="off" spellcheck="false"
placeholder="Search" autocorrect="off" autocapitalize="off" role="combobox"
aria-expanded="true" aria-owns="ember11064" data-artdeco-is-focused="true"/>
<div class="type-ahead-input-icons">
<!---->
</div>
</div>
<!---->
</div>
</div>
As you have asked Is there a method for this in Selenium, the Answer is Yes.
Selenium supports Sikuli. Sikuli automates anything you see on the screen. It uses image recognition to identify and control GUI components. It is useful when there is no easy access to a GUI's internal or source code.
You can find more about Sikuli here.
Let me know if this answers your question.
When you get an error-message like that, it could be that your search result is not what you expected. It could be that you are getting more than one result, ie a list. On a list you can not input.
You can find the input boxes with an xpath, select the prefered one from the list (by trying) and put a text in it with the WebDriverElement _set_value method. That is not appropriate because of the _, but it is usefull.
input_list = browser.find_by_xpath('//div[#class="type-ahead-input"]/input[#class="ember-textfield ember-view"]')
input_list[1]._set_value('just a text to see whether it is working')

Scrapy, javascript form, not crawling next page

I am having an issue. I am using scrapy to extract data from HTML tables that are displayed after a form search. The problem is that it will not continue to crawl to the next page. I have tried multiple combinations of rules. I understand that it is not recommended to override the default parse logic in CrawlSpider. I have found many answers that fix others issues but, I have not been able to find a solution in which a form POST must occur first. I look at my code and see that it requests the allowed_urls then POST to search.do and the results are returned in HTML formatted results page and thus the parsing begins. Here is my code and I have replaced the real url with nourl.com
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
from EMD.items import EmdItem
class EmdSpider(CrawlSpider):
name = "emd"
start_urls = ["https://nourl.com/methor"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div//div//div//span[#class="pagelinks"]/a[#href]'))),
Rule(SgmlLinkExtractor(allow=('')), callback = 'parse_item')
)
def parse_item(self, response):
url = "https://nourl.com/methor-app/search.do"
payload = {"county": "ANDERSON"}
return (FormRequest(url, formdata = payload, callback = self.parse_data))
def parse_data(self, response):
print response
sel = Selector(response)
items = sel.xpath('//td').extract()
print items
I have left allow = ('') blank because I have tried so many combinations of it. Also in my xpath leads to this:
<div align="center">
<div id="bg">
<!--
Main Container
-->
<div id="header2"></div>
<!--
Content
-->
<div id="content">
<!--
Hidden/Accessible Headers
-->
<h1 class="hide"></h1>
<!--
InstanceBeginEditable name="Content"
-->
<h2></h2>
<p align="left"></p>
<p id="printnow" align="center"></p>
<p align="left"></p>
<span class="pagebanner"></span>
<span class="pagelinks">
[First/Prev]
<strong></strong>
,
<a title="Go to page 2" href="/methor-app/results.jsp?d-49653-p=2"></a>
,
<a title="Go to page 3" href="/methor-app/results.jsp?d-49653-p=3"></a>
[
/
]
</span>
I have checked with multiple tools and my xpath is correctly pointing to the URLs to go to next page. my output in the command prompt is only grabbing data from the first page. I have seen a couple of tutorials where the code contains a yield statement but I am not sure what that does other than "tell the function that it will be used again later without loosing its data" Any ideas would be helpful. Thank you!!!
It may be because you need to select the actual URL in your rule, not just the <a>node. [...] in XPath is used to make a condition, not select something. Try:
//span[#class="pagelinks"]/a/#href
Also a few comments:
How did you find this HTML? Beware of tools to find XPath, as HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (which can be used to generated the page you're looking at, and also some browsers try to sanitize HTML).
It may not be the case here, but the "javascript form" in a scrapy question spooked me. You should always check that the content of response.body is what you expect.
//div//div//div is exactly the same as //div. The two slashes means we don't care anymore about the structure, just select all the nodes named div in the children of the current node. That also why here //span[...] might do the trick.

XPath related questions to locate elements in Webdriver 2 Java

Dear Selenium Webdriver 2 Experts,
I am new to this framework and need your advice on some XPath related questions on the following webpage XHTML snippet:
<dl class="cN-featDetails">
<dt class="propertytype">Property type</dt>
<!-- line 3 --> <dd id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_ddPropertyType" class="propertytype type-house" title="Property type: House">House</dd>
<!-- line 3a --> <!-- or class="propertytype type-townhouse"--->
.......
<div class="main-wrap">
<div class="s-prodDetails">
<a id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_hypMainThumb" class="photo contain" href="/Property/For-Sale/House/LA/St Gabriel/?adid=2009938763">img id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_imgMainThumb" title="44 Crown Street, St Gabriel" src="http://images.abc.com/img/2012814/2778/2009938763_1_PM.JPG?mod=121010-210000" alt="Main photo of 44 Crown Street, St Gabriel - More Details" style="border-width:0px;" /></a>
<div class="description">
<h4><span id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_lblPrice">Offers Over $900,000 </span></h4>
<h5>SOLD BY WAISE YUSOFZAI</h5><p>CHARACTER FAMILY HOME... Filled with warmth and charm, is a very well maintained family home in a quiet level street. Be...</p>
</div>
<a id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_hypMoreDetails" class="button" href="/Property/For-Sale/House/LA/St Gabriel/?adid=2009938763">More Details</a>
<dl id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_dlAgent" class="agent">
<!-- line 19 --> <dt>Advertiser</dt>
<!-- line 20 --> <dd class="contain">
<!-- line 20a --> <!-- or class="" -->
<img id="ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_imgAgencyLogo" title="Carmen Jones Realty" src="http://images.abc.com/img/Agencys/2778/searchlogo_2778.GIF" style="border-width:0px;" />
</dd>
</dl>
</div>
( a ) How to test whether an element exist or not? e.g. either line 3 or 3a exist but not both. The findElement() method will cause an exception which is what I am trying avoid. Another option is
use findElements() before checking whether its collection list result is empty or not. This approach seems to be a long winded way of doing it. Are there any other simpler way to validate the existence
of an element without the risk of causing an exception? The following statements did not work or cause exception:
WebElement resultsDiv = driver.findElement(By.xpath("/html/body/form/div[3]/div[2]/div[1]/h1/em"));
// If results have been returned, the results are displayed in a drop down.
if (resultsDiv.isDisplayed()) {
break;
}
( b ) Is there a simple way to validate the existence of either of the elements by incorporating boolean operator, and regex, as part of the findElement() or findElements()? This would significantly reduce
the number of finds as well as simplifying the search.
( c ) Is it possible to use XPath syntax when searching Element by TagName. e.g. driver.findElement(By.tagName("/div[#class='report']/result"));
( d ) Is it possible to use regex in XPath search such as driver.findElement(By.xpath("//div[#class='main-wrap']/dl[#class='agent']/dd[#class='' OR #class='contain']")) for line 20 - 20a?
( e ) How to reference the immediate following node? e.g. Assuming the current node is Advertiser on line 19, how to lookup title of which is under , where its class name
can have a value of "contain" or nothing "". There could potentially be multiple tags within on line 18.
I have use XPath on XML document in the past but would like to expand the ability to locate elements within Webdriver 2.
Any assistance would be very greatful.
Thanks a lot,
Jack
considering the answer given by Zarkonnen, i'd add to (a) point
public boolean isElementPresent(By selector)
{
return driver.findElements(selector).size()>0;
}
this method i use for verifying that element is present on the page.
you can also involve isDisplayed() method that can work in pair with isElement is present:
driver.findElement(By locator).isDisplayed()
Returning back to your source:
Suppose we want to verify wheter 'More details' links is present on the page or not.
String cssMoreDetails = "a[id='ctl00_ctl00_Content_Content_SrchResLst_rptResult_ctl04_lstTemplate_hypMoreDetails']";
//and you simply call the described above method
isElementPresent(By.cssSelector(cssMoreDetails));
//if you found xPath then it be
isElementPresent(By.xpath("//*..."));
And always try to verify found web elements' locators in firepath
In that way you can get xPath of the elment automatically. See the screen attached.
(a) & (d): You can use the | XPath operator to specify two alternate paths in the same expression.
(c): You can't use XPath in byTagName, but you can use // to look for any descendant of a given node.
(e): You can use XPath axes like parent and following-sibling to select all nodes relative to a given node.
I hope these pointers are helpful. I also recommend the excellent Selenium Locator Rosetta Stone (PDF) as a reference on how to construct XPaths.

CSS locator for corresponding xpath for selenium

The some part of the html of the webpage which I'm testing looks like this
<div id="twoWideCallouts">
<div class="callout">
<a target="_blank" href="http://facebook.com">Facebook</a>
</div>
<div class="callout last">
<a target="_blank" href="http://youtube.com">Youtube</a>
</div>
I've to check using selenium that when I click on text, the URL opened is the same that is given in href and not error page.
Using Xpath I've written the following command
//i is iterator
selenium.getAttribute("//div[contains(#class, 'callout')]["+i+"]/a/#href")
However, this is very slow and for some of the links doesn't work. By reading many answers and comments on this site I've come to know that CSS loactors are faster and cleaner to maintain so I wrote it again as
css = div:contains(callout)
Firstly, I'm not able to reach to the anchor tag.
Secondly, This page can have any number of div where id = callout. Using xpathcount i can get the count of this, and I'll be iterating on that count and performing the href check. How can something similar be done using CSS locator?
Any help would be appreciated.
EDIT
I can click on the link using the locator css=div.callout a, but when I try to read the href value using String str = "css=div.callout a[href]";
selenium.getAttribute(str);. I get the Error - element not found. Console description is given below.
19:12:33.968 INFO - Command request: getAttribute[css=div.callout a[href], ] on session
19:12:33.993 INFO - Got result: ERROR: Element css=div.callout a[href not found on session
I tried to get the href attribute using xpath like this
"xpath=(//div[contains(#class, 'callout')])["+1+"]/a/#href" and it worked fine.
Please tell me what should be the corresponding CSS locator for this.
It should be -
css = div:contains(callout)
Did you notice ":" instead of "." you used?
For CSSCount this might help -
http://www.eviltester.com/index.php/2010/03/13/a-simple-getcsscount-helper-method-for-use-with-selenium-rc/
#
On a different note, did you see proposal of new selenium site on area 51 - http://area51.stackexchange.com/proposals/4693/selenium.
#
To read the sttribute I used css=div.callout a#href and it worked. The problem was with use of square brackets around attribute name.
For the first part of your question, anchor your identifier on the hyperlink:
css=a[href=http://youtube.com]
For achieving a count of elements in the DOM, based on CSS selectors, here's an excellent article.