When Relative Xpath Fail? Selenium locator Xpath is reliable? - selenium

I'm doing automated tests. And I use SelectorHub to find elements in a website. In some cases I get very long Relative Xpath as you see below:
//body/div[#id='app']/div[#class='...']/div[#role='...']/div[#class='...']/div[#class='...']/div[#class='...']/div/div/div[#class='']/textarea"));
As I understood it correctly that it will fail if the website changes change in the future because it has too many "DIV". Why then is it said that relative Xpath is reliable? I could not create a shorter path manually to find a reliable path.

Any XPath that works today on a particular HTML page H1 may or may not produce the same result when applied (in the future) to a different HTML page H2. If you want it to have the best chance of returning the same result, then you want to minimise its dependencies, and in particular, you want to avoid having dependencies on the properties of H1 that are most likely to change. That, of course, is entirely subjective. It can be said that the longer your path expression is, the more dependencies it has (that is, the greater the number of changes that might cause it to break). But that's not universally true: the expression (//*)[842] is probably the shortest XPath expression to locate a particular element, but it's also highly fragile: it's likely to break if the HTML changes. Expressions using id attributes (such as //p[#id='Introduction'] are often considered reasonably stable, but they break too if the id values change.
The bottom line is that this is entirely subjective. Writing XPath expressions that are resilient to change in the HTML content is an art, not a science. It can only be done by reading the mind of the person who designed the HTML page.

Related

Web-scraping dynamic pages in Java

I know this question was asked before but none of the proposed solutions work in my case. I am trying to web-scrape a page of results but the problem is that 95% of div tags contain only class names that are dynamically changing. My code works for several hours using Xpath or a whole day but at some point class names slightly change and my program breaks. Xpath obviously does not work since it changes with the content. I wanted to make sure I fully understand the limitations of web-scraping using Selenium when I have limited options in terms of selecting tags. Is there any other solution that could work in situation like mine where I only have a page of results full of div tags and dynamically named classes?

Using * vs Element Tag

I'm writing a script to scrape some data off the web.
I've copied the XPaths for a few of the same elements on different pages directly from the browser, which produces //*[#id="priceblock_dealprice"].
However, they're all span elements. I don't know enough about how XPath works under the hood, but I'm assuming //span[#id="priceblock_dealprice"] would obviously be quicker since it only has to check the span elements? Is this true?
Is there any benefit to using * over, say, span in this specific context?
You are not likely to see a huge performance difference by changing * to span.
The bigger performance impact would be eliminating, or at least constraining, the descendant axis //.
With a descendant axis that starts at the root node, you are forcing the XPath engine to walk over the entire node tree and inspect each and every element, which can be expensive with large documents.
If you were to provide any clues about the structure, the engine can avoid a lot of unnecessary work, and should perform better.
For instance:
/html/body/section[2]/div//*[#id="priceblock_dealprice"]
Besides performance, the other considerations are maintenance and flexibility.
You might get better performance with a more specific XPath, but then changes to the page structure and element names might result in things not matching anymore. You will need to decide what is more important.
Yes, its better to use 'span' instead of *, but as it is having an ID, so instead of XPath, its better to use By.ID.
ID will be somewhat fast compared to Xpath.

Xpath selector difference: pros and cons

I have two xpath selectors that find exactly the same element, but I wonder which one is better from code, speed of execution, readability points of view etc.
First Xpath :
//*[#id="some_id"]/table/tbody/tr[td[contains(., "Stuff_01")]]//ancestor-or-self::td/input[#value="Stuff_02"]
Second Xpath:
//tr[td[#title="Stuff_01"]]//ancestor-or-self::td/input[#value="Stuff_02"]
The argument for example is that if the code of the page will be changed and for example some "tbody" will be moved that the first one won't work, is it true ?
So any way which variant of the code is better and why ?
I would appreciate an elaborate answer, because this is crucial to the workflow.
It is possible that neither XPath is ideal. Seeing the targeted HTML and a description of the selection goal would be needed to decide or to offer another alternative.
Also, as with all performance matters, measure first.
That said, performance is unlikely to matter, especially if you use an #id or other anchor point to hone in on a reduced subtree before further restraining the selection space.
For example, if there's only one elem with id of 1234 in the document, by using //elem[#id="1234"]/rest-of-xpath, you've eliminated the rest of the document as a performance/readability/robustness concern. As long as the subtree below elem is relatively tame (and it usually will be), you'll be fine regarding those concerns.
Also, yes, table//td is a fine way to abstract over whether tbody is present or not.

WebDriver / Can I identify DOM changes by WebDriver?

I trying to get all elements which were changed After I click on element.
I try to do the next:
List<WebElement> lstWeb = driver.findElements(By.xpath("//*");
driver.findElement(By.id("ImprBtn"));
List<WebElement> lstWebAfter = driver.findElements(By.xpath("//*");
lstWebAfter.removeAll(lstWeb );
The problem is that it's taking a long time, because in each list I have more than 800 WebElements.
There is an efficient way to identify changes in DOM after I click on element?
In general, i think it's not good approach (comparing the whole DOM before and after some operation) for designing your test cases, it's probably happens because not 100% correct design of the cases.
Design your tests cases more carefully, prepare your expected result and the expected DOM changes (i.e. the change in the webapp) and compare them to the actual result.
To learn more about recommended design of automation please read about the PageObjects Pattern.
You can find nice implementation for it here (don't mind the language, just read the code :)).
If you still need solution to identify DOM changes, check those solutions:
Mutation observers.
Check this DOM monitoring suggestions.
More about mutation events (recommended by W3 for what you looking for).
You should narrow down the xpath query in order to more efficiently search the DOM.
If you're not very comfortable with raw xath, I have a helper library that let's you create xpath based on LINQ-esq syntax.
Sharing a link in case you find it helpful
http://www.unit-testing.net/CurrentArticle/How-to-Create-Xpath-From-Lambda-Expressions.html

Is the <wbr> element semantic HTML? What about in a microdata context?

In short, this is bad web development and UX:
But solving it by using CSS3 word breaking (code & demo) can lead to an 'awkward whitespace' situation, and strange cut-offs — here's an example of both:
Maybe it's not such a big deal, and the UX perspective of it is here, but let's look at the semantics of one of the solutions:
You could ... use the <wbr> element to indicate an optional word
break opportunity. This will tell the browser to insert a line break
as necessary to flow onto a new line inside the container.
The first question: is using <wbr> semantic HTML? (Does it at least degrade gracefully?)
In either case, it seems that being un-semantic in the general sense is a small price to pay for good UX functionality.
However, the second quesiton is about the big picture:
Are there any schema.org (microdata/RFDa) ramifications to consider when using <wbr> to split up an email address? Will it still be valid there?
The wbr element is defined in the HTML5 spec. So it's fine to use it. If it's used right (= according to the definition in the spec), you may call it also "semantic use".
I don't think that there would be any problems in combination with micordata/RDFa. Usually you'd provide the URL in an attribute anyway, which can't contain wbr elements of course: foo<wbr>#example<wbr>.com.
For element content I'd guess (didn't check though) that microdata/RDFa parsers should use the text content without markup resp. understand what is markup and what is text, otherwise e.g. a FOAF name would be <abbr>Dr.</abbr> Foo instead of Dr. Foo.
So you can bet that microdata/RDFa parsers know HTML ;), and therefor it shouldn't be a problem to use its elements.