I'm writing a script to scrape some data off the web.
I've copied the XPaths for a few of the same elements on different pages directly from the browser, which produces //*[#id="priceblock_dealprice"].
However, they're all span elements. I don't know enough about how XPath works under the hood, but I'm assuming //span[#id="priceblock_dealprice"] would obviously be quicker since it only has to check the span elements? Is this true?
Is there any benefit to using * over, say, span in this specific context?
You are not likely to see a huge performance difference by changing * to span.
The bigger performance impact would be eliminating, or at least constraining, the descendant axis //.
With a descendant axis that starts at the root node, you are forcing the XPath engine to walk over the entire node tree and inspect each and every element, which can be expensive with large documents.
If you were to provide any clues about the structure, the engine can avoid a lot of unnecessary work, and should perform better.
For instance:
/html/body/section[2]/div//*[#id="priceblock_dealprice"]
Besides performance, the other considerations are maintenance and flexibility.
You might get better performance with a more specific XPath, but then changes to the page structure and element names might result in things not matching anymore. You will need to decide what is more important.
Yes, its better to use 'span' instead of *, but as it is having an ID, so instead of XPath, its better to use By.ID.
ID will be somewhat fast compared to Xpath.
Related
I'm doing automated tests. And I use SelectorHub to find elements in a website. In some cases I get very long Relative Xpath as you see below:
//body/div[#id='app']/div[#class='...']/div[#role='...']/div[#class='...']/div[#class='...']/div[#class='...']/div/div/div[#class='']/textarea"));
As I understood it correctly that it will fail if the website changes change in the future because it has too many "DIV". Why then is it said that relative Xpath is reliable? I could not create a shorter path manually to find a reliable path.
Any XPath that works today on a particular HTML page H1 may or may not produce the same result when applied (in the future) to a different HTML page H2. If you want it to have the best chance of returning the same result, then you want to minimise its dependencies, and in particular, you want to avoid having dependencies on the properties of H1 that are most likely to change. That, of course, is entirely subjective. It can be said that the longer your path expression is, the more dependencies it has (that is, the greater the number of changes that might cause it to break). But that's not universally true: the expression (//*)[842] is probably the shortest XPath expression to locate a particular element, but it's also highly fragile: it's likely to break if the HTML changes. Expressions using id attributes (such as //p[#id='Introduction'] are often considered reasonably stable, but they break too if the id values change.
The bottom line is that this is entirely subjective. Writing XPath expressions that are resilient to change in the HTML content is an art, not a science. It can only be done by reading the mind of the person who designed the HTML page.
I have two xpath selectors that find exactly the same element, but I wonder which one is better from code, speed of execution, readability points of view etc.
First Xpath :
//*[#id="some_id"]/table/tbody/tr[td[contains(., "Stuff_01")]]//ancestor-or-self::td/input[#value="Stuff_02"]
Second Xpath:
//tr[td[#title="Stuff_01"]]//ancestor-or-self::td/input[#value="Stuff_02"]
The argument for example is that if the code of the page will be changed and for example some "tbody" will be moved that the first one won't work, is it true ?
So any way which variant of the code is better and why ?
I would appreciate an elaborate answer, because this is crucial to the workflow.
It is possible that neither XPath is ideal. Seeing the targeted HTML and a description of the selection goal would be needed to decide or to offer another alternative.
Also, as with all performance matters, measure first.
That said, performance is unlikely to matter, especially if you use an #id or other anchor point to hone in on a reduced subtree before further restraining the selection space.
For example, if there's only one elem with id of 1234 in the document, by using //elem[#id="1234"]/rest-of-xpath, you've eliminated the rest of the document as a performance/readability/robustness concern. As long as the subtree below elem is relatively tame (and it usually will be), you'll be fine regarding those concerns.
Also, yes, table//td is a fine way to abstract over whether tbody is present or not.
I trying to get all elements which were changed After I click on element.
I try to do the next:
List<WebElement> lstWeb = driver.findElements(By.xpath("//*");
driver.findElement(By.id("ImprBtn"));
List<WebElement> lstWebAfter = driver.findElements(By.xpath("//*");
lstWebAfter.removeAll(lstWeb );
The problem is that it's taking a long time, because in each list I have more than 800 WebElements.
There is an efficient way to identify changes in DOM after I click on element?
In general, i think it's not good approach (comparing the whole DOM before and after some operation) for designing your test cases, it's probably happens because not 100% correct design of the cases.
Design your tests cases more carefully, prepare your expected result and the expected DOM changes (i.e. the change in the webapp) and compare them to the actual result.
To learn more about recommended design of automation please read about the PageObjects Pattern.
You can find nice implementation for it here (don't mind the language, just read the code :)).
If you still need solution to identify DOM changes, check those solutions:
Mutation observers.
Check this DOM monitoring suggestions.
More about mutation events (recommended by W3 for what you looking for).
You should narrow down the xpath query in order to more efficiently search the DOM.
If you're not very comfortable with raw xath, I have a helper library that let's you create xpath based on LINQ-esq syntax.
Sharing a link in case you find it helpful
http://www.unit-testing.net/CurrentArticle/How-to-Create-Xpath-From-Lambda-Expressions.html
How to traverse a tree of MP_Node (django-treebeard) categories and display effectively with lease amount of queries? I tried looking the docs but I see the queries number increasing with more categories.
Is there a method to limit the number of queries to display a menu like amazon.com and get all the categories in an optimized manner?
I see that dump_bulk() api in treebeard gets all the categories in a single query. Is it advisable to use it? If not why? Where is its practical usage?
A sample code using twitter-bootstrap nav menu would be appreciated.
I'm looking to reduce the number of queries. Answer with explanation with least number of queries will be accepted.
I chose django-mptt in lieu of Treebeard. It's much more simple, and uses only one tree management methodology (MPTT, hence the name). See this function (disclaimer, I submitted a patch which rewrote this function to be more optimized) - it caches an entire tree below a given node, allowing you to go up and down the tree as much as you want (e.g., node.get_children(), node.parent, etc.), without running any more queries. In other words, it's ideal for doing exactly what you're wanting to do.
I know its possible to get the top terms within a Lucene Index, but is there a way to get the top terms based on a subset of a Lucene index?
I.e. What are the top terms in the Index for documents within a certain date range?
Ideally there'd be a utility somewhere to do this, but I'm not aware of one. However, it's not too hard to do this "by hand" in a reasonably efficient way. I'll assume that you already have a Query and/or Filter object that you can use to define the subset of interest.
First, build a list in memory of all of the document IDs in your index subset. You can use IndexSearcher.search(Query, Filter, HitCollector) to do this very quickly; the HitCollector documentation includes an example that seems like it ought to work, or you can use some other container to store your doc IDs.
Next, initialize an empty HashMap (or whatever) to map terms to total frequency, and populate the map by invoking one of the IndexReader.getTermFreqVector methods for every document and field of interest. The three-argument form seems simpler, but either should be just fine. For the three-argument form, you'd make a TermVectorMapper whose map method checks if term is in the map, associates it with frequency if not, or adds frequency to the existing value if so. Be sure to use the same TermVectorMapper object across all of the calls to getTermFreqVector in this pass, rather than instantiating a new one for each document in the loop. You can also speed things up quite a bit by overriding isIgnoringPositions() and isIgnoringOffsets(); your object should return true for both of those. It looks like your TermVectorMapper might also be forced to define a setExpectations method, but that one doesn't need to do anything.
Once you've built your map, just sort the map items by descending frequency and read off however many top terms you like. If you know in advance how many terms you want, you might prefer to do some kind of fancy heap-based algorithm to find the top k items in linear time instead of using an O(n log n) sort. I imagine the plain old sort will be plenty fast in practice. But it's up to you.
If you prefer, you can combine the first two stages by having your HitCollector invoke getTermFreqVector directly. This should certainly produce equally correct results, and intuitively seems like it would be simpler and better, but the docs seem to warn that doing so is likely to be quite a bit slower than the two-pass approach (on same page as the HitCollector example above). Or I could be misinterpreting their warning. If you're feeling ambitious you could try it both ways, compare, and let us know.
Counting up the TermVectors will work, but will be slow if there are a lot of documents to iterate. Also note if you mean docFreq by top terms, then don't use the count in the TermFreqVector just count the terms as binary.
Alternatively, you could iterate the terms like facet counts. Use a cached filter for every term; their BitSets can be used for a fast intersection count.