Web-scraping dynamic pages in Java - selenium

I know this question was asked before but none of the proposed solutions work in my case. I am trying to web-scrape a page of results but the problem is that 95% of div tags contain only class names that are dynamically changing. My code works for several hours using Xpath or a whole day but at some point class names slightly change and my program breaks. Xpath obviously does not work since it changes with the content. I wanted to make sure I fully understand the limitations of web-scraping using Selenium when I have limited options in terms of selecting tags. Is there any other solution that could work in situation like mine where I only have a page of results full of div tags and dynamically named classes?

Related

When Relative Xpath Fail? Selenium locator Xpath is reliable?

I'm doing automated tests. And I use SelectorHub to find elements in a website. In some cases I get very long Relative Xpath as you see below:
//body/div[#id='app']/div[#class='...']/div[#role='...']/div[#class='...']/div[#class='...']/div[#class='...']/div/div/div[#class='']/textarea"));
As I understood it correctly that it will fail if the website changes change in the future because it has too many "DIV". Why then is it said that relative Xpath is reliable? I could not create a shorter path manually to find a reliable path.
Any XPath that works today on a particular HTML page H1 may or may not produce the same result when applied (in the future) to a different HTML page H2. If you want it to have the best chance of returning the same result, then you want to minimise its dependencies, and in particular, you want to avoid having dependencies on the properties of H1 that are most likely to change. That, of course, is entirely subjective. It can be said that the longer your path expression is, the more dependencies it has (that is, the greater the number of changes that might cause it to break). But that's not universally true: the expression (//*)[842] is probably the shortest XPath expression to locate a particular element, but it's also highly fragile: it's likely to break if the HTML changes. Expressions using id attributes (such as //p[#id='Introduction'] are often considered reasonably stable, but they break too if the id values change.
The bottom line is that this is entirely subjective. Writing XPath expressions that are resilient to change in the HTML content is an art, not a science. It can only be done by reading the mind of the person who designed the HTML page.

Losing Aria/accessibility when converting from HTML to PDF

I am using ABCpdf to generate a collection of PDFs from HTML markup, and am struggling with making it fully accessible.
The HTML pages include several graphs which are created by CSS, and which are completely ignored by the screenreader.
I have tried using aria-label to give a written explanation of the graphs, but it is lost in the conversion. I have tried configuring the Gecko engine within ABCpdf in numerous ways, including scaling back security options, altering markup options, and adding special tags to explicitly include an element. The PDF is tagged and is rated as fully accessible by our evaluation program.
I haven't been able to find a way to include "hidden" text in the PDF for the purpose of screenreaders. Any help is appreciated!
EDIT: Due to security concerns, I am unable to display the actual data behind the graphs. Manual steps are also not an option due to the sheer number of generated PDFs, and a short timeline.
HTML-to-PDF conversion utilities are usually pretty basic and typically don't handle complex CSS very well at all. You may be better off taking a screen capture and then using alt-text to describe the intent of the graph. Sometimes the simplest approach is the most reliable.
Another way of approaching the issue would be to present the complete data set to users via a data table. That way, they can "see" everything contained in the graph, and it won't matter if the graph itself is inaccessible. If placing a giant data table in the middle of your document doesn't fit with your formatting, you can also include the data set in an appendix with a note or hyperlink in the text directing readers where they can go to access the entirety of information.

How does Google power the box at the top of my search results?

What powers the little box that sometimes show up at the top of search results with things like: definitions for words, weather, movie times, and sometimes even the precise steps in a cooking recipe?
Because I recently searched for a recipe and google showed me the steps for making the recipe right at the top of my results.
Curious to know how they did this, I checked the source of the content and to my surprise, there was no [structured data / rich snippets][1]. There were no special meta tags either and the page didn't even use HTML5 elements.
There was nothing in the markup that would signify the relationship between a step in making the recipe and the details within the steps - we're talking plain old divs, p's, and h tags. There were also no class or div names that Google could have used to piece it together (eg. , etc)
So, how do they do this?
Google does this using the knowledge graph. You can help get your data in there by using structured data markup (look at http://schema.org)

Phantomjs - Finding words at a given coordinate on rendered pages

Using PhantomJs, I am trying to find out which single word is rendered at a given x/y-coordinate on a web page. I have seen several examples of achieving such localizations at object level, but I am looking for a way of breaking it down to term-level precision.
Any suggestion would be greatly appreciated.
This basically involves finding the bounds of all the words on the page, and then figuring out which one contains said point. This is nearly identical to finding the word under the cursor, which has been thoroughly answered in this question: How to get a word under cursor using JavaScript?

Knockoutjs and Selenium testing

Looking at Knockout examples, there is no real need for adding IDs to HTML elements. Creating a large form without the IDs seems to make it easy to maintain.
Though, this creates a problem with Selenium HQ. There is no way to uniquely identify elements on the form.
What are the choices? Is there another method for Selenium to select elements created by Knockout?
or will I have to assign IDs to elements?
I have reviewed other knockout and selenium questions. All of them had IDs defined for the HTML elements, when they started.
Thanks
Abhi
Short answer: Add ID's to your HTML elements.
Although you do not need these attributes in order for your website to function, you will make the life of your testers so much easier.
I've encountered the exact same problem in a project where a large ASP.NET MVC 4 application was created, that uses Knockout.js and Selenium extensively. For form elements, I relied on ASP.NET MVC utility methods to generate the output HTML in combination with data-bind expressions. ASP.NET MVC automatically generates unique NAME and ID attributes based on the backing model.
However, in all other cases where I had to render tables, display forms or dialogs, I ended up adding ID attributes to these HTML elements. If you think about it, this is a logical consequence of your requirements. Knockout is awesome because you longer need ID's and NAME's to wire your layout (HTML) and behavior (JS) together. However, other frameworks, such as Selenium, require these ID's to be present.
Yes, you could work your way around it with complicated and bloated XPATH expressions. But this will dramatically decrease the maintainability of your tests. In my experience, adding ID's to hundreds of HTML elements took less than a day and increased productivity of our testers by a manifold.
Remember, it may be nice to develop functional websites with as little HTML as possible. But if this makes your website untestable, you will lose more than you gain. Testability is non-functional requirement, but this does not mean it is not important!
You should add Ids to your html elements. your application will become more complicated and probably you'll need to bind multiple view models to different section on the same page, you'll need Ids. For example in ASP.NET MVC, you'll want to build a partial view to display all products you ordered, and you want to share this partial view all the way through the ordering process, you'll want your binding to this specific partial view section