Command-line web browser that outputs the DOM - testing

I'm looking for a way to process a web page and associated Javascript from the command-line, so that the resulting DOM model can be outputted.
The purpose for this is to identify forms within the page without doing any nasty HTML (and Javascript) parsing with regular expressions.
Are there any command-line tools that will do this? So hypothetically speaking, a command-line web browser that downloads the content and outputs the DOM as text rather than producing a pretty page.

I don't know of any, but I wanted to highlight one difficulty with what you've suggested:
process a web page and associated Javascript
When would the output be? Many webpages have time-sensitive javascripts, or onclick/onhover scripts which would affect the DOM. Would you want these to be executed? All of them, or only some? It's not trivial to decide when the page is "finished" and ready for the DOM to be output after javascript manipulation. (Before javascript manipulation, it's an easier problem; just wait till the document.DOMReady event...)
Edit: I'm not saying that you don't need javascript execution at all: you might want to handle any document.write sections during loading, as they might write out a form... I'm saying it's hard to know when you've done "enough" javascript...

For java, I've had fairly good experiences with htmlunit.
I've also used the BeautifulSoup python library to parse forms and formdata. No need to specify regexps, as it'll let you traverse the DOM tree without much effort.

Related

React testing library vs cypress query philosophies

I'm currently in the process of setting up Cypress for my project. Currently we're only using testing library for frontend tests. And reading the Cypress documentation has gotten me a bit confused as the two libraries seem to have opposite philosophies in regards to how you're supposed to query for elements.
Testing library basically says test what the user can see/touch and only use data-testid if all else fails. Cypress on the other hand states that best practice is that you should query elements by data-testid / data-cy attributes.
I feel conflicted between the two approaches. I get the point about we should test what the user actually sees (testing library). But I do also get that those things often change (cypress) and we need to spend time updating tests whenever we make small changes (i.e "Ok" -> "Done"). And when testing with data-cy attributes, are we not also ignoring accessibility / screen readers?
What are your thoughts on this?
React Testing library(RTL) is specifically made to write tests from a user perspective. From their Guiding Principles:
The more your tests resemble the way your software is used, the more confidence they can give you.
Meaning, RTL wants you to use accessibility queries like getByRole and only fallback to getByTestId for cases where you can't match by role or text, or it doesn't make sense (e.g. the text is dynamic).
However, thanks to the render method allowing us to specify props (compared to Cypress), we have much more flexibility and may entirely omit dynamic text.
Cypress, on the other hand, runs with all dependencies. In case of dynamic content from a C.M.S or multi-language application, things are not that easy using getByRole("heading", {name: /welcome/i }). In this case, the recommendation of testId's make sense.
My personal recommendation is to use accessibility query selectors in both Cypress and RTL, unless the text is dynamic. Then testId's in Cypress and a combination of testId's & accessibility query selectors provide the best solution.
It should also be noted that Playwright and Cypress test-generator tools select by accessibility query selectors.
I thought a lot for a few days before answering this question and I even got to do some tests, and after that time I came to the conclusion that the Cypress approach is the best.
The main reason that led me to this answer was that when the testing library says that we should test what the user actually sees, it is already applicable in Cypress even when we use a data-testid, because let's suppose you have a button that exists in the DOM, but it is not visible when you select this button, with the data-testid when you try to click in this button Cypress will return an error saying that the button is not visible and if you really want to do this action you must apply force:true. The same happens if the button is not clickable or if there is another element in front of the button.
Cypress already checks by default in click and type actions if the element:
element is into view
it is visible
it is not disabled
it is not detached
it is not readonly
it is not animating
it is not covered
fire the event at a
descendent
Also if you fetch the element by text, placeholder, or class, this does not guarantee that the element is actually visible to the user, as the element can be in the DOM and not be visible to the user for various reasons.
So the best way to make tests easier to maintain, easier to read, and avoid flaky tests is to use the data-testid and whenever possible or necessary combine the location of the element with an assertion to ensure that the element is visible. Example:
cy.get('[data-testid="button"]').should('be.visible')
I hope I had contributed to the discussion and would love to hear other people's points of view.

Access closure property names in the content block at runtime

I want to evaluate my content blocks before running my test suite but the closures' property names is in bytecode already. I'm ooking for the cleanest solution (compared with parsing source manually).
Already tried solution outlined in this post (and I'd still wind up doing some RegEx/parsing) but could only get it to work via script execution engine. It failed in IDE and GroovyConsole. Rather than embedding a Groovy script in project's code, I thought I'd try using Geb's native classes.
Is building on the suggestion about extending Geb Navigators here viable for Geb's PageContentSupport class whose contentTemplates contain a LinkedHashMap of exactly what I need? If yes, would someone provide guidance? If no, any suggestions?
It is currently not possible to get hold of all content elements for a given page/module. Feel free to create an issue for this in Geb's bug tracker, but remember that all that Geb can provide is either a list of content element names or a map from these names to closures that create these elements.
Having that information isn't a generic solution to your problem because it's possible for content elements to take parameters and there are situations where your content elements will be available on the page only after some other actions are performed (for example you have to click on button to reveal a section of a page that uses ajax to retrieve it's content). So I'm afraid that simply going over all elements and checking if they don't throw any errors will not cut it.
I'm still struggling to see what would "evaluating" all content elements prior to running the suite buy you. Are you after verifying that your content elements still work to get a faster feedback than running the whole suite? I'm pretty sure that you won't be able to fully automate detection of content definitions that don't work anymore. In my view it will be more effort than it's worth.

equivalent of nevow.tags.raw for twisted.web.template

I'm trying to port pydoctor to twisted.web.template and have hit a pretty basic problem: pydoctor uses epydoc to render docstrings into HTML but I can't see a way to include this HTML in the generated page without escaping. What can I do?
There is, somewhat intentionally, no way to insert HTML into the page without parsing; twisted.web.template is a bit more of a stickler about producing correct output than nevow was.
There are a couple of ways around this.
Ultimately, your HTML is going to some kind of output stream. You could simply insert a renderer that returns a pair of Deferred objects, and does a .write to the underlying stream after the first one fires but before the second. Kind of gross, but it effectively expresses your intent :).
You can simply re-parse the output of epydoc into HTML using XMLString or similar, so that twisted.web.template can write it out correctly. This will "waste" a little bit of CPU, but in my opinion it will be worth it for (A) the stress-test it will give t.w.t and (B) the guarantee - presuming that t.w.t is correct - that it will give you that you're emitting valid HTML.
As I was writing this answer, however, I realized that point 2 isn't generally possible with arbitrary HTML with the current public API of twisted.web.template. Ideally, you could use html5lib to parse this stuff, and then just dump the parsed input into your document tree.
If you don't mind mucking around with private API, you could probably hook up html5lib's SAX support to the internal SAX parser that we use to load templates.
Of course, the real solution is to fix the ticket you already filed, so you don't have to use private API outside of Twisted itself...

Injection Scripts Executing Multiple Times

I've read the documentation and understand that this is to be expected:
Scripts are injected into the top-level page and any children with HTML sources, such as iframes. Do not assume that there is only one instance of your script per browser tab.
I'm wondering, though:
Other than iframes, what other elements have "HTML sources" (images? objects?)? The term "HTML sources" is uncomfortably vague to my ears.
Is there any way to detect which element is executing the script?
I've filtered out iframes by determining that window === window.top, as recommended, but other elements are still executing the script and it's executing a lot more than I'd like.
Thanks.
It's really my own fault that I'm answering my own question because I didn't really provide enough information in my question. In my defense, I didn't know at the time that I was providing too little information.
Anyway, while traveling down the road to a solution for this, I asked a question on Apple's dev forum and included the following bit of key info:
Everything in the script occurs on the beforeload event of the document (or was supposed to).
What I learned was that the beforeload event only fires for subresources within the document. Not for the document (or window) itself. I removed the event handler and made sure that the script was applied as a start script (it was). I'd already applied the test for the window as top window so I was covered. Now my injection script only fires once.
Other HTML sources may be frames or object tags (with HTML contents). I don't think it can be anything else. However, to the extent of my knowledge, they should also be filtered off with window === window.top. Try console.loging the document.location variable to see which URL runs your injected script, and maybe you can find what loads them.

How to locally test cross-domain builds?

Using the dojo toolkit, what is the proper way of locally testing code that will be executed as cross-domain, without making the actual build?
As it appears, there are three possible options (each, with their own drawbacks):
Using local (non xd) XMLHttpRequest dojo.require
This option does not really test the xd behavior, since it dojo.require[s] the js synchronously via XHR.
djConfig.debugAtAllCosts = true;
Although this option does load the required code asynchronously (via the 'script' tag), it also pulls the code in via XHR, parses the dojo.require[s] inside that, and pulls them in. This (using the loader_debug), again, is not what the loader_xd is doing. More info on this topic in a different question.
Creating a cross-domain build
This approach requires a build, which is not possible in the environment which I'm running the code in (We're using our own on-the-fly build process, which includes only the js that is necessary for a particular page. This process is not suitable for development).
Thus, my question: is there a way to use the loader_xd, which does not require an xd build (which adds the xd prefix / suffix to every file)?
The 2nd way (using the debugAtAllCosts) also makes me question the motivation for pre-parsing the dojo.require[s]. If the loader_xd will not (or rather can not) pre-parse, why is the method that was created for testing/debugging doing so?
peller has described the situation. If you wanted to just generate .xd.js file for your modules, you could look at util/buildscripts/jslib/buildUtilXd.js and its buildUtilXd.xdgen() function.
It would take a bit of work to make your own script, but you could look at util/buildscripts/build.js for pointers.
I am hoping in the future for Dojo (maybe Dojo 2.x timeframe) we can switch to a loader that just uses script tags with a module format that has a function wrapper around the module, something that is coded by the developer. This would allow the same module format to work in the local and xd cases.
I don't think there's any way to do XD loading without building and deploying it. Your analysis of the various options seems about right.
debugAtAllCosts is there specifically to solve a debugging problem, where most browsers, until recently, could not do anything intelligent with code brought in through eval. Still today, Firefox will report exception in the console as appearing at the eval site (bootstrap.js) with a line number offset from the eval, rather than from the actual eval buffer, and normally that eval buffer is anonymous. Firebug was the first debugger to jump through some hoops to enhance the debugging experience and permitted special metadata that Dojo's loader injects between the XHR and the eval to determine a filepath to the source. Webkit/Safari have recently implemented this also. I believe debugAtAllCosts pre-dates the XD loader.