How to compare content between two web pages in different environments? - testing

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).

(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

Related

Creating a script that will automate filling out a form on an ad website

I would like to create a script that would help me automate filling out a form on a website. Here is a basic idea that I came up with. The website consists of 5 stages.
Selecting a category and a group of an item
Adding the item. It consists of a title, price, etc...
Selecting the visibility
Finish putting out an ad by clicking I accept.
Deleting the previous ad of the same product
So basically what I had in mind was to sort all of my items in subfolders, each subfolder would contain an image of the item along with an info.txt file. The info.txt file would contain all the information that would be needed for filling out the form (for example title, price, text of the ad, etc...). Using these subfolders and .txt files I'd like to create a script that could help me add my items and fill out a from on an ad-based website.
So my question is: How to do this, what language/script should I use?
I would recommend taking a look at Selenium or PhantomJS. These tools are normally used for running unit tests for web apps in the browser, but I have used them before to do automated stuff on some websites as well. Selenium and PhantomJS are both operated by programming languages.
If you prefer a more simple approach and don't mind using proprietary apps, you could take a look at iMacros, although I am not sure if reading files is easy when using iMacros.

Determining all required DNS Queries to show a website

I need to create a list of all DNS Queries required to display a large number of sites (ideally up to 1 000 000). The list needs to assign the queries to the page that required them.
Example: Visiting google.com required a DNS query for google.com, ssl.gstatic.com, apis.google.com and other sites. My List would read something along the lines of
google.com:google.com,ssl.gstatic.com,apis.google.com,...
(exact format not relevant here)
I currently have two ideas on how to do this:
Set up a DNS Server with logging, build a script that visits a given list of domains using my DNS Server as a resolver
Building a script that loads the source code of the site (think python's urllib2, for example), parsing all embedded content and constructing a list of queries that would be needed
Both ideas have problems though. Visiting 1 000 000 Domains with a space of 2 seconds between visits (to make it possible to assign queries to the visited site afterwards), taking about 1 second to load (which is pretty optimistic) would take over 34 days, probably longer. But to build a parser I would need a complete list of all possible forms of embedded content that would result in a DNS Query, and I would need to query some of the target URLs as well (think iframes), and some content would be impossible to check for further queries (think flash content which connects to other servers).
I'm kind of stuck here, and would appreciate some input on how to deal with this. It would be possible to shorten the List of URLs to maybe 100 000, but any less would dramatically reduce the use of the result.
For context: I need this list for my bachelor thesis dealing with a attack strategy on a proposed DNS privacy extension.
You can use PhantomJS to do this, as it provides an interface that will let you capture network requests and log them, something along the lines of this example.
You'd need to write some simple Javascript, but as it's Node, it should be fairly easy to run this asynchronously to gather the data you need within a reasonable time.
There is a tool that can do this and produce a graphic representation. It is part of dnssec-tools called DNSpktflow (DNS Packet Flow)
It may not do what you want exactly but it is open source so you can see how they do it.

RAILS3: Full-text Search Word Docs?

My company has a collection of about 3500 highly-structured Word docs (and growing) that contain multiple choice questions from one of our products. I've been tasked with writing a front-end that will let people find and use these in other products. There is some metadata on them that would go in a database, but we'd also like full-text search.
I've been given the option of using for the front-end either MS Access (because I know it well) or Rails (because I'm supposed to be learning it). I've done one Rails app and prefer to continue with it.
Rather than load the documents into the database, I thought it made more sense to just have them on the file system and store paths to them in the database.
I know I can use Ferret to search database fields but what's the best way to add full-text searching to a Rails app for a pile of files on the filesystem?
Not sure if there are any gems that would search the word files for you. Although you have mentioned that you do not want to load the entire documents into the database, you might look into just copying the text contents of each file in your db. You can use win32ol library for doing this (http://ruby-doc.org/stdlib/libdoc/win32ole/rdoc/classes/WIN32OLE.html) .. If I had to implement this, I would run a cron job every night (or whatever frequency seems fit) that would refresh the database content with the changes in the word files.

How to compare test website and live website

We have our production server running our website. Then we have a test server which has exact same data but with changes to code to do some new functionality. This web app has over 500 pages.
Is there any program that can
Login to the test site
Crawl through each page and then save the page as html
Compare with the same page saved with live site?
This way we can make sure that new features that we add to our test site will not break the live site when code updates are applied to production.
I am currently trying to use WinHTTrack website copier and then comparing the test and live folders with some code comparison tool like beyond compare. This works ok but there are lot of files changed because of the domain name changes.
Looking forward to ideas / solutions for this problem.
Regards
Have you looked at using Watir for this? It's not exactly the thing you are looking for but it might allow you some more granularity in your tests and ensure the site is functionally identical rather than getting caught up on changing guids, timestamps and all the other things that tend to change across any significant size website from day to day as part of it's standard functionality.
Apparently you can't make consistent, reproduceable builds in your project, can you? I would recommend moving towards that in the long run, it will save you a lot of headaches. That way you would know exactly what was deployed to which server when, so there would be no more need to bend around backwards to get the deployed sources back like this...
I know this is not a direct solution to your problem... but maybe it is worth comparing, whether you would save more in the long run by investing the efforts into your build process now, instead of implementing this workaround (and then improving your build process anyway - because one day you will almost surely need to do that).
wget has a --convert-links option, there are also some options to preserve cookies that might let you do it logged in http://drupal.org/node/118759#comment-664498
use an Offline Downloader, download all files to your computer from both sources, then compare the folder contents using a free tool like Total Commander.
EDIT
Load both of your sources into a CVS, and compare it there.

browser plugin to test a site's look when migrating

I'm thinking I need a browser plugin that does the following, and if it doesn't exist, it should. I may as well say FF for now, but it could be any browser.
The problem: when moving a website from one server to another, you need migration testing. It is a pain to click on every link by hand and compare it to the old host. You really need 2 machines or have to constantly thrash your hosts file.
The plugin:
Would allow you to specify an alternate hosts entry for a website. 2 entries would make it clear, one for live, one for test.
The plugin would crawl every link on the site, and render the page in the browser, and save an image of the entire page.
It would switch hosts and repeat, and save images in a second folder. Since the rendering engines match, the images should match. We need to switch hosts (like /etc/hosts) so all absolute links are the same for the site.
Now this could be part of the plugin or external, now that we have 2 folders of identically named images, we run an image-diff program on the whole batch. A quick test would be a bdiff or hash, or we could get more sophisticated and determine how different each image is.
This would save so much time. So can it be done with existing tools, or do I need to go write it?
Have a look at Selenium, it allows you to script interactions with the browser and verify content.
That is overengineered. What kind of website is it? How big? Which framework (PHP, JSP, Rails, etc.)? Why not copy the website onto the new server and grep the code for specific ties to the old server?
I'd concentrate on why you think the site would differ between two servers, and focus on testing those specific cases rather than the whole site. When a site is moved to a new machine the issues are generally very obvious from looking at a couple of pages.
Presumably they are both looking at the same data source, assuming there is a data source, otherwise a folder diff on the two installations would suffice. This being the case, it should be a simple task to identify which areas of the site are likely to be affected by a server migration.
Also, I wouldn't personally trust a machine matching two images to sign off system as ready to go live. There just isn't a substitute for real human testing. Yes it's time consuming, but how important is your site?
Try http://www.browsercam.com/ - free trial should allow you to specify main page and follow links to make screenshots automatically of the sub-pages as well.