Determining all required DNS Queries to show a website - scripting

I need to create a list of all DNS Queries required to display a large number of sites (ideally up to 1 000 000). The list needs to assign the queries to the page that required them.
Example: Visiting google.com required a DNS query for google.com, ssl.gstatic.com, apis.google.com and other sites. My List would read something along the lines of
google.com:google.com,ssl.gstatic.com,apis.google.com,...
(exact format not relevant here)
I currently have two ideas on how to do this:
Set up a DNS Server with logging, build a script that visits a given list of domains using my DNS Server as a resolver
Building a script that loads the source code of the site (think python's urllib2, for example), parsing all embedded content and constructing a list of queries that would be needed
Both ideas have problems though. Visiting 1 000 000 Domains with a space of 2 seconds between visits (to make it possible to assign queries to the visited site afterwards), taking about 1 second to load (which is pretty optimistic) would take over 34 days, probably longer. But to build a parser I would need a complete list of all possible forms of embedded content that would result in a DNS Query, and I would need to query some of the target URLs as well (think iframes), and some content would be impossible to check for further queries (think flash content which connects to other servers).
I'm kind of stuck here, and would appreciate some input on how to deal with this. It would be possible to shorten the List of URLs to maybe 100 000, but any less would dramatically reduce the use of the result.
For context: I need this list for my bachelor thesis dealing with a attack strategy on a proposed DNS privacy extension.

You can use PhantomJS to do this, as it provides an interface that will let you capture network requests and log them, something along the lines of this example.
You'd need to write some simple Javascript, but as it's Node, it should be fairly easy to run this asynchronously to gather the data you need within a reasonable time.

There is a tool that can do this and produce a graphic representation. It is part of dnssec-tools called DNSpktflow (DNS Packet Flow)
It may not do what you want exactly but it is open source so you can see how they do it.

Related

How to compare content between two web pages in different environments?

We are in the process of building a website from scratch from an existing website. The web page is an identical copy, and as the web page contains many pages we need a way to compare content between the sites. It is of course possible to do manually, but it takes both a lot of time and entails a risk of human errors.
I have seen that there are services that offer this by inputting two URLs which are then analyzed and where discrepancies are presented. However, these cannot be used as our test environment is local (built in Sitecore).
Is there a way to solve this without making our test environment available online (which is not possible)? For example, does software exist for this, or alternatively some service where you can compare a web page that is online with one that is local?
Note that we're only looking for content comparison (not visual).
(Un)fortunately there's many ways to do this, but fortunately there are some simple ones.
What I would do is:
Get a list of URLs for each site. If the Sitemap is exhaustive, then you could use that, if it's not you might want to run some Sitecore Powershell to get the lists.
Given the lists (from files, or Sitecore API or something), write a program to visit each URL, get the text of the page after it's done rendering, and save it to disk (something like Selenium is good for this and you can use any language). You'll want some folder structure like host/urlpart/urlpart/pagename.txt, basically the same as your content tree.
Use some filesystem diff program like WinMerge to compare the two folders
This is quick and dirty, but a good place to start.

Best approach for creating Random keystrokes for load testing a webapp for database-backed quicksearch using JMeter

Context:
I am load testing a prototype enterprise web app that performs quick searches on a large dataset. It's backed by a database and uses JQuery datables backed by a servlet to narrow the results upon each keystroke.
I want to find out how it will behave under load and measure response time, stability and usability under various loads and come up with a SLA. The load in this case would be a number of users logging in, typing various search string, simultaneously.
Tools:
I am using Apache Jmeter to do this.
Question:
To truly make my load tests random and eliminate the effect of caching at database level (or anywhere else), I want my HTTP requests for each search to be random. I want to do something like this: send a character, wait, send another character, send backspace, send one more character, send two backspaces, etc.
What is the most elegant/efficient way of doing something like that using JMeter?
Right now I am looking into using CSV dataset and read random characters from a large file, but I'm wondering if there is a better way.
You can achieve the random search strings by using functions.
Specifically, look at RANDOM and CHAR.
basically, you'd have something like ${__CHAR(${__RANDOM(0,82)})} to generate a single character.
I'd also recommend having a CSV file with the top 100 most popular search terms to test against.
I don't know your use case, but it seems unlikely that people are going to be typing in random characters. If I am correct, simulating random keystrokes could be just as misleading as using a very small set of search keywords.
Instead, you should locate or develop a set of keywords that people are likely to use - possibly be scanning the content they will be searching? Then use that to populate what users will enter when they are searching.

Nice remote apache log viewer

I have a server with 10+ virtual domains (most running Mediawiki). I'd like to be able to watch their traffic remotely with something nicer than tail -f . I could cobble something together, but was wondering if something super-deluxe already exists that involves a minimum of hacking and support. This is mostly to understand what's going on, not so much for security (though it could serve that role too). It must:
be able to deal with vhost log files
be able to handle updates every 10 seconds or so
Be free/open source
The nice to haves are:
Browser based display (supported by a web app/daemon on the server)
Support filters (bots, etc)
Features like counters for pages, with click to view history
Show a nice graphical display of a geographic map, timeline, etc
Identify individual browsers
Show link relationships (coming from remote site, to page, to another page)
Be able to identify logfile patterns (editing or creating a page)
I run Debian on the server.
Thanks!
Take a look at Splunk.
I'm not sure if it supports real time (~10 second) updates but there are a ton of features and it's pretty easy to get set up.
The free version has some limitations but there is also an enterprise version.
Logstash is the current answer. (=
Depending on the volume, Papertrail could be free for you. It is the closest thing to a tail -f and is searchable, archivable and also sends alerts based on custom criteria.

How should I format user uploaded pictures' filenames?

My website deals with pictures that users upload. I'm kind of conflicted on what my picture filename should consist of. I'm worried about scalability simply and possibly security? Maybe someone out there deals with the same thing and can tell me what their use on their site?
Currently, my filename convention is
{pictureId}_{userId}_{salt}_{variant}.{fileExt}
where salt is a token generated server-side (not sure why I decided to put this here, maybe for security purposes I don't know) and variant is something like t where it signifies it's a thumbnail. So it would look something like
12332_22_hb8324jk_t.jpg
Please advise, thanks.
In addition to the previous comments, you may want to consider creating a directory hierarchy for your files. Depending on volume and the particular OS hosting the files, you can easily reach a point where you have an unreasonably large number of files in a single directory. There may be limits on the number of files allowed per folder. If you ever need to do any manual QA or maintenance on your files, this may be problematic (especially if such maintenance is not scripted).
I once worked on a project with a high volume of images. We decided to record a subpath in our database in addition to the filename of each file. Our folder names looked like this:
a/e/2/f/9
3/3/2/b/7
Essentially, we created folders 5 deep with a single hex value as the folder name. The depth was probably excessive, but effective. I suppose this could have led to us reaching a limit on the number of folders on a volume (not sure if such a limit exists).
I would also consider storing a drive in addition to a path (assuming you have a bunch of disks for storage). This way you can move images around and then update your database (assuming you have one) as part of the move.
My 2 pence worth; there is a bit of a conflict between scalability and security in this problem I would say.
If you have real security concerns, then you should not rely at all on the filename of the target image : this is just security-by-obfusication - somebody could just guess the name eventually.[even with your salt idea, which makes it harder]
Instead you should at least have a login mechanism to create a session between client and server , to make sure you can only get at stuff once you have authenticated: even then stuff is sniffable: if security really is a concern , then I would say you have to use SSL.
Regarding scalability : I would suggest you actually do give your images sequential numbers: and store them in 'bins' of (say) 500 images each. As you fill up a bin, create a new one. Store bin (min-image-id, max-image id) information in one DB table and image numbers in another: you can then comparitively cheaply find which bin a particular image lives in from its id. This is a fairly common solution for storing lots of docs/images.
You could then map your URLs to the bin+image id: but then to avoid the problem noted by Jason Williams (sequential numbering, makes it easy to probe), you really should address security separately as in point 1.
You might like to consider replacing the underscores with (e.g.) minuses. (Underscores are used as wildcards in SQL, so you could potentially run into trouble one day in a LIKE comparison). (And of course, underscores are just plain evil :-)
It looks form your example like you're avoiding spaces and upper-case characters - good move. I'd keep everything lowercase and use case-insensitive comparisons to eliminate any potential case-sensitivity issues with different file systems.
Scalability should be fine as long as you can cope with any number of digits in your user, picture and type IDs. You're very unlikely to hit any filename length limits with this scheme.
Security could be an issue if you use sequential IDs, as someone could potentially tweak the numbers and request a picture they shouldn't be able to access - but the salt should make it virtually impossible for someone to guess the correct filename for another picture. If users can't see/access the internal filename in any way, that may be an unnecessary measure though.
The first thing to do is to setup a directory structure that models your use case. In your case you have a user that uploads a picture. You would probably have a directory structure like this (probably on a network share somewhere):
-Pictures
-UserID1
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
-UserID2
-PictureID1~^~Variant.jpg
-PictureID2~^~Variant.jpg
Pictures - simply the root directory for the following.
UserID - is the database user ID.
PictureID is simply the picture ID from the database (assuming you record the filename of each uploaded picture in a database.)
~^~ - This is simply a delimitor. You can use a one character or X character sequence. I like three characters as it is easily handled with the split function and is readily distinguishable in the file name.
Sometimes I like to add the size of the picture in with the file name .256.jpg or .1024.jpg.
At any rate, all of this depends on your use case. The most important thing is setting up the directory structure properly. That will make it easier to access/serve and manage the pictures.
You can add any other information you need into the filename as long it doesn't exceed the maximum filename length on your system.

browser plugin to test a site's look when migrating

I'm thinking I need a browser plugin that does the following, and if it doesn't exist, it should. I may as well say FF for now, but it could be any browser.
The problem: when moving a website from one server to another, you need migration testing. It is a pain to click on every link by hand and compare it to the old host. You really need 2 machines or have to constantly thrash your hosts file.
The plugin:
Would allow you to specify an alternate hosts entry for a website. 2 entries would make it clear, one for live, one for test.
The plugin would crawl every link on the site, and render the page in the browser, and save an image of the entire page.
It would switch hosts and repeat, and save images in a second folder. Since the rendering engines match, the images should match. We need to switch hosts (like /etc/hosts) so all absolute links are the same for the site.
Now this could be part of the plugin or external, now that we have 2 folders of identically named images, we run an image-diff program on the whole batch. A quick test would be a bdiff or hash, or we could get more sophisticated and determine how different each image is.
This would save so much time. So can it be done with existing tools, or do I need to go write it?
Have a look at Selenium, it allows you to script interactions with the browser and verify content.
That is overengineered. What kind of website is it? How big? Which framework (PHP, JSP, Rails, etc.)? Why not copy the website onto the new server and grep the code for specific ties to the old server?
I'd concentrate on why you think the site would differ between two servers, and focus on testing those specific cases rather than the whole site. When a site is moved to a new machine the issues are generally very obvious from looking at a couple of pages.
Presumably they are both looking at the same data source, assuming there is a data source, otherwise a folder diff on the two installations would suffice. This being the case, it should be a simple task to identify which areas of the site are likely to be affected by a server migration.
Also, I wouldn't personally trust a machine matching two images to sign off system as ready to go live. There just isn't a substitute for real human testing. Yes it's time consuming, but how important is your site?
Try http://www.browsercam.com/ - free trial should allow you to specify main page and follow links to make screenshots automatically of the sub-pages as well.