How do you Bulk Test URL Redirections? - selenium

Like many others on SO, I am not from a hardcore dev background - far more ops. Therefore I find myself struggling with something like this which I guess belongs very much here.
Requirement - I want to easily test large (1000-50000) batches of URL Redirections. Closer to the former.
Inputs I want to give
Source URL & Target URL
Outputs I want to... umm get out
Pass/Fail
HTTP Response Code
Bonus points for - Using real browsers (Selenium et al) as a very small proportion of redirects are done in JS. Very small. Being able to choose if Target URL equates to the first redirect or the penultimate one. Being able to easily change HTTP Headers (though happy to inject those with Fiddler etc)
Ideas I have currently
Bash script calling curl. I can do this but my problems are being able to make it scale i.e. parsing a csv input rather than manually editing a script. Also it doesn't cover the JS redirects (no dealbreaker). Seems like the easiest option.
Selenium IDE script. I probably could write the script but again struggling to scale it to even 10 URLs. Probably have to parse a CSV to create each script and then feed those into the command line runner and then capture the output.
Screaming Frog. I actually really love this tool and it can test redirections in bulk. However it has no concept of pass/fail. So close to being a one-stop shop. Also the free version doesn't follow redirection chains (i.e. -L in curl)
Just seems like one of those problems others must have had and tackled in a mainstream/easier way that I have thought of. Thanks in advance to anyone that can help.

One solution :
csv:
http://google.com;http://www.google.fr
http://domain.null;http://www.domain.null
code:
#!/bin/bash
while IFS=";" read -r url1 url2; do
ret=$(curl -s -o /dev/null -w "%{http_code}\n" "$url1")
((ret >= 200 && ret <= 400)) && echo 'url1 PASS' || echo 'url1 FAIL'
echo "url2 $(curl -s -L -o /dev/null -w "%{http_code}\n" "$url2")"
done < csv
If you need to know the real URL redirected (or not), use
curl -L -s -o /dev/null http://google.fr -w "%{url_effective}\n"
Feel free to improve to fit your needs.

Big thanks to StardustOne but I felt that I MUST be reinventing the wheel a little.
Ignoring the requirement to test in-browser and cover JS scenarios I looked again and the nicest solution I have found to-date was something sent out in the Devops Weekly newsletter
Smolder - https://github.com/sky-shiny/smolder
I know a member of staff working for one of our vendors was also working on a similar app that wrapped the Python requests library and I plan to back-to-back the two soon to find out which is actually best. I'll post back if his effort beats Smolder!

Related

Jira -- How to get issue changelog via REST API - but ALL, not single issue

I've seen this question many times, but no sufficient answer.
We're trying to dump all JIRA data into our data warehouse/ BI system. Or at least, the interesting parts.
One thing you can do is track status times, cycle time, lead time directly with field durations. This is very easy via JIRA's direct SQL database. The tables changeItem and changeGroup.
Of course the REST JSON API has less performance impact on the database.
However ... there appears to be no equivalent in the rest API of fetching ALL issue change history. Yes, you can fetch the changelog of one issue directly via an API call. If you have 100k issues, are you expected to make 100k API calls, iterating through issue IDs? Sounds like madness.
Is it somehow possible to expand changelogs through the search API, which amasses all issue data? I haven't seen it. Is what I'm seeking here possible? Or will we have to stick to the SQL route?
I think you are asking pretty much the same question as before: How can I fetch (via GET) all JIRA issues? Do I go to the Search node?, but additionally interesting in getting changelog data.
Yes, again you have to do it in batch, requesting JIRA API several times.
Here is little bash script, which could help you to do that:
#!/usr/bin/env bash
LDAP_USERNAME='<username>'
LDAP_PASSWORD='<password>'
JIRA_URL='https://jira.example.com/rest/api/2/search?'
JQL_QUERY='project=FOOBAR'
START_AT=0
MAX_RESULTS=50
TOTAL=$(curl --silent -u "${LDAP_USERNAME}:${LDAP_PASSWORD}" -X GET -H "Content-Type: application/json" "${JIRA_URL}maxResults=0&jql=${JQL_QUERY}" | jq '.total')
echo "Query would export ${TOTAL} issues."
while [ ${START_AT} -lt ${TOTAL} ]; do
echo "Exporting from ${START_AT} to $((START_AT + MAX_RESULTS))"
curl --silent -u "${LDAP_USERNAME}:${LDAP_PASSWORD}" -X GET -H "Content-Type: application/json" "${JIRA_URL}maxResults=${MAX_RESULTS}&startAt=${START_AT}&jql=${JQL_QUERY}& expand=changelog" | jq -c '.issues[]' >> issues.json
START_AT=$((START_AT + MAX_RESULTS))
done
Please note the expand parameter, which additionally put all change log to the json dump as well. Alternatively you can use issue dumper python solution: implement the callback to store data to db and you're done.
Another service worth considering especially if you need to have a feed like list of changes:
/plugins/servlet/streams?maxResults=99&issues=activity+IS+issue%3Aupdate&providers=issues
This returns a feed of last changes in issues in XML format for some criteria, like users etc. Actually, you may play around with "Activity Stream" gadget on a Dashboard to see hot it works.
The service has limit of 99 changes at once, but there's paging(see the Show More.. button)

How to query an API? specifically 'Adzuna'

I'm trying to query the Adzuna API, the documentation instructs me to write the following:
https://api.adzuna.com:443/v1/api/jobs/gb/jobsworth?app_id=1d3bc9c4&app_key=de61a42bf523e06f5b7ebe32d630e8fd
... i.e., the command generated by entering my Application ID1 and Application Keys2 into this example query generator
it issues the following response:
[1] 10123
bash: https://api.adzuna.com:443/v1/api/jobs/gb/jobsworth?app_id=1d3bc9c4: No such file or directory
[1]+ Exit 127
Which I guess means that I've done it completely wrong.
I also tried plugging my details into the example query the put in the documentation with stubs for credentials:
http://api.adzuna.com/v1/api/property/gb/search/1?app_id={YOUR_APP_ID}&app_key={YOUR_APP_KEY}
I have no experience querying APIs, I tried also appending that to wget and get and still the response was garbage.
How can I query this API in the right way?
1d3bc9c4
de61a42bf523e06f5b7ebe32d630e8fd
It seems this query A:
http://api.adzuna.com/v1/api/jobs/gb/search/1?app_id=1d3bc9c4&app_key=de61a42bf523e06f5b7ebe32d630e8fd&results_per_page=20&what=javascript%20developer&content-type=application/json
will work when put into a browser, a simple copy of my credentials into the example:
http://api.adzuna.com/v1/api/jobs/gb/search/1?app_id={YOUR API ID}&app_key={YOUR API KEY}&results_per_page=20&what=javascript%20developer&content-type=application/json
but is there a way to execute that without a browser? directly in the terminal- to expedite the process.
but when I tried query A with wget I got:

How to get application uptime report with New Relic API ?

I need to get a weekly report of my applications uptime.
The metric is available on the "SLA Report" screen (http://awesomescreenshot.com/0f02fsy34e) but i can't find a way to get it programatically.
Other SLA metrics are available using the API : http://docs.newrelic.com/docs/features/sla-report-examples#metrics
The uptime information is not considered a metric so is not available via the REST API. If you like you may contact support.newrelic.com to request this as a new feature.
Even though there was no direct api request for getting uptime from newrelic we can give the nrql queries inside curl.
curl -X POST -H "Accept: application/xml" -H "X-Query-Key: your_query_api_key" -d "nrql=SELECT+percentage(count(*)%2c+WHERE+result+%3d+'SUCCESS')+FROM+SyntheticCheck+SINCE+1+week+ago+WHERE+monitorName+%3d+'your+monitor+name'+FACET+dateOf(timestamp)" "https://insights-api.newrelic.com/v1/accounts/your_account_id/query"
The above curl will give uptime percentages distributed by Days of week. If you don't know how to url encode your nrql query please try this http://string-functions.com/urlencode.aspx. Hope this helps.
mm. I figured out that the following gives a csv formatted text content and could be useful for extracting data (monthly), but it's not going ask you for credentials on the tool / command line you use. (Beware). Since i export the monthly metrics from my machine, it's simple for me to make it work, and it really gives me much more flexibility.
https://rpm.newrelic.com/optimize/sla_report/run?account_id=<account_id>&application_id=<app_id>&format=csv&interval=months

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

How do you spell check a website?

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?
Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.
Edit: I have a list of all of the URLs on the site that I need to check.
Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).
lynx -dump http://www.example.com
It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:
lynx -dump http://www.example.com | grep -v "http"
The URLs could also be local (file://) if I have used wget to mirror the site.
I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).
This will ignore text in title and meta elements. These can be spellchecked seperately.
Just a view days before i discovered Spello web site spell checker. It uses my
NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.
If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.
#!/bin/sh
# Find HTML files
find $1 -name \*.html -type f |
while read f
do
# Split file into words
sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[ ][ ]*/\
/g ' "$f" |
# Remove blank lines
sed '/^$/d' |
# Sort the words
sort -u |
# Print words not in the dictionary
comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
# See if errors were found
if [ -s /tmp/spell.$$.out ]
then
# Print file, number, and matching words
fgrep -Hno -f /tmp/spell.$$.out "$f"
fi
done
# Remove temporary file
rm /tmp/spell.$$.out
I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.
You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?
I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.
If its a one off, and due to the number of pages to check it might be worth considering somthing like spellr.us which would be a quick solution. You can entering in your website url on the homepage to get a feel for how it would report spelling mistakes.
http://spellr.us/
but I'm sure there are some free alternatives.
Use templates (well) with your webapp (if you're programming the site instead of just writing html), and an html editor that includes spell-checking. Eclipse does, for one.
If that's not possible for some reason... yeah, wget to download the finished pages, and something like this:
http://netsw.org/dict/tools/ispell-html-mode.patch
We use the Telerik RAD Spell control in our ASP.NET applications.
Telerik RAD Spell
You may want to check out a library like jspell.
I made an English-only spell checker with Ruby here: https://github.com/Vinietskyzilla/fuzzy-wookie
Try it out.
It's main deficiency is absence of a thorough dictionary that includes all forms of each word (plural, not just singular; 'has', not just 'have'). Substituting your own dictionary, if you can find or make a better one, would make it really awesome.
That aside, I think the simplest way to spell check a single web page is to press ctrl+a (or cmd+a) to select all text, then copy and paste it into a multiline text box on a web page. (For example <html><head></head><body><textarea></textarea></body></html>.) Your browser should underline any misspelled words.
#Anthony Roy I've done exactly what you've done. Piped the page thru Aspell via Pyenchant. I have English dictionaries (GB, CA, US) for use at my site https://www.validator.pro/. Contact me and I will set up a one-time job for you to check 1000 pages or more