I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?
Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.
Edit: I have a list of all of the URLs on the site that I need to check.
Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).
lynx -dump http://www.example.com
It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:
lynx -dump http://www.example.com | grep -v "http"
The URLs could also be local (file://) if I have used wget to mirror the site.
I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).
This will ignore text in title and meta elements. These can be spellchecked seperately.
Just a view days before i discovered Spello web site spell checker. It uses my
NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.
If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.
#!/bin/sh
# Find HTML files
find $1 -name \*.html -type f |
while read f
do
# Split file into words
sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[ ][ ]*/\
/g ' "$f" |
# Remove blank lines
sed '/^$/d' |
# Sort the words
sort -u |
# Print words not in the dictionary
comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
# See if errors were found
if [ -s /tmp/spell.$$.out ]
then
# Print file, number, and matching words
fgrep -Hno -f /tmp/spell.$$.out "$f"
fi
done
# Remove temporary file
rm /tmp/spell.$$.out
I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.
You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?
I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.
If its a one off, and due to the number of pages to check it might be worth considering somthing like spellr.us which would be a quick solution. You can entering in your website url on the homepage to get a feel for how it would report spelling mistakes.
http://spellr.us/
but I'm sure there are some free alternatives.
Use templates (well) with your webapp (if you're programming the site instead of just writing html), and an html editor that includes spell-checking. Eclipse does, for one.
If that's not possible for some reason... yeah, wget to download the finished pages, and something like this:
http://netsw.org/dict/tools/ispell-html-mode.patch
We use the Telerik RAD Spell control in our ASP.NET applications.
Telerik RAD Spell
You may want to check out a library like jspell.
I made an English-only spell checker with Ruby here: https://github.com/Vinietskyzilla/fuzzy-wookie
Try it out.
It's main deficiency is absence of a thorough dictionary that includes all forms of each word (plural, not just singular; 'has', not just 'have'). Substituting your own dictionary, if you can find or make a better one, would make it really awesome.
That aside, I think the simplest way to spell check a single web page is to press ctrl+a (or cmd+a) to select all text, then copy and paste it into a multiline text box on a web page. (For example <html><head></head><body><textarea></textarea></body></html>.) Your browser should underline any misspelled words.
#Anthony Roy I've done exactly what you've done. Piped the page thru Aspell via Pyenchant. I have English dictionaries (GB, CA, US) for use at my site https://www.validator.pro/. Contact me and I will set up a one-time job for you to check 1000 pages or more
Related
Like many others on SO, I am not from a hardcore dev background - far more ops. Therefore I find myself struggling with something like this which I guess belongs very much here.
Requirement - I want to easily test large (1000-50000) batches of URL Redirections. Closer to the former.
Inputs I want to give
Source URL & Target URL
Outputs I want to... umm get out
Pass/Fail
HTTP Response Code
Bonus points for - Using real browsers (Selenium et al) as a very small proportion of redirects are done in JS. Very small. Being able to choose if Target URL equates to the first redirect or the penultimate one. Being able to easily change HTTP Headers (though happy to inject those with Fiddler etc)
Ideas I have currently
Bash script calling curl. I can do this but my problems are being able to make it scale i.e. parsing a csv input rather than manually editing a script. Also it doesn't cover the JS redirects (no dealbreaker). Seems like the easiest option.
Selenium IDE script. I probably could write the script but again struggling to scale it to even 10 URLs. Probably have to parse a CSV to create each script and then feed those into the command line runner and then capture the output.
Screaming Frog. I actually really love this tool and it can test redirections in bulk. However it has no concept of pass/fail. So close to being a one-stop shop. Also the free version doesn't follow redirection chains (i.e. -L in curl)
Just seems like one of those problems others must have had and tackled in a mainstream/easier way that I have thought of. Thanks in advance to anyone that can help.
One solution :
csv:
http://google.com;http://www.google.fr
http://domain.null;http://www.domain.null
code:
#!/bin/bash
while IFS=";" read -r url1 url2; do
ret=$(curl -s -o /dev/null -w "%{http_code}\n" "$url1")
((ret >= 200 && ret <= 400)) && echo 'url1 PASS' || echo 'url1 FAIL'
echo "url2 $(curl -s -L -o /dev/null -w "%{http_code}\n" "$url2")"
done < csv
If you need to know the real URL redirected (or not), use
curl -L -s -o /dev/null http://google.fr -w "%{url_effective}\n"
Feel free to improve to fit your needs.
Big thanks to StardustOne but I felt that I MUST be reinventing the wheel a little.
Ignoring the requirement to test in-browser and cover JS scenarios I looked again and the nicest solution I have found to-date was something sent out in the Devops Weekly newsletter
Smolder - https://github.com/sky-shiny/smolder
I know a member of staff working for one of our vendors was also working on a similar app that wrapped the Python requests library and I plan to back-to-back the two soon to find out which is actually best. I'll post back if his effort beats Smolder!
We do a lot of email marketing and sometimes developers will put the html file out on the image server (i know the easy answer is to not do this) but those html files end up getting indexed by Google and eventually rank high on search results. Which in turns makes the SEO company's want us to remove these pages. Is it possible to have google not index anything from our sub domain? we have image.{ourUrl}.com where we put all these files.
Would putting a robot.txt file in the main directory do it? Or would we need to add that robot text file in every directory?
Is there an easy way to blanket this?
A robots.txt file would just stop crawling, files might still be indexed. a noindex directive would work, you could use an x-robots-tag. See here https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I start studying Pelican today because I want to move my blog from wordpress to pelican.
However, after reading the docs, I still don't know the difference between pelican ./content and make html. They both seem to generate a static website. Besides, pelican ./content always returns a UnicodeDecodeError** for me, whilepelican ./content` does not.
What's the difference between them and why?
In the folder where you use $ pelican-quickstart, you will find a file named Makefile.
You will find a line like this html: clean $(OUTPUTDIR)/index.html, and $(OUTPUTDIR)/%.html:
$(PELICAN) $(INPUTDIR) -o $(OUTPUTDIR) -s $(CONFFILE) $(PELICANOPTS).
This file show you what pelican have done when you type make ***, and you can config many other things in this file.
pelican ./content runs the generation of the website using defaults and trying to guess the location of your content, output and configuration files.
make html calls pelican, but explicitly gives it the input directory, the output directory, the configuration file and, optionally, some extra options.
Basically, make html (along with make regenerate) are convenience methods that make the job a bit easier for you. In any case, you should run make publish to generate the content that is to be uploaded to your web server, as it loads the publishconf.py files, which define a few extra options (the rss feeds) and allows you to change settings for the "proper" website.
The website of my university unfortunately does not provide feeds but they keep publishing information there that is important for me (deadlines, dates of exams etc.) as links to pdfs
in a certain section of the website.
How can I regularly scrape that section of the site and have me notified (growl, mail something alike).
Normally I would use wget to mirror it but how to extract only parts of the website?
Is there a cli tool that can extract the XHTML via XPATH or similar?
Try this:
wget --spider --server-response http://example.com
This will print the headers which might contain the "Length"-attribute. If it changes, you can notify yourself.
edit: If it changes, you can download the whole html file, grep for a pdf file or whatever you want to look for (maybe for "<div id='news'>(.*?)</div>")
Mmm... You should take a look at QueryPath. QueryPath makes easy to parse HTML. What if the HTML structure changes? What if you want specific elements of the page? QueryPath does the hard work for you. Do you like JQuery? QueryPath is like the JQuery of PHP.
See: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
See: http://querypath.org/
You might be interested in looking at Pjscrape (disclaimer: this is my project). It's a web-scraping tool built on PhantomJS, giving you full jQuery access to the page in a headless Webkit browser context. It makes it very easy to pull semi-structured data from webpages via the command line, particularly if the page you're scraping has a consistent structure for new elements.
For example, you can pull all the course titles from this course catalog with the following code:
pjs.addScraper(
// the page you're scraping
'http://www.ischool.berkeley.edu/courses/catalog',
// selector for elements you want to pull text from
'.views-row .views-field-title'
);
// suppress STDOUT logging
pjs.config('log', 'none');
Running this from the command line gives you JSON to STDOUT by default:
~> phantomjs /path/to/pjscrape.js my_script.js
["W10. Introduction to Information","24. Freshman Seminar", ...]
So it would be pretty simple to run this script on a regular basis, capture the output in a file, and then alert you when the new output doesn't match the previous scrape. You can also write your own scraper functions, so there's a lot of flexibility for more complex scraping if a simple selector won't do the trick.
I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.