Okay so here is the problem
Q. How do you spell check an entire web site?
Now in true reductionist programmer's spirit you can ask a simpler question. How do I spell check one web page? If you are on a linux machine then answer is pretty simple
$links -dump | spell
links is a text browser and from links man pages
Write formatted document to stdout
so links -dump solves a big headache of parsing all that html to produce text output. If you are working with Java, you can use swing toolkit to do a similar trick. You also have a google gdata class that can do html to text. However links is neat and I have no intentions to programmatically do what I can get ready made from a tool. I am here to just write glue scripts and be done.
Now we have to solve the problem of getting a list of links from a starting page (or host base address). Here again YMMV but I decided to use perl www mechanize module. You can also use LinkExtor or fetch page and parse using sed or do a regex matching or whatever. Point is, you can get a list of web pages from a starting address and then you can feed these pages to links, get a text dump and feed that to spell.
This solution totally works using duct tape and utilizing existing tools in true unix fashion. However you can be done spell checking an entire website without spending much effort from your side.
Q. How do you spell check an entire web site?
Now in true reductionist programmer's spirit you can ask a simpler question. How do I spell check one web page? If you are on a linux machine then answer is pretty simple
$links -dump
links is a text browser and from links man pages
Write formatted document to stdout
so links -dump solves a big headache of parsing all that html to produce text output. If you are working with Java, you can use swing toolkit to do a similar trick. You also have a google gdata class that can do html to text. However links is neat and I have no intentions to programmatically do what I can get ready made from a tool. I am here to just write glue scripts and be done.
Now we have to solve the problem of getting a list of links from a starting page (or host base address). Here again YMMV but I decided to use perl www mechanize module. You can also use LinkExtor or fetch page and parse using sed or do a regex matching or whatever. Point is, you can get a list of web pages from a starting address and then you can feed these pages to links, get a text dump and feed that to spell.
This solution totally works using duct tape and utilizing existing tools in true unix fashion. However you can be done spell checking an entire website without spending much effort from your side.
rjha @mbp ~/code/misc $ cat web-links.pl
use strict ;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( autocheck => 1 );
my $base = $ARGV[0];
# @todo null test for $url
# only consider url with converse in them
my @links = $mech->find_all_links('tag' => 'a');
for my $link (@links) {
printf "%s \n", $link->url;
Above perl script will dump all the links into a file from where a shell script will pick them and feed to links -dump and spell.
rjha @mbp ~/code/misc $ cat spell.sh
perl web-links.pl $1 > $FILE
while read line
echo $line
links -dump $line | spell
done < $FILE
Thats it! First we learnt to spell check a web page and then we extended the scheme to spell check all the web pages of a web site. spelling errors can look unprofessional on a website and with this trick you make sure you can see the obvious ones!