Saturday, December 24, 2011

Spell check an entire web site

Okay so here is the problem
Q. How do you spell check an entire web site?


Now in true reductionist programmer's spirit you can ask a simpler question. How do I spell check one web page? If you are on a linux machine then answer is pretty simple




 $links -dump    | spell  


links is a text browser and from links man pages


-dump


              Write formatted document to stdout


so links -dump solves a big headache of parsing all that html to produce text output. If you are working with Java, you can use swing toolkit to do a similar trick. You also have a google gdata class that can do html to text. However links is neat and I have no intentions to programmatically do what I can get ready made from a tool. I am here to just write glue scripts and be done.


Now we have to solve the problem of getting a list of links from a starting page (or host base address). Here again YMMV but I decided to use perl www mechanize module. You can also use LinkExtor or fetch page and parse using sed or do a regex matching or whatever. Point is, you can get a list of web pages from a starting address and then you can feed these pages to links, get a text dump and feed that to spell.


This solution totally works using duct tape and utilizing existing tools in true unix fashion. However you can be done spell checking an entire website without spending much effort from your side.



rjha @mbp ~/code/misc $ cat web-links.pl 
use strict ; 
use warnings;
use WWW::Mechanize;

    my $mech = WWW::Mechanize->new( autocheck => 1 );
    my $base = $ARGV[0];
    # @todo null test for $url 
    $mech->get($base);
    # only consider url with converse in them
    my @links = $mech->find_all_links('tag' => 'a');
            
    for my $link (@links) {
            printf "%s \n", $link->url;
    }



Above perl script will dump all the links into a file from where a shell script will pick them and feed to links -dump and spell.


rjha @mbp ~/code/misc $ cat spell.sh 
#!/bin/bash
FILE=tmplinks
perl web-links.pl $1 > $FILE

while read line
do
    echo $line
    links -dump $line | spell
    echo 
done < $FILE

Thats it! First we learnt to spell check a web page and then we extended the scheme to spell check all the web pages of a web site. spelling errors can look unprofessional on a website and with this trick you make sure you can see the obvious ones!
© Life of a third world developer
Maira Gall