Thursday, December 29, 2011

Reddit pagination scheme and addressing concerns about mysql performance and SEO

Pagination of search results or of blog pages is a simple technique yet you need to be aware of few performance concerns especially if your backend is an RDBMS like MySQL. The simplest scheme goes like


  • Get a page size
  • Get a page offset (or page number)
  • sort on some column
  • use LIMIT with offset


The queries you form are like,

SELECT post.title, post.summary from t_post post order by post.created_on DESC LIMIT 100,20

where 100 is the offset and 20 is result size.  The problem with such queries is the offset. The performance is fine for smaller offsets but for bigger offsets (first number after LIMIT) you are examining a lot of rows even with an index. Naturally that has a performance impact. check LIMIT performance optimization on mysql performance blog.

So what to do? The remedy is to drop the offset from queries. You needed the offset to locate the starting rows (pointer to your results) and you need something else to locate the range of rows for you. The solution is to use WHERE clause to locate rows. You should check this nice presentation by Surat Bhati and Rick James done during Percona 2009 conference.

www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf 

To locate the right range using where clause you need a column that can sort your data. For date based sorting an auto increment key is a nice proxy.

SEO concerns

Suppose you had 3 pages and you served rows 1-20 from /page/3, rows 21-40 from /page/2 and rows 41-60 from /page/1. After inserting one more "page of data", /page/1 will fetch records 61-80, /page/2 will fetch 41-60, /page/3 will fetch 21-40 etc. The problem here is that all the pages have changed as a result of inserting "one more page". You can run into crawler indexing issues if that happens. (Every day every URL on your site is changing)

A better scheme is to use a script with query parameters to generate such summary pages (Check slashdot)
or use static pagination URL (check Digg)

Also, it may not be advisable to go on ad infinitum and you can refuse to show more results after user has browsed a certain number of pages (deep into your site)

Page link rendering concerns 


You will typically see two schemes. First is  a simple Next and Previous base scheme and second is rendering links to all the individual pages (1,2,3...10) with a total count. I believe the modern way is to use just the previous and Next links. Displaying total number of records is an absolute dud and user would any day prefer  to locate results directly via search than browsing.

Reddit site pagination

Now, armed with all this knowledge, lets see why reddit uses a  pagination scheme like (URL of type http://www.reddit.com/?count=25&after=t3_nu9we)

I know reddit source code is out there in open but I am not going to read it. Let's take an educated guess here. Most likely after parameter is in base36 and you can just use the php base_convert function to see the value in decimal. This number is used in where clause to locate the start row with a standard page size.

 count is keeping track of how deep you are and at some point the site will refuse to paginate further. Again, I am not going to click link after link but you can do the verification :D


Saturday, December 24, 2011

Spell check an entire web site

Okay so here is the problem
Q. How do you spell check an entire web site?


Now in true reductionist programmer's spirit you can ask a simpler question. How do I spell check one web page? If you are on a linux machine then answer is pretty simple




 $links -dump    | spell  


links is a text browser and from links man pages


-dump


              Write formatted document to stdout


so links -dump solves a big headache of parsing all that html to produce text output. If you are working with Java, you can use swing toolkit to do a similar trick. You also have a google gdata class that can do html to text. However links is neat and I have no intentions to programmatically do what I can get ready made from a tool. I am here to just write glue scripts and be done.


Now we have to solve the problem of getting a list of links from a starting page (or host base address). Here again YMMV but I decided to use perl www mechanize module. You can also use LinkExtor or fetch page and parse using sed or do a regex matching or whatever. Point is, you can get a list of web pages from a starting address and then you can feed these pages to links, get a text dump and feed that to spell.


This solution totally works using duct tape and utilizing existing tools in true unix fashion. However you can be done spell checking an entire website without spending much effort from your side.



rjha @mbp ~/code/misc $ cat web-links.pl 
use strict ; 
use warnings;
use WWW::Mechanize;

    my $mech = WWW::Mechanize->new( autocheck => 1 );
    my $base = $ARGV[0];
    # @todo null test for $url 
    $mech->get($base);
    # only consider url with converse in them
    my @links = $mech->find_all_links('tag' => 'a');
            
    for my $link (@links) {
            printf "%s \n", $link->url;
    }



Above perl script will dump all the links into a file from where a shell script will pick them and feed to links -dump and spell.


rjha @mbp ~/code/misc $ cat spell.sh 
#!/bin/bash
FILE=tmplinks
perl web-links.pl $1 > $FILE

while read line
do
    echo $line
    links -dump $line | spell
    echo 
done < $FILE

Thats it! First we learnt to spell check a web page and then we extended the scheme to spell check all the web pages of a web site. spelling errors can look unprofessional on a website and with this trick you make sure you can see the obvious ones!

Friday, December 23, 2011

Compiling php 5.3.7 with FPM on mac osx lion

My work machine is a new MBP 13 with mac osX lion. This machine had php 5.x installed in factory but I wanted to compile the latest PHP (5.3.x). Here are the required steps. One thing you have to be aware of is that pre-requisites may vary depending on what all module you want to bake in. I wanted mcrypt and GD2. GD2 has dependency on libpng and libjpeg. Also, I wanted to use mysql native driver.
Before compiling set the compiler flags to be 64 bit x86. Download required libraries (libmcrypt, libpng, libjpeg) from respective sites and do a configure make sequence.

Install dependencies
  • php-5.3.7.tar.gz
  • libmcrypt-2.5.8.tar.gz
  • libpng-1.5.4.tar.gz
  • jpegsrc.v8c.tar.tar.gz
Use following information to compile and install our dependencies
  CFLAGS="-arch x86_64"   
 echo $CFLAGS   
 ./configure --prefix=/usr/local make make install  
mysql was already installed on mac osx lion and I went with the default install of mysql. After compiling and installing the dependencies we compile PHP tar ball. Note that we supply required library paths to configure script. Use the --enable-fpm switch to enable baking in FPM module.

  ./configure --prefix=/usr/local --enable-fpm --with-mysql=mysqlnd   
 --with-mysqli=mysqlnd --with-openssl --with-pcre-regex --with-mhash   
 --with-curl --with-mcrypt=/usr/local/lib --with-gd   
 --with-png-dir=/usr/local --with-jpeg-dir=/usr/local

After successful install, check modules using $php -m. CLI, CGI, JSON etc. are default with 5.3.7 source bundle. install nginx from source. Nginx normally has only one dependency, PCRE. Now both nginx and php with fpm are installed.

 PHP layout

After successful install, PHP executable will be in /usr/local/bin and php.ini is in /usr/local/lib. php-fpm is in /usr/local/sbin and php-fpm config files are in /usr/local/etc

Start/stop NGINX, php-fpm and mysql
  sudo /usr/local/sbin/nginx -- to start nginx 
sudo /usr/local/sbin/nginx -s stop -- to stop nginx 
sudo /usr/local/sbin/php-fpm to start fpm 
sudo kill `cat /usr/local/var/run/php-fpm.pid` to kill fpm 
sudo /usr/local/mysql/bin/mysqld_safe to start mysql
© Life of a third world developer
Maira Gall