data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

all issues

The Perl Journal

#21

Fall 2001

vol 5

num 5

file_download download code 10,240 bytes

Perl News

Chris Nandor

Constants in Perl

Results of The Perl Poetry Contest

Using a Cleanup Handler in mod_perl

Jim Woodgate

Finance::Quotelets

Jim Woodgate (2001) Using a Cleanup Handler in mod_perl. The Perl Journal, vol 5(5), issue #21, Fall 2001.

Using a Cleanup Handler in mod_perl

Jim Woodgate

In true TMTOWTDI form, when writing Web applications you often need to come up with alternate ways to get things done. For example, if you're writing a "traditional" Perl program and you want to do some work on the machine without making the client wait, you have a variety of choices. You could cause your program to fork, and do some work in the forked process. You could write some data to a well-known directory and have a crontab entry pick up this data and run something in the background. In a Web application, however, there can be barriers to these types of approaches. For example, you might not know which machine your script will run on; thus, you won't know whether a crontab is running there or not.

This article describes another way to do some work for a client, without forcing the client to wait and by using the same httpd process (no forking!) that serviced the client connection.

I have a module on CPAN called Apache::Album. Like most modules, this one started out with the phrase, "I could probably automate that in about 5 minutes." The only thing more certain than having it take more than 5 minutes to write, was knowing that once written, the words, "that's really cool, but what I really want is..." were sure to follow.

I had been helping a friend (who recently bought a digital camera) write a Perl script that would create thumbnails for all his pictures. He would upload a bunch of pictures, then run a Perl script that would generate the thumbnails along with an HTML page for viewing. I was reading the Eagle book at the time (Writing Apache Modules with Perl and C by Lincoln Stein and Doug MacEachern, O'Reilly & Associates, ISSN: 1-56592-567-X) and convinced him that a mod_perl module would be much nicer.

Generating Thumbnails

The module is pretty straightforward. When someone visits a virtual Web page, the module maps that to a physical directory, checks whether thumbnails exist, checks whether the thumbnail is older than the current image, then creates a Web page for viewing it all. I added a variety of "features" -- multiple directories, captions, file uploads, etc. One of these features was to generate not only thumbnails, but also "small" (maximum width 640 pixels), "medium" (maximum width 800 pixels) and "large" (maximum width 1024 pixels) images. Like the thumbnails, these would be real files. I decided to use "real" files rather than shrink the file at the browser or shrink the file dynamically for each viewing. This was done on the assumption that disk space was cheaper than bandwidth or the user's time. Naturally, the first person to view a new page where a lot of pictures were uploaded would have to wait, but everyone after the initial visit would get a really fast download. Most of the uploaders understood this and, after uploading a bunch of files, would prime the system by viewing the Web page right away.

The initial thumbnail generation wasn't too bad; on a decent machine, it took a few minutes -- 5 at the most. The small, medium, and large pictures, however, slowed things down significantly. To ensure good quality, a fresh copy was read and sampled for each picture. Five minutes turned into 20, and that was way too long. The user simply needed to start looking at the thumbnails, the other pictures needed to be generated offline.

Readers of the mod_perl mailing list will know that forking (the traditional approach to running things in the background) with Apache can be tricky. Apache likes to do the forking and signal handling, and adding another 6- or 7-MB httpd process seemed a bit much. I thought maybe a lazy evaluation scheme would be fairly easy. The thumbnails would be generated, and 0-length placeholders would be created for the other sizes. When one of the 0-length files was requested, the correct sampling would be done and sent. (The file would also be saved, so subsequent users accessing the same picture wouldn't have to wait.). Most people I talked to didn't like that approach, so I searched for something better. I found an even cleaner solution -- cleanup handlers!

mod_perl lets you register a handler for the cleanup phase of Apache. This phase occurs after the client has the Web page and even after the access has been logged. Its main purpose is to clean up things like old filehandles and database connections, but there is no artificial limit to what can be done. With cleanup handlers, the solution to my problem is easy. For each picture that needs it, add it to a list. After the page is generated, register a cleanup handler to create all the additional pictures. The basic algorithm is like this:

my @cleanup_subs = ();

while ($more_pictures) {
  ...
  if ($need_resizing) {
    push (@cleanup_subs, sub{&create_final_resize(\@args);});
  }
}

$r->register_cleanup(sub {foreach (@cleanup_subs) {&$_;}})
  if @cleanup_subs;

That's it! I only add a subroutine when the thumbnail doesn't exist. The very first visitor will cause all the thumbnails to be created, then the same httpd process will create all the other sizes of the files. If someone hits the same page a short time later, the module will see the thumbnails and will not attempt to create the other pictures. There's no fork, because the same httpd process that handled the initial client request does the rest of the work as well. As far as Apache is concerned, it's just another working process.

A word of caution to others who use this with Image::Magick -- don't reuse an Image::Magick object for each call. When you read with Image::Magick, it keeps copies of each file Read in memory. (This is actually a requirement, as the module will let you create animated gifs.) I received complaints that directories with large numbers of pictures weren't getting all their resized files. After kicking off a new directory and watching the httpd process grow to 128 MB, I created a fresh Image::Magick object for every call. Now, everything works great!

Example

The module is too large for an article, so here is a completely contrived example. If you have purchased goods online, you may have dealt with sites that check your credit card number "while you wait". This may be a good policy, but it's not much fun online. Sometimes you get hangs, and you may never know whether the hang was your connection or your credit card, and you may not know whether you've actually placed an order. Wouldn't it be great if, after hitting "Accept", you got an instant response with an order number, and you received email if something went wrong?

This example will do something similar, you type in your email address, and it will send you some random bytes. Linux boxen (also FreeBSD, OpenBSD, and maybe others) have a really cool device called /dev/random that gives you random bytes. The OS reseeds itself on bootup and uses random events, such as disk timings and mouse/keyboard movements, to generate random values. However, there isn't always enough entropy in the system to give you the amount of random data you need. We can fix that by watching /dev/random for 2 full minutes; as data becomes available, we'll save it off. Users, however, won't have to wait the full 2 minutes. They can simply enter an email address and instantly receive a response that their data is being generated and will be mailed to them.

This isn't a mod_perl handler. It's just a basic CGI-style script using Apache::Registry. If called without parameters, it generates some HTML to prompt for an email address. When called with a mail_to parameter, it notifies the client that email will be sent. Note that after that message is displayed, a cleanup handler is registered; then the script exits normally.

$r->register_cleanup(sub {&mail_results($mail_to)});

The mail_results subroutine uses a Perl command seen even less frequently than cleanup handlers in mod_perl -- the four-argument form of select ! The algorithm is pretty straightforward: get a filehandle to /dev/random , create a select mask for reading from that filehandle, and set an initial timeout of 2 minutes.

Four-Argument Select

In case you're not familiar with this form of select, it takes four arguments: a read mask, a write mask, an error mask, and a timeout. It's not uncommon to see three undefs for the masks with just the timeout set. This is a more granular sleep; however, it's just a side effect of the select command, and not its main purpose. Its real purpose is to listen to multiple filehandles at once and block until either there is some activity on one of the filehandles or there is a timeout.

The first step is to generate a mask (I like to think of it as a set) of filehandles. In this case, we only have one member of the set -- the RANDOM filehandle. vec sets the bit for that filehandle and creates the mask for us. Then select is called. Note that we must assign $rout=$rin, because select will change this variable to say which filehandle actually caused the select to wake up. In this example, when $found is true we know which filehandle it is, because we only have one. Thus, this assignment really isn't necessary. It's always good practice to code things correctly, however; so when your program changes later, you won't have to worry about its breaking in strange places. $rout=$rin doesn't hurt anything, and will protect us later if we decide to block on multiple filehandles. If there were more than one filehandle, we'd have to check $rout to see which filehandle was ready. If $rin were used instead of $rout=$rin, each iteration of the loop would have to use vec to set bits for every filehandle that didn't participate in the previous iteration.

Typically, you'd want to do a bit more error checking, but as an example this should be fine. The select call will block until either data is ready to be read or the timeout has expired. If there is data waiting, read it, save it off, then reset the timeout to be the amount of time left. (Technically, the script doesn't run for exactly 2 minutes, it's 2 minutes of waiting + the time to read/store the random data) Once the timelimit expires, you can make the data look a little more presentable by adding a space after every 4th character, then by replacing every 15th of those spaces with a carriage return. This gives 74 characters per line and should fit nicely on most email clients. You can use the Net::SMTP module to send the final results to the user.

Cleanup handlers are a great place to let the server do a little extra work after the client has everything they need. They are also fairly easy to use. Here's the code:

#!/usr/bin/perl -w
#

use strict;
use Net::SMTP;
use Apache::Constants qw(OK);
use Apache::Request;

my $r = Apache::Request->new(Apache->request);
$r->content_type('text/html');
$r->send_http_header;

if ($r->param('mail_to')) {
  my $mail_to = $r->param('mail_to');
  $r->print(<<HTML);
<HTML>
<HEAD><TITLE>Random numbers being generated</TITLE></HEAD>
<BODY><P>Random Numbers being generated and sent to $mail_to</BODY>
</HTML>
HTML

  $r->register_cleanup(sub {&mail_results($mail_to)});
}
else {
  $r->print(<<HTML);
<HTML>
<HEADER><TITLE>Enter Mail Address</TITLE></HEADER>
<BODY><P>Please Enter E-Mail Address to send Random Bytes to:
<FORM><INPUT TYPE="text" NAME="mail_to" SIZE=20 MAXLENGTH=50>
<INPUT TYPE="submit"></FORM></BODY>
</HTML>
HTML
}

return OK;
    
sub mail_results {
  # Don't forget to set $SMTP_HOST, $SMTP_FROM, and $MAIL_FROM
  my $mail_to = shift;
  my $DEVICE = q(/dev/random);
  my $SMTP_HOST = q(your.smtp.host);
  my $SMTP_FROM = q(httpuserid@yourwebserver);
  my $MAIL_FROM = q(webmaster@yourwebserver);

  open(RANDOM, $DEVICE) or die "Could not open $DEVICE: $!\n";

  my $timeout = 2 * 60;  # two minutes
  my $total_bytes = 0;
  my ($all_bits, $rin, $rout) = ("") x 3;

  vec($rin, fileno(RANDOM), 1) = 1;

  while ($timeout) {
    my ($found, $timeleft) = select($rout=$rin, undef, undef, $timeout);
    if ($found) {
      my ($read, $bits);
      unless (defined($read = sysread(RANDOM, $bits, 1024))) {
die "There was an error reading from RANDOM\n";
      }

      $total_bytes += $read;
      $all_bits .= unpack("H*", $bits);
    }
    $timeout = $timeleft;
  }

  close RANDOM;

  my $pattern = qr(\w{4});
  $all_bits =~ s/($pattern)/$1 /g;
  $all_bits =~ s/(($pattern ){14})($pattern) /$1$3\n/g;

  my $smtp = new Net::SMTP($SMTP_HOST);
  $smtp->mail($SMTP_FROM);
  $smtp->to($mail_to);
  $smtp->data();
  $smtp->datasend(qq(From: "Random Numbers" <$MAIL_FROM>\n));
  $smtp->datasend(qq(Subject: Here is $total_bytes bytes of data\n\n));
  $smtp->datasend(qq(\n$all_bits\n\n));;
  $smtp->dataend();
  $smtp->quit;
}

Jim Woodgate graduated from the University of Texas at Austin in 1992. He's been writing Perl scripts since 1994 and currently is a developer at a startup called 2ndWave, Inc.