data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

all issues

The Perl Journal

#20

Winter 2000

vol 5

num 4

file_download download code 40,960 bytes

Letters

Perl News

Chris Nandor

All About Arrays

Nathan Torkington

Secure Internet Voting with Perl

Lincoln D. Stein

Developing Wireless Applications

Dan Brian

Lazy Text Formatting

Damian Conway

Simulating Typos with Perl

Sean M. Burke

Building Directory Services with Net::LDAP

Joe Johnston

Glade

Ace Thompson

A Simple Gnome Panel Applet

Joe Nasal

Beyond Hardcoded Database Applications with DBIx::Recordset

Terrence Brannon

Symbol::Approx::Sub: A Module for Bad Typists

Dave Cross

The 2nd Annual Perl Poetry Contest

Kevin Meltzer

Dave Cross (2000) Symbol::Approx::Sub: A Module for Bad Typists. The Perl Journal, vol 5(4), issue #20, Winter 2000.

Symbol::Approx::Sub: A Module for Bad Typists

Dave Cross

Module

Symbol::Approx::Sub

CPAN

Symbol::Approx::Sub is a Perl module which allows you to call subroutines even if you spell their names wrong. Using it can be as simple as adding this to your programs:

 use Symbol::Approx::Sub;

Once you've done this, you never have to worry about spelling your subroutine names correctly again. For example, this program prints This is the foo subroutine!, even though &foo was misspelled as &few.

 
 use Symbol::Approx::Sub;
 
 sub foo {
 print "This is the foo subroutine!\n";
 }

 &few;

Why was it written?

This is obviously a very dangerous thing to want, so what made me decide to write Symbol::Approx::Sub?

Last July I went to the O'Reilly Perl Conference in California, and attended Mark-Jason Dominus' "Tricks of the Wizards" tutorial. He explains a number of concepts that can take your Perl programs to a new level of complexity and elegance. The most important of these concepts are typeglobs and the AUTOLOAD function. It was the first time that I'd really tried to understand either of these concepts and, thanks to Dominus' clear explanations, I began to understand their power.

One example that Dominus uses in this class is a demonstration of how you could use AUTOLOAD to catch misspelled subroutine names and perhaps do something about it. He showed a slide containing code something
like this:

 sub AUTOLOAD {

 my ($sub) = s/.*::(.*)/;
 
 # Work out what sub the user really meant
 $sub = get_real_name_of_sub($sub);
 
 goto &$sub;
 }

On the following slide, he goes into some detail about what a really bad idea this would be and how it would make your code completely unmaintainable. But it was too late. I was already thinking about how I could write a "get the real name of the subroutine" function and put it into a module which could be used in any Perl program.

How does it work?

During the twelve-hour flight home from California to England I thrashed out the implementation details. Here are the four stages that the module requires:

When the module is loaded, it needs to install an AUTOLOAD function in the package that called it.

When AUTOLOAD is called (as the result of invoking a non-existent subroutine) it needs to get a list of all the subroutines in our calling package.

The AUTOLOAD function need to compare each of those subroutine names with what the user actually called, and choose the most likely candidate.

It then just invokes the chosen subroutine.

The key to the first two stages was the other main topic of Dominus' talk -- typeglobs.

In every Perl package, there is something called a stash ("symbol table hash") that contains the package's variables and subroutines. This stash is like a normal hash, with the keys being the names of the variables, and the values being references to the typeglobs. A typeglob is a data structure containing references to all of the objects with the same name. You know that in a Perl program you can have $a, @a, %a, and &a, and they are all completely separate -- but they all live in the same typeglob.

The first stage is achieved with a useful typeglob trick. You can assign values (which should be references) to the various slots of a typeglob. This has the effect of aliasing the typeglob's name to the referenced value. For example, if you execute the following line of code, @a will become an alias to @array_with_a_really_long_name and any changes you make to to @a will actually happen to the other array.

 *a = \@array_with_a_really_long_name;

Furthermore, you can do this with any typeglob object, not just arrays. In particular, you can do it with subroutines, which is what I needed for Symbol::Approx::Sub. The two objects don't even have to be in the same package, as you can see from the code below:

 package other;
 
 sub foo {
 print "This is &other::foo\n";
 }
 
 *main::bar = \&foo;
 
 package main;
 
 &bar;

In this example we create a subroutine called foo in the package other. We then alias that subroutine to &main::bar. This means that within the main package, if we call bar we actually call &other::foo. (This is how the Exporter module works.)

When Symbol::Approx::Sub is loaded, we alias our caller's AUTOLOAD function to the one in our module. We know what our AUTOLOAD needs to do, but how do we get a list of subroutines in the calling package?

Let's look at a simple typeglob example. The next piece of code declares three package variables and a subroutine. We then write a simple foreach loop to print out the contents of the %main:: stash. If you run this program you'll see the names of our package objects a, b, c, and d. (You'll also see the standard filehandles STDIN and STDOUT and other builtin Perl variables like @INC and %ENV.)

 use vars qw($a @b %c);
 
 sub d { print "Hello, world!\n" };
 
 foreach (keys %main::) {
 print "$_\n";
 }

Having listed the typeglobs, our next task is to work out which of them contain subroutines. For this, we can use a mechanism called the *FOO{THING} syntax. In the same way that scalar names always start with a $ and array names always start with a @, typeglob names always start with a *. *FOO therefore refers to the typeglob called FOO (which will contain $FOO, @FOO, %FOO, and &FOO). With the *FOO{THING} syntax, you can find out whether the typeglob FOO contains an object of type THING, where THING can be SCALAR, ARRAY, HASH, IO, FORMAT, CODE, or GLOB. The next piece of code uses this syntax to show which of the typeglobs in our current package contain a subroutine.

 #!/usr/bin/perl -w
 use strict;
 
 use vars qw($a @b %c);
 
 sub d { print "sub d" };
 
 while (my ($name, $glob) = each %main::) {
 print "$name contains a sub\n" if defined *$glob{CODE};
 }

We now know enough to create an AUTOLOAD function that generates a list of the subroutines that exist in the package.

Inside the AUTOLOAD function, the name of the subroutine that the program attempted to invoke will be available in the $AUTOLOAD variable. All we need to do is carry out some sort of fuzzy matching on the set of subroutine names and the misspelled subroutine name to find the best match.

Unfortunately, this isn't as simple as it sounds. I didn't want to write my own fuzzy matching algorithm, so I decided to borrow someone else's. Perl comes with a Text::Soundex module that converts any word to a single letter and three digits that collectively correspond to the pronunciation of the string. This is what I initially used to do my fuzzy matching.

The module computes the Soundex value for the misspelled subroutine, and then computes the Soundex values for each of the subroutines in the caller's package. If none match, it mimics Perl's standard "undefined subroutine called" error message. If one matches, it's assumed to be the right subroutine. But what if there are multiple matches? This is quite possible because of the lossy compression that Soundex provides. I thought about this for a while before deciding that the only option would be to pick one at random. I really couldn't see any other reasonable approach.

Sub::Approx

That's pretty much how the original version of the module worked. I called it Sub::Approx and released it to CPAN in the summer. People started to talk to me about the module, and one of the most common things they said was, "Really interesting idea, but you should do the fuzzy matching using Some::Other::Module."

So version 0.05 of Sub::Approx included what I called "fuzzy configurability" (or "configurable fuzziness") and with the help of Leon Brocard, we made the process of matching a subroutine more modular. We introduced the concept of a matcher, which is a subroutine called with two things: the name of a subroutine that we're trying to match, and the list of subroutines in the package. The matcher returns an array of the subroutine names which match the required name. We supplied a matcher for each of Text::Soundex, Text::Metaphone, and String::Approx. You can therefore now use Sub::Approx like this:

 use Sub::Approx (matcher => 'text_metaphone');

This makes matching be carried out with Text::Metaphone instead of Text::Soundex.

To make it even more flexible, we allowed you to define your own matching subroutines and use them, by passing a reference to the subroutine to Sub::Approx. This would look like this:

 use Sub::Approx (matcher => \&reverse);
 
 sub reverse {
 my $sub = reverse shift;
 return grep { $_ eq $sub } @_;
 }
 
 sub abc {
 print "In sub abc!\n";
 }
 
 &cba;

If your subroutine doesn't exist, this matcher searches for a subroutine whose name is the reverse of the subroutine you have tried to call.

One last feature was the ability to define your own chooser function. This is the fucntion which decides what to do if more than one subroutine matches the name of the called subroutine. This function is passed a list of matching subroutine names and should return the name of the one it chooses. The default chooser still picks one at random, but you can define your own like this:

 use Sub::Approx (chooser => \&first);

 sub first {
 return shift;
 }

This example will always choose the first item in the list of matching subroutines.

Symbol::Approx::Sub

This was how things remained until the end of September when I gave a lightning talk on Sub::Approx at YAPC::Europe. Afterward a number of discussions took place which changed the shape of Sub::Approx, resulting in four changes:

Perl RFC 324 was drafted, which suggested that in Perl 6, the AUTOLOAD function should be renamed to AUTOGLOB and invoked when any typeglob object that doesn't exist is called. This would allow us to create Scalar::Approx, Array::Approx, and so on.

A mailing list was set up to discuss Sub::Approx and associated matters. You can subscribe to the list at https://www.astray.com/mailman/listinfo/subapprox/.

The typeglob walking code from Sub::Approx was abstracted out into a new module called GlobWalker so that it could be reused in Scalar::Approx and friends. Later, I discovered that the Devel::Symdump module on CPAN did much the same thing and switched to that.

We realized that to produce Scalar::Approx and friends, we would be polluting a number of module namespaces. After some discussion on the modules and subapprox mailing lists, we decided on the name Symbol::Approx::Sub.1

Symbol::Approx::Sub version 1.60 is currently on CPAN.

Robin Houston has started work on a Symbol::Approx::Scalar module. Variables are trickier than subroutines for two reasons. First, there is currently no AUTOLOAD facility for variables the way there is for subroutines; Robin gets around this by tieing the scalar variables. Second, most variables (at least in good programs) are lexical variables, rather than package variables, and therefore don't live in typeglobs. Robin (who knows more about Perl internals that I do) is therefore writing a PadWalker module which does the same for lexical variables as GlobWalker (or Devel::Symdump) does for typeglobs.

You can find early versions of these modules together with a talk that Robin gave at a London.pm meeting: https://www.kitsite.com/~robin/.

Future Plans.

On the mailing list, we are already planning Symbol::Approx::Sub version 2.0. Planned features include:

Separating the matcher component out into two separate stages: canonization and matching. Canonization takes a subroutine name and returns some kind of canonical version, which might include removing underscores or converting all characters to lower case. This suggests having chained canonizers, each of which carries out one transformation in sequence.

Developing a plugin architecture for canonizers, matchers, and choosers. This would make it easy for other people to produce their own modules which work with Symbol::Approx::Sub.

Trying to accommodate calling packages that already define an AUTOLOAD function.

Even with all of this development, I have yet to find a real use for the module. As far as I can see, it's simply a very good demonstration of just how easy it is to do things in Perl that would be impossible in other languages. If you think you have an interesting use for Symbol::Approx::Sub, please let the mailing list know.

Dave Cross worries that Symbol::Approx::Sub will be seen as his major contribution to the Perl community, so he's written a book too. It's called Data Munging with Perl and should be out by the time you read this.

1 I like the fact that the new name includes the word "Symbol", since it means that we can also call it The::Module::Formerly::Known::As::Sub::Approx.

footnotes

1 I like the fact that the new name includes the word "Symbol", since it means that we can also call it The::Module::Formerly::Known::As::Sub::Approx.

modules used

	Symbol::Approx::Sub	CPAN