data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

all issues

The Perl Journal

#20

Winter 2000

vol 5

num 4

file_download download code 40,960 bytes

Letters

Perl News

Chris Nandor

All About Arrays

Nathan Torkington

Secure Internet Voting with Perl

Lincoln D. Stein

Developing Wireless Applications

Dan Brian

Lazy Text Formatting

Damian Conway

Simulating Typos with Perl

Sean M. Burke

Building Directory Services with Net::LDAP

Joe Johnston

Glade

Ace Thompson

A Simple Gnome Panel Applet

Joe Nasal

Beyond Hardcoded Database Applications with DBIx::Recordset

Terrence Brannon

Symbol::Approx::Sub: A Module for Bad Typists

Dave Cross

The 2nd Annual Perl Poetry Contest

Kevin Meltzer

Nathan Torkington (2000) All About Arrays. The Perl Journal, vol 5(4), issue #20, Winter 2000.

All About Arrays

Nathan Torkington

Basics

All array variables begin with an @ sign. They hold a list of scalar values (such as a string or number) whose positions are numbered beginning from 0. So in this code, "blue" is in position number 2 of the @colors array and 42 is in position 3 of the @data array:

  @colors = ("red", "green", "blue");
  @data   = ("Perl", 2_000_000, "Wall", 42);

At this early point it's good to start distinguishing lists from arrays. Perl gurus try to be precise about this distinction when they talk about their code: both are sequences of scalars, but while arrays are true stored variables, lists are mere temporary sequences of values. Subroutines accept lists, and can return them; as you pass an array into a subroutine, it becomes a list of values. Likewise, when a subroutine returns a list, you can store it in an array.

You store a list inside an array variable if you want to access the list's values later. Subroutines and functions don't, strictly speaking, accept arrays, except for a few special functions that we'll see later. Where Perl expects a bunch of values to work on, those values can come from a list, whether it's hard-coded in the program, returned by a function, or extracted from an array.

Inside double-quoted strings, arrays interpolate (expand) into their values, separated by spaces:

  print "Primary colors are: @colors\n";
  red green blue

Spaces are the default separator, but you can change this with the $" variable:

  $" = ' and ';
  print "Primary colors are: @colors\n";
  red and green and blue

Positions

To access a single value from an array, use square brackets:

  $colors[2]

The name of the array is "colors", the $ in front indicates a scalar value, and the position of that value, called a subscript, is in the square brackets. This notation works for both storing and fetching values.

  $colors[0] = "pink";
  print $colors[0];

Array subscripts also interpolate inside double-quoted strings:

  print "The 0th color is $colors[0]\n";

To make life easy for programmers, who often need to refer to both ends of the array conveniently, a negative subscript counts back from the end of the array:

  print $colors[-1];
  blue
  
  print $colors[-3];
  pink

An attempt to fetch a non-existent negative position returns undef, but an attempt to store in such a position is a fatal error:

  print $colors[-4];
  Use of uninitialized value ...

  $colors[-4] = "ultraviolent";
  Modification of non-creatable array value attempted,
    subscript -4 at ...

Perl has dynamic data structures, which grow as needed. They only grow when assigned to, however, and never simply by reading. So if you ask for an element that doesn't exist by specifying a position beyond the end of the array, you'll get undef -- and the array's size won't change as a result.

To determine the size of an array, you can evaluate it in scalar context by assigning it to a scalar:

  $size_before = @colors;
  print $colors[5];
  $size_after  = @colors;
  print "$size_before $size_after\n";
  Use of uninitialized value at ...
  3 3

  $size_before = @colors;
  $colors[5]  = "burgundy";
  $size_after  = @colors;
  print "$size_before $size_after\n";
  Use of uninitialized value at ...
  3 6

When you assigned to position 5, Perl created values in positions 3 and 4 as well. Now you have six elements in the array, in positions 0 through 5.

Position vs Count

Welcome to the torture of counting array positions. Because positions start at 0, the size and last position always differ by one. If the only value in the array is at position 0, then there is one element. If there are two elements, they must be in positions 0 and 1.

Each array has an accompanying scalar variable containing the last position of the array. That variable is $# followed by the array name (no @ sign, since it's a scalar we're after):

  print $#colors;		        # last position
  5

  print scalar(@colors);		# number of elements
  6

This often confuses beginners when they use loops to count over the positions of an array. There are two right ways to do it:

  for ($i=0; $i <   @colors; $i++) { ... }  # A
  for ($i=0; $i <= $#colors; $i++) { ... }  # B

And two wrong ways:

  for ($i=0; $i <= @colors; $i++) { ... }   # C
  for ($i=0; $i <  $#colors; $i++) { ... }  # D

Option C executes the loop body for one too many positions (if there are six things in @colors, the loop executes when $i is 6, even though that's not a valid position). Likewise, option D executes the body one too few times (if the last position is 5, the loop stops after executing the loop with $i set to 4). I prefer option A because it takes fewer keystrokes than option B.

The $#array variable has another use: you can set it, which pre-extends the array. If you know your array will eventually have 1000 elements in it, you can tell Perl to allocate all the elements at once rather than making Perl allocate 1000 items one-by-one (and slowly) as you grow the array.

  $#numbers = 999;
  for ($i = 0; $i < 1000; $i++) {
      $numbers[$i] = 5 * $i + 1;
  }

foreach loops

Many times you won't need the position of the current element, you only need its value. Rather than use a C-style for loop as above, use a Perl-style foreach loop:

  @colors = ("red", "green", "blue");
  foreach $c (@colors) {
      print "$c\n";
  }
  red
  green
  blue

You may choose any loop variable (the $c above) that you wish. If you follow tight programming discipline and used the strict module to prevent accidental use of global variables, you can mix my or local with the foreach:

  #!/usr/bin/perl -w

  use strict;

  my @colors = ("red", "green", "blue");
  foreach my $c (@colors) {
      print "$c\n";
  }
  red
  green
  blue

Inside foreach loops, the loop variable is actually an alias for the value in the list. So if you change the loop variable, you change the element in the list:

  @colors = ("red", "brown");
  foreach $c (@colors) {
      $c = "hot $c";
  }
  print "@colors\n";

  hot red hot brown

If you omit the variable, Perl will use $_ as the default variable:

  foreach (@colors) {

      print "Current item is $_\n";
  }

This is useful when you combine it with the string functions that use $_ as their default values:

  foreach (@colors) {
      tr/A-Z/a-z/;
      s/pink|burgundy/red/i;
      print length, "\n";
  }

reverse and sort

What else can you do with arrays? You can reverse the order of the elements:

  @inverted = reverse @colors;
  print "@inverted\n";
  blue green red

You can sort the elements in ASCIIbetical order:

  @colors = ("pink", "purple", "mauve");
  @ordered = sort @colors;
  print "@ordered\n";
  mauve pink purple

What if you prefer reverse alphabetical order? You might write this:

  @ordered  = sort @colors;
  @inverted = reverse @ordered;
  print "@inverted\n";
  purple pink mauve

This works, but you can be even more concise. Like many functions, reverse and sort take any list of values as arguments:

  @inverted = reverse sort @colors;

Can you see why the following won't work?

  @inverted = sort reverse @colors;   # WRONG

The answer is at the end.

Even when you combine sort and reverse in the right order, it is rather inefficient. sort returns a temporary list of values, which is then reversed. It'd be more efficient to tell sort to sort in the order we want. We can do that!

sort accepts a code block before the list of values to sort. The code block tells sort how to order any two values. Those values are put into the global variables $a and $b before the code block is executed. (Most code blocks use Perl's <=> or cmp operators to compare things numerically or ASCIIbetically.)

The default comparison routine is

  $a cmp $b

cmp compares values as strings, and by putting $a before $b we get an ascending sort. If we wanted to sort from highest to lowest, it's as simple as flipping the order of $a and $b in the comparison: instead of telling sort that "green" should come after "blue", it'll now say that "green" should come before "blue":

  @colors = ("pink", "purple", "mauve");
  @inverted_ordered = sort { $b cmp $a } @colors;
  print "@inverted_ordered\n";
  purple pink mauve

There are many more complicated sorts you can do, up to and including the Schwartzian Transform. But I digress. If you want more information on sorting, consult a good Perl book like The Perl Cookbook (Tom Christiansen and yours truly, O'Reilly & Associates) or Effective Perl Programming (Joseph Hall, Addison-Wesley).

Slices

You now know how to talk about the array as a whole, and how to talk about single values from the array, but what about subsets of the array? For that you need to know about array slices:

  @subset = @colors[0,2];
  print "@subset\n";
  pink mauve

The @ sign at the beginning says we want multiple values back. Inside the square brackets we have a list of values. In this case it's just positions 0 and 2 we want, but we can have any list we like:

  ($x, $y, $z) = @big_array[5, 2, 100];

That's like saying:

  $x = $big_array[5];
  $y = $big_array[2];
  $z = $big_array[100];

...only your fingers don't get worn out. When you want a range of values (e.g., from position 2 through 8) you can use the range (..) operator:

  @subset = @big_array[2..8];

Which, again, is like typing this, but without fingerprint damage:

  @subset = @big_array[2, 3, 4, 5, 6, 7, 8];

Adding and deleting values

Perl has five functions for inserting and removing values from an array. Four of those functions are quite specialized, working with only the start or end of the array. The last, splice, is far more general. Let's cover the specialized functions first.

push and pop act on the end of the array. push adds values to the end of the array, while pop removes the last value and returns it:

 
 @characters = ("Buffy", "Willow", "Xander");
 push(@characters, "Giles", "Anya");
 print "@characters\n";
 $ex_demon = pop @characters;
 print "popped $ex_demon\n";
 print "@characters\n";

 Buffy Willow Xander Giles Anya
 popped Anya
 Buffy Willow Xander Giles

The corresponding functions which work on the start of the array are shift and unshift:

 
 @baddies = ("Spike", "Mayor", "Adam");
 $in_wuv = shift @baddies;
 print "removed $in_wuv\n";
 print "left: @baddies\n";
 unshift @baddies, "Dracula";
 print "@baddies\n";

 removed: Spike
 left: Mayor Adam
 Dracula Mayor Adam

If you shift or pop but don't give an array name, Perl assumes you mean the current arguements. If you're in a subroutine definition, the array that's operated on is @_, containing the subroutine arguments. If you're not in a subroutine definition, @ARGV is shifted or popped.

The uber-function for arrays is splice, which lets you perform any combination of inserting, deleting, or replacing. You give it an array to work on, the position at which to begin deleting elements, the number of elements to delete, and then any elements to insert in place of those deleted. splice returns any deleted elements:

 
 @gals = ("Buffy", "Willow", "Anya", "Faith");
 @cut = splice @gals, 1, 2, "Tara";
 print "@gals\n";
 print "@cut\n";

 Buffy Tara Faith
 Willow Anya

The two things starting at position 1 were "Willow" and "Anya". In their place was put "Tara".

You can delete zero elements, and only use splice for its ability to insert:

 
 @gals = ("Buffy", "Willow", "Anya");
 splice @gals, 2, 0, "Tara";
 print "@gals\n";

 Buffy Willow Tara Anya

You can insert no elements, and only use splice for its ability to delete:

 
 @gals = ("Buffy", "Cordelia", "Faith", "Willow", "Anya");
 @cut = splice @gals, 1, 2;
 print "@gals\n";
 print "@cut\n";

 Buffy Willow Anya
 Cordelia Faith

And, of course, by giving positions at the start or end of the array, you can insert or delete there:

 
 splice @gals, @gals, 0, "Tara"; # push @gals, "Tara";
 splice @gals, $#gals, 1; # pop @gals;
 
 splice @gals, 0, 0, "Tara"; # unshift @gals, "Tara";
 splice @gals, 0, 1; # shift @gals;

These five functions are the only functions in Perl where you need to provide an array and not just a list. You cannot push onto a list, because a list is merely a fleeting gathering of values, and you need a persistent collection if you want to change it. Your program won't compile if you try to use one of these functions and its first argument does not begin with an @ sign.

Lists to strings and back again

How do you create a list? You can hard-code it in your program or accumulate it element by element using push or unshift. Often you just read the list from a file.

Imagine a list of words on one line:

 Buffy The Vampire Slayer

You would like an array with each element being a single word. You could do this with repeated matches:

 while ($string =~ m/(\S+)/g) {
 push @words, $1;
 }

But the easiest way is to use the split function. split takes up to three arguments. The first is a regular expression matching the stuff between the values you want. Here, we'll need a regular expression matching spaces. The second argument to split is the string to be split up. The third and final argument is the number of fields you want back, but if you omit it you'll get all the fields.

 @words = split /\s+/, $string;

If we omit the second argument, split looks in $_ for the string. This makes it perfect for these kinds of loops:

 while (<SOMEFILE>) {
 @words = split /\s+/;
 #...
 }

In fact, if you have your string in $_ and you want it split on whitespace, you don't even need the regular expression -- the default regular expression is whitespace!

 while (<SOMEFILE>) {
 @words = split;
 # ...
 }

Of course, your strings don't always have fields separated by spaces. The Unix password file, for instance, separates fields with colons:

 while (<PASSWDFILE>) {
 @fields = split /:/;
 # ...
 }

split has some quirks: it ignores any trailing empty fields, so if your colon-separated record was big:deal::: you'd get two fields back: big and deal.

The opposite of split is join. split extracts fields that have been separated. join produces a string of separated fields. The first argument is the separator (an exact string, not a regular expression), and the rest of the arguments are values to join together with the separator in between. For instance:

 
 @adjectives = ("hot", "damp", "sticky");
 $line = join(" and ", @adjectives);
 print $line;

 hot and damp and sticky

Putting it all together

So here's how you reverse the order of words for each line in a file:

 
 while (<INFILE>) {
 @fields = split;
 @new = reverse @fields;
 $line = join " ", @new;
 print OUTFILE "$line\n";
 }

or more concisely:

 
 while (<INFILE>) {
 print OUTLINE join(" ", reverse split), "\n";
 }

Answer to the earlier question. The code said to reverse the list, then sort it. The call to reverse is useless, because sort sorts the list into ascending order.

Nathan Torkington used to be a trainer for the Tom Christiansen Perl Consultancy, whose excellent Beginning Perl class may accidentally have inspired parts of this article.