Perl Practicum: Out of Sorts

by Hal Pomeranz

Welcome to the first in a series of articles intended to demystify some of the more occult aspects of Perl programming. The emphasis will be on practical solutions for common problems, rather than a serialized step-by-step tutorial for the language.

There have been a fair number of articles in comp.lang.perl recently which back down to a basic unfamiliarity with the workings of Perl's sort() function. Very little space is specifically devoted to sort() in the Camel Book [1] despite the fact that it is a basic and powerful function. On the face of it, sort() is rather straightforward: you feed it a list of strings and it spits the list back in alphabetic order:

     @list = (`how', `now', `brown', `cow');
     @sorted = sort @list;
If you want a descending sort, just feed the output of sort() through reverse():
     @ztoa = reverse sort @list;
Everybody with us so far? The powerful aspect of sort() is the ability to define customized sort criteria. To invoke this customization, add another parameter which is the name of a function which defines your sort criteria:
     @sorted = sort mysort @list;
Every time sort() needs to compare two values in @list, it will invoke your function (mysort() in this case). For ascending sorts, your function should return a negative integer (any negative integer) if the first value is less than the second, zero if equal, and positive if the first value is greater than the second. Reverse these conditions for a descending sort. Wily UNIX hackers will recognize the noncoincidental similarity to the qsort(3) library call.

In the interests of speed, the normal Perl function call mechanism is short-circuited; this has several effects. First, the values to be compared are automatically passed in as $a and $b rather than via the argument list. Second, these values are passed by reference, so expect civilization as we know it to end if you fiddle with the values of these variables in your subroutine. Third, your sort function must not be recursive.

With all of this in mind, here's how to do an ascending numeric sort:

     @numbers = (3, 5, 4, 11, 1, 7);
     @sorted = sort mysort @numbers;
     sub mysort { $a <=> $b; }
Reverse the positions of $a and $b for a descending sort (or use the reverse() function on the resulting list - this wouldn't be a Perl article if the motto "There's more than one way to do it!" didn't appear somewhere). Note that the function avoids an explicit return() to save time. In the absence of an explicit return(), the return value of a Perl function is the value of the last expression in that function. In fact, it's not necessary to create a separate subroutine - we may write an in-line block in place of the function name:
     @sorted = sort { $a <=> $b; } @numbers;
Let aesthetics be your guide here. The truly lazy will point out that the semicolon inside the block is optional if we are using a recent version of Perl (pl35 or greater). There are too many versions of Perl out there which do not support this (ahem) feature for the author to feel comfortable in recommending this practice, however. Suppose we are producing a monthly disk report. We have an associative array, %usage, which is keyed by user name and stores total disk usage information in MBytes. We would like to have a list of user names in descending order of disk usage so we can send nasty-grams to the top ten or so. We need to sort the list of keys in the array by the values referenced by those keys:
     @sorted = sort { $usage{$b} <=> $usage{$a}; } keys %usage;
Since the granularity of our data is MBytes, we may have several users with the same amount of disk usage. We can refine our sort to alphabetize user names with equivalent disk usage
     @sorted = sort { $usage{$b} <=> $usage{$a} || $a <=> $b; } keys%usage;
This is an example of a multifield sort. It relies on the short-circuit evaluation of the logical OR to only do the second comparison if the disk usage values are equal (i.e., if the "spaceship operator," <=>, returns zero).

Suppose instead we had a list into which we had slurped the contents of a file and we wanted to sort the lines of the file on many fields within each line. For example, suppose we wished to sort the password file first by parent directory of the user's home and then by user name:

     print sort psort @lines;  sub psort {
     ($anm, $adr) = (split(/:/, $a))[0,5];
     ($bnm, $bdr) = (split(/:/, $b))[0,5];
     $adr =~ s,/[^/]*,,; $bdr =~ s,/[^/]*,,;
     $adr cmp $bdr || $anm cmp $bnm; 	}
The function first uses the array slice operator to extract the user name and home directory fields from each password entry. It then uses the substitution operator to extract the parent directory of each home directory by removing the last "/" and everything after it.

Unfortunately, this approach is somewhat inefficient. We may compare against a given line multiple times (quicksort is an O(n log n) sort). Each time we do, we have to redo the split() and strip off the last portion of the home directory path. We can improve performance on the sort by caching this information ahead of time, at the cost of more than tripling the amount of memory consumed:

     for (@lines) {
          ($nm{$_}, $dir{$_}) = (split(/:  ))[0,5];
          $dir{$_} =~ s,/[^/]*,,;
     }
     print sort { $dir{$a} cmp $dir{$b} || $nm{$a} cmp $nm{$b}; } @lines;
This sort of approach is often twice as fast as the noncaching approach. Further optimization can be achieved if you are sorting on multiple numeric fields, e.g., the octets of an IP address. By combining the octets into a single integer value, we are able to sort by doing one, rather than many, numeric comparisons:
     for (@ipaddrs) {
          $num{$_} = sprintf("%d%03d%03d%03d", split(/\./));
     }
     @sorted = sort { $num{$a} <=> $num{$b}; } @ipaddrs;
The single comparison is generally more efficient than stringing together multiple comparisons with logical ORs. What have we learned? First, simple things can have unexpectedly complex applications. In the case of sort(), we have moved from simple alphabetic sorts to sorts on multiple fields and sorts on associated values. Second, heavy optimization can be applied to sort if you are willing to do some work and expend some extra memory up front. Third, some data is amenable to being jammed into a single compound key, thus avoiding multiple comparisons.

[1] Larry Wall and Randal Schwartz, Programming Perl, O'Reilly and Associates, 1991.ISBN 0-937175-64-1.


Reproduced from ;login: Vol. 18 No. 4, August 1993.

Back to Table of Contents

11/22/96ah