Perl Practicum: Thanks for the Reference

by Hal Pomeranz

In the last Perl Practicum, I covered the new Perl5 construct, references. Because the syntax and usage for references are confusing in places, I promised an extended example that would cover, in a real-world example, all the material I had introduced.

The problem is to "marshall" data into format such that if we do

     $string = marshall($some_ref);
     eval("\$other_ref = $string");

then the data structure pointed to by $other_ref will have the same contents as the data structure pointed to by $some_ref. If this is true, then we can save $string to a file and read it back in later in some other application. We use this kind of functionality here at NetMarket to allow Web CGI applications to share data because the format is extremely portable.

Thinking Recursively

The key to unlocking this problem is really a mode of thought, rather than a programming trick. Sometimes assuming you already know how to solve the problem enables you to develop a solution - this is the heart of "recursive thinking."

Remember from last time that the way to create a reference to an anonymous list was

     $a_ref = [1, 2, 3];

where the elements of the list could be lists (or associative arrays, etc.)

     $other_ref = [1, ["a", "b"], 3];

In general, then, an anonymous list reference is assigned with

     $some_ref = [EXPR1, ..., EXPRn];

where EXPR1, ..., EXPRn are either a scalar or a reference to some other complex data structure (list, associative array, list of lists, etc.).

This is where recursive thinking comes in. We assume that we already have a marshall() function which can encode EXPR1, ..., EXPRn properly. With this simplifying assumption, we can quickly write a function that creates a string containing the righthand side of the expression above. All we have to do is put commas after successive calls to the marshall() function and put square brackets around the whole affair:

     sub encode_list {
          my($list_ref) = @_;
          my($string);

          $string = "[ ";
          for (@$list_ref) {
               $string .= marshall($_);
               $string .= ", ";
          }
          $string .= "]";
          return($string);
     }

Notice the @$list in the for loop: remember that you can use a scalar reference any place you would use the name of an identifier. Yes, there is a dangling comma after the last element of the list - Perl ignores it. If you are thinking this is all handwaving and I have not even begun solving the problem, bear with me.

Dealing with Associative Arrays

Remember that the syntax for declaring associative array references was

     $mail_info = {
          "hal" => "hal@netmarket.com",
          "tina" => "tmd@iwi.com",
          "rob" => "kolstad@bsdi.com",
     };

Here the values in the hash can actually be references to complex data structures, but the keys have to be scalars. To state things generally, we assign hash references with expressions like:

     $some_ref = {
          KEY1 => EXPR1,

          ...,

          KEYn => EXPRn
     };

We are already assuming we have marshall() lying around to encode EXPR1, ..., EXPRn. I promise that I will write an encode_scalar() function next to handle encoding KEY1, ..., KEYn. With those two assumptions, encoding an associative array reference is now just a matter of putting commas in the right place:

     sub encode_hash {
          my($hash_ref) = @_;
          my($string, $key);

          $string = "{ ";
          foreach $key (keys(%$hash_ref)) {
               $string .= encode_scalar($key);
               $string .= "=> ";
               $string .= marshall ($$hash_ref{$key});
               $string .= ", ";
          }

          $string .= "}";
          return($string);
     }

Encoding Scalars

You may think that encoding scalar values is trivial. After all, one of our simple examples above just used:

     $mail_info = {
          "hal" => "hal@netmarket.com",
          "tina" => "tmd@iwi.com",
          "rob" => "kolstad@bsdi.com",
     };

You would guess from this example that you just throw quote marks around the value and be done. Well, suppose your scalar value is one of these strings:

     got"ya
     $variable

You will end up with encodings like:

     "got"ya"     # dangling quote
     "$variable"  # evaluates $variable

The safest thing to do is to backslash every nonalphanumeric character in the scalar:

     sub encode_scalar {
          my($scalar) = @_;
          $scalar =~ s/(\W)/\$1/g;
          return("
     }

Now our strings will get encoded as:

     "got\"ya"
     "$variable"

Putting It All Together

All along I have been defining the encoding routines in terms of this mythical marshall() routine. It turns out that we can define marshall() in terms of the encoding routines:

     sub marshall {
          my($thing) = @_;

          $type = ref($thing);
          if ($type eq "ARRAY") {
               return(encode_list($thing));
          }
          elsif ($type eq "HASH") {
               return(encode_hash($thing));
          }
          elsif (!$type) {
               return(encode_scalar($thing));
          }
          else { die("Can't handle $type\n"); }
     }

Remember that ref() is a function that returns what type of data type its argument points to, returning undef if its argument is not a reference.

How can this possibly work? Things will become clearer when we walk through a simple example. Suppose we do:

     $simple = [1, ["a", "b"], 3];
     $output = marshall($simple);

In the first call to marshall(), $simple is identified as a reference to a list and marshall() calls encode_list().

The encode_list() function starts building up $string. First the function sets $string to be the initial opening bracket. Then encode_list() begins walking through the list and calling marshall() on each element.

The first argument of the list is a scalar, 1. When encode_list() calls marshall() on this element, marshall() immediately calls encode_scalar(), and encode_scalar() returns the string:

"1"

marshall() simply returns this value back up to the encode_list() function, which appends it to the value of $string. At the end of one iteration of the loop, $string looks like this:

     [ "1",

Now encode_list() calls marshall() on the second argument of the list. This argument is a list reference, so marshall() calls encode_list() recursively.

This second call to encode_list() starts building a new $string. First it initializes this new $string with the opening square bracket. Next it calls marshall() on each of the list elements in turn, both of which are scalars. After the first iteration of the loop, the new $string looks like this:

     [ "a",

After the second iteration, we have:

     [ "a", "b",

The loop terminates and the encode_list() function appends a closing bracket and returns

     [ "a", "b", ]

to the marshall() function that called it originally. In turn, marshall() returns this value to the original encode_list() call. This function appends the string above to its own $string, which now looks like this:

     [ "1", [ "a", "b", ],

This encode_list() function now moves onto the third and final element of the list in this example. This is just another scalar, so after the third iteration we have this:

     [ "1", [ "a", "b", ], "3",

We have exhausted the elements of the example list, so we fall out of the for loop. encode_list() appends the closing bracket and returns the string:

     [ "1", [ "a", "b", ], "3", ]

Try working out a simple example for yourself, possibly with an associative array this time. Or just type in the code and try running some examples.

Brevity is the Soul of Wit

We can use function references to simplify the marshall() function. First, we build a "jump table": an associative array that links certain keywords to various function references. In this case, we link the strings returned by ref() to the function used to encode each data type:

     %encode = ("SCALAR" => \&encode_scalar,
                "ARRAY" => \&encode_list,
                "HASH" => \&encode_hash,);

marshall() now becomes extremely terse:

     sub marshall {
          my($thing) = @_;
          my($type, $func_ref);

          $type = ref($thing) || "SCALAR";
          $func_ref = $encode{$type};
          return(&$func_ref($thing)) if ($func_ref);
          die("Can't handle $type\n");
     }

First we assign $type to be whatever gets returned by ref() or SCALAR if ref() returns undef. Next we extract the appropriate function reference from %encode, and call that function on the argument to marshall(). If we cannot find an appropriate function to call, we die().

To make this even more terse, remember that you can use a block inside of curly braces in place of a scalar reference. In other words, we can use

     {$encode{$type}}

in place of $func_ref. Our new version of the function is:

     sub marshall {
          my($thing) = @_;
          my($type);

          $type = ref($thing) || "SCALAR";
          return(&{$encode{$type}}($thing)) if (defined($encode{$type}));
          die("Can't handle $type\n");
     }

I generally hate using extra variables, but the function above is hard for the human eye to comprehend.

Thanks for the Reference

Please type in this code and try it out on some complex examples. If nothing else, the exercise will get you using references so that you will remember how they work. Also, never forget this little demonstration of the power of recursive thinking - a useful tool for any programmer.

Reproduced from ;login: Vol. 21 No. 2, April 1996.

Back to Table of Contents

12/3/96ah