depth of field photography of brown tree logs

A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-​fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.

So I whipped this script up on Sunday night while watching RuPaul’s Drag Race reruns. It parses my Apache web server log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sample output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first started prototyping this on the command line as if it were an awk one-​liner by using the perl -n and -a flags. The former wraps code in a while loop over the <> diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I realized I’d need to filter out a bunch of URI patterns and do some aggregation by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-​reinvention exercise; no need to compound that further.)

Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 11 through 14. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t successful or browser cache hits, skip over requests that don’t GET web pages or other assets (e.g., POSTs to forms or updating other resources), and skip over the URI patterns mentioned earlier.

(Those patterns are worth a mention: they include the robots.txt and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)

Lines 38 and 39 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 46 and 47 provide a report sorted by week, displaying it as a friendly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.

Update: After this was initially published. astute reader Chris McGowan noted that I had a bug where $log{status} was assigned the value 304 with the = operator rather than compared with ==. He also suggested I use the double-​diamond <<>> operator introduced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first version, with the addition of a hash to convert month names to numbers and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature). On my server, this results in a ten- to eleven-​second savings when processing two months of compressed logs.

What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

Last week I explored using the Inline::Perl5 module to port a short Perl script to Raku while still keeping its Perl dependencies. Over at the Dev.to community, Dave Cross pointed out that I could get a bit more bang for my buck by letting his Feed::Find do the heavy lifting instead of WWW::Mechanizes more general-​purpose parsing.

A little more MetaCPAN investigation yielded XML::Feed, also maintained by Dave, and it had the added benefit of obviating my need for XML::RSS by not only discovering feeds but also retrieving and parsing them. It also handles the Atom syndication format as well as RSS (hi daxim!). Putting it all together produces the following much shorter and clearer Perl:

#!/usr/bin/env perl

use v5.12; # for strict and say
use warnings;
use XML::Feed;
use URI;

my $url = shift @ARGV || 'https://phoenixtrap.com';

my @feeds = XML::Feed->find_feeds($url);
my $feed  = XML::Feed->parse( URI->new( $feeds[0] ) )
    or die "Couldn't find a feed at $url\n";

binmode STDOUT, ':encoding(UTF-8)';
say $_->title, "\t", $_->link for $feed->entries;

And here’s the Raku version:

#!/usr/bin/env raku

use XML::Feed:from<Perl5>;
use URI:from<Perl5>;

sub MAIN($url = 'https://phoenixtrap.com') {
    my @feeds = XML::Feed.find_feeds($url);
    my $feed  = XML::Feed.parse( URI.new( @feeds.first ) )
        or exit note "Couldn't find a feed at $url";

    put .title, "\t", .link for $feed.entries;
}

It’s even closer to Perl now, though it’s using the first routine rather than subscripting the @feeds array and leaving off the the $_ variable name when calling methods on it—less punctuation noise often aids readability. I also took a suggested exit idiom from Raku developer Elizabeth Mattijsen on Reddit to simplify the contortions I was going through to exit with a simple message and error code.

There are a couple of lessons here:

  • A little more effort in module shopping pays dividends in simpler code.
  • Get feedback from far and wide to help improve your code. If it’s for work and you can’t release as open-​source, make sure your code review process covers readability and maintainability.

The Perl and Raku programming languages have a complicated history together. The latter was envisioned in the year 2000 as Perl 6, a complete redesign and rewrite of Perl to solve its problems of difficult maintenance and the burden of then-​13 years of backward compatibility. Unfortunately, the development effort towards a first major release dragged on for ten years, and some developers began to believe the delay contributed to the decline of Perl’s market- and mindshare among programming languages.

In the intervening years work continued on Perl 5, and eventually, Perl 6 was positioned as a sister language, part of the Perl family, not intended as a replacement for Perl.” Two years ago it was renamed Raku to better indicate it as a different project.

Although the two languages aren’t source-​compatible, the Inline::Perl5 module does enable Raku developers to run Perl code and use Perl modules within Raku, You can even subclass Perl classes in Raku and call Raku methods from Perl code. I hadn’t realized until recently that the Perl support was so strong in Raku despite them being so different, and so I thought I’d take the opportunity to write some sample code in both languages to better understand the Raku way of doing things.

Rather than a simple Hello World” program, I decided to write a simple syndicated news reader. The Raku modules directory didn’t appear to have anything comparable to Perl’s WWW::Mechanize and XML::RSS modules, so this seemed like a great way to test Perl-​Raku interoperability.

Perl Feed Finder

First, the Perl script. I wanted it smart enough to either directly fetch a news feed or find it on a site’s HTML page.

#!/usr/bin/env perl
use v5.24;    # for strict, say, and postfix dereferencing
use warnings;
use WWW::Mechanize;
use XML::RSS;
use List::Util 1.33 qw(first none);
my @rss_types = qw<
    application/rss+xml
    application/rdf+xml
    application/xml
    text/xml
>;
my $mech = WWW::Mechanize->new;
my $rss  = XML::RSS->new;
my $url = shift @ARGV || 'https://phoenixtrap.com';
my $response = $mech->get($url);
# If we got an HTML page, find the linked RSS feed
if ( $mech->is_html
    and my @alt_links = $mech->find_all_links( rel => 'alternate' ) )
{
    for my $rss_type (@rss_types) {
        $url = ( first { $_->attrs->{type} eq $rss_type } @alt_links )->url
            and last;
    }
    $response = $mech->get($url);
}
die "$url does not have an RSS feed\n"
    if none { $_ eq $response->content_type } @rss_types;
binmode STDOUT, ':encoding(UTF-8)';    # avoid wide character warnings
my @items = $rss->parse( $mech->content )->{items}->@*;
say join "\t", $_->@{qw<title link>} for @items;

In the beginning, you’ll notice there’s a bit of boilerplate: use v5.24 (released in 2016) to enable restricting unsafe code, the say function, and postfix dereferencing to reduce the noise from nested curly braces. I’m also bringing in the first and none list processing functions from List::Util as well as the WWW::Mechanize web page retriever and parser, and the XML::RSS feed parser.

Next is an array of possible media (formerly MIME) types used to serve the RSS news feed format on the web. Like Perl and Raku, RSS formats have a long and sometimes contentious history, so a newsreader needs to support several different ways of identifying them on a page.

The program then creates new WWW::Mechanize (called a mech for short) and XML::RSS objects for use later and gets a URL to browse from its command-​line argument, defaulting to my blog if it has none. (My site, my rules, right?) It then retrieves that URL from the web. If mech believes that the URL contains an HTML page and can find link tags with rel="alternate" attributes possibly identifying any news feeds, it then goes on to check the media types of those links against the earlier list of RSS types and retrieves the first one it finds.

Next comes the only error checking done by this script: checking if the retrieved feed’s media type actually matches the list defined earlier. This prevents the RSS parser from attempting to process plain web pages. This isn’t a large and complicated program, so the die function is called with a trailing newline character (\n) to suppress reporting the line on which the error occurred.

Finally, it’s time to output the headlines and links, but before that happens Perl has to be told that they may contain so-​called wide characters” found in the Unicode standard but not in the plain ASCII that it normally uses. This includes things like the typographical curly quotes’ that I sometimes use in my titles. The last two lines of the script loop through the parsed items in the feed, extracting their titles and links and printing them out with a tab (\t) separator between them:

Output from feed_finder.pl

Raku Feed Finder

Programming is often just stitching libraries and APIs together, so it shouldn’t have been surprising that the Raku version of the above would be so similar. There are some significant (and sometimes welcome) differences, though, which I’ll go over now:

#!/usr/bin/env raku
use WWW::Mechanize:from<Perl5>;
use XML::RSS:from<Perl5>;
my @rss_types = qw<
  application/rss+xml
  application/rdf+xml
  application/xml
  text/xml
>;
my $mech = WWW::Mechanize.new;
my $rss  = XML::RSS.new;
sub MAIN($url = 'https://phoenixtrap.com') {
    my $response = $mech.get($url);
    # If we got an HTML page, find the linked RSS feed        
    if $mech.is_html {
        my @alt_links = $mech.find_all_links( Scalar, rel => 'alternate' );
        $response = $mech.get(
            @alt_links.first( *.attrs<type> (elem) @rss_types ).url
        );
    }
    if $response.content_type(Scalar) !(elem) @rss_types {
        # Overriding Raku's `die` stack trace is more verbose than we need
        note $mech.uri ~ ' does not have an RSS feed';
        exit 1;
    }
    my @items = $rss.parse( $mech.content ).<items>;
    put join "\t", $_<title link> for @items;
}

The first thing to notice is there’s a bit less boilerplate code at the beginning. Raku is a younger language and doesn’t have to add instructions to enable less backward-​compatible features. It’s also a larger language with functions and methods built-​in that Perl needs to load from modules, though this feed finder program still needs to bring in WWW::Mechanize and XML::RSS with annotations to indicate they’re coming from the Perl5 side of the fence.

I decided to wrap the majority of the program in a MAIN function, which handily gives me command-​line arguments as variables as well as a usage message if someone calls it with a --help option. This is a neat quality-​of-​life feature for script authors that cleverly reuses function signatures, and I’d love to see this available in Perl as an extension to its signatures feature.

Raku and Perl also differ in that the former has a different concept of context, where an expression may be evaluated differently depending upon whether its result is expected to be a single value (scalar) or a list of values. Inline::Perl5 calls Perl functions in list context by default, but you can add the Scalar type object as a first argument to force scalar context as I’ve done with calls to find_​all_​links (to return an array reference) and content_​type (to return the first parameter of the HTTP Content-​Type header).

Another interesting difference is the use of the (elem) operator to determine membership in a set. This is Raku’s ASCII way of spelling the ∈ symbol, which it can also use; !(elem) can also be spelled . Both are hard to type on my keyboard so I chose the more verbose alternative, but if you want your code to more closely resemble mathematical notation it’s nice to know the option is there.

I also didn’t use Raku’s die routine to exit the program with an error, mainly because of its method of suppressing the line on which the error occurred. It requires using a CATCH block and then keying off of the type of exception thrown in order to customize its behavior, which seemed like overkill for such a small script. It would have looked something like this:

{
    die $mech.uri ~ ' does not have an RSS feed'
        if $response.content_type(Scalar) !(elem) @rss_types;
    CATCH {
        default {
            note .message;
            exit 1;
        }
    }
}

Doubtless, this could be golfed down to reduce its verbosity at the expense of readability, but I didn’t want to resort to clever tricks when trying to do a one-​to-​one comparison with Perl. More experienced Raku developers are welcome to set me straight in the comments below.

The last difference I’ll point out is Raku’s welcome lack of dereferencing operators compared to Perl. This is due to the former’s concept of containers, which I’m still learning about. It seems to be fairly DWIMmy so I’m not that worried, but it’s nice to know there’s an understandable mechanism behind it.

Overall I’m pleased with this first venture into Raku and I enjoyed what I’ve learned of the language so far. It’s not as different with Perl as I anticipated, and I can foresee coding more projects as I learn more. The community on the #raku IRC channel was also very friendly and helpful, so I’ll be hanging out there as time permits.

What do you think? Can Perl and Raku better learn to coexist, or are they destined to be rivals? Leave a comment below.