depth of field photography of brown tree logs

A recent Lobsters post laud­ing the virtues of AWK remind­ed me that although the lan­guage is pow­er­ful and lightning-​fast, I usu­al­ly find myself exceed­ing its capa­bil­i­ties and reach­ing for Perl instead. One such appli­ca­tion is ana­lyz­ing volu­mi­nous log files such as the ones gen­er­at­ed by this blog. Yes, WordPress has stats, but I’ve nev­er let rein­ven­tion of the wheel get in the way of a good pro­gram­ming exercise.

So I whipped this script up on Sunday night while watch­ing RuPaul’s Drag Race reruns. It pars­es my Apache web serv­er log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sam­ple output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first start­ed pro­to­typ­ing this on the com­mand line as if it were an awk one-​liner by using the perl -n and -a flags. The for­mer wraps code in a while loop over the <> dia­mond oper­a­tor”, pro­cess­ing each line from stan­dard input or files passed as argu­ments. The lat­ter splits the fields of the line into an array named @F. It looked some­thing like this while I was list­ing URIs (loca­tions on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I real­ized I’d need to fil­ter out a bunch of URI pat­terns and do some aggre­ga­tion by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log for­mat and its time­stamp strings with­out hav­ing to write even more com­pli­cat­ed reg­u­lar expres­sions myself. (As not­ed above, this was already a wheel-​reinvention exer­cise; no need to com­pound that further.)

Regexp::Log::Common builds a com­piled reg­u­lar expres­sion based on the log for­mat and fields you’re inter­est­ed in, so that’s the con­struc­tor on lines 11 through 14. The expres­sion then returns those fields as a list, which I’m assign­ing to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t suc­cess­ful or brows­er cache hits, skip over requests that don’t GET web pages or oth­er assets (e.g., POSTs to forms or updat­ing oth­er resources), and skip over the URI pat­terns men­tioned earlier.

(Those pat­terns are worth a men­tion: they include the robots.txt and sitemap XML files used by search engine index­ers, WordPress admin­is­tra­tion pages, files used by RSS news­read­ers sub­scribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapt­ing this for your site you might need to cus­tomize this list based on what soft­ware you use to run it.)

Lines 38 and 39 parse the time­stamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assum­ing the log is in chrono­log­i­cal order) and incre­ment the count. Once fin­ished, lines 46 and 47 pro­vide a report sort­ed by week, dis­play­ing it as a friend­ly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number func­tion dis­plays the totals with thou­sands separators.

Update: After this was ini­tial­ly pub­lished. astute read­er Chris McGowan not­ed that I had a bug where $log{status} was assigned the val­ue 304 with the = oper­a­tor rather than com­pared with ==. He also sug­gest­ed I use the double-​diamond <<>> oper­a­tor intro­duced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very pow­er­ful mod­ule but this comes at a price of speed and mem­o­ry. Something sim­pler like Date::WeekNumber should yield per­for­mance improve­ments, espe­cial­ly as my logs grow (here’s hop­ing). It requires a bit more man­u­al mas­sag­ing of the log dates to con­vert them into some­thing the mod­ule can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first ver­sion, with the addi­tion of a hash to con­vert month names to num­bers and the actu­al con­ver­sion (using named reg­u­lar expres­sion cap­ture groups for read­abil­i­ty, using Syntax::Construct to check for that fea­ture). On my serv­er, this results in a ten- to eleven-​second sav­ings when pro­cess­ing two months of com­pressed logs.

What’s next? Pretty graphs? Drilling down to spe­cif­ic blog posts? Database stor­age for fur­ther queries and analy­sis? Perl and CPAN make it pos­si­ble to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

Last week I explored using the Inline::Perl5 mod­ule to port a short Perl script to Raku while still keep­ing its Perl depen­den­cies. Over at the Dev.to com­mu­ni­ty, Dave Cross point­ed out that I could get a bit more bang for my buck by let­ting his Feed::Find do the heavy lift­ing instead of WWW::Mechanizes more general-​purpose parsing.

A lit­tle more MetaCPAN inves­ti­ga­tion yield­ed XML::Feed, also main­tained by Dave, and it had the added ben­e­fit of obvi­at­ing my need for XML::RSS by not only dis­cov­er­ing feeds but also retriev­ing and pars­ing them. It also han­dles the Atom syn­di­ca­tion for­mat as well as RSS (hi dax­im!). Putting it all togeth­er pro­duces the fol­low­ing much short­er and clear­er Perl:

#!/usr/bin/env perl

use v5.12; # for strict and say
use warnings;
use XML::Feed;
use URI;

my $url = shift @ARGV || 'https://phoenixtrap-com.us.stackstaging.com';

my @feeds = XML::Feed->find_feeds($url);
my $feed  = XML::Feed->parse( URI->new( $feeds[0] ) )
    or die "Couldn't find a feed at $url\n";

binmode STDOUT, ':encoding(UTF-8)';
say $_->title, "\t", $_->link for $feed->entries;

And here’s the Raku version:

#!/usr/bin/env raku

use XML::Feed:from<Perl5>;
use URI:from<Perl5>;

sub MAIN($url = 'https://phoenixtrap-com.us.stackstaging.com') {
    my @feeds = XML::Feed.find_feeds($url);
    my $feed  = XML::Feed.parse( URI.new( @feeds.first ) )
        or exit note "Couldn't find a feed at $url";

    put .title, "\t", .link for $feed.entries;
}

It’s even clos­er to Perl now, though it’s using the first rou­tine rather than sub­script­ing the @feeds array and leav­ing off the the $_ vari­able name when call­ing meth­ods on it—less punc­tu­a­tion noise often aids read­abil­i­ty. I also took a sug­gest­ed exit idiom from Raku devel­op­er Elizabeth Mattijsen on Reddit to sim­pli­fy the con­tor­tions I was going through to exit with a sim­ple mes­sage and error code.

There are a cou­ple of lessons here:

  • A lit­tle more effort in mod­ule shop­ping pays div­i­dends in sim­pler code.
  • Get feed­back from far and wide to help improve your code. If it’s for work and you can’t release as open-​source, make sure your code review process cov­ers read­abil­i­ty and maintainability.

The Perl and Raku pro­gram­ming lan­guages have a com­pli­cat­ed his­to­ry togeth­er. The lat­ter was envi­sioned in the year 2000 as Perl 6, a com­plete redesign and rewrite of Perl to solve its prob­lems of dif­fi­cult main­te­nance and the bur­den of then-​13 years of back­ward com­pat­i­bil­i­ty. Unfortunately, the devel­op­ment effort towards a first major release dragged on for ten years, and some devel­op­ers began to believe the delay con­tributed to the decline of Perl’s market- and mind­share among pro­gram­ming languages.

In the inter­ven­ing years work con­tin­ued on Perl 5, and even­tu­al­ly, Perl 6 was posi­tioned as a sis­ter lan­guage, part of the Perl fam­i­ly, not intend­ed as a replace­ment for Perl.” Two years ago it was renamed Raku to bet­ter indi­cate it as a dif­fer­ent project.

Although the two lan­guages aren’t source-​compatible, the Inline::Perl5 mod­ule does enable Raku devel­op­ers to run Perl code and use Perl mod­ules with­in Raku, You can even sub­class Perl class­es in Raku and call Raku meth­ods from Perl code. I hadn’t real­ized until recent­ly that the Perl sup­port was so strong in Raku despite them being so dif­fer­ent, and so I thought I’d take the oppor­tu­ni­ty to write some sam­ple code in both lan­guages to bet­ter under­stand the Raku way of doing things.

Rather than a sim­ple Hello World” pro­gram, I decid­ed to write a sim­ple syn­di­cat­ed news read­er. The Raku mod­ules direc­to­ry didn’t appear to have any­thing com­pa­ra­ble to Perl’s WWW::Mechanize and XML::RSS mod­ules, so this seemed like a great way to test Perl-​Raku interoperability.

Perl Feed Finder

First, the Perl script. I want­ed it smart enough to either direct­ly fetch a news feed or find it on a site’s HTML page.

#!/usr/bin/env perl
use v5.24;    # for strict, say, and postfix dereferencing
use warnings;
use WWW::Mechanize;
use XML::RSS;
use List::Util 1.33 qw(first none);
my @rss_types = qw<
    application/rss+xml
    application/rdf+xml
    application/xml
    text/xml
>;
my $mech = WWW::Mechanize->new;
my $rss  = XML::RSS->new;
my $url = shift @ARGV || 'https://phoenixtrap-com.us.stackstaging.com';
my $response = $mech->get($url);
# If we got an HTML page, find the linked RSS feed
if ( $mech->is_html
    and my @alt_links = $mech->find_all_links( rel => 'alternate' ) )
{
    for my $rss_type (@rss_types) {
        $url = ( first { $_->attrs->{type} eq $rss_type } @alt_links )->url
            and last;
    }
    $response = $mech->get($url);
}
die "$url does not have an RSS feed\n"
    if none { $_ eq $response->content_type } @rss_types;
binmode STDOUT, ':encoding(UTF-8)';    # avoid wide character warnings
my @items = $rss->parse( $mech->content )->{items}->@*;
say join "\t", $_->@{qw<title link>} for @items;

In the begin­ning, you’ll notice there’s a bit of boil­er­plate: use v5.24 (released in 2016) to enable restrict­ing unsafe code, the say func­tion, and post­fix deref­er­enc­ing to reduce the noise from nest­ed curly braces. I’m also bring­ing in the first and none list pro­cess­ing func­tions from List::Util as well as the WWW::Mechanize web page retriev­er and pars­er, and the XML::RSS feed parser.

Next is an array of pos­si­ble media (for­mer­ly MIME) types used to serve the RSS news feed for­mat on the web. Like Perl and Raku, RSS for­mats have a long and some­times con­tentious his­to­ry, so a news­read­er needs to sup­port sev­er­al dif­fer­ent ways of iden­ti­fy­ing them on a page.

The pro­gram then cre­ates new WWW::Mechanize (called a mech for short) and XML::RSS objects for use lat­er and gets a URL to browse from its command-​line argu­ment, default­ing to my blog if it has none. (My site, my rules, right?) It then retrieves that URL from the web. If mech believes that the URL con­tains an HTML page and can find link tags with rel="alternate" attrib­ut­es pos­si­bly iden­ti­fy­ing any news feeds, it then goes on to check the media types of those links against the ear­li­er list of RSS types and retrieves the first one it finds.

Next comes the only error check­ing done by this script: check­ing if the retrieved feed’s media type actu­al­ly match­es the list defined ear­li­er. This pre­vents the RSS pars­er from attempt­ing to process plain web pages. This isn’t a large and com­pli­cat­ed pro­gram, so the die func­tion is called with a trail­ing new­line char­ac­ter (\n) to sup­press report­ing the line on which the error occurred.

Finally, it’s time to out­put the head­lines and links, but before that hap­pens Perl has to be told that they may con­tain so-​called wide char­ac­ters” found in the Unicode stan­dard but not in the plain ASCII that it nor­mal­ly uses. This includes things like the typo­graph­i­cal curly quotes’ that I some­times use in my titles. The last two lines of the script loop through the parsed items in the feed, extract­ing their titles and links and print­ing them out with a tab (\t) sep­a­ra­tor between them:

Output from feed_finder.pl

Raku Feed Finder

Programming is often just stitch­ing libraries and APIs togeth­er, so it shouldn’t have been sur­pris­ing that the Raku ver­sion of the above would be so sim­i­lar. There are some sig­nif­i­cant (and some­times wel­come) dif­fer­ences, though, which I’ll go over now:

#!/usr/bin/env raku
use WWW::Mechanize:from<Perl5>;
use XML::RSS:from<Perl5>;
my @rss_types = qw<
  application/rss+xml
  application/rdf+xml
  application/xml
  text/xml
>;
my $mech = WWW::Mechanize.new;
my $rss  = XML::RSS.new;
sub MAIN($url = 'https://phoenixtrap-com.us.stackstaging.com') {
    my $response = $mech.get($url);
    # If we got an HTML page, find the linked RSS feed        
    if $mech.is_html {
        my @alt_links = $mech.find_all_links( Scalar, rel => 'alternate' );
        $response = $mech.get(
            @alt_links.first( *.attrs<type> (elem) @rss_types ).url
        );
    }
    if $response.content_type(Scalar) !(elem) @rss_types {
        # Overriding Raku's `die` stack trace is more verbose than we need
        note $mech.uri ~ ' does not have an RSS feed';
        exit 1;
    }
    my @items = $rss.parse( $mech.content ).<items>;
    put join "\t", $_<title link> for @items;
}

The first thing to notice is there’s a bit less boil­er­plate code at the begin­ning. Raku is a younger lan­guage and doesn’t have to add instruc­tions to enable less backward-​compatible fea­tures. It’s also a larg­er lan­guage with func­tions and meth­ods built-​in that Perl needs to load from mod­ules, though this feed find­er pro­gram still needs to bring in WWW::Mechanize and XML::RSS with anno­ta­tions to indi­cate they’re com­ing from the Perl5 side of the fence.

I decid­ed to wrap the major­i­ty of the pro­gram in a MAIN func­tion, which hand­i­ly gives me command-​line argu­ments as vari­ables as well as a usage mes­sage if some­one calls it with a --help option. This is a neat quality-​of-​life fea­ture for script authors that clev­er­ly reuses func­tion sig­na­tures, and I’d love to see this avail­able in Perl as an exten­sion to its sig­na­tures feature.

Raku and Perl also dif­fer in that the for­mer has a dif­fer­ent con­cept of con­text, where an expres­sion may be eval­u­at­ed dif­fer­ent­ly depend­ing upon whether its result is expect­ed to be a sin­gle val­ue (scalar) or a list of val­ues. Inline::Perl5 calls Perl func­tions in list con­text by default, but you can add the Scalar type object as a first argu­ment to force scalar con­text as I’ve done with calls to find_​all_​links (to return an array ref­er­ence) and content_​type (to return the first para­me­ter of the HTTP Content-​Type header).

Another inter­est­ing dif­fer­ence is the use of the (elem) oper­a­tor to deter­mine mem­ber­ship in a set. This is Raku’s ASCII way of spelling the ∈ sym­bol, which it can also use; !(elem) can also be spelled . Both are hard to type on my key­board so I chose the more ver­bose alter­na­tive, but if you want your code to more close­ly resem­ble math­e­mat­i­cal nota­tion it’s nice to know the option is there.

I also didn’t use Raku’s die rou­tine to exit the pro­gram with an error, main­ly because of its method of sup­press­ing the line on which the error occurred. It requires using a CATCH block and then key­ing off of the type of excep­tion thrown in order to cus­tomize its behav­ior, which seemed like overkill for such a small script. It would have looked some­thing like this:

{
    die $mech.uri ~ ' does not have an RSS feed'
        if $response.content_type(Scalar) !(elem) @rss_types;
    CATCH {
        default {
            note .message;
            exit 1;
        }
    }
}

Doubtless, this could be golfed down to reduce its ver­bosi­ty at the expense of read­abil­i­ty, but I didn’t want to resort to clever tricks when try­ing to do a one-​to-​one com­par­i­son with Perl. More expe­ri­enced Raku devel­op­ers are wel­come to set me straight in the com­ments below.

The last dif­fer­ence I’ll point out is Raku’s wel­come lack of deref­er­enc­ing oper­a­tors com­pared to Perl. This is due to the former’s con­cept of con­tain­ers, which I’m still learn­ing about. It seems to be fair­ly DWIMmy so I’m not that wor­ried, but it’s nice to know there’s an under­stand­able mech­a­nism behind it.

Overall I’m pleased with this first ven­ture into Raku and I enjoyed what I’ve learned of the lan­guage so far. It’s not as dif­fer­ent with Perl as I antic­i­pat­ed, and I can fore­see cod­ing more projects as I learn more. The com­mu­ni­ty on the #raku IRC chan­nel was also very friend­ly and help­ful, so I’ll be hang­ing out there as time permits.

What do you think? Can Perl and Raku bet­ter learn to coex­ist, or are they des­tined to be rivals? Leave a com­ment below.