depth of field photography of brown tree logs

A recent Lobsters post laud­ing the virtues of AWK remind­ed me that although the lan­guage is pow­er­ful and lightning-​fast, I usu­al­ly find myself exceed­ing its capa­bil­i­ties and reach­ing for Perl instead. One such appli­ca­tion is ana­lyz­ing volu­mi­nous log files such as the ones gen­er­at­ed by this blog. Yes, WordPress has stats, but I’ve nev­er let rein­ven­tion of the wheel get in the way of a good pro­gram­ming exercise.

So I whipped this script up on Sunday night while watch­ing RuPaul’s Drag Race reruns. It pars­es my Apache web serv­er log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sam­ple output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first start­ed pro­to­typ­ing this on the com­mand line as if it were an awk one-​liner by using the perl -n and -a flags. The for­mer wraps code in a while loop over the <> dia­mond oper­a­tor”, pro­cess­ing each line from stan­dard input or files passed as argu­ments. The lat­ter splits the fields of the line into an array named @F. It looked some­thing like this while I was list­ing URIs (loca­tions on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I real­ized I’d need to fil­ter out a bunch of URI pat­terns and do some aggre­ga­tion by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log for­mat and its time­stamp strings with­out hav­ing to write even more com­pli­cat­ed reg­u­lar expres­sions myself. (As not­ed above, this was already a wheel-​reinvention exer­cise; no need to com­pound that further.)

Regexp::Log::Common builds a com­piled reg­u­lar expres­sion based on the log for­mat and fields you’re inter­est­ed in, so that’s the con­struc­tor on lines 11 through 14. The expres­sion then returns those fields as a list, which I’m assign­ing to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t suc­cess­ful or brows­er cache hits, skip over requests that don’t GET web pages or oth­er assets (e.g., POSTs to forms or updat­ing oth­er resources), and skip over the URI pat­terns men­tioned earlier.

(Those pat­terns are worth a men­tion: they include the robots.txt and sitemap XML files used by search engine index­ers, WordPress admin­is­tra­tion pages, files used by RSS news­read­ers sub­scribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapt­ing this for your site you might need to cus­tomize this list based on what soft­ware you use to run it.)

Lines 38 and 39 parse the time­stamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assum­ing the log is in chrono­log­i­cal order) and incre­ment the count. Once fin­ished, lines 46 and 47 pro­vide a report sort­ed by week, dis­play­ing it as a friend­ly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number func­tion dis­plays the totals with thou­sands separators.

Update: After this was ini­tial­ly pub­lished. astute read­er Chris McGowan not­ed that I had a bug where $log{status} was assigned the val­ue 304 with the = oper­a­tor rather than com­pared with ==. He also sug­gest­ed I use the double-​diamond <<>> oper­a­tor intro­duced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very pow­er­ful mod­ule but this comes at a price of speed and mem­o­ry. Something sim­pler like Date::WeekNumber should yield per­for­mance improve­ments, espe­cial­ly as my logs grow (here’s hop­ing). It requires a bit more man­u­al mas­sag­ing of the log dates to con­vert them into some­thing the mod­ule can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first ver­sion, with the addi­tion of a hash to con­vert month names to num­bers and the actu­al con­ver­sion (using named reg­u­lar expres­sion cap­ture groups for read­abil­i­ty, using Syntax::Construct to check for that fea­ture). On my serv­er, this results in a ten- to eleven-​second sav­ings when pro­cess­ing two months of com­pressed logs.

What’s next? Pretty graphs? Drilling down to spe­cif­ic blog posts? Database stor­age for fur­ther queries and analy­sis? Perl and CPAN make it pos­si­ble to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

text

Last week’s arti­cle received a com­ment on a pri­vate Facebook group that amount­ed to just use JavaScript’s built-​in for­mat­ting.” So what would that look like?

#!/usr/bin/env perl

use Mojolicious::Lite -signatures;
use DateTime;

get '/' =>
    sub ($c) { $c->render( template => 'index', date => DateTime->today ) };

helper localize_date => sub ( $c, $date = DateTime->today, $style = 'full' ) {
    my $date_params = join ',' => $date->year, $date->month_0, $date->day;
    return
        qq<new Date($date_params).toLocaleString( [], {dateStyle: "$style"})>;
};

app->start;
__DATA__
@@ index.html.ep
% layout 'default';
% title 'Today';
<ul>
    <li><script>
        document.write(<%== localize_date $date %>)
    </script></li>
    % for my $style ( qw(long medium short) ) {
    <li><script>
        document.write(<%== localize_date $date, $style %>)
    </script></li>
    % }
</ul>
@@ layouts/default.html.ep
<!DOCTYPE html>
<html>
    <head><title><%= title %></title></head>
    <body><%= content %></body>
</html>

It’s struc­tured much like the Perl-​only solu­tion, with a default "/" route and a localize_date Mojolicious helper to do the for­mat­ting. I opt­ed to out­put a piece of JavaScript from the helper on lines 11 through 14 since it could be repeat­ed sev­er­al times in a doc­u­ment. You could instead declare a func­tion in the default lay­out’s HTML <head> on line 38 that would receive a date and a for­mat­ting style, out­putting the result­ing for­mat­ted date.

In the tem­plate’s list from lines 22 through 31 I decid­ed to use JavaScript document.write method calls to add our gen­er­at­ed code. This has a slew of caveats but works for our exam­ple here.

Worth not­ing is the dou­ble equals sign (<%== %>) when embed­ding a Perl expres­sion. This pre­vents Mojolicious from XML-​escaping spe­cial char­ac­ters, e.g., replac­ing "quotes" with &quot;, <angle brack­ets> with &lt; and &gt;, etc.. This is impor­tant when return­ing HTML and JavaScript code.

I also chose to use the JavaScript Date objec­t’s toLocaleString() method for my for­mat­ting on line 12. There are oth­er ways to do this:

Note that line 10 builds the para­me­ters for JavaScript’s Date con­struc­tor using the year, month_0, and day meth­ods of our Perl DateTime object; month_0 because the Date con­struc­tor takes its month as an inte­ger from 0 to 11 rather than 1 to 12. JavaScript Dates can be con­struct­ed in many ways; this seemed the sim­plest with­out hav­ing to explain things like epochs and incon­sis­tent parsing.

Why are we using Perl DateTimes and a helper any­way? I’m assum­ing that our dates are com­ing from the back­end of our appli­ca­tion, pos­si­bly inflat­ed from a data­base col­umn. If your dates are strict­ly on the fron­tend, you might decide to put your for­mat­ting code there in a JavaScript func­tion, per­haps using a JavaScript-​based tem­plat­ing library.

The bot­tom line is to do what­ev­er makes sense for your sit­u­a­tion. I pre­fer the Perl solu­tion because I like the lan­guage and its ecosys­tem and per­haps have accli­mat­ed to its quirks. The com­pli­ca­tions of JavaScript brows­er sup­port, com­pet­ing frame­works, and lay­ers of tool­ing make my head hurt. Despite this, I’m still learn­ing; if you have any com­ments or sug­ges­tions, please leave them below.

Western and eastern hemispheres of the Earth

When we’re writ­ing soft­ware for a glob­al audi­ence, it’s nice if we can pro­vide it accord­ing to their native lan­guages and con­ven­tions. Translating all of the text can be a huge under­tak­ing, but we can start small by mak­ing sure that when we show the day and date it appears as the user expects. For exam­ple, to me it’s Tuesday, April 20, 2021; to my friend Paul in the UK it’s Tuesday, 20 April 2021 (note the dif­fer­ence in order), and to my oth­er friend Gabór in Israel it’s יום שלישי, 20 באפריל 2021 (note the dif­fer­ent direc­tion of the text, dif­fer­ent lan­guage, and char­ac­ter set).

Thankfully, we have a num­ber of tools to assist us:

  • The DateTime::Locale library, which enables our Perl soft­ware to rep­re­sent dates and times glob­al­ly and con­tains a cat­a­log of locales. It works with the DateTime library for stor­ing our dates as objects that can be eas­i­ly manip­u­lat­ed and formatted.
  • The HTTP Accept-​Language head­er, which lets a web brows­er com­mu­ni­cate to the serv­er what nat­ur­al lan­guages and locale vari­ants the user understands.
  • The HTTP::AcceptLanguage mod­ule, which helps us parse the Accept-​Language head­er and select a com­pat­i­ble locale.

Our sam­ple code uses the Mojolicious frame­work and is very sim­ple; almost half of it is just HTML web page tem­plates. You could eas­i­ly adapt it to oth­er frame­works or tem­plat­ing systems.

#!/usr/bin/env perl

use Mojolicious::Lite -signatures;
use DateTime;
use DateTime::Locale;
use HTTP::AcceptLanguage;

my %locales
    = map { $_ => DateTime::Locale->load($_) } DateTime::Locale->codes;

get '/' =>
    sub ($c) { $c->render( template => 'index', date => DateTime->today ) };

helper localize_date => sub ( $c, $date = DateTime->today, $format = 'full' )
{
    my $locale = $locales{ HTTP::AcceptLanguage->new(
            $c->req->headers->accept_language )->match( keys %locales ) };

    my $method_name = "date_format_$format";
    return $date->clone->set_locale($locale)
        ->format_cldr( $locale->$method_name );
};

app->start;
__DATA__
@@ index.html.ep
% layout 'default';
% title 'Today';
<ul>
    <li><%= localize_date $date %></li>
    % for my $format ( qw(long medium short) ) {
    <li><%= localize_date $date, $format %></li>
    % }
</ul>
@@ layouts/default.html.ep
<!DOCTYPE html>
<html>
    <head><title><%= title %></title></head>
    <body><%= content %></body>
</html>

Lines 1 through 5 tell our code to use the Perl inter­preter in our exe­cu­tion PATH and load our pre­req­ui­site mod­ules. Note we’re using the micro ver­sion of Mojolicious, Mojolicious::Lite; lat­er you can grow your appli­ca­tion into a well-​structured Mojolicious app. We’re also using Perl sub­rou­tine sig­na­tures, which requires Perl 5.20 or lat­er (released in 2014).

Lines 7 and 8 pre­load all of the avail­able DateTime::Locale objects so that we can serve requests faster with­out hav­ing to load a new locale every time. We cre­ate a hash where the keys are the locale iden­ti­fiers (for exam­ple, en-US for United States English), and the val­ues are the locale objects.

Line 10 begins our route han­dler for HTTP GET requests on the default / route in our web appli­ca­tion. When a brows­er hits the home page of our app, it will exe­cute the code in the anony­mous sub in line 11, which is passed the con­troller object as $c. It’s a very sim­ple han­dler that ren­ders a tem­plate called index (described below), pass­ing it a date object with today’s date.

Lines 13 through 23 are where the smarts of our appli­ca­tion lie. It’s a helper that we’ll call from our tem­plate to local­ize a date object, and it’s anoth­er anony­mous sub. This time it’s passed a Mojolicious con­troller as $c, a $date para­me­ter that defaults to today, and a $format para­me­ter that defaults to full’.

Lines 14 through 18 in the helper get our locale. Working from the inside out, we get the HTTP Accept-​Language head­er from the request on line 16, cre­ate a new HTTP::AcceptLanguage object in line 15 for pars­ing that head­er, and then match it against the keys in our glob­al %locales hash in line 17. That matched key then looks up the appro­pri­ate DateTime::Locale object from the hash.

DateTime only allows you to set a locale at object con­struc­tion time, so in line 19 we cre­ate a new object from the old one, set­ting its locale to our newly-​discovered $locale object. Finally, in lines 21 and 22 we deter­mine what method to call on that object to retrieve the CLDR (Common Locale Data Repository) for­mat­ting pat­tern for the request­ed for­mat and then return the for­mat­ted date.

Finally, line 25 starts the appli­ca­tion. To run it using the devel­op­ment serv­er includ­ed with Mojolicious, do this at the com­mand line:

$ morbo perl_date_locale.pl

There are oth­er options for deploy­ing your appli­ca­tion, includ­ing Mojolicious’ built-​in web serv­er, inside a con­tain­er, using oth­er web servers, etc.

The rest of the above script is in the __DATA__ por­tion and con­tains two pseudo-​files that Mojolicious knows how to read in the absence of actu­al tem­plates and lay­outs. First on line 28 is the actu­al index.html.ep HTML page, which uses Mojolicious’ Embedded Perl (ep) tem­plat­ing sys­tem to select a lay­out of shared HTML to use (the layouts/default.html.ep file start­ing on line 39).

Lines 32 through 37 ren­der an HTML unordered list that runs through the var­i­ous for­mat­ting options avail­able to our localize_date helper, first with the default full’ for­mat­ting, and then a loop through long’, medi­um’, and short’. Note that we call our helper as an expres­sion, with an equals (=) sign after the per­cent (%) sign.

If you want to test dif­fer­ent locales with­out chang­ing your brows­er or oper­at­ing sys­tem set­tings, you can invoke the script from the com­mand line along with the HTTP request and head­ers to pass along. Here’s an exam­ple using German:

$ perl perl_date_locale.pl get -H 'Accept-Language: de' /
[2021-04-17 16:39:57.81379] [5425] [debug] [LcCSBKMVd90t] GET "/"
[2021-04-17 16:39:57.81408] [5425] [debug] [LcCSBKMVd90t] Routing to a callback
[2021-04-17 16:39:57.81610] [5425] [debug] [LcCSBKMVd90t] Rendering template "index.html.ep" from DATA section
[2021-04-17 16:39:57.81714] [5425] [debug] [LcCSBKMVd90t] Rendering template "layouts/default.html.ep" from DATA section
[2021-04-17 16:39:57.81792] [5425] [debug] [LcCSBKMVd90t] 200 OK (0.004118s, 242.836/s)
<!DOCTYPE html>
<html>
    <head><title>Today</title></head>
    <body>
<ul>
    <li>Sonntag, 18. April 2021</li>
    <li>18. April 2021</li>
    <li>18.04.2021</li>
    <li>18.04.21</li>
</ul>
</body>
</html>

And here’s Japanese:

$ perl perl_date_locale.pl get -H 'Accept-Language: ja' /
[2021-04-17 16:40:56.10840] [5478] [debug] [Wmr6cN5KUJlP] GET "/"
[2021-04-17 16:40:56.10874] [5478] [debug] [Wmr6cN5KUJlP] Routing to a callback
[2021-04-17 16:40:56.11101] [5478] [debug] [Wmr6cN5KUJlP] Rendering template "index.html.ep" from DATA section
[2021-04-17 16:40:56.11255] [5478] [debug] [Wmr6cN5KUJlP] Rendering template "layouts/default.html.ep" from DATA section
[2021-04-17 16:40:56.11360] [5478] [debug] [Wmr6cN5KUJlP] 200 OK (0.005164s, 193.648/s)
<!DOCTYPE html>
<html>
    <head><title>Today</title></head>
    <body>
<ul>
    <li>2021年4月18日日曜日</li>
    <li>2021年4月18日</li>
    <li>2021/04/18</li>
    <li>2021/04/18</li>
</ul>
</body>
</html>

A full list of sup­port­ed locales is pro­vid­ed in the DateTime::Locale::Catalog documentation.

I hope this arti­cle has helped demon­strate that it’s not too hard to make your Perl web appli­ca­tions respect glob­al audi­ences, if only with dates. For more on local­iza­tion and Perl, start with the Locale::Maketext framework.