depth of field photography of brown tree logs

A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-​fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.

So I whipped this script up on Sunday night while watching RuPaul’s Drag Race reruns. It parses my Apache web server log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sample output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first started prototyping this on the command line as if it were an awk one-​liner by using the perl -n and -a flags. The former wraps code in a while loop over the <> diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I realized I’d need to filter out a bunch of URI patterns and do some aggregation by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-​reinvention exercise; no need to compound that further.)

Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 11 through 14. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t successful or browser cache hits, skip over requests that don’t GET web pages or other assets (e.g., POSTs to forms or updating other resources), and skip over the URI patterns mentioned earlier.

(Those patterns are worth a mention: they include the robots.txt and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)

Lines 38 and 39 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 46 and 47 provide a report sorted by week, displaying it as a friendly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.

Update: After this was initially published. astute reader Chris McGowan noted that I had a bug where $log{status} was assigned the value 304 with the = operator rather than compared with ==. He also suggested I use the double-​diamond <<>> operator introduced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first version, with the addition of a hash to convert month names to numbers and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature). On my server, this results in a ten- to eleven-​second savings when processing two months of compressed logs.

What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

text

Last week’s article received a comment on a private Facebook group that amounted to just use JavaScript’s built-​in formatting.” So what would that look like?

#!/usr/bin/env perl

use Mojolicious::Lite -signatures;
use DateTime;

get '/' =>
    sub ($c) { $c->render( template => 'index', date => DateTime->today ) };

helper localize_date => sub ( $c, $date = DateTime->today, $style = 'full' ) {
    my $date_params = join ',' => $date->year, $date->month_0, $date->day;
    return
        qq<new Date($date_params).toLocaleString( [], {dateStyle: "$style"})>;
};

app->start;
__DATA__
@@ index.html.ep
% layout 'default';
% title 'Today';
<ul>
    <li><script>
        document.write(<%== localize_date $date %>)
    </script></li>
    % for my $style ( qw(long medium short) ) {
    <li><script>
        document.write(<%== localize_date $date, $style %>)
    </script></li>
    % }
</ul>
@@ layouts/default.html.ep
<!DOCTYPE html>
<html>
    <head><title><%= title %></title></head>
    <body><%= content %></body>
</html>

It’s structured much like the Perl-​only solution, with a default "/" route and a localize_date Mojolicious helper to do the formatting. I opted to output a piece of JavaScript from the helper on lines 11 through 14 since it could be repeated several times in a document. You could instead declare a function in the default layout’s HTML <head> on line 38 that would receive a date and a formatting style, outputting the resulting formatted date.

In the template’s list from lines 22 through 31 I decided to use JavaScript document.write method calls to add our generated code. This has a slew of caveats but works for our example here.

Worth noting is the double equals sign (<%== %>) when embedding a Perl expression. This prevents Mojolicious from XML-​escaping special characters, e.g., replacing "quotes" with &quot;, <angle brackets> with &lt; and &gt;, etc.. This is important when returning HTML and JavaScript code.

I also chose to use the JavaScript Date object’s toLocaleString() method for my formatting on line 12. There are other ways to do this:

  • Date objects also have a toLocaleDateString method. However, Mozilla has a performance note that states it’s better to use the Intl.DateTimeFormat object’s format property.
  • But Intl.DateTimeFormats browser support stands at about 70%, leaving out Safari (that’s Mac, iPhone, and iPad) and Internet Explorer users.
  • There are JavaScript libraries and polyfills to address these issues, but I’m trying to keep this example simple.

Note that line 10 builds the parameters for JavaScript’s Date constructor using the year, month_0, and day methods of our Perl DateTime object; month_0 because the Date constructor takes its month as an integer from 0 to 11 rather than 1 to 12. JavaScript Dates can be constructed in many ways; this seemed the simplest without having to explain things like epochs and inconsistent parsing.

Why are we using Perl DateTimes and a helper anyway? I’m assuming that our dates are coming from the backend of our application, possibly inflated from a database column. If your dates are strictly on the frontend, you might decide to put your formatting code there in a JavaScript function, perhaps using a JavaScript-​based templating library.

The bottom line is to do whatever makes sense for your situation. I prefer the Perl solution because I like the language and its ecosystem and perhaps have acclimated to its quirks. The complications of JavaScript browser support, competing frameworks, and layers of tooling make my head hurt. Despite this, I’m still learning; if you have any comments or suggestions, please leave them below.

Western and eastern hemispheres of the Earth

When we’re writing software for a global audience, it’s nice if we can provide it according to their native languages and conventions. Translating all of the text can be a huge undertaking, but we can start small by making sure that when we show the day and date it appears as the user expects. For example, to me it’s Tuesday, April 20, 2021; to my friend Paul in the UK it’s Tuesday, 20 April 2021 (note the difference in order), and to my other friend Gabór in Israel it’s יום שלישי, 20 באפריל 2021 (note the different direction of the text, different language, and character set).

Thankfully, we have a number of tools to assist us:

  • The DateTime::Locale library, which enables our Perl software to represent dates and times globally and contains a catalog of locales. It works with the DateTime library for storing our dates as objects that can be easily manipulated and formatted.
  • The HTTP Accept-​Language header, which lets a web browser communicate to the server what natural languages and locale variants the user understands.
  • The HTTP::AcceptLanguage module, which helps us parse the Accept-​Language header and select a compatible locale.

Our sample code uses the Mojolicious framework and is very simple; almost half of it is just HTML web page templates. You could easily adapt it to other frameworks or templating systems.

#!/usr/bin/env perl

use Mojolicious::Lite -signatures;
use DateTime;
use DateTime::Locale;
use HTTP::AcceptLanguage;

my %locales
    = map { $_ => DateTime::Locale->load($_) } DateTime::Locale->codes;

get '/' =>
    sub ($c) { $c->render( template => 'index', date => DateTime->today ) };

helper localize_date => sub ( $c, $date = DateTime->today, $format = 'full' )
{
    my $locale = $locales{ HTTP::AcceptLanguage->new(
            $c->req->headers->accept_language )->match( keys %locales ) };

    my $method_name = "date_format_$format";
    return $date->clone->set_locale($locale)
        ->format_cldr( $locale->$method_name );
};

app->start;
__DATA__
@@ index.html.ep
% layout 'default';
% title 'Today';
<ul>
    <li><%= localize_date $date %></li>
    % for my $format ( qw(long medium short) ) {
    <li><%= localize_date $date, $format %></li>
    % }
</ul>
@@ layouts/default.html.ep
<!DOCTYPE html>
<html>
    <head><title><%= title %></title></head>
    <body><%= content %></body>
</html>

Lines 1 through 5 tell our code to use the Perl interpreter in our execution PATH and load our prerequisite modules. Note we’re using the micro version of Mojolicious, Mojolicious::Lite; later you can grow your application into a well-​structured Mojolicious app. We’re also using Perl subroutine signatures, which requires Perl 5.20 or later (released in 2014).

Lines 7 and 8 preload all of the available DateTime::Locale objects so that we can serve requests faster without having to load a new locale every time. We create a hash where the keys are the locale identifiers (for example, en-US for United States English), and the values are the locale objects.

Line 10 begins our route handler for HTTP GET requests on the default / route in our web application. When a browser hits the home page of our app, it will execute the code in the anonymous sub in line 11, which is passed the controller object as $c. It’s a very simple handler that renders a template called index (described below), passing it a date object with today’s date.

Lines 13 through 23 are where the smarts of our application lie. It’s a helper that we’ll call from our template to localize a date object, and it’s another anonymous sub. This time it’s passed a Mojolicious controller as $c, a $date parameter that defaults to today, and a $format parameter that defaults to full’.

Lines 14 through 18 in the helper get our locale. Working from the inside out, we get the HTTP Accept-​Language header from the request on line 16, create a new HTTP::AcceptLanguage object in line 15 for parsing that header, and then match it against the keys in our global %locales hash in line 17. That matched key then looks up the appropriate DateTime::Locale object from the hash.

DateTime only allows you to set a locale at object construction time, so in line 19 we create a new object from the old one, setting its locale to our newly-​discovered $locale object. Finally, in lines 21 and 22 we determine what method to call on that object to retrieve the CLDR (Common Locale Data Repository) formatting pattern for the requested format and then return the formatted date.

Finally, line 25 starts the application. To run it using the development server included with Mojolicious, do this at the command line:

$ morbo perl_date_locale.pl

There are other options for deploying your application, including Mojolicious’ built-​in web server, inside a container, using other web servers, etc.

The rest of the above script is in the __DATA__ portion and contains two pseudo-​files that Mojolicious knows how to read in the absence of actual templates and layouts. First on line 28 is the actual index.html.ep HTML page, which uses Mojolicious’ Embedded Perl (ep) templating system to select a layout of shared HTML to use (the layouts/default.html.ep file starting on line 39).

Lines 32 through 37 render an HTML unordered list that runs through the various formatting options available to our localize_date helper, first with the default full’ formatting, and then a loop through long’, medium’, and short’. Note that we call our helper as an expression, with an equals (=) sign after the percent (%) sign.

If you want to test different locales without changing your browser or operating system settings, you can invoke the script from the command line along with the HTTP request and headers to pass along. Here’s an example using German:

$ perl perl_date_locale.pl get -H 'Accept-Language: de' /
[2021-04-17 16:39:57.81379] [5425] [debug] [LcCSBKMVd90t] GET "/"
[2021-04-17 16:39:57.81408] [5425] [debug] [LcCSBKMVd90t] Routing to a callback
[2021-04-17 16:39:57.81610] [5425] [debug] [LcCSBKMVd90t] Rendering template "index.html.ep" from DATA section
[2021-04-17 16:39:57.81714] [5425] [debug] [LcCSBKMVd90t] Rendering template "layouts/default.html.ep" from DATA section
[2021-04-17 16:39:57.81792] [5425] [debug] [LcCSBKMVd90t] 200 OK (0.004118s, 242.836/s)
<!DOCTYPE html>
<html>
    <head><title>Today</title></head>
    <body>
<ul>
    <li>Sonntag, 18. April 2021</li>
    <li>18. April 2021</li>
    <li>18.04.2021</li>
    <li>18.04.21</li>
</ul>
</body>
</html>

And here’s Japanese:

$ perl perl_date_locale.pl get -H 'Accept-Language: ja' /
[2021-04-17 16:40:56.10840] [5478] [debug] [Wmr6cN5KUJlP] GET "/"
[2021-04-17 16:40:56.10874] [5478] [debug] [Wmr6cN5KUJlP] Routing to a callback
[2021-04-17 16:40:56.11101] [5478] [debug] [Wmr6cN5KUJlP] Rendering template "index.html.ep" from DATA section
[2021-04-17 16:40:56.11255] [5478] [debug] [Wmr6cN5KUJlP] Rendering template "layouts/default.html.ep" from DATA section
[2021-04-17 16:40:56.11360] [5478] [debug] [Wmr6cN5KUJlP] 200 OK (0.005164s, 193.648/s)
<!DOCTYPE html>
<html>
    <head><title>Today</title></head>
    <body>
<ul>
    <li>2021年4月18日日曜日</li>
    <li>2021年4月18日</li>
    <li>2021/04/18</li>
    <li>2021/04/18</li>
</ul>
</body>
</html>

A full list of supported locales is provided in the DateTime::Locale::Catalog documentation.

I hope this article has helped demonstrate that it’s not too hard to make your Perl web applications respect global audiences, if only with dates. For more on localization and Perl, start with the Locale::Maketext framework.