Replied to Yes to ActivityPub, but no to Friends by Shelley Powers (Burningbird)
I decided to disable the Friends plug-in when I realized it was inserting every new feed item as a new post in my database. This could easily become unmanageable. Considering you can use a feed reader to read weblogs AND Mastodon accounts, it just didn’t seem worth the database burden.

I’ve also been messing with the Friends and ActivityPub plugins for WordPress on my blog, and I share Shelley’s concerns about the former bloating the database with feed items. You can control this somewhat by setting retention values in days or a number of posts, but you have to go into each friend’s Feeds tab and do it manually–there’s no default setting.

After reading that post, I’m also considering disabling Friends in favor of a feed reader, especially because (as Shelley also noted) there are gaps when with favorites and comment conversations bridging between WordPress and Mastodon servers. Like her, I’m not keen on installing a single-​user Mastodon instance or other fediverse server that requires managing an unfamiliar programming language.

I’m also trying to do this in tandem with a suite of IndieWeb plugins, and I’m running into an issue with my friends feed page not showing any posts when the Post Kinds plugin is activated. I really want to keep this plugin because it lets me interact better with other IndieWeb sites as well as the Bridgy POSSE/​backfeed service connecting me to other social networks.

My ideal is a personal website where I write everything, including long-​form articles, short statuses, and replies like these. Folks can then find me via a single identifiable address and then subscribe/​follow the entire firehose of content or choose subsets according to post types, topics, or tags. They’d then be able to reply or react on my site or their favored platform, which my site would collect regardless of origin, with subsequent replies and reactions getting pushed out to them. Oh, and it should work with both ActivityPub clients and servers, IndieWeb sites, and syndicate/​backfeed to other social networks either with or akin to the Bridgy service I mentioned above.

So far I haven’t seen anything that ticks all these boxes, and I’m getting itchy to write my own. Perl is my favorite programming language, so I’m looking at the Yancy CMS as a base. But I know that it would still be a hell of a project, and one of the reasons I chose WordPress for blogging was that it was well-​established and ‑supported but still easily extensible so that I could concentrate on writing instead of endlessly tweaking the engine. Unfortunately, I’m starting to fall into that trap anyway.

Replied to Star GitHub repos on favorite as well as like · Issue #1345 · snarfed/bridgy by Ryan BarrettRyan Barrett (GitHub)
Sounds like this change would need to happen in that WordPress plugin though, not in Bridgy, right?

I’m not so sure. The plug-​in uses a different property for each (like-​of vs. favorite-​of) and there’s some argument that they mean different things to users and are treated differently on different systems, though I don’t think any system currently uses both. So a POSSE gateway like Bridgy should consolidate both to a single property for systems that do that, but preserve the ability to keep them separate should a different (or newer version) receiver distinguish between the two.

In any case I think the plug-​in should preserve the existing properties authors have already added to posts rather than lose information by consolidating their meaning at the source.

depth of field photography of brown tree logs

A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-​fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.

So I whipped this script up on Sunday night while watching RuPaul’s Drag Race reruns. It parses my Apache web server log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sample output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first started prototyping this on the command line as if it were an awk one-​liner by using the perl -n and -a flags. The former wraps code in a while loop over the <> diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I realized I’d need to filter out a bunch of URI patterns and do some aggregation by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-​reinvention exercise; no need to compound that further.)

Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 11 through 14. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t successful or browser cache hits, skip over requests that don’t GET web pages or other assets (e.g., POSTs to forms or updating other resources), and skip over the URI patterns mentioned earlier.

(Those patterns are worth a mention: they include the robots.txt and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)

Lines 38 and 39 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 46 and 47 provide a report sorted by week, displaying it as a friendly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.

Update: After this was initially published. astute reader Chris McGowan noted that I had a bug where $log{status} was assigned the value 304 with the = operator rather than compared with ==. He also suggested I use the double-​diamond <<>> operator introduced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first version, with the addition of a hash to convert month names to numbers and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature). On my server, this results in a ten- to eleven-​second savings when processing two months of compressed logs.

What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.