Replied to Yes to ActivityPub, but no to Friends by Shelley Powers (Burningbird)
I decided to disable the Friends plug-in when I realized it was inserting every new feed item as a new post in my database. This could easily become unmanageable. Considering you can use a feed reader to read weblogs AND Mastodon accounts, it just didn’t seem worth the database burden.

I’ve also been mess­ing with the Friends and ActivityPub plu­g­ins for WordPress on my blog, and I share Shelley’s con­cerns about the for­mer bloat­ing the data­base with feed items. You can con­trol this some­what by set­ting reten­tion val­ues in days or a num­ber of posts, but you have to go into each friend’s Feeds tab and do it manually–there’s no default setting.

After read­ing that post, I’m also con­sid­er­ing dis­abling Friends in favor of a feed read­er, espe­cial­ly because (as Shelley also not­ed) there are gaps when with favorites and com­ment con­ver­sa­tions bridg­ing between WordPress and Mastodon servers. Like her, I’m not keen on installing a single-​user Mastodon instance or oth­er fedi­verse serv­er that requires man­ag­ing an unfa­mil­iar pro­gram­ming language.

I’m also try­ing to do this in tan­dem with a suite of IndieWeb plu­g­ins, and I’m run­ning into an issue with my friends feed page not show­ing any posts when the Post Kinds plu­g­in is acti­vat­ed. I real­ly want to keep this plu­g­in because it lets me inter­act bet­ter with oth­er IndieWeb sites as well as the Bridgy POSSE/​back­feed ser­vice con­nect­ing me to oth­er social networks.

My ide­al is a per­son­al web­site where I write every­thing, includ­ing long-​form arti­cles, short sta­tus­es, and replies like these. Folks can then find me via a sin­gle iden­ti­fi­able address and then subscribe/​follow the entire fire­hose of con­tent or choose sub­sets accord­ing to post types, top­ics, or tags. They’d then be able to reply or react on my site or their favored plat­form, which my site would col­lect regard­less of ori­gin, with sub­se­quent replies and reac­tions get­ting pushed out to them. Oh, and it should work with both ActivityPub clients and servers, IndieWeb sites, and syndicate/​backfeed to oth­er social net­works either with or akin to the Bridgy ser­vice I men­tioned above.

So far I haven’t seen any­thing that ticks all these box­es, and I’m get­ting itchy to write my own. Perl is my favorite pro­gram­ming lan­guage, so I’m look­ing at the Yancy CMS as a base. But I know that it would still be a hell of a project, and one of the rea­sons I chose WordPress for blog­ging was that it was well-​established and ‑sup­port­ed but still eas­i­ly exten­si­ble so that I could con­cen­trate on writ­ing instead of end­less­ly tweak­ing the engine. Unfortunately, I’m start­ing to fall into that trap anyway.

Replied to Star GitHub repos on favorite as well as like · Issue #1345 · snarfed/bridgy by Ryan BarrettRyan Barrett (GitHub)
Sounds like this change would need to happen in that WordPress plugin though, not in Bridgy, right?

I’m not so sure. The plug-​in uses a dif­fer­ent prop­er­ty for each (like-​of vs. favorite-​of) and there’s some argu­ment that they mean dif­fer­ent things to users and are treat­ed dif­fer­ent­ly on dif­fer­ent sys­tems, though I don’t think any sys­tem cur­rent­ly uses both. So a POSSE gate­way like Bridgy should con­sol­i­date both to a sin­gle prop­er­ty for sys­tems that do that, but pre­serve the abil­i­ty to keep them sep­a­rate should a dif­fer­ent (or new­er ver­sion) receiv­er dis­tin­guish between the two.

In any case I think the plug-​in should pre­serve the exist­ing prop­er­ties authors have already added to posts rather than lose infor­ma­tion by con­sol­i­dat­ing their mean­ing at the source.

depth of field photography of brown tree logs

A recent Lobsters post laud­ing the virtues of AWK remind­ed me that although the lan­guage is pow­er­ful and lightning-​fast, I usu­al­ly find myself exceed­ing its capa­bil­i­ties and reach­ing for Perl instead. One such appli­ca­tion is ana­lyz­ing volu­mi­nous log files such as the ones gen­er­at­ed by this blog. Yes, WordPress has stats, but I’ve nev­er let rein­ven­tion of the wheel get in the way of a good pro­gram­ming exercise.

So I whipped this script up on Sunday night while watch­ing RuPaul’s Drag Race reruns. It pars­es my Apache web serv­er log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;

    # get first date of each week
    $week_of{$key} ||= $dt->date;
    $count{$key}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sam­ple output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first start­ed pro­to­typ­ing this on the com­mand line as if it were an awk one-​liner by using the perl -n and -a flags. The for­mer wraps code in a while loop over the <> dia­mond oper­a­tor”, pro­cess­ing each line from stan­dard input or files passed as argu­ments. The lat­ter splits the fields of the line into an array named @F. It looked some­thing like this while I was list­ing URIs (loca­tions on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | \
perl -anE 'say $F[6]'

But once I real­ized I’d need to fil­ter out a bunch of URI pat­terns and do some aggre­ga­tion by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log for­mat and its time­stamp strings with­out hav­ing to write even more com­pli­cat­ed reg­u­lar expres­sions myself. (As not­ed above, this was already a wheel-​reinvention exer­cise; no need to com­pound that further.)

Regexp::Log::Common builds a com­piled reg­u­lar expres­sion based on the log for­mat and fields you’re inter­est­ed in, so that’s the con­struc­tor on lines 11 through 14. The expres­sion then returns those fields as a list, which I’m assign­ing to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t suc­cess­ful or brows­er cache hits, skip over requests that don’t GET web pages or oth­er assets (e.g., POSTs to forms or updat­ing oth­er resources), and skip over the URI pat­terns men­tioned earlier.

(Those pat­terns are worth a men­tion: they include the robots.txt and sitemap XML files used by search engine index­ers, WordPress admin­is­tra­tion pages, files used by RSS news­read­ers sub­scribed to my blog, and routes used by the Jetpack WordPress add-​on. If you’re adapt­ing this for your site you might need to cus­tomize this list based on what soft­ware you use to run it.)

Lines 38 and 39 parse the time­stamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-​week hit count. The last lines of the loop then grab the first date of each new week (assum­ing the log is in chrono­log­i­cal order) and incre­ment the count. Once fin­ished, lines 46 and 47 pro­vide a report sort­ed by week, dis­play­ing it as a friend­ly Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number func­tion dis­plays the totals with thou­sands separators.

Update: After this was ini­tial­ly pub­lished. astute read­er Chris McGowan not­ed that I had a bug where $log{status} was assigned the val­ue 304 with the = oper­a­tor rather than com­pared with ==. He also sug­gest­ed I use the double-​diamond <<>> oper­a­tor intro­duced in Perl v5.22.0 to avoid maliciously-​named files. Thanks, Chris!

Room for improvement

DateTime is a very pow­er­ful mod­ule but this comes at a price of speed and mem­o­ry. Something sim­pler like Date::WeekNumber should yield per­for­mance improve­ments, espe­cial­ly as my logs grow (here’s hop­ing). It requires a bit more man­u­al mas­sag­ing of the log dates to con­vert them into some­thing the mod­ule can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct qw<
  operator-double-diamond
  regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-\w]*sitemap[-\w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01',
    Feb => '02',
    Mar => '03',
    Apr => '04',
    May => '05',
    Jun => '06',
    Jul => '07',
    Aug => '08',
    Sep => '09',
    Oct => '10',
    Nov => '11',
    Dec => '12',
);

my ( %count, %week_of );
while ( <<>> ) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} == 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD
    # for Date::WeekNumber
    $log{ts} =~ m!^
      (?<day>\d\d) /
      (?<month>...) /
      (?<year>\d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first ver­sion, with the addi­tion of a hash to con­vert month names to num­bers and the actu­al con­ver­sion (using named reg­u­lar expres­sion cap­ture groups for read­abil­i­ty, using Syntax::Construct to check for that fea­ture). On my serv­er, this results in a ten- to eleven-​second sav­ings when pro­cess­ing two months of com­pressed logs.

What’s next? Pretty graphs? Drilling down to spe­cif­ic blog posts? Database stor­age for fur­ther queries and analy­sis? Perl and CPAN make it pos­si­ble to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.