This month I started a new job at Alert Logic, a cybersecurity provider with Perl (among many other things) at its beating heart. I’ve been learning a lot, and part of the process has been understanding the APIs in the code base. To that end, I’ve been writing small test scripts to tease apart data structures, using Perl array-processing, list-processing, and hash- (i.e., associative array)-processing functions.
I’ve covered map, grep, and friends a coupletimes before. Most recently, I described using List::Util’s any function to check if a condition is true for any item in a list. In the simplest case, you can use it to check to see if a given value is in the list at all:
use feature 'say';
use List::Util 'any';
my @colors =
qw(red orange yellow green blue indigo violet);
say 'matched' if any { /^red$/ } @colors;
However, if you’re going to be doing this a lot with arbitrary strings, Perl FAQ section 4 advises turning the array into the keys of a hash and then checking for membership there. For example, here’s a simple script to check if the colors input (either from the keyboard or from files passed as arguments) are in the rainbow:
#!/usr/bin/env perl
use v5.22; # introduced <<>> for safe opening of arguments
use warnings;
my %in_colors = map {$_ => 1}
qw(red orange yellow green blue indigo violet);
while (<<>>) {
chomp;
say "$_ is in the rainbow" if $in_colors{$_};
}
List::Util has a bunch of functions for processing lists of pairs that I’ve found useful when pawing through hashes. pairgrep, for example, acts just like grep but instead assigns $a and $b to each key and value passed in and returns the resulting pairs that match. I’ve used it as a quick way to search for hash entries matching certain value conditions:
use List::Util 'pairgrep';
my %numbers = (zero => 0, one => 1, two => 2, three => 3);
my %odds = pairgrep {$b % 2} %numbers;
Sure, you could do this by invoking a mix of plain grep, keys, and a hash slice, but it’s noisier and more repetitive:
use v5.20; # for key/value hash slice
my %odds = %numbers{grep {$numbers{$_} % 2} keys %numbers};
pairgrep’s compiled C‑based XS code can also be faster, as evidenced by this Benchmark script that works through a hash made of the Unix words file (479,828 entries on my machine):
#!/usr/bin/env perl
use v5.20;
use warnings;
use List::Util 'pairgrep';
use Benchmark 'cmpthese';
my (%words, $count);
open my $fh, '<', '/usr/share/dict/words'
or die "can't open words: $!";
while (<$fh>) {
chomp;
$words{$_} = $count++;
}
close $fh;
cmpthese(100, {
grep => sub {
my %odds = %words{grep {$words{$_} % 2} keys %words};
},
pairgrep => sub {
my %odds = pairgrep {$b % 2} %words;
},
} );
Six months ago I gave an overview of Perl’s list processing fundamentals, briefly describing what lists are and then introducing the built-in map and grep functions for transforming and filtering them. Later on, I compiled a list (how appropriate) of list processing modules available via CPAN, noting there’s some confusing duplication of effort. But you’re a busy developer, and you just want to know the Right Thing To Do™ when faced with a list processing challenge.
First, some credit is due: these are all restatements of several Perl::Critic policies which in turn codify standards described in Damian Conway’s Perl Best Practices (2005). I’ve repeatedly recommended the latter as a starting point for higher-quality Perl development. Over the years these practices continue to be re-evaluated (including by the author himself) and various authors release new policy modules, but perlcritic remains a great tool for ensuring you (and your team or other contributors) maintain a consistent high standard in your code.
With that said, on to the recommendations!
Don’t use grep to check if any list elements match
It might sound weird to lead off by recommending not to use grep, but sometimes it’s not the right tool for the job. If you’ve got a list and want to determine if a condition matches any item in it, you might try:
if (grep { some_condition($_) } @my_list) {
... # don't do this!
}
Yes, this works because (in scalar context) grep returns the number of matches found, but it’s wasteful, checking every element of @my_list (which could be lengthy) before finally providing a result. Use the standard List::Util module’s any function, which immediately returns (“short-circuits”) on the first match:
use List::Util 1.33 qw(any);
if (any { some_condition($_) } @my_list) { ... # do something }
As a side note for web developers, the Perl Dancer framework also includes an any keyword for declaring multiple HTTP routes, so if you’re mixing List::Util in there don’t import it. Instead, call it explicitly like this or you’ll get an error about a redefined function:
use List::Util 1.33;
if (List::Util::any { some_condition($_) } @my_list) { ... # do something }
I mentioned this back in March, but it bears repeating: map and grep are intended as pure functions, not mutators with side effects. This means that the original list should remain unchanged. Yes, each element aliases in turn to the $_ special variable, but that’s for speed and can have surprising results if changed even if it’s technically allowed. If you need to modify an array in-place use something like:
for (@my_array) { $_ = ...; # make your changes here }
If you want something that looks like map but won’t change the original list (and don’t mind a few CPAN dependencies), consider List::SomeUtils’ apply function:
use List::SomeUtils qw(apply);
my @doubled_array = apply {$_ *= 2} @old_array;
Lastly, side effects also include things like manipulating other variables or doing input and output. Don’t use map or grep in a void context (i.e., without a resulting array or list); do something with the results or use a for or foreach loop:
map { print foo($_) } @my_array; # don't do this print map { foo($_) } @my_array; # do this instead
map { push @new_array, foo($_) } @my_array; # don't do this @new_array = map { foo($_) } @my_array; # do this instead
my @new_array = map foo($_), @old_array; # don't do this my @new_array2 = grep !/^#/, @old_array; # don't do this
Or like this:
my @new_array = map { foo($_) } @old_array; my @new_array2 = grep {!/^#/} @old_array;
Do it the second way. It’s easier to read, especially if you’re passing in a literal list or multiple arrays, and the expression forms can conceal bugs. This recommendation is codified by the BuiltinFunctions::RequireBlockGrep and BuiltinFunctions::RequireBlockMap Perl::Critic policies and comes from Perl Best Practices.
Refactor multi-statement maps, greps, and other list functions
map, grep, and friends should follow the Unix philosophy of “Do One Thing and Do It Well.” Your readability and maintainability drop with every statement you place inside one of their blocks. Consider junior developers and future maintainers (this includes you!) and refactor anything with more than one statement into a separate subroutine or at least a for loop. This goes for list processing functions (like the aforementioned any) imported from other modules, too.
A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.
#!/usr/bin/env perl
use strict;
use warnings;
use Syntax::Construct 'operator-double-diamond';
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';
my $parser = Regexp::Log::Common->new(
format => ':extended',
capture => [qw<req ts status>],
);
my @fields = $parser->capture;
my $compiled_re = $parser->regexp;
my @skip_uri_patterns = qw<
^/+robots.txt
[-\w]*sitemap[-\w]*.xml
^/+wp-
/feed/?$
^/+?rest_route=
>;
my ( %count, %week_of );
while ( <<>> ) {
my %log;
@log{@fields} = /$compiled_re/;
# only interested in successful or cached requests
next unless $log{status} =~ /^2/ or $log{status} == 304;
my ( $method, $uri, $protocol ) = split ' ', $log{req};
next unless $method eq 'GET';
next if any { $uri =~ $_ } @skip_uri_patterns;
my $dt = DateTime::Format::HTTP->parse_datetime( $log{ts} );
my $key = sprintf '%u-%02u', $dt->week;
# get first date of each week
$week_of{$key} ||= $dt->date;
$count{$key}++;
}
printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
for sort keys %count;
Here’s some sample output:
Week of 2021-07-31: 2,672
Week of 2021-08-02: 16,222
Week of 2021-08-09: 12,609
Week of 2021-08-16: 17,714
Week of 2021-08-23: 14,462
Week of 2021-08-30: 11,758
Week of 2021-09-06: 14,811
Week of 2021-09-13: 407
I first started prototyping this on the command line as if it were an awk one-liner by using the perl -n and -a flags. The former wraps code in a while loop over the <> “diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):
But once I realized I’d need to filter out a bunch of URI patterns and do some aggregation by date, I turned it into a script and turned to CPAN.
There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-reinvention exercise; no need to compound that further.)
Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 11 through 14. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 29. I then skip over requests that aren’t successful or browser cache hits, skip over requests that don’t GET web pages or other assets (e.g., POSTs to forms or updating other resources), and skip over the URI patterns mentioned earlier.
(Those patterns are worth a mention: they include the robots.txt and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)
Lines 38 and 39 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 46 and 47 provide a report sorted by week, displaying it as a friendly “Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.
Update: After this was initially published. astute reader Chris McGowan noted that I had a bug where $log{status} was assigned the value 304 with the = operator rather than compared with ==. He also suggested I use the double-diamond <<>> operator introduced in Perl v5.22.0 to avoid maliciously-named files. Thanks, Chris!
Room for improvement
DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:
#!/usr/bin/env perl
use strict;
use warnings;
use Syntax::Construct qw<
operator-double-diamond
regex-named-capture-group
>;
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';
my $parser = Regexp::Log::Common->new(
format => ':extended',
capture => [qw<req ts status>],
);
my @fields = $parser->capture;
my $compiled_re = $parser->regexp;
my @skip_uri_patterns = qw<
^/+robots.txt
[-\w]*sitemap[-\w]*.xml
^/+wp-
/feed/?$
^/+?rest_route=
>;
my %month = (
Jan => '01',
Feb => '02',
Mar => '03',
Apr => '04',
May => '05',
Jun => '06',
Jul => '07',
Aug => '08',
Sep => '09',
Oct => '10',
Nov => '11',
Dec => '12',
);
my ( %count, %week_of );
while ( <<>> ) {
my %log;
@log{@fields} = /$compiled_re/;
# only interested in successful or cached requests
next unless $log{status} =~ /^2/ or $log{status} == 304;
my ( $method, $uri, $protocol ) = split ' ', $log{req};
next unless $method eq 'GET';
next if any { $uri =~ $_ } @skip_uri_patterns;
# convert log timestamp to YYYY-MM-DD
# for Date::WeekNumber
$log{ts} =~ m!^
(?<day>\d\d) /
(?<month>...) /
(?<year>\d{4}) : !x;
my $date = "$+{year}-$month{ $+{month} }-$+{day}";
my $week = iso_week_number($date);
$week_of{$week} ||= $date;
$count{$week}++;
}
printf "Week of %s: % 10s\n", $week_of{$_}, format_number( $count{$_} )
for sort keys %count;
It looks almost the same as the first version, with the addition of a hash to convert month names to numbers and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature). On my server, this results in a ten- to eleven-second savings when processing two months of compressed logs.
What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.
{"id":"11","mode":"button","open_style":"in_modal","currency_code":"USD","currency_symbol":"$","currency_type":"decimal","blank_flag_url":"https:\/\/phoenixtrap.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/blank.gif","flag_sprite_url":"https:\/\/phoenixtrap.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/flags.png","default_amount":500,"top_media_type":"featured_image","featured_image_url":"https:\/\/phoenixtrap.com\/wp-content\/uploads\/2021\/02\/image-200x200.jpg","featured_embed":"","header_media":null,"file_download_attachment_data":null,"recurring_options_enabled":true,"recurring_options":{"never":{"selected":true,"after_output":"One time only"},"weekly":{"selected":false,"after_output":"Every week"},"monthly":{"selected":false,"after_output":"Every month"},"yearly":{"selected":false,"after_output":"Every year"}},"strings":{"current_user_email":"","current_user_name":"","link_text":"Leave a tip!","complete_payment_button_error_text":"Check info and try again","payment_verb":"Pay","payment_request_label":"The Phoenix Trap","form_has_an_error":"Please check and fix the errors above","general_server_error":"Something isn't working right at the moment. Please try again.","form_title":"The Phoenix Trap","form_subtitle":"Do you like what you see? Leave a one-time or recurring tip!","currency_search_text":"Country or Currency here","other_payment_option":"Other payment option","manage_payments_button_text":"Manage your payments","thank_you_message":"Thank you for being a supporter!","payment_confirmation_title":"The Phoenix Trap","receipt_title":"Your Receipt","print_receipt":"Print Receipt","email_receipt":"Email Receipt","email_receipt_sending":"Sending receipt...","email_receipt_success":"Email receipt successfully sent","email_receipt_failed":"Email receipt failed to send. Please try again.","receipt_payee":"Paid to","receipt_statement_descriptor":"This will show up on your statement as","receipt_date":"Date","receipt_transaction_id":"Transaction ID","receipt_transaction_amount":"Amount","refund_payer":"Refund from","login":"Log in to manage your payments","manage_payments":"Manage Payments","transactions_title":"Your Transactions","transaction_title":"Transaction Receipt","transaction_period":"Plan Period","arrangements_title":"Your Plans","arrangement_title":"Manage Plan","arrangement_details":"Plan Details","arrangement_id_title":"Plan ID","arrangement_payment_method_title":"Payment Method","arrangement_amount_title":"Plan Amount","arrangement_renewal_title":"Next renewal date","arrangement_action_cancel":"Cancel Plan","arrangement_action_cant_cancel":"Cancelling is currently not available.","arrangement_action_cancel_double":"Are you sure you'd like to cancel?","arrangement_cancelling":"Cancelling Plan...","arrangement_cancelled":"Plan Cancelled","arrangement_failed_to_cancel":"Failed to cancel plan","back_to_plans":"\u2190 Back to Plans","update_payment_method_verb":"Update","sca_auth_description":"Your have a pending renewal payment which requires authorization.","sca_auth_verb":"Authorize renewal payment","sca_authing_verb":"Authorizing payment","sca_authed_verb":"Payment successfully authorized!","sca_auth_failed":"Unable to authorize! Please try again.","login_button_text":"Log in","login_form_has_an_error":"Please check and fix the errors above","uppercase_search":"Search","lowercase_search":"search","uppercase_page":"Page","lowercase_page":"page","uppercase_items":"Items","lowercase_items":"items","uppercase_per":"Per","lowercase_per":"per","uppercase_of":"Of","lowercase_of":"of","back":"Back to plans","zip_code_placeholder":"Zip\/Postal Code","download_file_button_text":"Download File","input_field_instructions":{"tip_amount":{"placeholder_text":"How much would you like to tip?","initial":{"instruction_type":"normal","instruction_message":"How much would you like to tip? Choose any currency."},"empty":{"instruction_type":"error","instruction_message":"How much would you like to tip? Choose any currency."},"invalid_curency":{"instruction_type":"error","instruction_message":"Please choose a valid currency."}},"recurring":{"placeholder_text":"Recurring","initial":{"instruction_type":"normal","instruction_message":"How often would you like to give this?"},"success":{"instruction_type":"success","instruction_message":"How often would you like to give this?"},"empty":{"instruction_type":"error","instruction_message":"How often would you like to give this?"}},"name":{"placeholder_text":"Name on Credit Card","initial":{"instruction_type":"normal","instruction_message":"What is the name on your credit card?"},"success":{"instruction_type":"success","instruction_message":"Enter the name on your card."},"empty":{"instruction_type":"error","instruction_message":"Please enter the name on your card."}},"privacy_policy":{"terms_title":"Terms and conditions","terms_body":null,"terms_show_text":"View Terms","terms_hide_text":"Hide Terms","initial":{"instruction_type":"normal","instruction_message":"I agree to the terms."},"unchecked":{"instruction_type":"error","instruction_message":"Please agree to the terms."},"checked":{"instruction_type":"success","instruction_message":"I agree to the terms."}},"email":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"What is your email address?"},"success":{"instruction_type":"success","instruction_message":"Enter your email address"},"blank":{"instruction_type":"error","instruction_message":"Enter your email address"},"not_an_email_address":{"instruction_type":"error","instruction_message":"Make sure you have entered a valid email address"}},"note_with_tip":{"placeholder_text":"Your note here...","initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"empty":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"not_empty_initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"saving":{"instruction_type":"normal","instruction_message":"Saving note..."},"success":{"instruction_type":"success","instruction_message":"Note successfully saved!"},"error":{"instruction_type":"error","instruction_message":"Unable to save note note at this time. Please try again."}},"email_for_login_code":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email to log in."},"success":{"instruction_type":"success","instruction_message":"Enter your email to log in."},"blank":{"instruction_type":"error","instruction_message":"Enter your email to log in."},"empty":{"instruction_type":"error","instruction_message":"Enter your email to log in."}},"login_code":{"initial":{"instruction_type":"normal","instruction_message":"Check your email and enter the login code."},"success":{"instruction_type":"success","instruction_message":"Check your email and enter the login code."},"blank":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."},"empty":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."}},"stripe_all_in_one":{"initial":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"empty":{"instruction_type":"error","instruction_message":"Enter your credit card details here."},"success":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"invalid_number":{"instruction_type":"error","instruction_message":"The card number is not a valid credit card number."},"invalid_expiry_month":{"instruction_type":"error","instruction_message":"The card's expiration month is invalid."},"invalid_expiry_year":{"instruction_type":"error","instruction_message":"The card's expiration year is invalid."},"invalid_cvc":{"instruction_type":"error","instruction_message":"The card's security code is invalid."},"incorrect_number":{"instruction_type":"error","instruction_message":"The card number is incorrect."},"incomplete_number":{"instruction_type":"error","instruction_message":"The card number is incomplete."},"incomplete_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incomplete."},"incomplete_expiry":{"instruction_type":"error","instruction_message":"The card's expiration date is incomplete."},"incomplete_zip":{"instruction_type":"error","instruction_message":"The card's zip code is incomplete."},"expired_card":{"instruction_type":"error","instruction_message":"The card has expired."},"incorrect_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incorrect."},"incorrect_zip":{"instruction_type":"error","instruction_message":"The card's zip code failed validation."},"invalid_expiry_year_past":{"instruction_type":"error","instruction_message":"The card's expiration year is in the past"},"card_declined":{"instruction_type":"error","instruction_message":"The card was declined."},"missing":{"instruction_type":"error","instruction_message":"There is no card on a customer that is being charged."},"processing_error":{"instruction_type":"error","instruction_message":"An error occurred while processing the card."},"invalid_request_error":{"instruction_type":"error","instruction_message":"Unable to process this payment, please try again or use alternative method."},"invalid_sofort_country":{"instruction_type":"error","instruction_message":"The billing country is not accepted by SOFORT. Please try another country."}}}},"fetched_oembed_html":false}