The Phoenix Trap

Code, music, philosophy, etc.

Tag: lists

Scraping the Dragon with Perl and Mojolicious
Every extended Labor Day weekend, 80,000 fans of pop culture descend on Atlanta for Dragon Con. It’s a sprawling choose-your-own adventure of a convention with 38 programming tracks and over 5,000 hours of events. It spans five downtown host hotels, and there is no way to see it all.

Sadly, this year’s con is almost over. Still, I thought I’d share a little script I wrote to help me make sense of it all.

The official mobile app is fine for searching and bookmarking events, speakers, and exhibitors. Nonetheless, it’s not suitable for scanning the whole landscape at once. I wanted a single, scrollable view of every event, before I even packed my cosplay.

Even in the app’s tablet version, the Dragon Con Events area is a scroll-fest.

The web version of the app gave me exactly what I needed: predictable per-day URLs and semantically marked-up HTML. That meant I can skip the API hunt, skip the manual scrolling, and go straight to scraping.

Inspecting the HTML reveals per-day event URLs and per-event <div> blocks.

From Chaos to Clarity in 40 lines

We’re about to turn a messy, multi-day, multi-hotel schedule into one clean, scroll-once list. This is the forty-five-line Perl map that gets us there, aided by the Mojolicious web toolkit.

Laying the Groundwork: Tools for the Job
```
#!/usr/bin/env perl

use v5.40;

use Carp;
use English;
use Mojo::UserAgent;
use Mojo::URL;
use Mojo::DOM;
use Mojo::Collection q(c);
use Time::Piece;
use HTML::HTML5::Entities;
use Memoize;

binmode STDOUT, ':encoding(UTF-8)'
  or croak "Couldn't encode STDOUT: $OS_ERROR";

my $ua   = Mojo::UserAgent->new();
my $site = Mojo::URL->new('https://app.core-apps.com');
my $path = '/dragoncon25/events/view_by_day';
```
What’s happening: Load the modules that will do the heavy lifting–HTTP fetches, DOM parsing, date handling, Unicode cleanup. Lock STDOUT to UTF‑8 so characters like curly quotes and em-dashes don’t break the output. Point the script at the base schedule URL.

Remembering the Days Without Re-Parsing
```
my $date_from_dom = memoize( sub ($dom) {
  return content_at( $dom, 'div.section_header[class~="alt"]' );
} );
```
What’s happening: Create a memoized helper that plucks the date from a day’s HTML and caches it. That way, if we need it again, we skip the DOM re-parse and keep the pipeline fast.

content_at is a helper function I define later.

Starting Where the App Starts
```
my $today_dom = Mojo::DOM->new( $ua->get("$site$path")->result->text );
```
What’s happening: Fetch the “today” view–the same default the app shows. This is so we have a known starting point for building the full timeline.

Collecting the Whole Timeline
```
my $day_doms = c(
  $today_dom,
  $today_dom->find(qq(div.filter_box-days > a[href^="$path?day="]))
    ->map( \&dom_from_anchor )
    ->to_array->@*,
)->sort( sub { day_epoch($a) <=> day_epoch($b) } );
```
What’s happening: Grab every day link from the filter bar, fetch each day’s HTML, and sort them chronologically. Now we’ve got the entire con’s schedule in memory, ready to process.

dom_from_anchor and day_epoch are two more helper functions explained further down.

Turning HTML into a Human-Readable Schedule
```
$day_doms->each( sub {    # process each day's events
  my $date = $date_from_dom->($_);

  $_->find('a.bookmark[data-type="events"] + a.object_link')
    ->each( sub {         # output start time + title

      my $time    = content_at( $_, 'div.line[class~="two"]' );
      my $title   = content_at( $_, 'div.line[class~="one"]' );
      my ($start) = split /\s*\p{Dash_Punctuation}/, $time;

      say "$date $start: ", decode_entities($title);
    } );
} );
```
What’s happening: For each day, find every event link and pull out the start time and title. Split the time cleanly on any dash and decode HTML entities so the output reads like a real schedule.

The Little Routines That Make It All Work
```
sub dom_from_anchor ($dom) {    # fetch DOM for a day link
  return Mojo::DOM->new(
    $ua->get( Mojo::URL->new( $dom->attr('href') )->to_abs($site) )
      ->result->text );
}

sub day_epoch ($dom) {    # parse date into epoch
  return Time::Piece->strptime( $date_from_dom->($dom), '%A, %b %e' )
    ->epoch;
}

# extract and trim text from selector
sub content_at ( $dom, @args ) { return trim $dom->at(@args)->content }
```
What’s happening:
1. dom_from_anchor: fetch and parses a linked days’ HTML.
2. day_epoch: turn a date string into a sort-able epoch.
3. content_at: extract and trim text from a DOM fragment, given a CSS selector.
These helpers keep the main flow readable and re-usable.
The Schedule, Unlocked

Run the script and you get a clean, UTF-8-safe list of every event, in chronological order, across all days. No swiping around, no tapping, no “what did I miss?” anxiety. (Ha, who am I kidding? There’s too much going on at Dragon Con to not end up missing something.)

An example run of the script in my terminal. Each line is “Day, Date Time: Event Title”, sorted chronologically across the whole con.

And here’s just a small slice of the 2,500+ lines it produces:

Sunday, Aug 31 11:30 AM: Unmasking Sherlock: Beyond the Many Faces
Sunday, Aug 31 11:30 AM: Weaponization of the FCC and Other Agencies to Chill Speech
Sunday, Aug 31 11:30 AM: Where Physics Gets Weird
. . .
Sunday, Aug 31 11:50 AM: Photo Session: Amelia Tyler
Sunday, Aug 31 11:50 AM: Photo Session: Cissy Jones
Sunday, Aug 31 11:50 AM: Photo Session: Emma Gregory
. . .
Sunday, Aug 31 12:00 PM: Dragon Con Mashups
Sunday, Aug 31 12:00 PM: James J. Butcher and R.R. Virdi signing at The Missing Volume booth# 1300
Sunday, Aug 31 12:00 PM: JoeDan Worley and Eric Dontigney signing at the Shadow Alley Press Booth# 2
. . .
Sunday, Aug 31 12:00 PM: Photo Session: Robert Duncan McNeill
Sunday, Aug 31 12:00 PM: Photo Session: Robert Picardo
Sunday, Aug 31 12:00 PM: Photo Session: Tamara Taylor
Key Techniques

Here’s the fun part–the techniques that make this tidy, scroll-once list possible.

CSS selectors for precision

I used a.bookmark[data-type="events" + a.object_link] to grab only the event title links, and div.line[class~="two" /div.line[class~="one"] for time and title, respectively. This avoids scraping unrelated elements.

Memoization for efficiency

memoize caches the date string for each day’s DOM so I didn’t end up re-parsing the HTML fragment multiple times.

Unicode-safe splitting

\p{Dash_Punctuation} matches any dash type (em, en, hyphen-minus, etc.), so I could split times reliably without worrying about which dash the site uses.

Functional chaining

Mojo::Collection’s map, sort, and each methods let me express the scrape→transform→output pipeline in a linear, readable way.

Entity decoding at output

HTML::HTML5::Entities’ decode_entities is applied right before printing, so HTML entities like & or " are human-readable in the final output.
A Pattern You Can Take Anywhere

The same approach that tamed Dragon Con’s chaos works anywhere you’ve got:
- Predictable URLs–so you can iterate without guesswork
- Consistent HTML structure–so your selectors stay stable
- A need to see everything at once–so you can make decisions without paging or filtering
From fan conventions to conference schedules, from local sports fixtures to film festival line‑ups–the same pattern applies. Sometimes the right tool isn’t a sprawling framework or heavyweight API client. It’s a forty‑odd‑line Perl script that does one thing with ruthless clarity.

Because once you’ve tamed a schedule like this, the only lines you’ll stand in are the ones that feel like part of the show.
August 31, 2025
Perl lightning talk: “Don’t Fear map and grep”
This week’s Perl and Raku Conference 2022 in Houston was packed with great presentations, and I humbly added to them with a five-ish minute lightning talk on two of Perl’s more misunderstood functions: map and grep.

Sorry about the ”um”s and ”ah”s…

PDF slides of the the presentation Download

I’ve written much about list processing in Perl, and this talk was based on the following blog posts:
Overall I loved attending the conference, and it really invigorated my participation in the Perl community. Stay tuned as I resume regular posting!

Update for Raku

On Twitter I nudged prominent Raku hacker (and recovered Perl hacker) Elizabeth Mattijsen to write about the Raku programming language’s map and grep functionality. Check out her five-part series on DEV.to.
June 24, 2022
Perl list processing is for hashes, too
This month I started a new job at Alert Logic, a cybersecurity provider with Perl (among many other things) at its beating heart. I’ve been learning a lot, and part of the process has been understanding the APIs in the code base. To that end, I’ve been writing small test scripts to tease apart data structures, using Perl array-processing, list-processing, and hash- (i.e., associative array)-processing functions.

I’ve covered map, grep, and friends a couple times before. Most recently, I described using List::Util’s any function to check if a condition is true for any item in a list. In the simplest case, you can use it to check to see if a given value is in the list at all:
```
use feature 'say';
use List::Util 'any';
my @colors =
  qw(red orange yellow green blue indigo violet);
say 'matched' if any { /^red$/ } @colors;
```
However, if you’re going to be doing this a lot with arbitrary strings, Perl FAQ section 4 advises turning the array into the keys of a hash and then checking for membership there. For example, here’s a simple script to check if the colors input (either from the keyboard or from files passed as arguments) are in the rainbow:
```
#!/usr/bin/env perl

use v5.22; # introduced <<>> for safe opening of arguments
use warnings;
 
my %in_colors = map {$_ => 1}
  qw(red orange yellow green blue indigo violet);

while (<<>>) {
  chomp;
  say "$_ is in the rainbow" if $in_colors{$_};
}
```
List::Util has a bunch of functions for processing lists of pairs that I’ve found useful when pawing through hashes. pairgrep, for example, acts just like grep but instead assigns $a and $b to each key and value passed in and returns the resulting pairs that match. I’ve used it as a quick way to search for hash entries matching certain value conditions:
```
use List::Util 'pairgrep';
my %numbers = (zero => 0, one => 1, two => 2, three => 3);
my %odds = pairgrep {$b % 2} %numbers;
```
Sure, you could do this by invoking a mix of plain grep, keys, and a hash slice, but it’s noisier and more repetitive:
```
use v5.20; # for key/value hash slice 
my %odds = %numbers{grep {$numbers{$_} % 2} keys %numbers};
```
pairgrep’s compiled C‑based XS code can also be faster, as evidenced by this Benchmark script that works through a hash made of the Unix words file (479,828 entries on my machine):
```
#!/usr/bin/env perl

use v5.20;
use warnings;
use List::Util 'pairgrep';
use Benchmark 'cmpthese';

my (%words, $count);
open my $fh, '<', '/usr/share/dict/words'
  or die "can't open words: $!";
while (<$fh>) {
  chomp;
  $words{$_} = $count++;
}
close $fh;

cmpthese(100, {
  grep => sub {
    my %odds = %words{grep {$words{$_} % 2} keys %words};
  },
  pairgrep => sub {
    my %odds = pairgrep {$b % 2} %words;
  },
} );
```
Benchmark output:
```
           Rate     grep pairgrep
grep     1.47/s       --     -20%
pairgrep 1.84/s      25%       --
```
In general, I urge you to work through the Perl documentation’s tutorials on references, lists of lists, the data structures cookbook, and the FAQs on array and hash manipulation. Then dip into the various list-processing modules (especially the included List::Util and CPAN’s List::SomeUtils) for ready-made functions for common operations. You’ll find a wealth of techniques for creating, managing, and processing the data structures that your programs need.
February 10, 2022
Better Perl: Four list processing best practices with map, grep, and more
Six months ago I gave an overview of Perl’s list processing fundamentals, briefly describing what lists are and then introducing the built-in map and grep functions for transforming and filtering them. Later on, I compiled a list (how appropriate) of list processing modules available via CPAN, noting there’s some confusing duplication of effort. But you’re a busy developer, and you just want to know the Right Thing To Do™ when faced with a list processing challenge.

First, some credit is due: these are all restatements of several Perl::Critic policies which in turn codify standards described in Damian Conway’s Perl Best Practices (2005). I’ve repeatedly recommended the latter as a starting point for higher-quality Perl development. Over the years these practices continue to be re-evaluated (including by the author himself) and various authors release new policy modules, but perlcritic remains a great tool for ensuring you (and your team or other contributors) maintain a consistent high standard in your code.

With that said, on to the recommendations!

Don’t use grep to check if any list elements match

It might sound weird to lead off by recommending not to use grep, but sometimes it’s not the right tool for the job. If you’ve got a list and want to determine if a condition matches any item in it, you might try:
```
if (grep { some_condition($_) } @my_list) {
    ... # don't do this!
}
```
Yes, this works because (in scalar context) grep returns the number of matches found, but it’s wasteful, checking every element of @my_list (which could be lengthy) before finally providing a result. Use the standard List::Util module’s any function, which immediately returns (“short-circuits”) on the first match:
```
use List::Util 1.33 qw(any);

if (any { some_condition($_) } @my_list) {
    ... # do something
}
```
Perl has included the requisite version of this module since version 5.20 in 2014; for earlier releases, you’ll need to update from CPAN. List::Util has many other great list-reduction, key/value pair, and other related functions you can import into your code, so check it out before you attempt to re-invent any wheels.

As a side note for web developers, the Perl Dancer framework also includes an any keyword for declaring multiple HTTP routes, so if you’re mixing List::Util in there don’t import it. Instead, call it explicitly like this or you’ll get an error about a redefined function:
```
use List::Util 1.33;

if (List::Util::any { some_condition($_) } @my_list) {
    ... # do something
}
```
This recommendation is codified in the BuiltinFunctions::ProhibitBooleanGrep Perl::Critic policy, comes directly from Perl Best Practices, and is recommended by the Software Engineering Institute Computer Emergency Response Team (SEI CERT)’s Perl Coding Standard.

Don’t change $_ in map or grep

I mentioned this back in March, but it bears repeating: map and grep are intended as pure functions, not mutators with side effects. This means that the original list should remain unchanged. Yes, each element aliases in turn to the $_ special variable, but that’s for speed and can have surprising results if changed even if it’s technically allowed. If you need to modify an array in-place use something like:
```
for (@my_array) {
    $_ = ...; # make your changes here
}
```
If you want something that looks like map but won’t change the original list (and don’t mind a few CPAN dependencies), consider List::SomeUtils’ apply function:
```
use List::SomeUtils qw(apply);

my @doubled_array = apply {$_ *= 2} @old_array;
```
Lastly, side effects also include things like manipulating other variables or doing input and output. Don’t use map or grep in a void context (i.e., without a resulting array or list); do something with the results or use a for or foreach loop:
```
map { print foo($_) } @my_array; # don't do this
print map { foo($_) } @my_array; # do this instead

map { push @new_array, foo($_) } @my_array; # don't do this
@new_array = map { foo($_) } @my_array;     # do this instead
```
This recommendation is codified by the BuiltinFunctions::ProhibitVoidGrep, BuiltinFunctions::ProhibitVoidMap, and ControlStructures::ProhibitMutatingListFunctions Perl::Critic policies. The latter comes from Perl Best Practices and is an SEI CERT Perl Coding Standard rule.

Use blocks with map and grep, not expressions

You can call map or grep like this (parentheses are optional around built-in functions):
```
my @new_array  = map foo($_), @old_array; # don't do this
my @new_array2 = grep !/^#/, @old_array;  # don't do this
```
Or like this:
```
my @new_array  = map { foo($_) } @old_array;
my @new_array2 = grep {!/^#/} @old_array;
```
Do it the second way. It’s easier to read, especially if you’re passing in a literal list or multiple arrays, and the expression forms can conceal bugs. This recommendation is codified by the BuiltinFunctions::RequireBlockGrep and BuiltinFunctions::RequireBlockMap Perl::Critic policies and comes from Perl Best Practices.

Refactor multi-statement maps, greps, and other list functions

map, grep, and friends should follow the Unix philosophy of “Do One Thing and Do It Well.” Your readability and maintainability drop with every statement you place inside one of their blocks. Consider junior developers and future maintainers (this includes you!) and refactor anything with more than one statement into a separate subroutine or at least a for loop. This goes for list processing functions (like the aforementioned any) imported from other modules, too.

This recommendation is codified by the Perl Best Practices-inspired BuiltinFunctions::ProhibitComplexMappings and BuiltinFunctions::RequireSimpleSortBlock Perl::Critic policies, although those only cover map and sort functions, respectively.

Do you have any other suggestions for list processing best practices? Feel free to leave them in the comments or better yet, consider creating new Perl::Critic policies for them or contacting the Perl::Critic team to develop them for your organization.
October 26, 2021
A list of Perl list processing modules
As previously written, I like list processing. Many computing problems can be broken down into transforming and filtering lists, and Perl has got the fundamentals covered with functions like map, grep, and sort. There is so much more you might want to do, though, and CPAN has a plethora of list and array processing modules.

However, due to the vicissitudes of Perl module maintenance, we have a situation where it’s not clear at a glance where to turn when you’ve got a list that needs processing. So here’s another list: the list modules of CPAN. Click through to discover what functions they provide.
- We’ve got List::Util which has been released as part of Perl since version 5.7.3.
- We’ve got List::MoreUtils which has some functions which are named the same as Util but behave differently.
- We’ve got List::SomeUtils which duplicates MoreUtils but with fewer dependencies.
- We’ve got List::UtilsBy which MoreUtils has also cribbed some functions from.
- We’ve got List::AllUtils which attempts to consolidate Util, SomeUtils, and ListBy but has some exceptions to called modules because of the aforementioned duplication between Util and SomeUtils.
- We’ve got List::Util::MaybeXS which helps with pure Perl fallbacks in case your version of Util is too old to have a certain function.
- We’ve got List::MoreUtils::XS which provides (some?) faster versions of MoreUtils’ functions (but you still have to use MoreUtils).
- And lastly, we have Util::Any which lets you import functions from Util, MoreUtils, and just for good measure Scalar::Util, Hash::Util, String::Util, String::CamelCase, List::Pairwise, and Data::Dumper. But it hasn’t been updated since 2016, so it doesn’t necessarily export the functions added to those modules since then.
Am I missing anything? Probably! But these are the ones most associated with being upstream on the CPAN River, so they (or the modules they consolidate) have more projects depending on them.
May 18, 2021