Tag: lists

  • Scraping the Dragon with Perl and Mojolicious

    Scraping the Dragon with Perl and Mojolicious

    Every extend­ed Labor Day week­end, 80,000 fans of pop cul­ture descend on Atlanta for Dragon Con. It’s a sprawl­ing choose-​your-​own adven­ture of a con­ven­tion with 38 pro­gram­ming tracks and over 5,000 hours of events. It spans five down­town host hotels, and there is no way to see it all.

    Sadly, this year’s con is almost over. Still, I thought I’d share a lit­tle script I wrote to help me make sense of it all.

    The offi­cial mobile app is fine for search­ing and book­mark­ing events, speak­ers, and exhibitors. Nonetheless, it’s not suit­able for scan­ning the whole land­scape at once. I want­ed a sin­gle, scrol­lable view of every event, before I even packed my cosplay.

    Even in the app’s tablet ver­sion, the Dragon Con Events area is a scroll-fest.

    The web ver­sion of the app gave me exact­ly what I need­ed: pre­dictable per-​day URLs and seman­ti­cal­ly marked-​up HTML. That meant I can skip the API hunt, skip the man­u­al scrolling, and go straight to scraping.

    Inspecting the HTML reveals per-​day event URLs and per-​event <div> blocks.

    From Chaos to Clarity in 40 lines

    We’re about to turn a messy, multi-​day, multi-​hotel sched­ule into one clean, scroll-​once list. This is the forty-​five-​line Perl map that gets us there, aid­ed by the Mojolicious web toolkit.

    Laying the Groundwork: Tools for the Job

    #!/usr/bin/env perl
    
    use v5.40;
    
    use Carp;
    use English;
    use Mojo::UserAgent;
    use Mojo::URL;
    use Mojo::DOM;
    use Mojo::Collection q(c);
    use Time::Piece;
    use HTML::HTML5::Entities;
    use Memoize;
    
    binmode STDOUT, ':encoding(UTF-8)'
      or croak "Couldn't encode STDOUT: $OS_ERROR";
    
    my $ua   = Mojo::UserAgent->new();
    my $site = Mojo::URL->new('https://app.core-apps.com');
    my $path = '/dragoncon25/events/view_by_day';

    What’s hap­pen­ing: Load the mod­ules that will do the heavy lifting–HTTP fetch­es, DOM pars­ing, date han­dling, Unicode cleanup. Lock STDOUT to UTF8 so char­ac­ters like curly quotes and em-​dashes don’t break the out­put. Point the script at the base sched­ule URL.

    Remembering the Days Without Re-Parsing

    my $date_from_dom = memoize( sub ($dom) {
      return content_at( $dom, 'div.section_header[class~="alt"]' );
    } );

    What’s hap­pen­ing: Create a mem­o­ized helper that plucks the date from a day’s HTML and caches it. That way, if we need it again, we skip the DOM re-​parse and keep the pipeline fast.

    content_at is a helper func­tion I define later.

    Starting Where the App Starts

    my $today_dom = Mojo::DOM->new( $ua->get("$site$path")->result->text );

    What’s hap­pen­ing: Fetch the today” view–the same default the app shows. This is so we have a known start­ing point for build­ing the full timeline.

    Collecting the Whole Timeline

    my $day_doms = c(
      $today_dom,
      $today_dom->find(qq(div.filter_box-days > a[href^="$path?day="]))
        ->map( \&dom_from_anchor )
        ->to_array->@*,
    )->sort( sub { day_epoch($a) <=> day_epoch($b) } );

    What’s hap­pen­ing: Grab every day link from the fil­ter bar, fetch each day’s HTML, and sort them chrono­log­i­cal­ly. Now we’ve got the entire con’s sched­ule in mem­o­ry, ready to process.

    dom_from_anchor and day_epoch are two more helper func­tions explained fur­ther down.

    Turning HTML into a Human-​Readable Schedule

    $day_doms->each( sub {    # process each day's events
      my $date = $date_from_dom->($_);
    
      $_->find('a.bookmark[data-type="events"] + a.object_link')
        ->each( sub {         # output start time + title
    
          my $time    = content_at( $_, 'div.line[class~="two"]' );
          my $title   = content_at( $_, 'div.line[class~="one"]' );
          my ($start) = split /\s*\p{Dash_Punctuation}/, $time;
    
          say "$date $start: ", decode_entities($title);
        } );
    } );

    What’s hap­pen­ing: For each day, find every event link and pull out the start time and title. Split the time clean­ly on any dash and decode HTML enti­ties so the out­put reads like a real schedule.

    The Little Routines That Make It All Work

    sub dom_from_anchor ($dom) {    # fetch DOM for a day link
      return Mojo::DOM->new(
        $ua->get( Mojo::URL->new( $dom->attr('href') )->to_abs($site) )
          ->result->text );
    }
    
    sub day_epoch ($dom) {    # parse date into epoch
      return Time::Piece->strptime( $date_from_dom->($dom), '%A, %b %e' )
        ->epoch;
    }
    
    # extract and trim text from selector
    sub content_at ( $dom, @args ) { return trim $dom->at(@args)->content }

    What’s hap­pen­ing:

    1. dom_from_anchor: fetch and pars­es a linked days’ HTML.
    2. day_epoch: turn a date string into a sort-​able epoch.
    3. content_at: extract and trim text from a DOM frag­ment, giv­en a CSS selector.

    These helpers keep the main flow read­able and re-usable.

    The Schedule, Unlocked

    Run the script and you get a clean, UTF-​8-​safe list of every event, in chrono­log­i­cal order, across all days. No swip­ing around, no tap­ping, no what did I miss?” anx­i­ety. (Ha, who am I kid­ding? There’s too much going on at Dragon Con to not end up miss­ing something.)

    An exam­ple run of the script in my ter­mi­nal. Each line is Day, Date Time: Event Title”, sort­ed chrono­log­i­cal­ly across the whole con.

    And here’s just a small slice of the 2,500+ lines it produces:

    Sunday, Aug 31 11:30 AM: Unmasking Sherlock: Beyond the Many Faces
    Sunday, Aug 31 11:30 AM: Weaponization of the FCC and Other Agencies to Chill Speech
    Sunday, Aug 31 11:30 AM: Where Physics Gets Weird
    . . .
    Sunday, Aug 31 11:50 AM: Photo Session: Amelia Tyler
    Sunday, Aug 31 11:50 AM: Photo Session: Cissy Jones
    Sunday, Aug 31 11:50 AM: Photo Session: Emma Gregory
    . . .
    Sunday, Aug 31 12:00 PM: Dragon Con Mashups
    Sunday, Aug 31 12:00 PM: James J. Butcher and R.R. Virdi signing at The Missing Volume booth# 1300
    Sunday, Aug 31 12:00 PM: JoeDan Worley and Eric Dontigney signing at the Shadow Alley Press Booth# 2
    . . .
    Sunday, Aug 31 12:00 PM: Photo Session: Robert Duncan McNeill
    Sunday, Aug 31 12:00 PM: Photo Session: Robert Picardo
    Sunday, Aug 31 12:00 PM: Photo Session: Tamara Taylor

    Key Techniques

    Here’s the fun part–the tech­niques that make this tidy, scroll-​once list possible.

    CSS selectors for precision

    I used a.bookmark[data-type="events" + a.object_link] to grab only the event title links, and div.line[class~="two" /​div.line[class~="one"] for time and title, respec­tive­ly. This avoids scrap­ing unre­lat­ed elements.

    Memoization for efficiency

    memoize caches the date string for each day’s DOM so I did­n’t end up re-​parsing the HTML frag­ment mul­ti­ple times.

    Unicode-​safe splitting

    \p{Dash_Punctuation} match­es any dash type (em, en, hyphen-​minus, etc.), so I could split times reli­ably with­out wor­ry­ing about which dash the site uses.

    Functional chaining

    Mojo::Collections map, sort, and each meth­ods let me express the scrape→transform→output pipeline in a lin­ear, read­able way.

    Entity decoding at output

    HTML::HTML5::Entitiesdecode_entities is applied right before print­ing, so HTML enti­ties like &amp; or &quot; are human-​readable in the final output.

    A Pattern You Can Take Anywhere

    The same approach that tamed Dragon Con’s chaos works any­where you’ve got:

    • Predictable URLs–so you can iter­ate with­out guesswork
    • Consistent HTML structure–so your selec­tors stay stable
    • A need to see every­thing at once–so you can make deci­sions with­out pag­ing or filtering

    From fan con­ven­tions to con­fer­ence sched­ules, from local sports fix­tures to film fes­ti­val line‑ups–the same pat­tern applies. Sometimes the right tool isn’t a sprawl­ing frame­work or heavy­weight API client. It’s a forty‑odd‑line Perl script that does one thing with ruth­less clarity.

    Because once you’ve tamed a sched­ule like this, the only lines you’ll stand in are the ones that feel like part of the show.

  • Perl lightning talk: Don’t Fear map and grep”

    This week’s Perl and Raku Conference 2022 in Houston was packed with great pre­sen­ta­tions, and I humbly added to them with a five-​ish minute light­ning talk on two of Perl’s more mis­un­der­stood func­tions: map and grep.

    Sorry about the um”s and ah”s…

    I’ve writ­ten much about list pro­cess­ing in Perl, and this talk was based on the fol­low­ing blog posts:

    Overall I loved attend­ing the con­fer­ence, and it real­ly invig­o­rat­ed my par­tic­i­pa­tion in the Perl com­mu­ni­ty. Stay tuned as I resume reg­u­lar posting!

    Update for Raku

    On Twitter I nudged promi­nent Raku hack­er (and recov­ered Perl hack­er) Elizabeth Mattijsen to write about the Raku pro­gram­ming language’s map and grep func­tion­al­i­ty. Check out her five-​part series on DEV.to.

  • Perl list processing is for hashes, too

    Perl list processing is for hashes, too

    This month I start­ed a new job at Alert Logic, a cyber­se­cu­ri­ty provider with Perl (among many oth­er things) at its beat­ing heart. I’ve been learn­ing a lot, and part of the process has been under­stand­ing the APIs in the code base. To that end, I’ve been writ­ing small test scripts to tease apart data struc­tures, using Perl array-​processing, list-​processing, and hash- (i.e., asso­cia­tive array)-processing func­tions.

    I’ve cov­ered map, grep, and friends a cou­ple times before. Most recent­ly, I described using List::Util’s any func­tion to check if a con­di­tion is true for any item in a list. In the sim­plest case, you can use it to check to see if a giv­en val­ue is in the list at all:

    use feature 'say';
    use List::Util 'any';
    my @colors =
      qw(red orange yellow green blue indigo violet);
    say 'matched' if any { /^red$/ } @colors;

    However, if you’re going to be doing this a lot with arbi­trary strings, Perl FAQ sec­tion 4 advis­es turn­ing the array into the keys of a hash and then check­ing for mem­ber­ship there. For exam­ple, here’s a sim­ple script to check if the col­ors input (either from the key­board or from files passed as argu­ments) are in the rainbow:

    #!/usr/bin/env perl
    
    use v5.22; # introduced <<>> for safe opening of arguments
    use warnings;
     
    my %in_colors = map {$_ => 1}
      qw(red orange yellow green blue indigo violet);
    
    while (<<>>) {
      chomp;
      say "$_ is in the rainbow" if $in_colors{$_};
    }

    List::Util has a bunch of func­tions for pro­cess­ing lists of pairs that I’ve found use­ful when paw­ing through hash­es. pairgrep, for exam­ple, acts just like grep but instead assigns $a and $b to each key and val­ue passed in and returns the result­ing pairs that match. I’ve used it as a quick way to search for hash entries match­ing cer­tain val­ue conditions:

    use List::Util 'pairgrep';
    my %numbers = (zero => 0, one => 1, two => 2, three => 3);
    my %odds = pairgrep {$b % 2} %numbers;

    Sure, you could do this by invok­ing a mix of plain grep, keys, and a hash slice, but it’s nois­i­er and more repetitive:

    use v5.20; # for key/value hash slice 
    my %odds = %numbers{grep {$numbers{$_} % 2} keys %numbers};

    pairgreps com­piled C‑based XS code can also be faster, as evi­denced by this Benchmark script that works through a hash made of the Unix words file (479,828 entries on my machine):

    #!/usr/bin/env perl
    
    use v5.20;
    use warnings;
    use List::Util 'pairgrep';
    use Benchmark 'cmpthese';
    
    my (%words, $count);
    open my $fh, '<', '/usr/share/dict/words'
      or die "can't open words: $!";
    while (<$fh>) {
      chomp;
      $words{$_} = $count++;
    }
    close $fh;
    
    cmpthese(100, {
      grep => sub {
        my %odds = %words{grep {$words{$_} % 2} keys %words};
      },
      pairgrep => sub {
        my %odds = pairgrep {$b % 2} %words;
      },
    } );

    Benchmark out­put:

               Rate     grep pairgrep
    grep     1.47/s       --     -20%
    pairgrep 1.84/s      25%       --

    In gen­er­al, I urge you to work through the Perl doc­u­men­ta­tions tuto­ri­als on ref­er­ences, lists of lists, the data struc­tures cook­book, and the FAQs on array and hash manip­u­la­tion. Then dip into the var­i­ous list-​processing mod­ules (espe­cial­ly the includ­ed List::Util and CPAN’s List::SomeUtils) for ready-​made func­tions for com­mon oper­a­tions. You’ll find a wealth of tech­niques for cre­at­ing, man­ag­ing, and pro­cess­ing the data struc­tures that your pro­grams need.

  • Better Perl: Four list processing best practices with map, grep, and more

    Better Perl: Four list processing best practices with map, grep, and more

    Six months ago I gave an overview of Perl’s list pro­cess­ing fun­da­men­tals, briefly describ­ing what lists are and then intro­duc­ing the built-​in map and grep func­tions for trans­form­ing and fil­ter­ing them. Later on, I com­piled a list (how appro­pri­ate) of list pro­cess­ing mod­ules avail­able via CPAN, not­ing there’s some con­fus­ing dupli­ca­tion of effort. But you’re a busy devel­op­er, and you just want to know the Right Thing To Do™ when faced with a list pro­cess­ing challenge.

    First, some cred­it is due: these are all restate­ments of sev­er­al Perl::Critic poli­cies which in turn cod­i­fy stan­dards described in Damian Conway’s Perl Best Practices (2005). I’ve repeat­ed­ly rec­om­mend­ed the lat­ter as a start­ing point for higher-​quality Perl devel­op­ment. Over the years these prac­tices con­tin­ue to be re-​evaluated (includ­ing by the author him­self) and var­i­ous authors release new pol­i­cy mod­ules, but perlcritic remains a great tool for ensur­ing you (and your team or oth­er con­trib­u­tors) main­tain a con­sis­tent high stan­dard in your code.

    With that said, on to the recommendations!

    Don’t use grep to check if any list elements match

    It might sound weird to lead off by rec­om­mend­ing not to use grep, but some­times it’s not the right tool for the job. If you’ve got a list and want to deter­mine if a con­di­tion match­es any item in it, you might try:

    if (grep { some_condition($_) } @my_list) {
        ... # don't do this!
    }

    Yes, this works because (in scalar con­text) grep returns the num­ber of match­es found, but it’s waste­ful, check­ing every ele­ment of @my_list (which could be lengthy) before final­ly pro­vid­ing a result. Use the stan­dard List::Util module’s any func­tion, which imme­di­ate­ly returns (“short-​circuits”) on the first match:

    use List::Util 1.33 qw(any);

    if (any { some_condition($_) } @my_list) {
    ... # do something
    }

    Perl has includ­ed the req­ui­site ver­sion of this mod­ule since ver­sion 5.20 in 2014; for ear­li­er releas­es, you’ll need to update from CPAN. List::Util has many oth­er great list-​reduction, key/​value pair, and oth­er relat­ed func­tions you can import into your code, so check it out before you attempt to re-​invent any wheels.

    As a side note for web devel­op­ers, the Perl Dancer frame­work also includes an any key­word for declar­ing mul­ti­ple HTTP routes, so if you’re mix­ing List::Util in there don’t import it. Instead, call it explic­it­ly like this or you’ll get an error about a rede­fined function:

    use List::Util 1.33;

    if (List::Util::any { some_condition($_) } @my_list) {
    ... # do something
    }

    This rec­om­men­da­tion is cod­i­fied in the BuiltinFunctions::ProhibitBooleanGrep Perl::Critic pol­i­cy, comes direct­ly from Perl Best Practices, and is rec­om­mend­ed by the Software Engineering Institute Computer Emergency Response Team (SEI CERT)’s Perl Coding Standard.

    Don’t change $_ in map or grep

    I men­tioned this back in March, but it bears repeat­ing: map and grep are intend­ed as pure func­tions, not muta­tors with side effects. This means that the orig­i­nal list should remain unchanged. Yes, each ele­ment alias­es in turn to the $_ spe­cial vari­able, but that’s for speed and can have sur­pris­ing results if changed even if it’s tech­ni­cal­ly allowed. If you need to mod­i­fy an array in-​place use some­thing like:

    for (@my_array) {
    $_ = ...; # make your changes here
    }

    If you want some­thing that looks like map but won’t change the orig­i­nal list (and don’t mind a few CPAN depen­den­cies), con­sid­er List::SomeUtilsapply function:

    use List::SomeUtils qw(apply);
    
    my @doubled_array = apply {$_ *= 2} @old_array;

    Lastly, side effects also include things like manip­u­lat­ing oth­er vari­ables or doing input and out­put. Don’t use map or grep in a void con­text (i.e., with­out a result­ing array or list); do some­thing with the results or use a for or foreach loop:

    map { print foo($_) } @my_array; # don't do this
    print map { foo($_) } @my_array; # do this instead

    map { push @new_array, foo($_) } @my_array; # don't do this
    @new_array = map { foo($_) } @my_array; # do this instead

    This rec­om­men­da­tion is cod­i­fied by the BuiltinFunctions::ProhibitVoidGrep, BuiltinFunctions::ProhibitVoidMap, and ControlStructures::ProhibitMutatingListFunctions Perl::Critic poli­cies. The lat­ter comes from Perl Best Practices and is an SEI CERT Perl Coding Standard rule.

    Use blocks with map and grep, not expressions

    You can call map or grep like this (paren­the­ses are option­al around built-​in functions):

    my @new_array  = map foo($_), @old_array; # don't do this
    my @new_array2 = grep !/^#/, @old_array; # don't do this

    Or like this:

    my @new_array  = map { foo($_) } @old_array;
    my @new_array2 = grep {!/^#/} @old_array;

    Do it the sec­ond way. It’s eas­i­er to read, espe­cial­ly if you’re pass­ing in a lit­er­al list or mul­ti­ple arrays, and the expres­sion forms can con­ceal bugs. This rec­om­men­da­tion is cod­i­fied by the BuiltinFunctions::RequireBlockGrep and BuiltinFunctions::RequireBlockMap Perl::Critic poli­cies and comes from Perl Best Practices.

    Refactor multi-​statement maps, greps, and other list functions

    map, grep, and friends should fol­low the Unix phi­los­o­phy of Do One Thing and Do It Well.” Your read­abil­i­ty and main­tain­abil­i­ty drop with every state­ment you place inside one of their blocks. Consider junior devel­op­ers and future main­tain­ers (this includes you!) and refac­tor any­thing with more than one state­ment into a sep­a­rate sub­rou­tine or at least a for loop. This goes for list pro­cess­ing func­tions (like the afore­men­tioned any) import­ed from oth­er mod­ules, too.

    This rec­om­men­da­tion is cod­i­fied by the Perl Best Practices-inspired BuiltinFunctions::ProhibitComplexMappings and BuiltinFunctions::RequireSimpleSortBlock Perl::Critic poli­cies, although those only cov­er map and sort func­tions, respectively.


    Do you have any oth­er sug­ges­tions for list pro­cess­ing best prac­tices? Feel free to leave them in the com­ments or bet­ter yet, con­sid­er cre­at­ing new Perl::Critic poli­cies for them or con­tact­ing the Perl::Critic team to devel­op them for your organization.

  • A list of Perl list processing modules

    A list of Perl list processing modules

    As pre­vi­ous­ly writ­ten, I like list pro­cess­ing. Many com­put­ing prob­lems can be bro­ken down into trans­form­ing and fil­ter­ing lists, and Perl has got the fun­da­men­tals cov­ered with func­tions like map, grep, and sort. There is so much more you might want to do, though, and CPAN has a pletho­ra of list and array pro­cess­ing modules.

    However, due to the vicis­si­tudes of Perl mod­ule main­te­nance, we have a sit­u­a­tion where it’s not clear at a glance where to turn when you’ve got a list that needs pro­cess­ing. So here’s anoth­er list: the list mod­ules of CPAN. Click through to dis­cov­er what func­tions they provide.

    • We’ve got List::Util which has been released as part of Perl since ver­sion 5.7.3.
    • We’ve got List::MoreUtils which has some func­tions which are named the same as Util but behave differently.
    • We’ve got List::SomeUtils which dupli­cates MoreUtils but with few­er dependencies.
    • We’ve got List::UtilsBy which MoreUtils has also cribbed some func­tions from.
    • We’ve got List::AllUtils which attempts to con­sol­i­date Util, SomeUtils, and ListBy but has some excep­tions to called mod­ules because of the afore­men­tioned dupli­ca­tion between Util and SomeUtils.
    • We’ve got List::Util::MaybeXS which helps with pure Perl fall­backs in case your ver­sion of Util is too old to have a cer­tain function.
    • We’ve got List::MoreUtils::XS which pro­vides (some?) faster ver­sions of MoreUtils’ func­tions (but you still have to use MoreUtils).
    • And last­ly, we have Util::Any which lets you import func­tions from Util, MoreUtils, and just for good mea­sure Scalar::Util, Hash::Util, String::Util, String::CamelCase, List::Pairwise, and Data::Dumper. But it has­n’t been updat­ed since 2016, so it does­n’t nec­es­sar­i­ly export the func­tions added to those mod­ules since then.

    Am I miss­ing any­thing? Probably! But these are the ones most asso­ci­at­ed with being upstream on the CPAN River, so they (or the mod­ules they con­sol­i­date) have more projects depend­ing on them.