This week we con­sid­ered a view­er’s pull request, added admin­is­tra­tor login, and start­ed on adding the SQLite data­base that will store the admin­is­tra­tor’s accep­tance of assign­ments. We also shored up file upload per­mis­sions for authen­ti­cat­ed users only and added a logout link, learn­ing about some more Mojolicious helpers.

You can find the whole series here.

young lady learning sign language during online lesson with female tutor

It’s been years since I’ve had to hack on any­thing XML-relat­ed, but a recent project at work has me once again jump­ing into the waters of gen­er­at­ing, pars­ing, and mod­i­fy­ing this 90s-​era doc­u­ment for­mat. Most devel­op­ers these days like­ly only know of it as part of the curiously-​named XMLHTTPRequest object in web browsers used to retrieve data in JSON for­mat from servers, and as the X” in AJAX. But here we are in 2021, and there are still plen­ty of APIs and doc­u­ments using XML to get their work done.

In my par­tic­u­lar case, the task is to update the API calls for a new ver­sion of Virtuozzo Automator. Its API is a bit unusu­al in that it does­n’t use HTTP, but rather relies on open­ing a TLS-encrypt­ed sock­et to the serv­er and exchang­ing doc­u­ments delim­it­ed with a null char­ac­ter. The pre­vi­ous ver­sion of our code is in 1990s-​sysadmin-​style Perl, with man­u­al blessing of objects and pars­ing the XML using reg­u­lar expres­sions. I’ve decid­ed to update it to use the Moo object sys­tem and a prop­er XML pars­er. But which pars­er and mod­ule to use?

Selecting a parser

There are sev­er­al gener­ic XML mod­ules for pars­ing and gen­er­at­ing XML on CPAN, each with its own advan­tages and dis­ad­van­tages. I’d like to say that I did a com­pre­hen­sive sur­vey of each of them, but this project is pressed for time (aren’t they all?) and I did­n’t want to cre­ate too many extra depen­den­cies in our Perl stack. Luckily, XML::LibXML is already avail­able, I’ve had some pre­vi­ous expe­ri­ence with it, and it’s a good choice for per­for­mant standards-​based XML pars­ing (using either DOM or SAX) and generation.

Given more time and lee­way in adding depen­den­cies, I might use some­thing else. If the Virtuozzo API had an XML Schema or used SOAP, I would con­sid­er XML::Compile as I’ve had some suc­cess with that in oth­er projects. But even that uses XML::LibXML under the hood, so I’d still be using that. Your mileage may vary.

Generating XML

Depending on the size and com­plex­i­ty of the XML doc­u­ments to gen­er­ate, you might choose to build them up node by node using XML::LibXML::Node and XML::LibXML::Element objects. Most of the mes­sages I’m send­ing to Virtuozzo Automator are short and have easily-​interpolated val­ues, so I’m using here-​document islands of XML inside my Perl code. This also has the advan­tage of being eas­i­ly val­i­dat­ed against the exam­ples in the documentation.

Where the inter­po­lat­ed val­ues in the mes­sages are a lit­tle com­pli­cat­ed, I’m using this idiom inside the here-docs:

@{[ ... ]}

This allows me to put an arbi­trary expres­sion in the … part, which is then put into an anony­mous array ref­er­ence, which is then imme­di­ate­ly deref­er­enced into its string result. It’s a cheap and cheer­ful way to do min­i­mal tem­plat­ing inside Perl strings with­out load­ing a full tem­plat­ing library; I’ve also had suc­cess using this tech­nique when gen­er­at­ing SQL for data­base queries.

Parser as an object attribute

Rather than instan­ti­ate a new XML::LibXML in every method that needs to parse a doc­u­ment, I cre­at­ed a pri­vate attribute:

package Local::API::Virtozzo::Agent {
    use Moo;
    use XML::LibXML;
    use Types::Standard qw(InstanceOf);
    ...
    has _parser => (
        is      => 'ro',
        isa     => InstanceOf['XML::LibXML'],
        default => sub { XML::LibXML->new() },
    );
    sub foo {
        my $self = shift;
        my $send_doc = $self->_parser
          ->parse_string(<<"END_XML");
            <foo/>
END_XML
        ...
    }
...
}

Boilerplate

XML doc­u­ments can be ver­bose, with ele­ments that rarely change in every doc­u­ment. In the Virtuozzo API’s case, every doc­u­ment has a <packet> ele­ment con­tain­ing a version attribute and an id attribute to match requests to respons­es. I wrote a sim­ple func­tion to wrap my doc­u­ments in this ele­ment that pulled the ver­sion from a con­stant and always increased the id by one every time it’s called:

sub _wrap_packet {
    state $send_id = 1;
    return qq(<packet version="$PACKET_VERSION" id=")
      . $send_id++ . '">' . shift . '</packet>';
}

If I need to add more attrib­ut­es to the <packet> ele­ment (for instance, name­spaces for attrib­ut­es in enclosed ele­ments, I can always use XML::LibXML::Element::setAttribute after pars­ing the doc­u­ment string.

Parsing responses with XPath

Rather than using brit­tle reg­u­lar expres­sions to extract data from the response, I use the shared pars­er object from above and then the full pow­er of XPath:

use English;
...
sub get_sampleID {
    my ($self, $sample_name) = @_;
    ...
    # used to separate documents
    local $INPUT_RECORD_SEPARATOR = "\0";
    # $self->_sock is the IO::Socket::SSL connection
    my $get_doc = $self->_parser( parse_string(
      $self->_sock->getline(),
    ) );
    my $sample_id = $get_doc->findvalue(
        qq(//ns3:id[following-sibling::ns3:name="$sample_name"]),
    );
    return $sample_id;
}

This way, even if the order of ele­ments change or more ele­ments are intro­duced, the XPath pat­terns will con­tin­ue to find the right data.

Conclusion… so far

I’m only about halfway through updat­ing these API calls, and I’ve left out some non-​XML-​related details such as set­ting up the TLS sock­et con­nec­tion. Hopefully this arti­cle has giv­en you a taste of what’s involved in XML pro­cess­ing these days. Please leave me a com­ment if you have any sug­ges­tions or questions.