young lady learning sign language during online lesson with female tutor

It’s been years since I’ve had to hack on any­thing XML-relat­ed, but a recent project at work has me once again jump­ing into the waters of gen­er­at­ing, pars­ing, and mod­i­fy­ing this 90s-​era doc­u­ment for­mat. Most devel­op­ers these days like­ly only know of it as part of the curiously-​named XMLHTTPRequest object in web browsers used to retrieve data in JSON for­mat from servers, and as the X” in AJAX. But here we are in 2021, and there are still plen­ty of APIs and doc­u­ments using XML to get their work done.

In my par­tic­u­lar case, the task is to update the API calls for a new ver­sion of Virtuozzo Automator. Its API is a bit unusu­al in that it does­n’t use HTTP, but rather relies on open­ing a TLS-encrypt­ed sock­et to the serv­er and exchang­ing doc­u­ments delim­it­ed with a null char­ac­ter. The pre­vi­ous ver­sion of our code is in 1990s-​sysadmin-​style Perl, with man­u­al blessing of objects and pars­ing the XML using reg­u­lar expres­sions. I’ve decid­ed to update it to use the Moo object sys­tem and a prop­er XML pars­er. But which pars­er and mod­ule to use?

Selecting a parser

There are sev­er­al gener­ic XML mod­ules for pars­ing and gen­er­at­ing XML on CPAN, each with its own advan­tages and dis­ad­van­tages. I’d like to say that I did a com­pre­hen­sive sur­vey of each of them, but this project is pressed for time (aren’t they all?) and I did­n’t want to cre­ate too many extra depen­den­cies in our Perl stack. Luckily, XML::LibXML is already avail­able, I’ve had some pre­vi­ous expe­ri­ence with it, and it’s a good choice for per­for­mant standards-​based XML pars­ing (using either DOM or SAX) and generation.

Given more time and lee­way in adding depen­den­cies, I might use some­thing else. If the Virtuozzo API had an XML Schema or used SOAP, I would con­sid­er XML::Compile as I’ve had some suc­cess with that in oth­er projects. But even that uses XML::LibXML under the hood, so I’d still be using that. Your mileage may vary.

Generating XML

Depending on the size and com­plex­i­ty of the XML doc­u­ments to gen­er­ate, you might choose to build them up node by node using XML::LibXML::Node and XML::LibXML::Element objects. Most of the mes­sages I’m send­ing to Virtuozzo Automator are short and have easily-​interpolated val­ues, so I’m using here-​document islands of XML inside my Perl code. This also has the advan­tage of being eas­i­ly val­i­dat­ed against the exam­ples in the documentation.

Where the inter­po­lat­ed val­ues in the mes­sages are a lit­tle com­pli­cat­ed, I’m using this idiom inside the here-docs:

@{[ ... ]}

This allows me to put an arbi­trary expres­sion in the … part, which is then put into an anony­mous array ref­er­ence, which is then imme­di­ate­ly deref­er­enced into its string result. It’s a cheap and cheer­ful way to do min­i­mal tem­plat­ing inside Perl strings with­out load­ing a full tem­plat­ing library; I’ve also had suc­cess using this tech­nique when gen­er­at­ing SQL for data­base queries.

Parser as an object attribute

Rather than instan­ti­ate a new XML::LibXML in every method that needs to parse a doc­u­ment, I cre­at­ed a pri­vate attribute:

package Local::API::Virtozzo::Agent {
    use Moo;
    use XML::LibXML;
    use Types::Standard qw(InstanceOf);
    ...
    has _parser => (
        is      => 'ro',
        isa     => InstanceOf['XML::LibXML'],
        default => sub { XML::LibXML->new() },
    );
    sub foo {
        my $self = shift;
        my $send_doc = $self->_parser
          ->parse_string(<<"END_XML");
            <foo/>
END_XML
        ...
    }
...
}

Boilerplate

XML doc­u­ments can be ver­bose, with ele­ments that rarely change in every doc­u­ment. In the Virtuozzo API’s case, every doc­u­ment has a <packet> ele­ment con­tain­ing a version attribute and an id attribute to match requests to respons­es. I wrote a sim­ple func­tion to wrap my doc­u­ments in this ele­ment that pulled the ver­sion from a con­stant and always increased the id by one every time it’s called:

sub _wrap_packet {
    state $send_id = 1;
    return qq(<packet version="$PACKET_VERSION" id=")
      . $send_id++ . '">' . shift . '</packet>';
}

If I need to add more attrib­ut­es to the <packet> ele­ment (for instance, name­spaces for attrib­ut­es in enclosed ele­ments, I can always use XML::LibXML::Element::setAttribute after pars­ing the doc­u­ment string.

Parsing responses with XPath

Rather than using brit­tle reg­u­lar expres­sions to extract data from the response, I use the shared pars­er object from above and then the full pow­er of XPath:

use English;
...
sub get_sampleID {
    my ($self, $sample_name) = @_;
    ...
    # used to separate documents
    local $INPUT_RECORD_SEPARATOR = "\0";
    # $self->_sock is the IO::Socket::SSL connection
    my $get_doc = $self->_parser( parse_string(
      $self->_sock->getline(),
    ) );
    my $sample_id = $get_doc->findvalue(
        qq(//ns3:id[following-sibling::ns3:name="$sample_name"]),
    );
    return $sample_id;
}

This way, even if the order of ele­ments change or more ele­ments are intro­duced, the XPath pat­terns will con­tin­ue to find the right data.

Conclusion… so far

I’m only about halfway through updat­ing these API calls, and I’ve left out some non-​XML-​related details such as set­ting up the TLS sock­et con­nec­tion. Hopefully this arti­cle has giv­en you a taste of what’s involved in XML pro­cess­ing these days. Please leave me a com­ment if you have any sug­ges­tions or questions.