young lady learning sign language during online lesson with female tutor

It’s been years since I’ve had to hack on anything XML-related, but a recent project at work has me once again jumping into the waters of generating, parsing, and modifying this 90s-​era document format. Most developers these days likely only know of it as part of the curiously-​named XMLHTTPRequest object in web browsers used to retrieve data in JSON format from servers, and as the X” in AJAX. But here we are in 2021, and there are still plenty of APIs and documents using XML to get their work done.

In my particular case, the task is to update the API calls for a new version of Virtuozzo Automator. Its API is a bit unusual in that it doesn’t use HTTP, but rather relies on opening a TLS-encrypted socket to the server and exchanging documents delimited with a null character. The previous version of our code is in 1990s-​sysadmin-​style Perl, with manual blessing of objects and parsing the XML using regular expressions. I’ve decided to update it to use the Moo object system and a proper XML parser. But which parser and module to use?

Selecting a parser

There are several generic XML modules for parsing and generating XML on CPAN, each with its own advantages and disadvantages. I’d like to say that I did a comprehensive survey of each of them, but this project is pressed for time (aren’t they all?) and I didn’t want to create too many extra dependencies in our Perl stack. Luckily, XML::LibXML is already available, I’ve had some previous experience with it, and it’s a good choice for performant standards-​based XML parsing (using either DOM or SAX) and generation.

Given more time and leeway in adding dependencies, I might use something else. If the Virtuozzo API had an XML Schema or used SOAP, I would consider XML::Compile as I’ve had some success with that in other projects. But even that uses XML::LibXML under the hood, so I’d still be using that. Your mileage may vary.

Generating XML

Depending on the size and complexity of the XML documents to generate, you might choose to build them up node by node using XML::LibXML::Node and XML::LibXML::Element objects. Most of the messages I’m sending to Virtuozzo Automator are short and have easily-​interpolated values, so I’m using here-​document islands of XML inside my Perl code. This also has the advantage of being easily validated against the examples in the documentation.

Where the interpolated values in the messages are a little complicated, I’m using this idiom inside the here-docs:

@{[ ... ]}

This allows me to put an arbitrary expression in the … part, which is then put into an anonymous array reference, which is then immediately dereferenced into its string result. It’s a cheap and cheerful way to do minimal templating inside Perl strings without loading a full templating library; I’ve also had success using this technique when generating SQL for database queries.

Parser as an object attribute

Rather than instantiate a new XML::LibXML in every method that needs to parse a document, I created a private attribute:

package Local::API::Virtozzo::Agent {
    use Moo;
    use XML::LibXML;
    use Types::Standard qw(InstanceOf);
    ...
    has _parser => (
        is      => 'ro',
        isa     => InstanceOf['XML::LibXML'],
        default => sub { XML::LibXML->new() },
    );
    sub foo {
        my $self = shift;
        my $send_doc = $self->_parser
          ->parse_string(<<"END_XML");
            <foo/>
END_XML
        ...
    }
...
}

Boilerplate

XML documents can be verbose, with elements that rarely change in every document. In the Virtuozzo API’s case, every document has a <packet> element containing a version attribute and an id attribute to match requests to responses. I wrote a simple function to wrap my documents in this element that pulled the version from a constant and always increased the id by one every time it’s called:

sub _wrap_packet {
    state $send_id = 1;
    return qq(<packet version="$PACKET_VERSION" id=")
      . $send_id++ . '">' . shift . '</packet>';
}

If I need to add more attributes to the <packet> element (for instance, namespaces for attributes in enclosed elements, I can always use XML::LibXML::Element::setAttribute after parsing the document string.

Parsing responses with XPath

Rather than using brittle regular expressions to extract data from the response, I use the shared parser object from above and then the full power of XPath:

use English;
...
sub get_sampleID {
    my ($self, $sample_name) = @_;
    ...
    # used to separate documents
    local $INPUT_RECORD_SEPARATOR = "\0";
    # $self->_sock is the IO::Socket::SSL connection
    my $get_doc = $self->_parser( parse_string(
      $self->_sock->getline(),
    ) );
    my $sample_id = $get_doc->findvalue(
        qq(//ns3:id[following-sibling::ns3:name="$sample_name"]),
    );
    return $sample_id;
}

This way, even if the order of elements change or more elements are introduced, the XPath patterns will continue to find the right data.

Conclusion… so far

I’m only about halfway through updating these API calls, and I’ve left out some non-​XML-​related details such as setting up the TLS socket connection. Hopefully this article has given you a taste of what’s involved in XML processing these days. Please leave me a comment if you have any suggestions or questions.