dxml.util

This module contains helper functions which aren't specific to the parser, the DOM, or the writer.

Symbol	Description
decodeXML	Takes a range of characters, strips carriage returns from it, and converts both character references and the predefined entity references in the range into the characters that they refer to.
asDecodedXML	The version of decodeXML that returns a lazy range.
parseCharRef	Parses a character reference from the front of a range of characters.
parseStdEntityRef	Parses one of the predefined entity references from the start of a range of characters.
stripIndent	Removes the indent from the front of each line of a range of characters that was XML text which was formatted for human-readability.
withoutIndent	The version of stripIndent that returns a lazy range.
StdEntityRef	Enum containing the string representations of the five, predefined entity references.
encodeText	Encodes characters which cannot appear in EntityType.text in their literal form.
encodeAttr	Encodes characters which cannot appear in the attribute value of an element start tag in their literal form.
encodeCharRef	Encodes a character as a character reference.

License: Boost License 1.0.

Authors: Jonathan M Davis

Source: https://github.com/jmdavis/dxml/blob/v0.3.0/source/dxml/util.d

string decodeXML(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

auto asDecodedXML(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

Decodes any XML character references and standard XML entity references in the text as well as removing any carriage returns. It's intended to be used on the text fields of element tags and on the values of start tag attributes.

There are a number of characters that either can't be directly represented in the text fields or attribute values in XML or which can sometimes be directly represented but not always (e.g. an attribute value can contain either a single quote or a double quote, but it can't contain both at the same time, because one of them would match the opening quote). So, those characters have alternate representations in order to be allowed (e.g. "<" for '<', because '<' would normally be the beginning of an entity). Technically, they're entity references, but the ones handled by decodeXML are the ones explicitly defined in the XML standard and which don't require a DTD section.

Ideally, the parser would transform all such alternate representations to what they represent when providing the text to the application, but that would make it impossible to return slices of the original text from the properties of an Entity. So, instead of having those properties do the transformation themselves, decodeXML and asDecodedXML do that so that the application can choose to do it or not (in many cases, there is nothing to decode, making the calls unnecessary).

Similarly, an application can choose to encode a character as a character reference (e.g. '&#65" or '&#x40" for 'A'). decodeXML will decode such character references to their corresponding characters.

However, decodeXML does not handle any entity references beyond the five predefined ones listed below. All others are left unprocessed. Processing them properly would require handling the DTD section, which dxml does not support. The parser considers any entity references other than the predefined ones to be invalid XML, so unless the text being passed to decodeXML doesn't come from dxml's parser, it can't have any entity references in it other than the predefined ones. Similarly, invalid character references are left unprocessed as well as any character that is not valid in an XML document. decodeXML never throws on invalid XML.

Also, '\r' is not supposed to appear in an XML document except as a character reference unless it's in a CDATA section. So, it really should be stripped out before being handed off to the application, but again, that doesn't work with slices. So, decodeXML also handles that.

Specifically, what decodeXML and asDecodedXML do is

convert & to &

convert > to >

convert < to <

convert ' to '

convert " to "

remove all instances of \r

convert all character references (e.g. 
) to the characters that they represent

All other entity references are left untouched, and any '&' which is not used in one of the constructs listed in the table as well as any malformed constructs (e.g. "&Amp;" or "&#xGGA2;") are left untouched.

The difference between decodeXML and asDecodedXML is that decodeXML returns a string, whereas asDecodedXML returns a lazy range of code units. In the case where a string is passed to decodeXML, it will simply return the original string if there is no text to decode (whereas in other cases, decodeXML and asDecodedXML are forced to return new ranges even if there is no text to decode).

Parameters:

R range The range of characters to decodeXML.

Returns: The decoded text. decodeXML returns a string, whereas asDecodedXML returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).

See Also: http://www.w3.org/TR/REC-xml/#dt-chardata
parseStdEntityRef
parseCharRef
dxml.parser.EntityRange.Entity.attributes
dxml.parser.EntityRange.Entity.text
encodeAttr
encodeText

Examples:

assert(decodeXML("hello world &amp;&gt;&lt;&apos;&quot; \r\r\r\r\r foo") ==
       `hello world &><'"  foo`);

assert(decodeXML("if(foo &amp;&amp; bar)\r\n" ~
                 "    left = right;") ==
       "if(foo && bar)\n" ~
       "    left = right;");

assert(decodeXML("&#12487;&#12451;&#12521;&#12531;") == "ディラン");
assert(decodeXML("foo") == "foo");
assert(decodeXML("&#   ;") == "&#   ;");

{
    import std.algorithm.comparison : equal;
    auto range = asDecodedXML("hello world &amp;&gt;&lt;&apos;&quot; " ~
                              "\r\r\r\r\r foo");
    assert(equal(range, `hello world &><'"  foo`));
}

{
    import dxml.parser;
    auto xml = "<root>\n" ~
               "    <function return='vector&lt;int&gt;' name='foo'>\r\n" ~
               "        <doc_comment>This function does something really\r\n" ~
               "                 fancy, and you will love it.</doc_comment>\r\n" ~
               "        <param type='int' name='i'>\r\n" ~
               "        <param type='const std::string&amp;' name='s'>\r\n" ~
               "    </function>\n" ~
               "</root>";
    auto range = parseXML!simpleXML(xml);
    range.popFront();
    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "function");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "return");
        assert(attrs.front.value == "vector&lt;int&gt;");
        assert(decodeXML(attrs.front.value) == "vector<int>");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "foo");
        assert(decodeXML(attrs.front.value) == "foo");
    }
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "doc_comment");
    range.popFront();

    assert(range.front.text ==
           "This function does something really\r\n" ~
           "                 fancy, and you will love it.");
    assert(decodeXML(range.front.text) ==
           "This function does something really\n" ~
           "                 fancy, and you will love it.");
    range.popFront();

    assert(range.front.type == EntityType.elementEnd);
    assert(range.front.name == "doc_comment");
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "param");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "type");
        assert(attrs.front.value == "int");
        assert(decodeXML(attrs.front.value) == "int");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "i");
        assert(decodeXML(attrs.front.value) == "i");
    }
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "param");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "type");
        assert(attrs.front.value == "const std::string&amp;");
        assert(decodeXML(attrs.front.value) == "const std::string&");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "s");
        assert(decodeXML(attrs.front.value) == "s");
    }
}

deprecated alias normalize = decodeXML(R)(R range) if (isForwardRange!R && isSomeChar!(ElementType!R));

deprecated alias asNormalized = asDecodedXML(R)(R range) if (isForwardRange!R && isSomeChar!(ElementType!R));

Deprecated

normalize has been renamed to decodeXML, and asNormalized has been renamed to asDecodedXML. It was pointed out that there's a fairly high chance that std.uni.normalize would be used in conjunction with dxml, making conflicts annoyingly likely. Also, there was no good opposite for normalize for the functions that became encodeAttr and encodeText. denormalizeAttr and denormalizeText would arguably have been a bit ugly.

These aliases have been added to avoid code breakage when upgrading from dxml 0.2.*. They will be removed in dxml 0.4.0.

Nullable!dchar parseStdEntityRef(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

This parses one of the five, predefined entity references mention in the XML spec from the front of a range of characters.

If the given range starts with one of the five, predefined entity references, then it is removed from the range, and the corresponding character is returned.

If the range does not start with one of those references, then the return value is null, and the range is unchanged.

Std Entity Ref	Converts To
&	&
>	>
<	<
'	'
"	"

Any other entity references would require processing a DTD section in order to be handled and are untouched by parseStdEntityRef as are any other types of references.

Parameters:

R range

A range of characters.

Returns: The character represented by the predefined entity reference that was parsed from the front of the given range or null if the range did not start with one of the five predefined entity references.

See Also: http://www.w3.org/TR/REC-xml/#dt-chardata
parseCharRef
decodeXML
asDecodedXML

Examples:

{
    auto range = "&amp;foo";
    assert(range.parseStdEntityRef() == '&');
    assert(range == "foo");
}
{
    auto range = "&gt;bar";
    assert(range.parseStdEntityRef() == '>');
    assert(range == "bar");
}
{
    auto range = "&lt;baz";
    assert(range.parseStdEntityRef() == '<');
    assert(range == "baz");
}
{
    auto range = "&apos;dlang";
    assert(range.parseStdEntityRef() == '\'');
    assert(range == "dlang");
}
{
    auto range = "&quot;rocks";
    assert(range.parseStdEntityRef() == '"');
    assert(range == "rocks");
}
{
    auto range = " &amp;foo";
    assert(range.parseStdEntityRef().isNull);
    assert(range == " &amp;foo");
}
{
    auto range = "&Amp;hello";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "&Amp;hello");
}
{
    auto range = "&nbsp;foo";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "&nbsp;foo");
}
{
    auto range = "hello world";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "hello world");
}

Nullable!dchar parseCharRef(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

If the given range starts with a valid, XML, character reference, it is removed from the range, and the corresponding character is returned.

If the range does not start with a valid, XML, character reference, then the return value is null, and the range is unchanged.

Parameters:

R range

A range of characters.

Returns: The character represented by the character reference that was parsed from the front of the given range or null if the range did not start with a valid, XML, character reference.

See Also: http://www.w3.org/TR/REC-xml/#NT-CharRef
parseStdEntityRef
decodeXML
asDecodedXML
encodeCharRef

Examples:

{
    auto range = "&#48; hello world";
    assert(parseCharRef(range) == '0');
    assert(range == " hello world");
}
{
    auto range = "&#x30; hello world";
    assert(parseCharRef(range) == '0');
    assert(range == " hello world");
}
{
    auto range = "&#12487;&#12451;&#12521;&#12531;";
    assert(parseCharRef(range) == 'デ');
    assert(range == "&#12451;&#12521;&#12531;");
    assert(parseCharRef(range) == 'ィ');
    assert(range == "&#12521;&#12531;");
    assert(parseCharRef(range) == 'ラ');
    assert(range == "&#12531;");
    assert(parseCharRef(range) == 'ン');
    assert(range.empty);
}
{
    auto range = "&#x;foo";
    assert(parseCharRef(range).isNull);
    assert(range == "&#x;foo");
}
{
    auto range = "foobar";
    assert(parseCharRef(range).isNull);
    assert(range == "foobar");
}
{
    auto range = " &x48;";
    assert(parseCharRef(range).isNull);
    assert(range == " &x48;");
}

string stripIndent(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

auto withoutIndent(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));

Strips the indent from a character range (most likely from Entity.text). The idea is that if the XML is formatted to be human-readable, and it's multiple lines long, the lines are likely to be indented, but the application probably doesn't want that extra whitespace. So, stripIndent and withoutIndent attempt to intelligently strip off the leading whitespace.

For these functions, whitespace is considered to be some combination of ' ', '\t', and '\r' ('\n' is used to delineate lines, so it's not considered whitespace).

Whitespace characters are stripped from the start of the first line, and then those same number of whitespace characters are stripped from the beginning of each subsequent line (or up to the first non-whitespace character if the line starts with fewer whitespace characters).

If the first line has no leading whitespace, then the leading whitespace on the second line is treated as the indent. This is done to handle case where there is text immediately after a start tag and then subsequent lines are indented rather than the text starting on the line after the start tag.

If neither of the first two lines has any leading whitespace, then no whitespace is stripped.

So, if the text is well-formatted, then the indent should be cleanly removed, and if it's unformatted or badly formatted, then no characters other than leading whitespace will be removed, and in principle, no real data will have been lost - though of course, it's up to the programmer to decide whether it's better for the application to try to cleanly strip the indent or to leave the text as-is.

The difference between stripIndent and withoutIndent is that stripIndent returns a string, whereas withoutIndent returns a lazy range of code units. In the case where a string is passed to stripIndent, it will simply return the original string if there is no indent (whereas in other cases, stripIndent and withoutIndent are forced to return new ranges).

Parameters:

R range

A range of characters.

Returns: The text with the indent stripped from each line. stripIndent returns a string, whereas withoutIndent returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).

Examples:

import std.algorithm.comparison : equal;

// The prime use case for these two functions is for an Entity.text section
// that is formatted to be human-readable, and the rules of what whitespace
// is stripped from the beginning or end of the range are geared towards
// the text coming from a well-formatted Entity.text section.
{
    import dxml.parser;
    auto xml = "<root>\n" ~
               "    <code>\n" ~
               "    bool isASCII(string str)\n" ~
               "    {\n" ~
               "        import std.algorithm : all;\n" ~
               "        import std.ascii : isASCII;\n" ~
               "        return str.all!isASCII();\n" ~
               "    }\n" ~
               "    </code>\n" ~
               "<root>";
    auto range = parseXML(xml);
    range.popFront();
    range.popFront();
    assert(range.front.type == EntityType.text);
    assert(range.front.text ==
           "\n" ~
           "    bool isASCII(string str)\n" ~
           "    {\n" ~
           "        import std.algorithm : all;\n" ~
           "        import std.ascii : isASCII;\n" ~
           "        return str.all!isASCII();\n" ~
           "    }\n" ~
           "    ");
    assert(range.front.text.stripIndent() ==
           "bool isASCII(string str)\n" ~
           "{\n" ~
           "    import std.algorithm : all;\n" ~
           "    import std.ascii : isASCII;\n" ~
           "    return str.all!isASCII();\n" ~
           "}");
}

// The indent that is stripped matches the amount of whitespace at the front
// of the first line.
assert(("    start\n" ~
        "    foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy\n" ~
        "           ").stripIndent() ==
       "start\n" ~
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy\n" ~
       "       ");

// If the first line has no leading whitespace but the second line does,
// then the second line's leading whitespace is treated as the indent.
assert(("foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy");

assert(("\n" ~
        "    foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy");

// If neither of the first two lines has leading whitespace, then nothing
// is stripped.
assert(("foo\n" ~
        "bar\n" ~
        "    baz\n" ~
        "    xyzzy\n" ~
        "    ").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy\n" ~
       "    ");

// If a subsequent line starts with less whitespace than the indent, then
// all of its leading whitespace is stripped but no other characters are
// stripped.
assert(("      foo\n" ~
        "         bar\n" ~
        "   baz\n" ~
        "         xyzzy").stripIndent() ==
       "foo\n" ~
       "   bar\n" ~
       "baz\n" ~
       "   xyzzy");

// If the last line is just the indent, then it and the newline before it
// are stripped.
assert(("    foo\n" ~
        "       bar\n" ~
        "    ").stripIndent() ==
       "foo\n" ~
       "   bar");

// If the last line is just whitespace, but it's more than the indent, then
// the whitespace after the indent is kept.
assert(("    foo\n" ~
        "       bar\n" ~
        "       ").stripIndent() ==
       "foo\n" ~
       "   bar\n" ~
       "   ");

// withoutIndent does the same as stripIndent but with a lazy range.
assert(equal(("  foo\n" ~
              "    bar\n" ~
              "    baz\n").withoutIndent(),
             "foo\n" ~
             "  bar\n" ~
             "  baz"));

enum StdEntityRef: string;

The string representations of the five, entity references predefined by the XML spec.

See Also: http://www.w3.org/TR/REC-xml/#dt-chardata
parseStdEntityRef

amp: Entity reference for &
gt: Entity reference for >
lt: Entity reference for <
apos: Entity reference for '
quot: Entity reference for "

auto encodeText(R)(R text)
if(isForwardRange!R && isSomeChar!(ElementType!R));

Returns a lazy range of code units which encodes any characters which cannot be put in an dxml.parser.EntityType.text in their literal form.

encodeText is intended primarily to be used with dxml.writer.XMLWriter.writeText to ensure that characters which cannot appear in their literal form do not appear in their literal form.

Specifically, what encodeText does is

convert & to &

convert < to <

See Also: dxml.writer.XMLWriter.writeText
encodeAttr
decodeXML
asDecodedXML

Examples:

import std.algorithm.comparison : equal;

assert(equal(encodeText(`foo & bar`), `foo &amp; bar`));
assert(equal(encodeText(`foo < bar`), `foo &lt; bar`));
assert(equal(encodeText(`foo > bar`), `foo > bar`));
assert(equal(encodeText(`foo ' bar`), `foo ' bar`));
assert(equal(encodeText(`foo " bar`), `foo " bar`));

assert(equal(encodeText("hello world"), "hello world"));

auto encodeAttr(char quote = '"', R)(R text)
if((quote == '"' || quote == '\'') && isForwardRange!R && isSomeChar!(ElementType!R));

Returns a lazy range of code units which encodes any characters which cannot be put in an attribute value of an element tag in their literal form.

encodeAttr is intended primarily to be used with dxml.writer.XMLWriter.writeAttr to ensure that characters which cannot appear in their literal form do not appear in their literal form.

Specifically, what encodeAttr does is

convert & to &

convert < to <

convert ' to &pos; if quote == '''

convert " to " if quote == '"'

See Also: dxml.writer.XMLWriter.writeAttr
encodeText
decodeXML
asDecodedXML

Examples:

import std.algorithm.comparison : equal;

assert(equal(encodeAttr(`foo & bar`), `foo &amp; bar`));
assert(equal(encodeAttr(`foo < bar`), `foo &lt; bar`));
assert(equal(encodeAttr(`foo > bar`), `foo > bar`));
assert(equal(encodeAttr(`foo ' bar`), `foo ' bar`));
assert(equal(encodeAttr(`foo " bar`), `foo &quot; bar`));

assert(equal(encodeAttr!'\''(`foo ' bar`), `foo &apos; bar`));
assert(equal(encodeAttr!'\''(`foo " bar`), `foo " bar`));

assert(equal(encodeAttr("hello world"), "hello world"));

auto encodeCharRef(dchar c);

Returns a range of char containing the character reference corresponding to the given character.

Parameters:

dchar c

The character to encode.