Jonathan M Davis: The Long-Winded D Guy

dxml.util

This module contains helper functions which aren't specific to the parser, the DOM, or the writer.
Symbol Description
normalize Takes a range of characters, strips carriage returns from it, and converts both character references and the predefined entity references in the range into the characters that they refer to.
asNormalized The version of normalize that returns a lazy range.
parseCharRef Parses a character reference from the front of a range of characters.
parseStdEntityRef Parses one of the predefined entity references from the start of a range of characters.
stripIndent description
withoutIndent The version of stripIndent that returns a lazy range.
string normalize(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
auto asNormalized(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
"Normalizes" the given text and transforms character references into the characters that they represent. normalize combines parseStdEntityRef and parseCharRef along with processing for '\r' to normalize an entire character range. It's intended to be used on on the text fields of entities and on the values of start tag attributes.
There are a number of characters that either can't be directly represented in the text fields or attribute values in XML or which can sometimes be directly represented but not always (e.g. an attribute value can contain either a single quote or a double quote, but it can't contain both at the same time, because one of them would match the opening quote). So, those characters have alternate representations in order to be allowed (e.g. "<" for '>', because '>' would normally be the end of an entity). Technically, they're entity references, but the ones handled by normalize are the ones explicitly defined in the XML standard and which don't require a DTD section.
Ideally, the parser would transform all such alternate representations to what they represent when providing the text to the application, but that would make it impossible to return slices of the original text from the properties of a Entity. So, instead of having those properties do the transformation themselves, normalize and asNormalized do that so that the application can choose to do it or not (in many cases, there is nothing to normalize, making the calls unnecessary).
Similarly, an application can choose to encode a character as a character reference (e.g. '&#65" or '&#x40" for 'A'). normalize will decode such character references to their corresponding characters.
However, normalize does not handle any entity references beyond the five predefined ones listed below. All others are left unprocessed. Processing them properly would require handling the DTD section, which dxml does not do. The parser considers any entity references other than the predefined ones to be invalid XML, so unless the text being passed to normalize doesn't come from dxml's parser, it can't have any entity references in it other than the predefined ones. Similarly, invalid character references are left unprocessed as well as any character that is not valid in an XML document. normalize never throws on invalid XML.
Also, '\r' is not supposed to appear in an XML document except as a character reference unless it's in a CDATA section. So, it really should be stripped out before being handed off to the application, but again, that doesn't work with slices. So, normalize also handles that.
Specifically, what normalize and asNormalized do is
convert & to &
convert > to >
convert &lt; to <
convert &apos; to '
convert &quot; to "
remove all instances of '\r'
convert all character references (e.g. "&#xA;") to the characters that they represent
All other entity references are left untouched, and any '&' which is not used in one of the constructs listed in the table as well as any malformed constructs (e.g. "&Amp;" or "&#xGGA2;") are left untouched.
The difference between normalize and asNormalized is that normalize returns a string, whereas asNormalized returns a lazy range of code units. In the case where a string is passed to normalize, it will simply return the original string if there is no text to normalize (whereas in other cases, normalize and asNormalized are forced to return new ranges even if there is no un-normalized text).
Parameters:
R range The range of characters to normalize.
Returns: The normalized text. normalize returns a string, whereas asNormalized returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).
Examples:
assert(normalize("hello world &amp;&gt;&lt;&apos;&quot; \r\r\r\r\r foo") ==
       `hello world &><'"  foo`);

assert(normalize("if(foo &amp;&amp; bar)\r\n" ~
                 "    left = right;") ==
       "if(foo && bar)\n" ~
       "    left = right;");

assert(normalize("&#12487;&#12451;&#12521;&#12531;") == "ディラン");
assert(normalize("foo") == "foo");
assert(normalize("&#   ;") == "&#   ;");

{
    import std.algorithm.comparison : equal;
    auto range = asNormalized("hello world &amp;&gt;&lt;&apos;&quot; " ~
                              "\r\r\r\r\r foo");
    assert(equal(range, `hello world &><'"  foo`));
}

{
    import dxml.parser;
    auto xml = "<root>\n" ~
               "    <function return='vector&lt;int&gt;' name='foo'>\r\n" ~
               "        <doc_comment>This function does something really\r\n" ~
               "                 fancy, and you will love it.</doc_comment>\r\n" ~
               "        <param type='int' name='i'>\r\n" ~
               "        <param type='const std::string&amp;' name='s'>\r\n" ~
               "    </function>\n" ~
               "</root>";
    auto range = parseXML!simpleXML(xml);
    range.popFront();
    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "function");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "return");
        assert(attrs.front.value == "vector&lt;int&gt;");
        assert(normalize(attrs.front.value) == "vector<int>");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "foo");
        assert(normalize(attrs.front.value) == "foo");
    }
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "doc_comment");
    range.popFront();

    assert(range.front.text ==
           "This function does something really\r\n" ~
           "                 fancy, and you will love it.");
    assert(normalize(range.front.text) ==
           "This function does something really\n" ~
           "                 fancy, and you will love it.");
    range.popFront();

    assert(range.front.type == EntityType.elementEnd);
    assert(range.front.name == "doc_comment");
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "param");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "type");
        assert(attrs.front.value == "int");
        assert(normalize(attrs.front.value) == "int");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "i");
        assert(normalize(attrs.front.value) == "i");
    }
    range.popFront();

    assert(range.front.type == EntityType.elementStart);
    assert(range.front.name == "param");
    {
        auto attrs = range.front.attributes;
        assert(attrs.front.name == "type");
        assert(attrs.front.value == "const std::string&amp;");
        assert(normalize(attrs.front.value) == "const std::string&");
        attrs.popFront();
        assert(attrs.front.name == "name");
        assert(attrs.front.value == "s");
        assert(normalize(attrs.front.value) == "s");
    }
}
Nullable!dchar parseStdEntityRef(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
This parses one of the five, predefined entity references mention in the XML spec from the front of a range of characters.
If the given range starts with one of the five, predefined entity references, then it is removed from the range, and the corresponding character is returned.
If the range does not start with one of those references, then the return value is null, and the range is unchanged.
Std Entity RefConverts To
&amp;&
&gt;>
&lt;<
&apos;'
&quot;"
Any other entity references would require processing a DTD section in order to be handled and are untouched by parseStdEntityRef as are any other types of references.
Parameters:
R range A range of characters.
Returns: The character represented by the predefined entity reference that was parsed from the front of the given range or null if the range did not start with one of the five predefined entity references.
Examples:
{
    auto range = "&amp;foo";
    assert(range.parseStdEntityRef() == '&');
    assert(range == "foo");
}
{
    auto range = "&gt;bar";
    assert(range.parseStdEntityRef() == '>');
    assert(range == "bar");
}
{
    auto range = "&lt;baz";
    assert(range.parseStdEntityRef() == '<');
    assert(range == "baz");
}
{
    auto range = "&apos;dlang";
    assert(range.parseStdEntityRef() == '\'');
    assert(range == "dlang");
}
{
    auto range = "&quot;rocks";
    assert(range.parseStdEntityRef() == '"');
    assert(range == "rocks");
}
{
    auto range = " &amp;foo";
    assert(range.parseStdEntityRef().isNull);
    assert(range == " &amp;foo");
}
{
    auto range = "&Amp;hello";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "&Amp;hello");
}
{
    auto range = "&nbsp;foo";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "&nbsp;foo");
}
{
    auto range = "hello world";
    assert(range.parseStdEntityRef().isNull);
    assert(range == "hello world");
}
Nullable!dchar parseCharRef(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
If the given range starts with a valid, XML, character reference, it is removed from the range, and the corresponding character is returned.
If the range does not start with a valid, XML, character reference, then the return value is null, and the range is unchanged.
Parameters:
R range A range of characters.
Returns: The character represented by the character reference that was parsed from the front of the given range or null if the range did not start with a valid, XML, character reference.
Examples:
{
    auto range = "&#48; hello world";
    assert(parseCharRef(range) == '0');
    assert(range == " hello world");
}
{
    auto range = "&#x30; hello world";
    assert(parseCharRef(range) == '0');
    assert(range == " hello world");
}
{
    auto range = "&#12487;&#12451;&#12521;&#12531;";
    assert(parseCharRef(range) == 'デ');
    assert(range == "&#12451;&#12521;&#12531;");
    assert(parseCharRef(range) == 'ィ');
    assert(range == "&#12521;&#12531;");
    assert(parseCharRef(range) == 'ラ');
    assert(range == "&#12531;");
    assert(parseCharRef(range) == 'ン');
    assert(range.empty);
}
{
    auto range = "&#x;foo";
    assert(parseCharRef(range).isNull);
    assert(range == "&#x;foo");
}
{
    auto range = "foobar";
    assert(parseCharRef(range).isNull);
    assert(range == "foobar");
}
{
    auto range = " &x48;";
    assert(parseCharRef(range).isNull);
    assert(range == " &x48;");
}
string stripIndent(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
auto withoutIndent(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R));
Strips the indent from a character range (most likely from Entity.text ). The idea is that if the XML is formatted to be human-readable, and it's multiple lines long, the lines are likely to be indented, but the application probably doesn't want that extra whitespace. So, stripIndent and withoutIndent attempt to intelligently strip off the leading whitespace.
For these functions, whitespace is considered to be some combination of ' ', '\t', and '\r' (as '\n' is used to delineate lines, so it's not considered whitespace).
Whitespace characters are stripped from the start of the first line, and then those same number of whitespace characters are stripped from the beginning of each subsequent line (or up to the first non-whitespace character if the line starts with fewer whitespace characters).
If the first line has no leading whitespace, then the leading whitespace on the second line is treated as the indent. This is done to handle case where there is text immediately after a start tag and then subsequent lines are indented rather than the text starting on the line after the start tag.
If neither of the first two lines has any leading whitespace, then no whitespace is stripped.
So, if the text is well-formatted, then the indent should be cleanly removed, and if it's unformatted or badly formatted, then no characters other than leading whitespace will be removed, and in principle, no real data will have been lost - though of course, it's up to the programmer to decide whether it's better for the application to try to cleanly strip the indent or to leave the text as-is.
The difference between stripIndent and withoutIndent is that stripIndent returns a string, whereas withoutIndent returns a lazy range of code units. In the case where a string is passed to stripIndent, it will simply return the original string if the indent is determined to be zero (whereas in other cases, stripIndent and withoutIndent are forced to return new ranges).
Parameters:
R range A range of characters.
Returns: The text with the indent stripped from each line. stripIndent returns a string, whereas withoutIndent returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).
Examples:
import std.algorithm.comparison : equal;

// The prime use case for these two functions is for an Entity.text section
// that is formatted to be human-readable, and the rules of what whitespace
// is stripped from the beginning or end of the range are geared towards
// the text coming from a well-formatted Entity.text section.
{
    import dxml.parser;
    auto xml = "<root>\n" ~
               "    <code>\n" ~
               "    bool isASCII(string str)\n" ~
               "    {\n" ~
               "        import std.algorithm : all;\n" ~
               "        import std.ascii : isASCII;\n" ~
               "        return str.all!isASCII();\n" ~
               "    }\n" ~
               "    </code>\n" ~
               "<root>";
    auto range = parseXML(xml);
    range.popFront();
    range.popFront();
    assert(range.front.text ==
           "\n" ~
           "    bool isASCII(string str)\n" ~
           "    {\n" ~
           "        import std.algorithm : all;\n" ~
           "        import std.ascii : isASCII;\n" ~
           "        return str.all!isASCII();\n" ~
           "    }\n" ~
           "    ");
    assert(range.front.text.stripIndent() ==
           "bool isASCII(string str)\n" ~
           "{\n" ~
           "    import std.algorithm : all;\n" ~
           "    import std.ascii : isASCII;\n" ~
           "    return str.all!isASCII();\n" ~
           "}");
}

// The indent that is stripped matches the amount of whitespace at the front
// of the first line.
assert(("    start\n" ~
        "    foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy\n" ~
        "           ").stripIndent() ==
       "start\n" ~
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy\n" ~
       "       ");

// If the first has no leading whitespace but the second line does, then
// the second line's leading whitespace is treated as the indent.
assert(("foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy");

assert(("\n" ~
        "    foo\n" ~
        "    bar\n" ~
        "        baz\n" ~
        "        xyzzy").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy");

// If neither of the first two lines has leading whitespace, then nothing
// is stripped.
assert(("foo\n" ~
        "bar\n" ~
        "    baz\n" ~
        "    xyzzy\n" ~
        "    ").stripIndent() ==
       "foo\n" ~
       "bar\n" ~
       "    baz\n" ~
       "    xyzzy\n" ~
       "    ");

// If a subsequent line starts with less whitespace than the indent, then
// all of its leading whitespace is stripped but no other characters are
// stripped.
assert(("      foo\n" ~
        "         bar\n" ~
        "   baz\n" ~
        "         xyzzy").stripIndent() ==
       "foo\n" ~
       "   bar\n" ~
       "baz\n" ~
       "   xyzzy");

// If the last line is just the indent, then it and the newline before it
// are stripped.
assert(("    foo\n" ~
        "       bar\n" ~
        "    ").stripIndent() ==
       "foo\n" ~
       "   bar");

// If the last line is just whitespace, but it's more than the indent, then
// the whitespace after the indent is kept.
assert(("    foo\n" ~
        "       bar\n" ~
        "       ").stripIndent() ==
       "foo\n" ~
       "   bar\n" ~
       "   ");

// withoutIndent does the same as stripIndent but with a lazy range.
assert(equal(("  foo\n" ~
              "    bar\n" ~
              "    baz\n").withoutIndent(),
             "foo\n" ~
             "  bar\n" ~
             "  baz"));