dxml.util
This module contains helper functions which aren't specific to the parser,
the DOM, or the writer.
Symbol | Description |
---|---|
normalize | Takes a range of characters, strips carriage returns from it, and converts both character references and the predefined entity references in the range into the characters that they refer to. |
asNormalized | The version of normalize that returns a lazy range. |
parseCharRef | Parses a character reference from the front of a range of characters. |
parseStdEntityRef | Parses one of the predefined entity references from the start of a range of characters. |
stripIndent | description |
withoutIndent | The version of stripIndent that returns a lazy range. |
License:
Boost License 1.0.
See Also:
Official Specification for XML 1.0
- string
normalize
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); autoasNormalized
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - "Normalizes" the given text and transforms character references into the characters that they represent.
normalize
combines parseStdEntityRef and parseCharRef along with processing for '\r' tonormalize
an entire character range. It's intended to be used on on the text fields of entities and on the values of start tag attributes.There are a number of characters that either can't be directly represented in the text fields or attribute values in XML or which can sometimes be directly represented but not always (e.g. an attribute value can contain either a single quote or a double quote, but it can't contain both at the same time, because one of them would match the opening quote). So, those characters have alternate representations in order to be allowed (e.g. "<" for '>', because '>' would normally be the end of an entity). Technically, they're entity references, but the ones handled bynormalize
are the ones explicitly defined in the XML standard and which don't require a DTD section. Ideally, the parser would transform all such alternate representations to what they represent when providing the text to the application, but that would make it impossible to return slices of the original text from the properties of a Entity. So, instead of having those properties do the transformation themselves,normalize
andasNormalized
do that so that the application can choose to do it or not (in many cases, there is nothing tonormalize
, making the calls unnecessary). Similarly, an application can choose to encode a character as a character reference (e.g. 'A" or '@" for 'A').normalize
will decode such character references to their corresponding characters. However,normalize
does not handle any entity references beyond the five predefined ones listed below. All others are left unprocessed. Processing them properly would require handling the DTD section, which dxml does not do. The parser considers any entity references other than the predefined ones to be invalid XML, so unless the text being passed tonormalize
doesn't come from dxml's parser, it can't have any entity references in it other than the predefined ones. Similarly, invalid character references are left unprocessed as well as any character that is not valid in an XML document.normalize
never throws on invalid XML. Also, '\r' is not supposed to appear in an XML document except as a character reference unless it's in a CDATA section. So, it really should be stripped out before being handed off to the application, but again, that doesn't work with slices. So,normalize
also handles that. Specifically, whatnormalize
andasNormalized
do isconvert & to & convert > to > convert < to < convert ' to ' convert " to " remove all instances of '\r' convert all character references (e.g. "
") to the characters that they represent normalize
andasNormalized
is thatnormalize
returns a string, whereasasNormalized
returns a lazy range of code units. In the case where a string is passed tonormalize
, it will simply return the original string if there is no text tonormalize
(whereas in other cases,normalize
andasNormalized
are forced to return new ranges even if there is no un-normalized text).Parameters:R range The range of characters to normalize
.Returns: The normalized text.normalize
returns a string, whereasasNormalized
returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).See Also: http://www.w3.org/TR/REC-xml/#dt-chardata
parseStdEntityRef
parseCharRef
dxml.parser.EntityRange.Entity.attributes
dxml.parserx.EntityRange.Entity.textExamples:assert(normalize("hello world &><'" \r\r\r\r\r foo") == `hello world &><'" foo`); assert(normalize("if(foo && bar)\r\n" ~ " left = right;") == "if(foo && bar)\n" ~ " left = right;"); assert(normalize("ディラン") == "ディラン"); assert(normalize("foo") == "foo"); assert(normalize("&# ;") == "&# ;"); { import std.algorithm.comparison : equal; auto range = asNormalized("hello world &><'" " ~ "\r\r\r\r\r foo"); assert(equal(range, `hello world &><'" foo`)); } { import dxml.parser; auto xml = "<root>\n" ~ " <function return='vector<int>' name='foo'>\r\n" ~ " <doc_comment>This function does something really\r\n" ~ " fancy, and you will love it.</doc_comment>\r\n" ~ " <param type='int' name='i'>\r\n" ~ " <param type='const std::string&' name='s'>\r\n" ~ " </function>\n" ~ "</root>"; auto range = parseXML!simpleXML(xml); range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "function"); { auto attrs = range.front.attributes; assert(attrs.front.name == "return"); assert(attrs.front.value == "vector<int>"); assert(normalize(attrs.front.value) == "vector<int>"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "foo"); assert(normalize(attrs.front.value) == "foo"); } range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "doc_comment"); range.popFront(); assert(range.front.text == "This function does something really\r\n" ~ " fancy, and you will love it."); assert(normalize(range.front.text) == "This function does something really\n" ~ " fancy, and you will love it."); range.popFront(); assert(range.front.type == EntityType.elementEnd); assert(range.front.name == "doc_comment"); range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "param"); { auto attrs = range.front.attributes; assert(attrs.front.name == "type"); assert(attrs.front.value == "int"); assert(normalize(attrs.front.value) == "int"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "i"); assert(normalize(attrs.front.value) == "i"); } range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "param"); { auto attrs = range.front.attributes; assert(attrs.front.name == "type"); assert(attrs.front.value == "const std::string&"); assert(normalize(attrs.front.value) == "const std::string&"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "s"); assert(normalize(attrs.front.value) == "s"); } }
- Nullable!dchar
parseStdEntityRef
(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - This parses one of the five, predefined entity references mention in the XML spec from the front of a range of characters.If the given range starts with one of the five, predefined entity references, then it is removed from the range, and the corresponding character is returned. If the range does not start with one of those references, then the return value is
null
, and the range is unchanged.Std Entity Ref Converts To & & > > < < ' ' " " parseStdEntityRef
as are any other types of references.Parameters:R range A range of characters. Returns: The character represented by the predefined entity reference that was parsed from the front of the given range ornull
if the range did not start with one of the five predefined entity references.Examples:{ auto range = "&foo"; assert(range.parseStdEntityRef() == '&'); assert(range == "foo"); } { auto range = ">bar"; assert(range.parseStdEntityRef() == '>'); assert(range == "bar"); } { auto range = "<baz"; assert(range.parseStdEntityRef() == '<'); assert(range == "baz"); } { auto range = "'dlang"; assert(range.parseStdEntityRef() == '\''); assert(range == "dlang"); } { auto range = ""rocks"; assert(range.parseStdEntityRef() == '"'); assert(range == "rocks"); } { auto range = " &foo"; assert(range.parseStdEntityRef().isNull); assert(range == " &foo"); } { auto range = "&Amp;hello"; assert(range.parseStdEntityRef().isNull); assert(range == "&Amp;hello"); } { auto range = " foo"; assert(range.parseStdEntityRef().isNull); assert(range == " foo"); } { auto range = "hello world"; assert(range.parseStdEntityRef().isNull); assert(range == "hello world"); }
- Nullable!dchar
parseCharRef
(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - If the given range starts with a valid, XML, character reference, it is removed from the range, and the corresponding character is returned.If the range does not start with a valid, XML, character reference, then the return value is
null
, and the range is unchanged.Parameters:R range A range of characters. Returns: The character represented by the character reference that was parsed from the front of the given range ornull
if the range did not start with a valid, XML, character reference.Examples:{ auto range = "0 hello world"; assert(parseCharRef(range) == '0'); assert(range == " hello world"); } { auto range = "0 hello world"; assert(parseCharRef(range) == '0'); assert(range == " hello world"); } { auto range = "ディラン"; assert(parseCharRef(range) == 'デ'); assert(range == "ィラン"); assert(parseCharRef(range) == 'ィ'); assert(range == "ラン"); assert(parseCharRef(range) == 'ラ'); assert(range == "ン"); assert(parseCharRef(range) == 'ン'); assert(range.empty); } { auto range = "&#x;foo"; assert(parseCharRef(range).isNull); assert(range == "&#x;foo"); } { auto range = "foobar"; assert(parseCharRef(range).isNull); assert(range == "foobar"); } { auto range = " &x48;"; assert(parseCharRef(range).isNull); assert(range == " &x48;"); }
- string
stripIndent
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); autowithoutIndent
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - Strips the indent from a character range (most likely from Entity.text ). The idea is that if the XML is formatted to be human-readable, and it's multiple lines long, the lines are likely to be indented, but the application probably doesn't want that extra whitespace. So,
stripIndent
andwithoutIndent
attempt to intelligently strip off the leading whitespace.For these functions, whitespace is considered to be some combination of ' ', '\t', and '\r' (as '\n' is used to delineate lines, so it's not considered whitespace). Whitespace characters are stripped from the start of the first line, and then those same number of whitespace characters are stripped from the beginning of each subsequent line (or up to the first non-whitespace character if the line starts with fewer whitespace characters). If the first line has no leading whitespace, then the leading whitespace on the second line is treated as the indent. This is done to handle case where there is text immediately after a start tag and then subsequent lines are indented rather than the text starting on the line after the start tag. If neither of the first two lines has any leading whitespace, then no whitespace is stripped. So, if the text is well-formatted, then the indent should be cleanly removed, and if it's unformatted or badly formatted, then no characters other than leading whitespace will be removed, and in principle, no real data will have been lost - though of course, it's up to the programmer to decide whether it's better for the application to try to cleanly strip the indent or to leave the text as-is. The difference betweenstripIndent
andwithoutIndent
is thatstripIndent
returns a string, whereaswithoutIndent
returns a lazy range of code units. In the case where a string is passed tostripIndent
, it will simply return the original string if the indent is determined to be zero (whereas in other cases,stripIndent
andwithoutIndent
are forced to return new ranges).Parameters:R range A range of characters. Returns: The text with the indent stripped from each line.stripIndent
returns a string, whereaswithoutIndent
returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).See Also: dxml.parser.EntityRange.Entity.textExamples:import std.algorithm.comparison : equal; // The prime use case for these two functions is for an Entity.text section // that is formatted to be human-readable, and the rules of what whitespace // is stripped from the beginning or end of the range are geared towards // the text coming from a well-formatted Entity.text section. { import dxml.parser; auto xml = "<root>\n" ~ " <code>\n" ~ " bool isASCII(string str)\n" ~ " {\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ " }\n" ~ " </code>\n" ~ "<root>"; auto range = parseXML(xml); range.popFront(); range.popFront(); assert(range.front.text == "\n" ~ " bool isASCII(string str)\n" ~ " {\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ " }\n" ~ " "); assert(range.front.text.stripIndent() == "bool isASCII(string str)\n" ~ "{\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ "}"); } // The indent that is stripped matches the amount of whitespace at the front // of the first line. assert((" start\n" ~ " foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " ").stripIndent() == "start\n" ~ "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " "); // If the first has no leading whitespace but the second line does, then // the second line's leading whitespace is treated as the indent. assert(("foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy"); assert(("\n" ~ " foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy"); // If neither of the first two lines has leading whitespace, then nothing // is stripped. assert(("foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " ").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " "); // If a subsequent line starts with less whitespace than the indent, then // all of its leading whitespace is stripped but no other characters are // stripped. assert((" foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ " bar\n" ~ "baz\n" ~ " xyzzy"); // If the last line is just the indent, then it and the newline before it // are stripped. assert((" foo\n" ~ " bar\n" ~ " ").stripIndent() == "foo\n" ~ " bar"); // If the last line is just whitespace, but it's more than the indent, then // the whitespace after the indent is kept. assert((" foo\n" ~ " bar\n" ~ " ").stripIndent() == "foo\n" ~ " bar\n" ~ " "); // withoutIndent does the same as stripIndent but with a lazy range. assert(equal((" foo\n" ~ " bar\n" ~ " baz\n").withoutIndent(), "foo\n" ~ " bar\n" ~ " baz"));