dxml.util
This module contains helper functions which aren't specific to the parser,
the DOM, or the writer.
Symbol | Description |
---|---|
decodeXML | Takes a range of characters, strips carriage returns from it, and converts both character references and the predefined entity references in the range into the characters that they refer to. |
asDecodedXML | The version of decodeXML that returns a lazy range. |
parseCharRef | Parses a character reference from the front of a range of characters. |
parseStdEntityRef | Parses one of the predefined entity references from the start of a range of characters. |
stripIndent | Removes the indent from the front of each line of a range of characters that was XML text which was formatted for human-readability. |
withoutIndent | The version of stripIndent that returns a lazy range. |
StdEntityRef | Enum containing the string representations of the five, predefined entity references. |
encodeText | Encodes characters which cannot appear in EntityType.text in their literal form. |
encodeAttr | Encodes characters which cannot appear in the attribute value of an element start tag in their literal form. |
encodeCharRef | Encodes a character as a character reference. |
License:
Boost License 1.0.
See Also:
Official Specification for XML 1.0
- string
decodeXML
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); autoasDecodedXML
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - Decodes any XML character references and standard XML entity references in the text as well as removing any carriage returns. It's intended to be used on the text fields of element tags and on the values of start tag attributes.There are a number of characters that either can't be directly represented in the text fields or attribute values in XML or which can sometimes be directly represented but not always (e.g. an attribute value can contain either a single quote or a double quote, but it can't contain both at the same time, because one of them would match the opening quote). So, those characters have alternate representations in order to be allowed (e.g. "<" for '<', because '<' would normally be the beginning of an entity). Technically, they're entity references, but the ones handled by
decodeXML
are the ones explicitly defined in the XML standard and which don't require a DTD section. Ideally, the parser would transform all such alternate representations to what they represent when providing the text to the application, but that would make it impossible to return slices of the original text from the properties of an Entity. So, instead of having those properties do the transformation themselves,decodeXML
andasDecodedXML
do that so that the application can choose to do it or not (in many cases, there is nothing to decode, making the calls unnecessary). Similarly, an application can choose to encode a character as a character reference (e.g. 'A" or '@" for 'A').decodeXML
will decode such character references to their corresponding characters. However,decodeXML
does not handle any entity references beyond the five predefined ones listed below. All others are left unprocessed. Processing them properly would require handling the DTD section, which dxml does not support. The parser considers any entity references other than the predefined ones to be invalid XML, so unless the text being passed todecodeXML
doesn't come from dxml's parser, it can't have any entity references in it other than the predefined ones. Similarly, invalid character references are left unprocessed as well as any character that is not valid in an XML document.decodeXML
never throws on invalid XML. Also, '\r' is not supposed to appear in an XML document except as a character reference unless it's in a CDATA section. So, it really should be stripped out before being handed off to the application, but again, that doesn't work with slices. So,decodeXML
also handles that. Specifically, whatdecodeXML
andasDecodedXML
do isconvert & to & convert > to > convert < to < convert ' to ' convert " to " remove all instances of \r convert all character references (e.g. 
) to the characters that they represent decodeXML
andasDecodedXML
is thatdecodeXML
returns a string, whereasasDecodedXML
returns a lazy range of code units. In the case where a string is passed todecodeXML
, it will simply return the original string if there is no text to decode (whereas in other cases,decodeXML
andasDecodedXML
are forced to return new ranges even if there is no text to decode).Parameters:R range The range of characters to decodeXML
.Returns: The decoded text.decodeXML
returns a string, whereasasDecodedXML
returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).See Also: http://www.w3.org/TR/REC-xml/#dt-chardata
parseStdEntityRef
parseCharRef
dxml.parser.EntityRange.Entity.attributes
dxml.parser.EntityRange.Entity.text
encodeAttr
encodeTextExamples:assert(decodeXML("hello world &><'" \r\r\r\r\r foo") == `hello world &><'" foo`); assert(decodeXML("if(foo && bar)\r\n" ~ " left = right;") == "if(foo && bar)\n" ~ " left = right;"); assert(decodeXML("ディラン") == "ディラン"); assert(decodeXML("foo") == "foo"); assert(decodeXML("&# ;") == "&# ;"); { import std.algorithm.comparison : equal; auto range = asDecodedXML("hello world &><'" " ~ "\r\r\r\r\r foo"); assert(equal(range, `hello world &><'" foo`)); } { import dxml.parser; auto xml = "<root>\n" ~ " <function return='vector<int>' name='foo'>\r\n" ~ " <doc_comment>This function does something really\r\n" ~ " fancy, and you will love it.</doc_comment>\r\n" ~ " <param type='int' name='i'>\r\n" ~ " <param type='const std::string&' name='s'>\r\n" ~ " </function>\n" ~ "</root>"; auto range = parseXML!simpleXML(xml); range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "function"); { auto attrs = range.front.attributes; assert(attrs.front.name == "return"); assert(attrs.front.value == "vector<int>"); assert(decodeXML(attrs.front.value) == "vector<int>"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "foo"); assert(decodeXML(attrs.front.value) == "foo"); } range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "doc_comment"); range.popFront(); assert(range.front.text == "This function does something really\r\n" ~ " fancy, and you will love it."); assert(decodeXML(range.front.text) == "This function does something really\n" ~ " fancy, and you will love it."); range.popFront(); assert(range.front.type == EntityType.elementEnd); assert(range.front.name == "doc_comment"); range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "param"); { auto attrs = range.front.attributes; assert(attrs.front.name == "type"); assert(attrs.front.value == "int"); assert(decodeXML(attrs.front.value) == "int"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "i"); assert(decodeXML(attrs.front.value) == "i"); } range.popFront(); assert(range.front.type == EntityType.elementStart); assert(range.front.name == "param"); { auto attrs = range.front.attributes; assert(attrs.front.name == "type"); assert(attrs.front.value == "const std::string&"); assert(decodeXML(attrs.front.value) == "const std::string&"); attrs.popFront(); assert(attrs.front.name == "name"); assert(attrs.front.value == "s"); assert(decodeXML(attrs.front.value) == "s"); } }
- deprecated alias
normalize
= decodeXML(R)(R range) if (isForwardRange!R && isSomeChar!(ElementType!R)); deprecated aliasasNormalized
= asDecodedXML(R)(R range) if (isForwardRange!R && isSomeChar!(ElementType!R)); Deprecated
normalize
has been renamed to decodeXML, andasNormalized
has been renamed to asDecodedXML. It was pointed out that there's a fairly high chance that std.uni.normalize would be used in conjunction with dxml, making conflicts annoyingly likely. Also, there was no good opposite fornormalize
for the functions that became encodeAttr and encodeText. denormalizeAttr and denormalizeText would arguably have been a bit ugly. These aliases have been added to avoid code breakage when upgrading from dxml 0.2.*. They will be removed in dxml 0.4.0.- Nullable!dchar
parseStdEntityRef
(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - This parses one of the five, predefined entity references mention in the XML spec from the front of a range of characters.If the given range starts with one of the five, predefined entity references, then it is removed from the range, and the corresponding character is returned. If the range does not start with one of those references, then the return value is
null
, and the range is unchanged.Std Entity Ref Converts To & & > > < < ' ' " " parseStdEntityRef
as are any other types of references.Parameters:R range A range of characters. Returns: The character represented by the predefined entity reference that was parsed from the front of the given range ornull
if the range did not start with one of the five predefined entity references.Examples:{ auto range = "&foo"; assert(range.parseStdEntityRef() == '&'); assert(range == "foo"); } { auto range = ">bar"; assert(range.parseStdEntityRef() == '>'); assert(range == "bar"); } { auto range = "<baz"; assert(range.parseStdEntityRef() == '<'); assert(range == "baz"); } { auto range = "'dlang"; assert(range.parseStdEntityRef() == '\''); assert(range == "dlang"); } { auto range = ""rocks"; assert(range.parseStdEntityRef() == '"'); assert(range == "rocks"); } { auto range = " &foo"; assert(range.parseStdEntityRef().isNull); assert(range == " &foo"); } { auto range = "&Amp;hello"; assert(range.parseStdEntityRef().isNull); assert(range == "&Amp;hello"); } { auto range = " foo"; assert(range.parseStdEntityRef().isNull); assert(range == " foo"); } { auto range = "hello world"; assert(range.parseStdEntityRef().isNull); assert(range == "hello world"); }
- Nullable!dchar
parseCharRef
(R)(ref R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - If the given range starts with a valid, XML, character reference, it is removed from the range, and the corresponding character is returned.If the range does not start with a valid, XML, character reference, then the return value is
null
, and the range is unchanged.Parameters:R range A range of characters. Returns: The character represented by the character reference that was parsed from the front of the given range ornull
if the range did not start with a valid, XML, character reference.See Also: http://www.w3.org/TR/REC-xml/#NT-CharRef
parseStdEntityRef
decodeXML
asDecodedXML
encodeCharRefExamples:{ auto range = "0 hello world"; assert(parseCharRef(range) == '0'); assert(range == " hello world"); } { auto range = "0 hello world"; assert(parseCharRef(range) == '0'); assert(range == " hello world"); } { auto range = "ディラン"; assert(parseCharRef(range) == 'デ'); assert(range == "ィラン"); assert(parseCharRef(range) == 'ィ'); assert(range == "ラン"); assert(parseCharRef(range) == 'ラ'); assert(range == "ン"); assert(parseCharRef(range) == 'ン'); assert(range.empty); } { auto range = "&#x;foo"; assert(parseCharRef(range).isNull); assert(range == "&#x;foo"); } { auto range = "foobar"; assert(parseCharRef(range).isNull); assert(range == "foobar"); } { auto range = " &x48;"; assert(parseCharRef(range).isNull); assert(range == " &x48;"); }
- string
stripIndent
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); autowithoutIndent
(R)(R range)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - Strips the indent from a character range (most likely from Entity.text). The idea is that if the XML is formatted to be human-readable, and it's multiple lines long, the lines are likely to be indented, but the application probably doesn't want that extra whitespace. So,
stripIndent
andwithoutIndent
attempt to intelligently strip off the leading whitespace.For these functions, whitespace is considered to be some combination of ' ', '\t', and '\r' ('\n' is used to delineate lines, so it's not considered whitespace). Whitespace characters are stripped from the start of the first line, and then those same number of whitespace characters are stripped from the beginning of each subsequent line (or up to the first non-whitespace character if the line starts with fewer whitespace characters). If the first line has no leading whitespace, then the leading whitespace on the second line is treated as the indent. This is done to handle case where there is text immediately after a start tag and then subsequent lines are indented rather than the text starting on the line after the start tag. If neither of the first two lines has any leading whitespace, then no whitespace is stripped. So, if the text is well-formatted, then the indent should be cleanly removed, and if it's unformatted or badly formatted, then no characters other than leading whitespace will be removed, and in principle, no real data will have been lost - though of course, it's up to the programmer to decide whether it's better for the application to try to cleanly strip the indent or to leave the text as-is. The difference betweenstripIndent
andwithoutIndent
is thatstripIndent
returns a string, whereaswithoutIndent
returns a lazy range of code units. In the case where a string is passed tostripIndent
, it will simply return the original string if there is no indent (whereas in other cases,stripIndent
andwithoutIndent
are forced to return new ranges).Parameters:R range A range of characters. Returns: The text with the indent stripped from each line.stripIndent
returns a string, whereaswithoutIndent
returns a lazy range of code units (so it could be a range of char or wchar and not just dchar; which it is depends on the code units of the range being passed in).See Also: dxml.parser.EntityRange.Entity.textExamples:import std.algorithm.comparison : equal; // The prime use case for these two functions is for an Entity.text section // that is formatted to be human-readable, and the rules of what whitespace // is stripped from the beginning or end of the range are geared towards // the text coming from a well-formatted Entity.text section. { import dxml.parser; auto xml = "<root>\n" ~ " <code>\n" ~ " bool isASCII(string str)\n" ~ " {\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ " }\n" ~ " </code>\n" ~ "<root>"; auto range = parseXML(xml); range.popFront(); range.popFront(); assert(range.front.type == EntityType.text); assert(range.front.text == "\n" ~ " bool isASCII(string str)\n" ~ " {\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ " }\n" ~ " "); assert(range.front.text.stripIndent() == "bool isASCII(string str)\n" ~ "{\n" ~ " import std.algorithm : all;\n" ~ " import std.ascii : isASCII;\n" ~ " return str.all!isASCII();\n" ~ "}"); } // The indent that is stripped matches the amount of whitespace at the front // of the first line. assert((" start\n" ~ " foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " ").stripIndent() == "start\n" ~ "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " "); // If the first line has no leading whitespace but the second line does, // then the second line's leading whitespace is treated as the indent. assert(("foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy"); assert(("\n" ~ " foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy"); // If neither of the first two lines has leading whitespace, then nothing // is stripped. assert(("foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " ").stripIndent() == "foo\n" ~ "bar\n" ~ " baz\n" ~ " xyzzy\n" ~ " "); // If a subsequent line starts with less whitespace than the indent, then // all of its leading whitespace is stripped but no other characters are // stripped. assert((" foo\n" ~ " bar\n" ~ " baz\n" ~ " xyzzy").stripIndent() == "foo\n" ~ " bar\n" ~ "baz\n" ~ " xyzzy"); // If the last line is just the indent, then it and the newline before it // are stripped. assert((" foo\n" ~ " bar\n" ~ " ").stripIndent() == "foo\n" ~ " bar"); // If the last line is just whitespace, but it's more than the indent, then // the whitespace after the indent is kept. assert((" foo\n" ~ " bar\n" ~ " ").stripIndent() == "foo\n" ~ " bar\n" ~ " "); // withoutIndent does the same as stripIndent but with a lazy range. assert(equal((" foo\n" ~ " bar\n" ~ " baz\n").withoutIndent(), "foo\n" ~ " bar\n" ~ " baz"));
- enum
StdEntityRef
: string; - The string representations of the five, entity references predefined by the XML spec.
amp
- Entity reference for &
gt
- Entity reference for >
lt
- Entity reference for <
apos
- Entity reference for '
quot
- Entity reference for "
- auto
encodeText
(R)(R text)
if(isForwardRange!R && isSomeChar!(ElementType!R)); - Returns a lazy range of code units which encodes any characters which cannot be put in an dxml.parser.EntityType.text in their literal form.
encodeText
is intended primarily to be used with dxml.writer.XMLWriter.writeText to ensure that characters which cannot appear in their literal form do not appear in their literal form. Specifically, whatencodeText
does isconvert & to & convert < to < Examples:import std.algorithm.comparison : equal; assert(equal(encodeText(`foo & bar`), `foo & bar`)); assert(equal(encodeText(`foo < bar`), `foo < bar`)); assert(equal(encodeText(`foo > bar`), `foo > bar`)); assert(equal(encodeText(`foo ' bar`), `foo ' bar`)); assert(equal(encodeText(`foo " bar`), `foo " bar`)); assert(equal(encodeText("hello world"), "hello world"));
- auto
encodeAttr
(char quote = '"', R)(R text)
if((quote == '"' || quote == '\'') && isForwardRange!R && isSomeChar!(ElementType!R)); - Returns a lazy range of code units which encodes any characters which cannot be put in an attribute value of an element tag in their literal form.
encodeAttr
is intended primarily to be used with dxml.writer.XMLWriter.writeAttr to ensure that characters which cannot appear in their literal form do not appear in their literal form. Specifically, whatencodeAttr
does isconvert & to & convert < to < convert ' to &pos; if quote == ''' convert " to " if quote == '"' Examples:import std.algorithm.comparison : equal; assert(equal(encodeAttr(`foo & bar`), `foo & bar`)); assert(equal(encodeAttr(`foo < bar`), `foo < bar`)); assert(equal(encodeAttr(`foo > bar`), `foo > bar`)); assert(equal(encodeAttr(`foo ' bar`), `foo ' bar`)); assert(equal(encodeAttr(`foo " bar`), `foo " bar`)); assert(equal(encodeAttr!'\''(`foo ' bar`), `foo ' bar`)); assert(equal(encodeAttr!'\''(`foo " bar`), `foo " bar`)); assert(equal(encodeAttr("hello world"), "hello world"));
- auto
encodeCharRef
(dchar c); - Returns a range of char containing the character reference corresponding to the given character.Parameters:
dchar c The character to encode. See Also: parseCharRefExamples:import std.algorithm.comparison : equal; assert(equal(encodeCharRef(' '), " ")); assert(equal(encodeCharRef('A'), "A")); assert(equal(encodeCharRef('\u2424'), "␤")); auto range = encodeCharRef('*'); assert(parseCharRef(range) == '*');