Whitespace
Definition
In USFM/USX, a 'common whitespace' (WS) character is defined by the pattern /(?:
${ws}|
${nl}|$)/
, where:
- ws
-
/[\u0009\u000D\u000A\u0020]/
-
A single whitespace character.
-
- nl
-
/(?:\u000D?\u000A|\u000D)/
-
A single newline (as supported by all operating systems).
-
Common whitespace is found throughout the document content for most languages.
Some common whitespace in USFM/USX documents is 'structural whitespace'. Structural (non-content) whitespace exists either to aid in the readability of a source file, or to delimit markers. Structural whitespace is identified in the documentation’s syntax diagrams (WS
/Ws
/HS
/Hs
) and is not part of the document content.
All other characters are always treated as content (including other Unicode whitespace characters).
Reducing or Eliminating Common Whitespace
-
Any common whitespace is reduced according to these rules:
-
Any string of WS chars which includes a newline is reduced to a single newline.
-
Any string of WS chars which does not include a newline is reduced to a single space.
-
-
Common whitespace can be eliminated when it occurs:
-
Common whitespace canonicalization is done as follows:
USX documents contain common whitespace that delimits aspects of the XML syntax itself. XML allows for unlimited whitespace wherever it is permitted by the XML syntax, which is typically added for the purpose of formatting for easier reading (i.e. 'pretty-printing'). |