Whitespace

Definition

In USFM/USX, a 'common whitespace' (WS) character is defined by the pattern /(?:${ws}|${nl}|$)/, where:

ws

/[\u0009\u000D\u000A\u0020]/

  • A single whitespace character.

nl

/(?:\u000D?\u000A|\u000D)/

  • A single newline (as supported by all operating systems).

Common whitespace is found throughout the document content for most languages.

Some common whitespace in USFM/USX documents is 'structural whitespace'. Structural (non-content) whitespace exists either to aid in the readability of a source file, or to delimit markers. Structural whitespace is identified in the documentation’s syntax diagrams (WS/Ws/HS/Hs) and is not part of the document content.

All other characters are always treated as content (including other Unicode whitespace characters).

Reducing or Eliminating Common Whitespace

  • Any common whitespace is reduced according to these rules:

    • Any string of WS chars which includes a newline is reduced to a single newline.

    • Any string of WS chars which does not include a newline is reduced to a single space.

  • Common whitespace can be eliminated when it occurs:

    • At the start of any element’s content.

    • At the end of any element’s content, except inside a character element, when followed by text or another character element.

  • Common whitespace canonicalization is done as follows:

    • A newline before a character marker is converted to a space. That space is content.

    • Any whitespace before a paragraph marker, including none, is replaced by a newline. This newline is not content.

USX documents contain common whitespace that delimits aspects of the XML syntax itself. XML allows for unlimited whitespace wherever it is permitted by the XML syntax, which is typically added for the purpose of formatting for easier reading (i.e. 'pretty-printing').