Implementation Details#

Last Updated on 2023-04-20

Behind the scene, the TextWrapper implementation uses a state machine based tokenizer to transform the input text into a stream of tokens (word chunks, white spaces and paragraph markers) to be wrapped by the text wrapper algorithm.

Tokenizer#

class Tokenizer#

Transform a text formed of lines and paragraphs into a stream of typed tokens for further processing by a token consumer.

To make text processing and formatting simpler, the algorithms work on indivisible chunks of text separated by white spaces and eventually paragraph markers.

Chunks are not the same as words; for example when word breaking on hyphens is activated, a hyphenated word would be broken into multiple chunks just where the hyphens are located.

Chunks will never have white spaces in them. Contiguous white spaces are concatenated into a single block and presented as a single token. A special case of white space is when two consecutive \n characters are encountered. This is considered a paragraph marker and presented as a specific token: TokenType::ParagraphMark.

As an example, the text:

“Just plain finger-licking good!”

breaks into the following chunks:

‘Just’, ‘ ‘, ‘plain’, ‘ ‘, ‘finger-’, ‘licking’, ‘ ‘, ‘good!’

if break_on_hyphens is true; or in :

‘Just’, ‘ ‘, ‘plain’, ‘ ‘, ‘finger-licking’, ‘ ‘, ‘good!’

otherwise.

In addition to breaking text into chunks, the Tokenizer is also responsible for implementing two specific behaviors prior to the text wrapping/ formatting, and which can be controlled by configuration parameters passed to the Tokenizer constructor:

  1. Tab expansion:

    controlled with the tab configuration parameter. All tab characters in the text will be replaced with the content of tab. For example, to expand tabs to spaces, one would specify the tab value to be as many spaces as a tab character should expand to. To keep tabs as they are, simply specify a tab value of "\t".

  2. Special characters:

    The special characters ‘\r’ and ‘\f’ are always ignored as they do not add value to the proper formatting and wrapping of the text.

    Both ‘

    ’ and ‘\v’ are considering as line breaks.

  3. Collapse white space:

    controlled with the collapse_ws configuration parameter. If true, a contiguous series of white space characters will be replaced with a single <SPACE>.

  4. Break on hyphens:

    controlled with the break_on_hyphens configuration parameter. If true, compound words will be broken into separate chunks right after hyphens, as it is customary in English. If false, only white spaces will be considered as chunk boundaries.

Example

  constexpr const char *tab = " ";
  constexpr bool collapse_ws = true;
  constexpr bool break_on_hyphens = true;

  const Tokenizer tokenizer{tab, collapse_ws, break_on_hyphens};

  constexpr const char *text = "Why? \nJust plain \tfinger-licking good!";
  std::vector<Token> tokens;
  const auto status = tokenizer.Tokenize(
      text, [&tokens](TokenType token_type, std::string token) {
        if (token_type != detail::TokenType::EndOfInput) {
          tokens.emplace_back(token_type, std::move(token));
        }
      });
  // All white spaces replaced and collapsed, hyphenated words
  // broken, to produce the following tokens:
  //     "Why?", " ", "Just", " ", "plain", " ",
  //     "finger-", "licking", " ", "good!"

Public Functions

inline explicit Tokenizer(std::string tab, bool collapse_ws, bool break_on_hyphens)#

Create a new instance of the Tokenizer class configured with the given parameters.

See also

Tokenizer class documentation for a detailed description of all configuration parameters and the associated behaviors.

Parameters:
  • tab – string to which tab character should be expanded.

  • collapse_ws – controls collapsing of multiple white spaces into a single space.

  • break_on_hyphens – controls whether hyphens can be used to break words into multiple chunks.

ASAP_TEXTWRAP_API auto Tokenize(const std::string &text, const TokenConsumer &consume_token) const -> bool#

Transform the given text into a stream of tokens.

Tokens produced by the Tokenizer are consumed via the TokenConsumer passed as an argument to this method.

Parameters:
  • text – the input text to be split into tokens.

  • consume_token – the token consumer which will be called each time a token is produced.

Returns:

true if the tokenization completed successfully; false otherwise.