textToWords method

  1. @override
List<String> textToWords(
  1. String text
)
override

Given unsegmented text, perform text segmentation particular to the language and return a list of parsed words.

For example, in the case of Japanese, '日本語は難しいです。', this should ideally return a list containing '日本語', 'は', '難しい', 'です', '。'.

In the case of English, 'This is a pen.' should ideally return a list containing 'This', ' ', 'is', ' ', 'a', ' ', 'pen', '.'. Delimiters should stay intact for languages that feature such, such as spaces.

Implementation

@override
List<String> textToWords(String text) {
  String delimiterSanitisedText = text
      .replaceAll('', '␝')
      .replaceAll(' ', '␝')
      .replaceAll('\n', '␜')
      .replaceAll(' ', '␝');

  List<Word> tokens = parseVe(mecab, delimiterSanitisedText);

  List<String> terms = [];

  for (Word token in tokens) {
    final buffer = StringBuffer();
    for (TokenNode token in token.tokens) {
      buffer.write(token.surface);
    }

    String term = buffer.toString();
    term = term.replaceAll('␜', '\n').replaceAll('␝', ' ');
    terms.add(term);
  }

  return terms;
}