Skip to main content

TOKENIZE

Description

This function tokenizes text based on the specified tokenizer and JSON parameters.

Syntax

TOKENIZE('text', ['parser'], ['behavior_ctrl'])

Parameters

ParameterDescription
textThe text to be tokenized. It can be of the TEXT, CHAR, or VARCHAR data type.
parserThe name of the tokenizer. Valid values: BENG (basic English), NGRAM (Chinese), SPACE (space), and IK (Chinese).
behavior_ctrlJSON parameters for optional configurations. Valid values:
  • stopwords : the stopword list. If this parameter is not specified, the global default stopword list is used. If this parameter is specified as an empty string, no stopword is used.
  • case : the case sensitivity. If this parameter is not specified, the system default behavior is used. If this parameter is specified, the text is converted to uppercase or lowercase.
  • output : the output format. Valid values:
    • default : the output is a JSON array that contains only the tokens, for example, ["hello", "world", "english"].
    • all : the output is a JSON object that contains the doc_len and token frequency, for example, {"doc_len":3, "tokens":["i":1, "love":1, "china":1]}.
    • additional-args : specific parameters for a tokenizer, for example, ngram and token_size:2.

Examples

Use the TOKENIZE function to split the string I Love China into words, and use beng as the delimiter. Then, use JSON parameters to set the output options.

SELECT TOKENIZE('I Love China','beng', '[{"output": "all"}]');

The return result is as follows:

+--------------------------------------------------------+
| TOKENIZE('I Love China','beng', '[{"output": "all"}]') |
+--------------------------------------------------------+
| {"tokens": [{"love": 1}, {"china": 1}], "doc_len": 2} |
+--------------------------------------------------------+
1 row in set (0.001 sec)