Version: V1.1.0

TOKENIZE

Description

This function tokenizes text based on the specified tokenizer and JSON parameters.

Syntax

TOKENIZE('text', ['parser'], ['behavior_ctrl'])

Parameters

Parameter	Description
text	The text to be tokenized. It can be of the `TEXT`, `CHAR`, or `VARCHAR` data type.
parser	The name of the tokenizer. Valid values: `BENG` (basic English), `NGRAM` (Chinese), `SPACE` (space), and `IK` (Chinese).
behavior_ctrl	JSON parameters for optional configurations. Valid values: `stopwords` : the stopword list. If this parameter is not specified, the global default stopword list is used. If this parameter is specified as an empty string, no stopword is used. `case` : the case sensitivity. If this parameter is not specified, the system default behavior is used. If this parameter is specified, the text is converted to uppercase or lowercase. `output` : the output format. Valid values: `default` : the output is a JSON array that contains only the tokens, for example, `["hello", "world", "english"]`. `all` : the output is a JSON object that contains the `doc_len` and token frequency, for example, `{"doc_len":3, "tokens":["i":1, "love":1, "china":1]}`. `additional-args` : specific parameters for a tokenizer, for example, `ngram` and `token_size:2`.

Examples

Use the TOKENIZE function to split the string I Love China into words, and use beng as the delimiter. Then, use JSON parameters to set the output options.

SELECT TOKENIZE('I Love China','beng', '[{"output": "all"}]');

The return result is as follows:

+--------------------------------------------------------+
| TOKENIZE('I Love China','beng', '[{"output": "all"}]') |
+--------------------------------------------------------+
| {"tokens": [{"love": 1}, {"china": 1}], "doc_len": 2}  |
+--------------------------------------------------------+
1 row in set (0.001 sec)

Description​

Syntax​

Parameters​

Examples​

Contents

Description

Syntax

Parameters

Examples