Full-text indexes

In seekdb, full-text indexes can be applied to columns of CHAR, VARCHAR, and TEXT types. Additionally, seekdb allows multiple full-text indexes to be created on the primary table, and multiple full-text indexes can also be created on the same column.

Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:

Full-text indexes can only be applied to columns of CHAR, VARCHAR, and TEXT types.
The current version only supports creating local (LOCAL) full-text indexes.
The UNIQUE keyword cannot be specified when creating a full-text index.
If you want to create a full-text index involving multiple columns, you must ensure that these columns have the same character set.

By using these syntax rules and guidelines, seekdb's full-text indexing functionality provides efficient search and retrieval capabilities for text data.

DML operations

For tables with full-text indexes, complex DML operations are supported, including INSERT INTO ON DUPLICATE KEY, REPLACE INTO, multi-table updates/deletes, and updatable views.

Examples:

INSERT INTO ON DUPLICATE KEY:

INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
    ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';

REPLACE INTO:

REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');

Multi-table updates and deletes.

Create table tbl1.

CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));

Create table tbl2.

CREATE TABLE tbl2 (a int PRIMARY KEY, b text);

Perform an update (UPDATE) operation on multiple tables.

UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
    SET tbl1.b = 'dddd', tbl2.b = 'eeee';

UPDATE  tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';

UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;

Perform a delete (DELETE) operation on multiple tables.

DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DML operations on updatable views.

Create view fts_view.

CREATE VIEW fts_view AS SELECT * FROM tbl1;

Perform an INSERT operation on the updatable view.

INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');

Perform an UPDATE operation on the updatable view.

UPDATE fts_view SET b = 'dddd';

UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
   SET fts_view.b = 'dddd', tbl2.b = 'eeee';

Perform a DELETE operation on the updatable view.

DELETE FROM fts_view WHERE b = 'dddd';

DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;

Full-text index tokenizer

seekdb's full-text index functionality supports multiple built-in tokenizers, helping users select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is Space, while other tokenizers need to be explicitly specified using the WITH PARSER parameter.

List of tokenizers:

Space tokenizer
Basic English tokenizer
IK tokenizer
Ngram tokenizer
Jieba tokenizer

Configuration example:

When creating or modifying a table, specify the tokenizer type for the full-text index by setting the WITH PARSER tokenizer_option parameter in the CREATE TABLE/ALTER TABLE statement.

CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
    FULLTEXT INDEX full_idx1_tbl2(name, doc)
      WITH PARSER NGRAM
      PARSER_PROPERTIES=(ngram_token_size=3));


-- Modify the full-text index tokenizer of an existing table.
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
    FULLTEXT INDEX full_idx1_tbl2(name, doc)
      WITH PARSER NGRAM
      PARSER_PROPERTIES=(ngram_token_size=3));  -- Ngram example

Space tokenizer (default)

Concepts:

This tokenizer splits text using spaces, punctuation marks (such as commas, periods), or non-alphanumeric characters (except underscore _) as delimiters.
The tokenization results include only valid tokens with lengths between min_token_size (default 3) and max_token_size (default 84).
Chinese characters are treated as single tokens.

Applicable scenarios:

Languages separated by spaces such as English (for example "apple watch series 9").
Chinese text with manually added delimiters (for example, "南京长江大桥").

Tokenization result:

OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM，一平方公里也很小 hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见www.XXX.COM, 邮箱xx@OB.COM，一平方公里也很小 hello-word h_name", 'space')   |
+-------------------------------------------------------------------------------------------------------------+
|［"详见www", "一平方公里也很小", "xxx", "南京市长江大桥有1千米长", "邮箱xx", "word", "hello”, "h_name"］             |
+-------------------------------------------------------------------------------------------------------------+

Example explanation:

Spaces, commas, periods, and other symbols serve as delimiters, and consecutive Chinese characters are treated as words.

Basic English (Beng) tokenizer

Concepts:

Similar to the Space tokenizer, but treats underscores _ as separators instead of preserving them.
Suitable for separating English phrases, but has limited effectiveness in splitting terms without spaces (such as "iPhone15").

Applicable scenarios:

Basic retrieval of English documents (such as logs, comments).

Tokenization result:

OceanBase [(rooteoceanbase)]> select tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("System log entry: server_status is active, visit www.EXAMPLE.COM, contact admin@DB.COM, response_time 150ms user_name", 'beng')     |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| ["user", "log", "system", "admin", "contact", "server", "active", "visit", "status", "entry", "example", "name", "time", "response", "150ms"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

Example explanation:

Underscores _ are split into separate tokens (for example, server_status -> server, status, and user_name -> user, name). The core difference from the Space tokenizer lies in how it handles underscores _.

Ngram tokenizer

Concepts:

Fixed n-value tokenization: By default, n=2. This tokenizer splits consecutive non-delimiter characters into subsequences of length n.
Delimiter rules follow the Space tokenizer (preserving _, digits, and letters).
Does not support length limit parameters, outputs all possible tokens of length n.

Applicable scenarios:

Fuzzy matching for short text (such as user IDs, order numbers).
Scenarios requiring fixed-length feature extraction (such as password policy analysis).

Tokenization result:

OceanBase [(rooteoceanbase)]> select tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram');
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Order ID: ORD12345, user_account: john_doe, email support@example.com, tracking code ABC-XYZ-789", 'ngram')                                                                                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["ab", "hn", "am", "r_", "em", "le", "po", "ma", "ou", "xy", "jo", "pl", "_d", "89", "yz", "xa", "ck", "in", "se", "tr", "oh", "12", "d1", "il", "oe", "45", "un", "ac", "co", "ex", "us", "23", "34", "or", "er", "mp", "up", "de", "su", "rt", "pp", "n_", "nt", "ki", "rd", "_a", "bc", "ng", "cc", "od", "om", "78", "ra", "ai", "do", "id"] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example explanation:

With the default setting n=2, this tokenizer outputs all consecutive 2-character tokens, including overlapping ones (for example, ORD12345 -> OR, RD, D1, 12, 23, 34, 45; user_account -> us, se, er, r_, _a, ac, cc, co, ou, un, nt).

Ngram2 tokenizer

Concepts:

Supports dynamic n-value range: Sets token length range through min_ngram_size and max_ngram_size parameters.
Suitable for scenarios requiring multi-length token coverage.

Applicable scenarios: Scenarios that require multiple fixed-length tokens simultaneously.

info

When using the ngram2 tokenizer, be aware of its high memory consumption. For example, setting a large range for min_ngram_size and max_ngram_size parameters will generate a large number of token combinations, which may lead to excessive resource consumption.

Tokenization result:

OceanBase [(rooteoceanbase)]> select tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("user_login_session_2024", 'ngram2', '[{"additional_args":[{"min_ngram_size": 2},{"max_ngram_size": 4}]}]')                                                                                                                                                                                                                                                                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["io", "lo", "r_lo", "_ses", "_l", "r_", "ss", "user", "ses", "_s", "ogin", "sion", "on", "ess", "20", "logi", "er_", "on_", "use", "essi", "in", "se", "sio", "log", "202", "gin_", "_2", "ssi", "ogi", "us", "n_se", "r_l", "er", "024", "es", "n_2", "og", "_lo", "n_", "_log", "2024", "n_20", "gi", "er_l", "ser", "24", "ssio", "n_s", "gin", "in_", "_se", "02", "_20", "si", "sess", "on_2", "ion_", "ser_", "ion", "_202", "in_s"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example explanation:

This tokenizer outputs all consecutive subsequences with lengths between 2-4 characters, with overlapping tokens allowed (for example, user_login_session_2024 generates tokens like us, use, user, se, ser, ser_, er_, er_l, r_lo, log, logi, ogin, etc.).

IK tokenizer

Concepts:

A Chinese tokenizer based on the open-source IK Analyzer tool, supporting two modes:
- Smart mode: Prioritizes outputting longer words, reducing the number of splits (for example, "南京市" is not split into "南京" and "市").
- Max Word mode: Outputs all possible shorter words (for example, "南京市" is split into "南京" and "市").
Automatically recognizes English words, email addresses, URLs (without ://), IP addresses, and other formats.

Applicable scenarios: Chinese word segmentation

Business scenarios:

E-commerce product description search (for example, precise matching for "华为Mate60").
Social media content analysis (for example, keyword extraction from user comments).
Smart mode: Ensures that each character belongs to only one word with no overlap, and guarantees that individual words are as long as possible while minimizing the total number of words. Attempts to combine numerals and quantifiers into a single token.

OceanBase [(rooteoceanbase)]> select tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("南京市长江大桥有1千米长,详见WWW.XXX.COM, 邮箱xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]')                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["邮箱", "hello_word", "192.168.1.1", "hello-word", "长江大桥", "www.baidu.com", "www.xxx.com", "xx@ob.com", "长", "http", "1千米", "详见", "南京市", "有"]                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

max_word mode: Includes the same character in different tokens, providing as many possible words as possible.

OceanBase [(rooteoceanbase)]> select select tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("The Nanjing Yangtze River Bridge is 1 kilometer long. For more information, see www.xxx.com. E-mail: xx@ob.com.", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]')                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["kilometer", "Yangtze River Bridge", "city", "dry", "Nanjing City", "Nanjing", "kilometers", "xx", "www.xxx.com", "long", "www", "xx@ob.com", "Yangtze River", "ob", "XXX", "com", "see", "l", "is", "Bridge", "E-mail"]       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

jieba tokenizer

Concept: A tokenizer based on the open-source jieba tool from the Python ecosystem, supporting precise mode, full mode, and search engine mode.

Features:

Precise mode: Strictly segments words according to the dictionary (for example, "不能" is not segmented into "不" and "能").
Full mode: Lists all possible segmentation combinations.
Search engine mode: Balances precision and recall rate (for example, "南京市长江大桥" is segmented into "南京", "市长", and "长江大桥").
Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).

Applicable scenarios:

Medical/technology domain terminology analysis (e.g., precise segmentation of "人工智能").
Multi-language mixed text processing (e.g., social media content with mixed Chinese and English).

To use the jieba tokenizer plugin, you need to install it yourself. For instructions on how to install it on the compiler, see Tokenizer plugin.

tip

The current tokenizer plugin is an experimental feature and is not recommended for use in production environments.

Tokenizer selection strategy

Business scenario	Recommended tokenizer	Reason
Search for English product titles	Space or Basic English	Simple and efficient, aligns with English tokenization conventions.
Retrieval of Chinese product descriptions	IK tokenizer	Accurately recognizes Chinese terminology, supports custom dictionaries.
Fuzzy matching of logs (such as error codes)	Ngram tokenizer	No dictionary required, covers fuzzy query needs for text without spaces.
Keyword extraction from technology papers	jieba tokenizer	Supports new word discovery and complex mode switching.

References

For more information about creating full-text indexes, see the Create full-text indexes section in Create an index.

DML operations​

Full-text index tokenizer​

Space tokenizer (default)​

Basic English (Beng) tokenizer​

Ngram tokenizer​

Ngram2 tokenizer​

IK tokenizer​

jieba tokenizer​

Tokenizer selection strategy​

References​

Contents

DML operations

Full-text index tokenizer

Space tokenizer (default)

Basic English (Beng) tokenizer

Ngram tokenizer

Ngram2 tokenizer

IK tokenizer

jieba tokenizer

Tokenizer selection strategy

References