跳到主要内容

Full-text search

seekdb provides efficient search and retrieval capabilities for text data through its full-text indexing feature.

Scenarios

When you need to perform fuzzy searches on a large amount of text data, scanning the entire table to perform fuzzy queries on each row of data may not meet the performance requirements, especially when the text is large and the data volume is high. Additionally, complex query scenarios such as approximate matching and relevance sorting are also difficult to support through SQL rewriting.

Full-text indexes can better support these scenarios. By preprocessing the text content and establishing keyword indexes, full-text indexes can significantly improve the efficiency of full-text searches. Full-text indexes are applicable to various scenarios. The following are some specific examples:

  • Enterprise internal knowledge base: Many large enterprises build their own internal knowledge base systems to store project documents, meeting records, research reports, and other materials. Using full-text indexes helps employees quickly and accurately find the required information, thereby improving work efficiency.

  • Online libraries and e-book platforms: For services that provide a large number of books for users to read, full-text indexes are extremely important. Users can search by entering the book title, author name, or even a specific passage from the book as keywords. The system can quickly locate the results that match the criteria based on the full-text index.

  • News portals and social media websites: These platforms generate a massive amount of fresh content daily, including articles, posts, and comments. By using full-text indexes, users can filter the information stream based on topics, events, or person names they are interested in, and obtain the most relevant content.

  • Legal document retrieval systems: The legal industry involves a large amount of document review work, such as contracts, judgments, and legal provisions. An efficient full-text search engine can greatly simplify the workflow for lawyers, allowing them to quickly find precedents, cited clauses, and relevant legal bases.

  • Medical health information systems: In the medical field, doctors often need to review patients' medical history, the latest medical research papers, and other reference materials. With the help of full-text indexes, medical staff can conveniently access relevant information, enabling more accurate diagnostic decisions.

Any application involving the management and querying of a large amount of unstructured text data can consider adopting full-text indexes to enhance search efficiency.

Introduction to full-text indexes

In seekdb, full-text indexes can be created on CHAR, VARCHAR, and TEXT columns. Additionally, seekdb allows the creation of multiple full-text indexes on the same table and even on the same column.

Full-text indexes can be created on both non-partitioned and partitioned tables, regardless of whether they have a primary key. The following limitations apply when creating full-text indexes:

  • Full-text indexes can only be created on CHAR, VARCHAR, and TEXT columns.
  • Only local (LOCAL) full-text indexes are supported in the current version.
  • The UNIQUE keyword cannot be specified when creating a full-text index.
  • If you want to create a full-text index that involves multiple columns, ensure that these columns have the same character set.

By using these syntax and rules, seekdb's full-text index feature provides efficient search and retrieval capabilities for text data.

DML operations

For a table that contains a full-text index, you can perform complex DML operations such as INSERT INTO ON DUPLICATE KEY, REPLACE INTO, updates and deletes across multiple tables, and updates on updatable views.

Here are some examples:

  • INSERT INTO ON DUPLICATE KEY:

    INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
    ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';
  • REPLACE INTO:

    REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');
  • Updates and deletes across multiple tables.

    1. Create a table named tbl1.

      CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));
    2. Create a table named tbl2.

      CREATE TABLE tbl2 (a int PRIMARY KEY, b text);
    3. Update multiple tables by using the UPDATE statement.

      UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
      SET tbl1.b = 'dddd', tbl2.b = 'eeee';
      UPDATE  tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';
      UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;
    4. Delete data from multiple tables by using the DELETE statement.

      DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
      DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
      DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
  • DML operations on updatable views.

    1. Create a view named fts_view.

      CREATE VIEW fts_view AS SELECT * FROM tbl1;
    2. Insert data into the updatable view by using the INSERT statement.

      INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');
    3. Update data in the updatable view by using the UPDATE statement.

      UPDATE fts_view SET b = 'dddd';
      UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
      SET fts_view.b = 'dddd', tbl2.b = 'eeee';
    4. Delete data from the updatable view by using the DELETE statement.

      DELETE FROM fts_view WHERE b = 'dddd';
      DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;

Full-text index tokenizer

The full-text index feature of seekdb supports multiple built-in tokenizers, allowing users to select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is Space tokenizer, and other tokenizers need to be explicitly specified using the WITH PARSER parameter.

Tokenizer list:

  • Space tokenizer
  • Basic English tokenizer
  • IK tokenizer
  • Ngram tokenizer
  • jieba tokenizer

Configuration method:

When creating or modifying a table, use the CREATE TABLE/ALTER TABLE statement to specify the tokenizer type for the full-text index by setting the parameter WITH PARSER tokenizer_option when creating the full-text index. For more information about the attribute parameters of other tokenizers, see Create index.

CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));

-- Modify the tokenizer for the full-text index of an existing table
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example

Space tokenizer (default)

Concept:

  • Splits text using spaces, punctuation (such as commas and periods), or non-alphanumeric characters (excluding underscores _).
  • The tokenization result contains only valid tokens with lengths between min_token_size (default 3) and max_token_size (default 84).
  • Chinese characters are treated as single-character tokens.

Scenarios:

  • For languages that use spaces as separators, such as English ("apple watch series 9").
  • For Chinese text with manually added separators, such as "Nanjing Yangtze River Bridge".

Tokenization effect:

select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at www.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at www.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|["see details at www", "one square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 kilometer in length", "email xx", "word", "hello", "h_name"] |
+-------------------------------------------------------------------------------------------------------------+

Example explanation:

  • Symbols such as spaces, commas, and periods are used as separators. Continuous Chinese characters are treated as words.

Basic English (Beng) tokenizer

Concept:

  • Similar to the Space tokenizer, but underscores _ are treated as separators and not preserved.
  • Suitable for splitting English phrases but may not effectively split terms without spaces, such as "iPhone15".

Scenarios:

  • For basic search in English documents, such as logs and comments.

Tokenization effect:

select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM,email xx@OB.COM, one square kilometer is also small hello-word h_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------+
|["see details at www", "one square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 kilometer in length", "email xx", "word", "hello", "name"] |
+-----------------------------------------------------------------------------------------------------------------------+

Example explanation:

  • The underscore _ is split. The core difference from the Space tokenizer is how it handles underscores.

Ngram tokenizer

Concept:

  • Fixed-length n-gram tokenization: By default, n=2, and consecutive non-separator characters are split into subsequences of length n.
  • Separator determination rules are the same as the Space tokenizer (preserving underscores _ and alphanumeric characters).
  • Does not support length limit parameters, outputting all possible n-length tokens.

Scenarios:

  • For fuzzy matching of short texts, such as user IDs and order numbers.
  • For scenarios requiring fixed-length feature extraction, such as password policy analysis.

Tokenization effect:

select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["email", "ww", "bridge", “ob", "me", "also", "or", "_n", "kilometer", "small", "meter", "ll", "boxx", "kilometer", "see w", "co", "also", "1000", "city", "lo", "Yangtze", "el", "rd", "one square", "square kilometer", "he", "am", "Nanjing", "h_", "city", "wo", "xx", "Yangtze", "has 1", "na", "see details", "square kilometer", "om", "bridge has" |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example explanation:

  • When n=2 by default, all consecutive 2-character tokens, including overlapping ones, are output.

Ngram2 tokenizer

Concepts:

  • Dynamic n-value range: The length of the token is specified by the min_ngram_size and max_ngram_size parameters.
  • It is suitable for scenarios that require tokens of multiple lengths.

Scenarios: Scenarios that require tokens of multiple fixed lengths.

提示

When you use the ngram2 tokenizer, note that it consumes a large amount of memory. For example, if you set the min_ngram_size and max_ngram_size parameters to a large range, a large number of token combinations are generated, which may cause excessive resource consumption.

Tokenization result:

select tokenize ("Nanjing Yangtze River Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]');
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]') |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["Yangtze River Bridge", "Bridge 1 km", "River Bridge 1 km", "Nanjing Yangtze", "Nanjing Yangtze", "River Bridge 1", "Nanjing Yangtze", "Nanjing Yangtze", "Bridge 1 km", "River Bridge 1 km", "Nanjing Yangtze Bridge 1", "Yangtze River Bridge 1", "Nanjing Yangtze River", "Bridge 1 km", "Nanjing Yangtze River Bridge", "Yangtze River Bridge 1 km", "Nanjing Yangtze River Bridge", "Nanjing Yangtze River" |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example explanation:

  • All continuous subsequences of lengths 4 to 6 are output, and the tokens can overlap.

IK tokenizer

Concepts:

  • A Chinese tokenizer based on the open-source IK Analyzer, supporting two modes:

    • Smart mode: Prioritizes longer terms to minimize segmentation (e.g., "Nanjing City" is not split into "Nanjing" and "City").
    • Max Word mode: Outputs all possible shorter terms (e.g., "Nanjing City" is split into "Nanjing" and "City").
  • Automatically identifies formats such as English words, email addresses, URLs (excluding ://), and IP addresses.

Use Cases: Chinese text segmentation

Business Scenarios:

  • E-commerce product description search (e.g., precise matching of "Huawei Mate 60").

  • Social media content analysis (e.g., keyword extraction from user comments).

  • Smart mode: Ensures each character belongs to only one term, with no overlaps, and maximizes the length of each term while minimizing the number of terms. It also attempts to combine numbers and units into a single term.

    select tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |["Email", "hello_word", "192.168.1.1", "hello-word", "Yangtze River Bridge", "www.baidu.com", "www.xxx.com", "xx@ob.com", "Long", "http", "1 kilometer", "See", "Nanjing City", "Have"] |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  • Max Word mode: Includes the same character in different terms to provide as many possible terms as possible.

    select tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |["Meter", "Yangtze River Bridge", "Mayor", "Dry", "Nanjing City", "Nanjing", "Kilometer", "xx", "www.xxx.com", "Long", "www", "xx@ob.com", "Yangtze", "ob", "XXX", "com", "See", "l", "Have", "Bridge", "Email"] |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

jieba Tokenizer

Concept: A tokenizer based on the Python open-source library jieba. Supports precise mode, full mode, and search engine mode.

Features:

  • Precise mode: strictly divides terms according to the dictionary (e.g., "cannot" is not split into "not" and "can").
  • Full mode: Displays all possible partitioning combinations.
  • Search mode: Balances precision and recall (for example, search for “Nanjing City Yangtze River Bridge” to return results containing all of “Nanjing”, “City”, and “Yangtze River Bridge”).
  • Supports custom dictionaries and new word discovery, and is compatible with multiple languages, including Chinese, English, and Japanese.

Scenarios:

  • Accurate segmentation of medical or technological terms. For example, the accurate segmentation of "artificial intelligence."
  • Multilingual mixed text processing (such as content on social media in a mix of Chinese and English).

The jieba tokenization plug-in needs to be installed and compiled. For more information, see Tokenization plug-in.

提示

The current tokenizer plugin is an experimental feature and is not recommended for use in a production environment.

Tokenizer Selection Strategy

ScenarioSuggested AnalyzerReasons
English title searchSpace or Basic EnglishSimple and efficient, aligning with English word segmentation habits.
Retrieving Chinese product descriptionIK tokenizerAccurately identifies Chinese terms and supports custom dictionaries.
Fuzzy match logs (for example, error codes)Ngram tokenizerNo dictionary is required. It is used to perform fuzzy queries on texts without spaces.
Keywords extraction in technical papersjieba tokenizerSupports new word discovery and complex mode switching.

Create a full-text index

You can use the CREATE TABLE, CREATE FULLTEXT INDEX, or ALTER TABLE statement to create a full-text index.

Limitations and considerations for full-text indexes

You can create a full-text index on a non-partitioned table or a partitioned table without a primary key. The following limitations apply:

  • A full-text index can be created only on columns of the CHAR, VARCHAR, or TEXT type.
  • Only local (LOCAL) full-text indexes are supported in the current version.
  • You cannot specify the UNIQUE keyword when you create a full-text index.
  • To create a full-text index that involves multiple columns, ensure that these columns have the same character set.
  • For information about how to use full-text indexes with offline DDL operations, see Offline DDL.
  • Columnstore full-text indexes are not supported.

Create a full-text index by using the CREATE TABLE statement

You can create a full-text index for a table when you create the table by using the CREATE TABLE statement. The simplified syntax is as follows:

CREATE TABLE table_name(column_name column_definition,[column_name column_definition,...]
FULLTEXT [INDEX | KEY] [index_name](column_name)
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL]);

tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2

parser_properties_list:
parser_properties, [parser_properties]

parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value

For more information about the CREATE TABLE statement, see CREATE TABLE.

The following table describes the related parameters:

  • table_name: specifies the name of the table to be created.

  • column_name: specifies the columns of the table.

  • column_definition: specifies the data types of the columns in the table.

  • FULLTEXT: specifies to create a full-text index.

    提示

    Only local full-text indexes are supported in the current version.

  • INDEX | KEY: specifies to use the INDEX or KEY keyword in this statement.

  • index_name: specifies the name of the index to be created. This parameter is optional. If you do not specify this parameter, the index name is the same as the column name.

  • WITH PARSER tokenizer_option: specifies the tokenizer for the full-text index. Valid values:

    • SPACE: the default value. This option specifies to split the text by spaces. You can specify the following parameters:

      ParameterValue range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • NGRAM: specifies to split the text by using the N-Gram tokenizer for Chinese. You can specify the following parameters:

      ParameterValue range
      ngram_token_size[1, 10]
    • NGRAM2: specifies to split the text into consecutive characters of lengths ranging from min_ngram_size to max_ngram_size.

      ParameterValue range
      min_ngram_size[1, 16]
      max_ngram_size[1, 16]
    • BENG: specifies to split the text by using the Beng tokenizer for basic English. You can specify the following parameters:

      ParameterValue range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • IK: specifies to split the text by using the IK tokenizer for Chinese. Only the utf-8 character set is supported. You can specify the following parameters:

      ParameterValue range
      ik_mode
      • smart
      • max_word

    You can use the TOKENIZE function to view the tokenization result of the text by using the specified tokenizer and JSON parameters.

  • PARSER_PROPERTIES[=](parser_properties_list): specifies the properties of the tokenizer. Valid values:

    • min_token_size: specifies the minimum token length. The default value is 3, and the value ranges from 1 to 16.

    • max_token_size: specifies the maximum token length. The default value is 84, and the value ranges from 10 to 84.

    • ngram_token_size: specifies the token length for the NGRAM tokenizer. This parameter is valid only for the NGRAM tokenizer. The default value is 2, and the value ranges from 1 to 10.

    • ik_mode: specifies the tokenization mode for the IK tokenizer. Valid values:

      • smart: the default value. This mode uses the words in the dictionary to improve the accuracy of tokenization. The boundaries of the words in the dictionary are prioritized, which may reduce unnecessary expansion.
      • max_word: this mode recognizes the words defined in the dictionary but does not affect the maximum expansion of tokenization. Even if the dictionary defines words, the max_word mode still attempts to split the text into more words.
  • LOCAL: specifies to create a local index. This parameter is optional.

Here are some examples:

  • Create a table named tbl1 and a full-text index named full_idx1_tbl1.

    CREATE TABLE tbl1(id INT, name VARCHAR(18), date DATE, PRIMARY KEY (id), FULLTEXT INDEX full_idx1_tbl1(name));
  • Create a table named tbl2 and a full-text index named full_idx1_tbl2. Specify the NGRAM tokenizer for the full-text index and set the tokenizer properties by using the PARSER_PROPERTIES parameter.

    CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT, 
    FULLTEXT INDEX full_idx1_tbl2(name, doc)
    WITH PARSER NGRAM
    PARSER_PROPERTIES=(ngram_token_size=3));

Use the CREATE FULLTEXT INDEX statement to create a full-text index

You can run the CREATE FULLTEXT INDEX statement to create a full-text index for an existing table. The syntax is as follows:

CREATE FULLTEXT INDEX index_name ON table_name (column_name, [column_name ...])
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL];

tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2

parser_properties_list:
parser_properties, [parser_properties]

parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value

For more information about the CREATE INDEX statement, see CREATE INDEX.

The following table describes the related parameters:

  • index_name: specifies the name of the index to be added.

  • table_name: specifies the name of the table for which to create an index.

  • column_name: specifies the columns to be indexed. Separate multiple columns with commas.

  • WITH PARSER tokenizer_option: specifies the tokenizer for the full-text index. Valid values:

    • SPACE: the default value. This option specifies to split the text by spaces. You can specify the following parameters:

      ParameterValue Range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • NGRAM: specifies to split the text by N-Gram (Chinese) method. You can specify the following parameter:

      ParameterValue Range
      ngram_token_size[1, 10]
    • NGRAM2: specifies to split the text into continuous characters in the range of min_ngram_size to max_ngram_size. You can specify the following parameters:

      ParameterValue Range
      min_ngram_size[1, 16]
      max_ngram_size[1, 16]
    • BENG: specifies to split the text by Beng (basic English) method. You can specify the following parameters:

      ParameterValue Range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • IK: specifies to split the text by IK (Chinese) method. Only the utf-8 character set is supported. You can specify the following parameter:

      ParameterValue Range
      ik_mode
      • smart
      • max_word

    You can run the TOKENIZE function to view the tokenization result of the text in the specified tokenizer and JSON format.

  • PARSER_PROPERTIES[=](parser_properties_list): specifies the parameters of the tokenizer. Valid values:

    • min_token_size: specifies the minimum token length. The default value is 3, and the value ranges from 1 to 16.

    • max_token_size: specifies the maximum token length. The default value is 84, and the value ranges from 10 to 84.

    • ngram_token_size: specifies the token length of the NGRAM tokenizer. This parameter is valid only for the NGRAM tokenizer. The default value is 2, and the value ranges from 1 to 10.

    • ik_mode: specifies the tokenization mode of the IK tokenizer. Valid values:

      • smart: the default value. This option specifies to use the words in the dictionary to improve the tokenization accuracy. The boundaries of the words in the dictionary are prioritized, which may reduce unnecessary extensions.
      • max_word: specifies to identify the words defined in the dictionary but not to affect the maximum extension of tokenization. Even if the dictionary defines the words, the max_word mode still tries to split the text into more words.
  • LOCAL: specifies to create a local index.

Here is an example:

After you create the tbl3 table, create a full-text index named ft_idx1_tbl3.

  1. Create the tbl3 table.

    CREATE TABLE tbl3(col1 INT, col2 VARCHAR(4096));
  2. Create a full-text index named ft_idx1_tbl3 for the tbl3 table. Specify the IK tokenizer and set the tokenizer parameters by using the PARSER_PROPERTIES clause.

    CREATE FULLTEXT INDEX ft_idx1_tbl3 ON tbl3(col2)
    WITH PARSER IK
    PARSER_PROPERTIES=(ik_mode='max_word');

Use the ALTER TABLE statement to create a full-text index

You can use the ALTER TABLE statement to add a full-text index to an existing table. The syntax is as follows:

ALTER TABLE table_name ADD FULLTEXT [INDEX | KEY] [index_name](column_name, [column_name ...])
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL];

tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2

parser_properties_list:
parser_properties, [parser_properties]

parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value

For more information about the ALTER TABLE statement, see ALTER TABLE.

The following table describes the related parameters:

  • table_name : specifies the name of the table to which the index is added.

  • INDEX | KEY : specifies whether to use INDEX or KEY as the index keyword. The default value is KEY.

  • index_name : specifies the name of the index to be created. If you do not specify this parameter, the index name is the same as the column name.

  • column_name : specifies the column to be indexed. Separate multiple columns with commas (,).

  • WITH PARSER tokenizer_option : specifies the tokenizer for the full-text index. Valid values:

    • SPACE : specifies the tokenizer based on spaces. You can specify the following parameters:

      ParameterValue Range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • NGRAM : specifies the tokenizer based on N-Grams (for Chinese). You can specify the following parameters:

      ParameterValue Range
      ngram_token_size[1, 10]
    • NGRAM2 : specifies the tokenizer that splits the text into continuous characters of length between min_ngram_size and max_ngram_size. You can specify the following parameters:

      ParameterValue Range
      min_ngram_size[1, 16]
      max_ngram_size[1, 16]
    • BENG : specifies the tokenizer based on Beng (basic English). You can specify the following parameters:

      ParameterValue Range
      min_token_size[1, 16]
      max_token_size[10, 84]
    • IK : specifies the tokenizer based on IK (for Chinese). Only the utf-8 character set is supported. You can specify the following parameter:

      ParameterValue Range
      ik_mode
      • smart
      • max_word

    You can use the TOKENIZE function to view the tokenization result of the text based on the specified tokenizer and JSON parameters.

  • PARSER_PROPERTIES[=](parser_properties_list) : specifies the tokenizer parameters. Valid values:

    • min_token_size : specifies the minimum token length. Default value: 3. Value range: 1 to 16.

    • max_token_size : specifies the maximum token length. Default value: 84. Value range: 10 to 84.

    • ngram_token_size : specifies the token length for the NGRAM tokenizer. This parameter is valid only for the NGRAM tokenizer. Default value: 2. Value range: 1 to 10.

    • ik_mode: specifies the tokenization mode for the IK tokenizer. Valid values:

      • smart : default value. This mode uses the dictionary to improve the accuracy of tokenization. It prioritizes the boundaries of dictionary words, thereby reducing unnecessary extensions.
      • max_word : this mode recognizes the words defined in the dictionary but does not affect the maximum extension of tokenization. Even if the dictionary defines words, the max_word mode still tries to split the text into more words.
  • LOCAL : specifies whether to create a local index.

Here is an example:

After you create a table named tbl4, you can add a full-text index named ft_idx1_tbl4 to the table.

  1. Create the tbl4 table.

    CREATE TABLE tbl4(col1 INT, col2 TEXT);
  2. Add a full-text index named ft_idx1_tbl4 to the tbl4 table. Specify BENG as the tokenizer for the full-text index and set the tokenizer parameters by using the PARSER_PROPERTIES parameter.

    ALTER TABLE tbl4 ADD FULLTEXT INDEX ft_idx1_tbl4(col2)
    WITH PARSER BENG
    PARSER_PROPERTIES=(min_token_size=2, max_token_size=64);

Example

In this example, a table is defined to store document information, and a full-text index is set for the documents. With the full-text index, you can quickly match documents containing the desired keywords and sort them by similarity in descending order.

  1. Create a table.

    CREATE TABLE Articles (
    id INT AUTO_INCREMENT,
    title VARCHAR(255) ,
    content TEXT ,
    PRIMARY KEY (id),
    FULLTEXT ft1 (content) WITH PARSER SPACE
    );
    Query OK, 0 rows affected (0.67 sec)
  2. Insert data.

    INSERT INTO Articles (title, content) VALUES
    ('seekdb overview', 'seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows.'),
    ('Full-Text Search in Databases', 'Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly.'),
    ('Advantages of Using seekdb', 'seekdb offers several advantages such as high performance, reliability, and ease of use. ');
    Query OK, 3 rows affected (0.10 sec)
    Records: 3 Duplicates: 0 Warnings: 0
  3. Query the table.

    select * from Articles;
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | id | title | content |
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | 1 | seekdb overview | seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows. |
    | 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. |
    | 3 | Advantages of Using seekdb | seekdb offers several advantages such as high performance, reliability, and ease of use. |
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    3 rows in set (0.002 sec)
  4. Query the matched documents.

    select id,title, content,match(content) against('seekdb database') score from Articles where match(content) against('seekdb database');
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
    | id | title | content | score |
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
    | 1 | seekdb overview | seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows. | 0.4570384669555348 |
    | 3 | Advantages of Using seekdb | seekdb offers several advantages such as high performance, reliability, and ease of use. | 0.240174672489083 |
    | 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. | 0.20072992700729927 |
    +----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
    3 rows in set (0.003 sec)
  5. Use the EXPLAIN command to view the query plan and analyze its performance.

    explain select id,title, content,match(content) against('seekdb database') score from Articles where match(content) against('seekdb database');
    +-------------------------------------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +-------------------------------------------------------------------------------------------------------------------------------------------------+
    | ============================================================== |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | -------------------------------------------------------------- |
    | |0 |SORT | |4 |139 | |
    | |1 |└─TEXT RETRIEVAL SCAN|articles(ft1)|4 |138 | |
    | ============================================================== |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([articles.id], [articles.title], [articles.content], [MATCH(articles.content) AGAINST('seekdb database')]), filter(nil), rowset=16 |
    | sort_keys([MATCH(articles.content) AGAINST('seekdb database'), DESC]) |
    | 1 - output([articles.id], [articles.content], [articles.title], [MATCH(articles.content) AGAINST('seekdb database')]), filter(nil), rowset=16 |
    | access([articles.id], [articles.content], [articles.title]), partitions(p0) |
    | is_index_back=true, is_global_index=false, |
    | calc_relevance=true, match_expr(MATCH(articles.content) AGAINST('seekdb database')), |
    | pushdown_match_filter(MATCH(articles.content) AGAINST('seekdb database')) |
    +-------------------------------------------------------------------------------------------------------------------------------------------------+
    15 rows in set (0.002 sec)