Full-text search
seekdb provides efficient search and retrieval capabilities for text data through its full-text indexing feature.
Scenarios
When you need to perform fuzzy searches on a large amount of text data, scanning the entire table to perform fuzzy queries on each row of data may not meet the performance requirements, especially when the text is large and the data volume is high. Additionally, complex query scenarios such as approximate matching and relevance sorting are also difficult to support through SQL rewriting.
Full-text indexes can better support these scenarios. By preprocessing the text content and establishing keyword indexes, full-text indexes can significantly improve the efficiency of full-text searches. Full-text indexes are applicable to various scenarios. The following are some specific examples:
-
Enterprise internal knowledge base: Many large enterprises build their own internal knowledge base systems to store project documents, meeting records, research reports, and other materials. Using full-text indexes helps employees quickly and accurately find the required information, thereby improving work efficiency.
-
Online libraries and e-book platforms: For services that provide a large number of books for users to read, full-text indexes are extremely important. Users can search by entering the book title, author name, or even a specific passage from the book as keywords. The system can quickly locate the results that match the criteria based on the full-text index.
-
News portals and social media websites: These platforms generate a massive amount of fresh content daily, including articles, posts, and comments. By using full-text indexes, users can filter the information stream based on topics, events, or person names they are interested in, and obtain the most relevant content.
-
Legal document retrieval systems: The legal industry involves a large amount of document review work, such as contracts, judgments, and legal provisions. An efficient full-text search engine can greatly simplify the workflow for lawyers, allowing them to quickly find precedents, cited clauses, and relevant legal bases.
-
Medical health information systems: In the medical field, doctors often need to review patients' medical history, the latest medical research papers, and other reference materials. With the help of full-text indexes, medical staff can conveniently access relevant information, enabling more accurate diagnostic decisions.
Any application involving the management and querying of a large amount of unstructured text data can consider adopting full-text indexes to enhance search efficiency.
Introduction to full-text indexes
In seekdb, full-text indexes can be created on CHAR, VARCHAR, and TEXT columns. Additionally, seekdb allows the creation of multiple full-text indexes on the same table and even on the same column.
Full-text indexes can be created on both non-partitioned and partitioned tables, regardless of whether they have a primary key. The following limitations apply when creating full-text indexes:
- Full-text indexes can only be created on
CHAR,VARCHAR, andTEXTcolumns. - Only local (
LOCAL) full-text indexes are supported in the current version. - The
UNIQUEkeyword cannot be specified when creating a full-text index. - If you want to create a full-text index that involves multiple columns, ensure that these columns have the same character set.
By using these syntax and rules, seekdb's full-text index feature provides efficient search and retrieval capabilities for text data.
DML operations
For a table that contains a full-text index, you can perform complex DML operations such as INSERT INTO ON DUPLICATE KEY, REPLACE INTO, updates and deletes across multiple tables, and updates on updatable views.
Here are some examples:
-
INSERT INTO ON DUPLICATE KEY:INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3'; -
REPLACE INTO:REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace'); -
Updates and deletes across multiple tables.
-
Create a table named
tbl1.CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b)); -
Create a table named
tbl2.CREATE TABLE tbl2 (a int PRIMARY KEY, b text); -
Update multiple tables by using the
UPDATEstatement.UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
SET tbl1.b = 'dddd', tbl2.b = 'eeee';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b; -
Delete data from multiple tables by using the
DELETEstatement.DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
-
-
DML operations on updatable views.
-
Create a view named
fts_view.CREATE VIEW fts_view AS SELECT * FROM tbl1; -
Insert data into the updatable view by using the
INSERTstatement.INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd'); -
Update data in the updatable view by using the
UPDATEstatement.UPDATE fts_view SET b = 'dddd';UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
SET fts_view.b = 'dddd', tbl2.b = 'eeee'; -
Delete data from the updatable view by using the
DELETEstatement.DELETE FROM fts_view WHERE b = 'dddd';DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
-
Full-text index tokenizer
The full-text index feature of seekdb supports multiple built-in tokenizers, allowing users to select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is Space tokenizer, and other tokenizers need to be explicitly specified using the WITH PARSER parameter.
Tokenizer list:
- Space tokenizer
- Basic English tokenizer
- IK tokenizer
- Ngram tokenizer
- jieba tokenizer
Configuration method:
When creating or modifying a table, use the CREATE TABLE/ALTER TABLE statement to specify the tokenizer type for the full-text index by setting the parameter WITH PARSER tokenizer_option when creating the full-text index. For more information about the attribute parameters of other tokenizers, see Create index.
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the tokenizer for the full-text index of an existing table
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
Space tokenizer (default)
Concept:
- Splits text using spaces, punctuation (such as commas and periods), or non-alphanumeric characters (excluding underscores
_). - The tokenization result contains only valid tokens with lengths between
min_token_size(default 3) andmax_token_size(default 84). - Chinese characters are treated as single-character tokens.
Scenarios:
- For languages that use spaces as separators, such as English ("apple watch series 9").
- For Chinese text with manually added separators, such as "Nanjing Yangtze River Bridge".
Tokenization effect:
select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at www.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at www.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|["see details at www", "one square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 kilometer in length", "email xx", "word", "hello", "h_name"] |
+-------------------------------------------------------------------------------------------------------------+
Example explanation:
- Symbols such as spaces, commas, and periods are used as separators. Continuous Chinese characters are treated as words.
Basic English (Beng) tokenizer
Concept:
- Similar to the Space tokenizer, but underscores
_are treated as separators and not preserved. - Suitable for splitting English phrases but may not effectively split terms without spaces, such as "iPhone15".
Scenarios:
- For basic search in English documents, such as logs and comments.
Tokenization effect:
select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM,email xx@OB.COM, one square kilometer is also small hello-word h_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------+
|["see details at www", "one square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 kilometer in length", "email xx", "word", "hello", "name"] |
+-----------------------------------------------------------------------------------------------------------------------+
Example explanation:
- The underscore
_is split. The core difference from the Space tokenizer is how it handles underscores.
Ngram tokenizer
Concept:
- Fixed-length n-gram tokenization: By default,
n=2, and consecutive non-separator characters are split into subsequences of lengthn. - Separator determination rules are the same as the Space tokenizer (preserving underscores
_and alphanumeric characters). - Does not support length limit parameters, outputting all possible
n-length tokens.
Scenarios:
- For fuzzy matching of short texts, such as user IDs and order numbers.
- For scenarios requiring fixed-length feature extraction, such as password policy analysis.
Tokenization effect:
select tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize ("Nanjing Yangtze River Bridge has 1 kilometer in length, see details at WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["email", "ww", "bridge", “ob", "me", "also", "or", "_n", "kilometer", "small", "meter", "ll", "boxx", "kilometer", "see w", "co", "also", "1000", "city", "lo", "Yangtze", "el", "rd", "one square", "square kilometer", "he", "am", "Nanjing", "h_", "city", "wo", "xx", "Yangtze", "has 1", "na", "see details", "square kilometer", "om", "bridge has" |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example explanation:
- When
n=2by default, all consecutive 2-character tokens, including overlapping ones, are output.
Ngram2 tokenizer
Concepts:
- Dynamic n-value range: The length of the token is specified by the
min_ngram_sizeandmax_ngram_sizeparameters. - It is suitable for scenarios that require tokens of multiple lengths.
Scenarios: Scenarios that require tokens of multiple fixed lengths.
When you use the ngram2 tokenizer, note that it consumes a large amount of memory. For example, if you set the min_ngram_size and max_ngram_size parameters to a large range, a large number of token combinations are generated, which may cause excessive resource consumption.
Tokenization result:
select tokenize ("Nanjing Yangtze River Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]');
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]') |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["Yangtze River Bridge", "Bridge 1 km", "River Bridge 1 km", "Nanjing Yangtze", "Nanjing Yangtze", "River Bridge 1", "Nanjing Yangtze", "Nanjing Yangtze", "Bridge 1 km", "River Bridge 1 km", "Nanjing Yangtze Bridge 1", "Yangtze River Bridge 1", "Nanjing Yangtze River", "Bridge 1 km", "Nanjing Yangtze River Bridge", "Yangtze River Bridge 1 km", "Nanjing Yangtze River Bridge", "Nanjing Yangtze River" |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example explanation:
- All continuous subsequences of lengths 4 to 6 are output, and the tokens can overlap.
IK tokenizer
Concepts:
-
A Chinese tokenizer based on the open-source IK Analyzer, supporting two modes:
- Smart mode: Prioritizes longer terms to minimize segmentation (e.g., "Nanjing City" is not split into "Nanjing" and "City").
- Max Word mode: Outputs all possible shorter terms (e.g., "Nanjing City" is split into "Nanjing" and "City").
-
Automatically identifies formats such as English words, email addresses, URLs (excluding
://), and IP addresses.
Use Cases: Chinese text segmentation
Business Scenarios:
-
E-commerce product description search (e.g., precise matching of "Huawei Mate 60").
-
Social media content analysis (e.g., keyword extraction from user comments).
-
Smart mode: Ensures each character belongs to only one term, with no overlaps, and maximizes the length of each term while minimizing the number of terms. It also attempts to combine numbers and units into a single term.
select tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Email", "hello_word", "192.168.1.1", "hello-word", "Yangtze River Bridge", "www.baidu.com", "www.xxx.com", "xx@ob.com", "Long", "http", "1 kilometer", "See", "Nanjing City", "Have"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -
Max Word mode: Includes the same character in different terms to provide as many possible terms as possible.
select tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Meter", "Yangtze River Bridge", "Mayor", "Dry", "Nanjing City", "Nanjing", "Kilometer", "xx", "www.xxx.com", "Long", "www", "xx@ob.com", "Yangtze", "ob", "XXX", "com", "See", "l", "Have", "Bridge", "Email"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
jieba Tokenizer
Concept: A tokenizer based on the Python open-source library jieba. Supports precise mode, full mode, and search engine mode.
Features:
- Precise mode: strictly divides terms according to the dictionary (e.g., "cannot" is not split into "not" and "can").
- Full mode: Displays all possible partitioning combinations.
- Search mode: Balances precision and recall (for example, search for “Nanjing City Yangtze River Bridge” to return results containing all of “Nanjing”, “City”, and “Yangtze River Bridge”).
- Supports custom dictionaries and new word discovery, and is compatible with multiple languages, including Chinese, English, and Japanese.
Scenarios:
- Accurate segmentation of medical or technological terms. For example, the accurate segmentation of "artificial intelligence."
- Multilingual mixed text processing (such as content on social media in a mix of Chinese and English).
The jieba tokenization plug-in needs to be installed and compiled. For more information, see Tokenization plug-in.
The current tokenizer plugin is an experimental feature and is not recommended for use in a production environment.
Tokenizer Selection Strategy
| Scenario | Suggested Analyzer | Reasons |
|---|---|---|
| English title search | Space or Basic English | Simple and efficient, aligning with English word segmentation habits. |
| Retrieving Chinese product description | IK tokenizer | Accurately identifies Chinese terms and supports custom dictionaries. |
| Fuzzy match logs (for example, error codes) | Ngram tokenizer | No dictionary is required. It is used to perform fuzzy queries on texts without spaces. |
| Keywords extraction in technical papers | jieba tokenizer | Supports new word discovery and complex mode switching. |
Create a full-text index
You can use the CREATE TABLE, CREATE FULLTEXT INDEX, or ALTER TABLE statement to create a full-text index.
Limitations and considerations for full-text indexes
You can create a full-text index on a non-partitioned table or a partitioned table without a primary key. The following limitations apply:
- A full-text index can be created only on columns of the
CHAR,VARCHAR, orTEXTtype. - Only local (
LOCAL) full-text indexes are supported in the current version. - You cannot specify the
UNIQUEkeyword when you create a full-text index. - To create a full-text index that involves multiple columns, ensure that these columns have the same character set.
- For information about how to use full-text indexes with offline DDL operations, see Offline DDL.
- Columnstore full-text indexes are not supported.
Create a full-text index by using the CREATE TABLE statement
You can create a full-text index for a table when you create the table by using the CREATE TABLE statement. The simplified syntax is as follows:
CREATE TABLE table_name(column_name column_definition,[column_name column_definition,...]
FULLTEXT [INDEX | KEY] [index_name](column_name)
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL]);
tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2
parser_properties_list:
parser_properties, [parser_properties]
parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value
For more information about the CREATE TABLE statement, see CREATE TABLE.
The following table describes the related parameters:
-
table_name: specifies the name of the table to be created. -
column_name: specifies the columns of the table. -
column_definition: specifies the data types of the columns in the table. -
FULLTEXT: specifies to create a full-text index.tipOnly local full-text indexes are supported in the current version.
-
INDEX | KEY: specifies to use theINDEXorKEYkeyword in this statement. -
index_name: specifies the name of the index to be created. This parameter is optional. If you do not specify this parameter, the index name is the same as the column name. -
WITH PARSER tokenizer_option: specifies the tokenizer for the full-text index. Valid values:-
SPACE: the default value. This option specifies to split the text by spaces. You can specify the following parameters:Parameter Value range min_token_size [1, 16] max_token_size [10, 84] -
NGRAM: specifies to split the text by using the N-Gram tokenizer for Chinese. You can specify the following parameters:Parameter Value range ngram_token_size [1, 10] -
NGRAM2: specifies to split the text into consecutive characters of lengths ranging frommin_ngram_sizetomax_ngram_size.Parameter Value range min_ngram_size [1, 16] max_ngram_size [1, 16] -
BENG: specifies to split the text by using the Beng tokenizer for basic English. You can specify the following parameters:Parameter Value range min_token_size [1, 16] max_token_size [10, 84] -
IK: specifies to split the text by using the IK tokenizer for Chinese. Only theutf-8character set is supported. You can specify the following parameters:Parameter Value range ik_mode smartmax_word
You can use the TOKENIZE function to view the tokenization result of the text by using the specified tokenizer and JSON parameters.
-
-
PARSER_PROPERTIES[=](parser_properties_list): specifies the properties of the tokenizer. Valid values:-
min_token_size: specifies the minimum token length. The default value is 3, and the value ranges from 1 to 16. -
max_token_size: specifies the maximum token length. The default value is 84, and the value ranges from 10 to 84. -
ngram_token_size: specifies the token length for theNGRAMtokenizer. This parameter is valid only for theNGRAMtokenizer. The default value is 2, and the value ranges from 1 to 10. -
ik_mode: specifies the tokenization mode for theIKtokenizer. Valid values:smart: the default value. This mode uses the words in the dictionary to improve the accuracy of tokenization. The boundaries of the words in the dictionary are prioritized, which may reduce unnecessary expansion.max_word: this mode recognizes the words defined in the dictionary but does not affect the maximum expansion of tokenization. Even if the dictionary defines words, themax_wordmode still attempts to split the text into more words.
-
-
LOCAL: specifies to create a local index. This parameter is optional.
Here are some examples:
-
Create a table named
tbl1and a full-text index namedfull_idx1_tbl1.CREATE TABLE tbl1(id INT, name VARCHAR(18), date DATE, PRIMARY KEY (id), FULLTEXT INDEX full_idx1_tbl1(name)); -
Create a table named
tbl2and a full-text index namedfull_idx1_tbl2. Specify theNGRAMtokenizer for the full-text index and set the tokenizer properties by using thePARSER_PROPERTIESparameter.CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
Use the CREATE FULLTEXT INDEX statement to create a full-text index
You can run the CREATE FULLTEXT INDEX statement to create a full-text index for an existing table. The syntax is as follows:
CREATE FULLTEXT INDEX index_name ON table_name (column_name, [column_name ...])
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL];
tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2
parser_properties_list:
parser_properties, [parser_properties]
parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value
For more information about the CREATE INDEX statement, see CREATE INDEX.
The following table describes the related parameters:
-
index_name: specifies the name of the index to be added. -
table_name: specifies the name of the table for which to create an index. -
column_name: specifies the columns to be indexed. Separate multiple columns with commas. -
WITH PARSER tokenizer_option: specifies the tokenizer for the full-text index. Valid values:-
SPACE: the default value. This option specifies to split the text by spaces. You can specify the following parameters:Parameter Value Range min_token_size [1, 16] max_token_size [10, 84] -
NGRAM: specifies to split the text by N-Gram (Chinese) method. You can specify the following parameter:Parameter Value Range ngram_token_size [1, 10] -
NGRAM2: specifies to split the text into continuous characters in the range ofmin_ngram_sizetomax_ngram_size. You can specify the following parameters:Parameter Value Range min_ngram_size [1, 16] max_ngram_size [1, 16] -
BENG: specifies to split the text by Beng (basic English) method. You can specify the following parameters:Parameter Value Range min_token_size [1, 16] max_token_size [10, 84] -
IK: specifies to split the text by IK (Chinese) method. Only theutf-8character set is supported. You can specify the following parameter:Parameter Value Range ik_mode smartmax_word
You can run the TOKENIZE function to view the tokenization result of the text in the specified tokenizer and JSON format.
-
-
PARSER_PROPERTIES[=](parser_properties_list): specifies the parameters of the tokenizer. Valid values:-
min_token_size: specifies the minimum token length. The default value is 3, and the value ranges from 1 to 16. -
max_token_size: specifies the maximum token length. The default value is 84, and the value ranges from 10 to 84. -
ngram_token_size: specifies the token length of theNGRAMtokenizer. This parameter is valid only for theNGRAMtokenizer. The default value is 2, and the value ranges from 1 to 10. -
ik_mode: specifies the tokenization mode of theIKtokenizer. Valid values:smart: the default value. This option specifies to use the words in the dictionary to improve the tokenization accuracy. The boundaries of the words in the dictionary are prioritized, which may reduce unnecessary extensions.max_word: specifies to identify the words defined in the dictionary but not to affect the maximum extension of tokenization. Even if the dictionary defines the words, themax_wordmode still tries to split the text into more words.
-
-
LOCAL: specifies to create a local index.
Here is an example:
After you create the tbl3 table, create a full-text index named ft_idx1_tbl3.
-
Create the
tbl3table.CREATE TABLE tbl3(col1 INT, col2 VARCHAR(4096)); -
Create a full-text index named
ft_idx1_tbl3for thetbl3table. Specify theIKtokenizer and set the tokenizer parameters by using thePARSER_PROPERTIESclause.CREATE FULLTEXT INDEX ft_idx1_tbl3 ON tbl3(col2)
WITH PARSER IK
PARSER_PROPERTIES=(ik_mode='max_word');
Use the ALTER TABLE statement to create a full-text index
You can use the ALTER TABLE statement to add a full-text index to an existing table. The syntax is as follows:
ALTER TABLE table_name ADD FULLTEXT [INDEX | KEY] [index_name](column_name, [column_name ...])
[WITH PARSER tokenizer_option]
[PARSER_PROPERTIES[=](parser_properties_list)]
[LOCAL];
tokenizer_option:
SPACE
| NGRAM
| BENG
| IK
| NGRAM2
parser_properties_list:
parser_properties, [parser_properties]
parser_properties:
min_token_size = int_value
| max_token_size = int_value
| ngram_token_size = int_value
| ik_mode = 'char_value'
| min_ngram_size = int_value
| max_ngram_size = int_value
For more information about the ALTER TABLE statement, see ALTER TABLE.
The following table describes the related parameters:
-
table_name: specifies the name of the table to which the index is added. -
INDEX | KEY: specifies whether to useINDEXorKEYas the index keyword. The default value isKEY. -
index_name: specifies the name of the index to be created. If you do not specify this parameter, the index name is the same as the column name. -
column_name: specifies the column to be indexed. Separate multiple columns with commas (,). -
WITH PARSER tokenizer_option: specifies the tokenizer for the full-text index. Valid values:-
SPACE: specifies the tokenizer based on spaces. You can specify the following parameters:Parameter Value Range min_token_size [1, 16] max_token_size [10, 84] -
NGRAM: specifies the tokenizer based on N-Grams (for Chinese). You can specify the following parameters:Parameter Value Range ngram_token_size [1, 10] -
NGRAM2: specifies the tokenizer that splits the text into continuous characters of length betweenmin_ngram_sizeandmax_ngram_size. You can specify the following parameters:Parameter Value Range min_ngram_size [1, 16] max_ngram_size [1, 16] -
BENG: specifies the tokenizer based on Beng (basic English). You can specify the following parameters:Parameter Value Range min_token_size [1, 16] max_token_size [10, 84] -
IK: specifies the tokenizer based on IK (for Chinese). Only theutf-8character set is supported. You can specify the following parameter:Parameter Value Range ik_mode smartmax_word
You can use the TOKENIZE function to view the tokenization result of the text based on the specified tokenizer and JSON parameters.
-
-
PARSER_PROPERTIES[=](parser_properties_list): specifies the tokenizer parameters. Valid values:-
min_token_size: specifies the minimum token length. Default value: 3. Value range: 1 to 16. -
max_token_size: specifies the maximum token length. Default value: 84. Value range: 10 to 84. -
ngram_token_size: specifies the token length for theNGRAMtokenizer. This parameter is valid only for theNGRAMtokenizer. Default value: 2. Value range: 1 to 10. -
ik_mode: specifies the tokenization mode for theIKtokenizer. Valid values:smart: default value. This mode uses the dictionary to improve the accuracy of tokenization. It prioritizes the boundaries of dictionary words, thereby reducing unnecessary extensions.max_word: this mode recognizes the words defined in the dictionary but does not affect the maximum extension of tokenization. Even if the dictionary defines words, themax_wordmode still tries to split the text into more words.
-
-
LOCAL: specifies whether to create a local index.
Here is an example:
After you create a table named tbl4, you can add a full-text index named ft_idx1_tbl4 to the table.
-
Create the
tbl4table.CREATE TABLE tbl4(col1 INT, col2 TEXT); -
Add a full-text index named
ft_idx1_tbl4to thetbl4table. SpecifyBENGas the tokenizer for the full-text index and set the tokenizer parameters by using thePARSER_PROPERTIESparameter.ALTER TABLE tbl4 ADD FULLTEXT INDEX ft_idx1_tbl4(col2)
WITH PARSER BENG
PARSER_PROPERTIES=(min_token_size=2, max_token_size=64);
Example
In this example, a table is defined to store document information, and a full-text index is set for the documents. With the full-text index, you can quickly match documents containing the desired keywords and sort them by similarity in descending order.
-
Create a table.
CREATE TABLE Articles (
id INT AUTO_INCREMENT,
title VARCHAR(255) ,
content TEXT ,
PRIMARY KEY (id),
FULLTEXT ft1 (content) WITH PARSER SPACE
);Query OK, 0 rows affected (0.67 sec) -
Insert data.
INSERT INTO Articles (title, content) VALUES
('seekdb overview', 'seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows.'),
('Full-Text Search in Databases', 'Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly.'),
('Advantages of Using seekdb', 'seekdb offers several advantages such as high performance, reliability, and ease of use. ');Query OK, 3 rows affected (0.10 sec)
Records: 3 Duplicates: 0 Warnings: 0 -
Query the table.
select * from Articles;+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | content |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | seekdb overview | seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows. |
| 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. |
| 3 | Advantages of Using seekdb | seekdb offers several advantages such as high performance, reliability, and ease of use. |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.002 sec) -
Query the matched documents.
select id,title, content,match(content) against('seekdb database') score from Articles where match(content) against('seekdb database');+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title | content | score |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| 1 | seekdb overview | seekdb is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows. | 0.4570384669555348 |
| 3 | Advantages of Using seekdb | seekdb offers several advantages such as high performance, reliability, and ease of use. | 0.240174672489083 |
| 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. | 0.20072992700729927 |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
3 rows in set (0.003 sec) -
Use the EXPLAIN command to view the query plan and analyze its performance.
explain select id,title, content,match(content) against('seekdb database') score from Articles where match(content) against('seekdb database');+-------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| ============================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| -------------------------------------------------------------- |
| |0 |SORT | |4 |139 | |
| |1 |└─TEXT RETRIEVAL SCAN|articles(ft1)|4 |138 | |
| ============================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([articles.id], [articles.title], [articles.content], [MATCH(articles.content) AGAINST('seekdb database')]), filter(nil), rowset=16 |
| sort_keys([MATCH(articles.content) AGAINST('seekdb database'), DESC]) |
| 1 - output([articles.id], [articles.content], [articles.title], [MATCH(articles.content) AGAINST('seekdb database')]), filter(nil), rowset=16 |
| access([articles.id], [articles.content], [articles.title]), partitions(p0) |
| is_index_back=true, is_global_index=false, |
| calc_relevance=true, match_expr(MATCH(articles.content) AGAINST('seekdb database')), |
| pushdown_match_filter(MATCH(articles.content) AGAINST('seekdb database')) |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
15 rows in set (0.002 sec)