Experience full-text indexing
Background information
seekdb's full-text indexing feature can effectively solve various problems encountered in actual production, especially in scenarios such as system log analysis and user behavior and profile analysis. This feature can quickly filter and screen data efficiently, as well as perform high-quality relevance evaluation. In addition, combined with the multi-path recall architecture of sparse and dense vectors, more efficient recall can be achieved in RAG systems in specific knowledge domains.
This tutorial uses document retrieval scenarios as an example. In such scenarios, three core challenges place higher demands on retrieval systems:
- Real-time requirements: Quickly locate target information from TB-level data.
- Semantic complexity: Solve natural language processing challenges such as word segmentation and synonym processing.
- Hybrid query requirements: Improve the joint optimization capability of text retrieval and structured queries.
This tutorial demonstrates how to quickly find target documents from massive information by using the full-text indexing feature. We will use keywords in queries to demonstrate the improvements of seekdb's full-text indexing in terms of functionality, performance, and ease of use.
How it works
In seekdb's storage engine, user documents and queries are split into multiple keywords (word/token) by a tokenizer. These keywords and the statistical information features of documents are stored in internal auxiliary tables (tablets) for relevance evaluation (ranking) during the information retrieval phase. seekdb uses the advanced BM25 algorithm, which can more effectively calculate the relevance score between keywords in user query statements and stored documents, and finally output documents that meet the conditions and their scores.
In the full-text indexing query process, combined with seekdb's high-performance query engine, seekdb has optimized the TAAT/DAAT process and supports union merge between multiple indexes. These improvements enable full-text indexing to handle more complex query features and meet users' data retrieval needs.

Prerequisites
To successfully operate and experience seekdb's full-text indexing feature, ensure that the following prerequisites are met:
-
Environment requirements: seekdb is deployed.
-
Database creation: Ensure that a database is created. For detailed steps, see Create a database.
Procedure
The following steps guide you through experiencing seekdb's full-text indexing and common views and query techniques.
Step 1: Import a dataset
seekdb has a built-in Beng tokenizer that is suitable for English, as well as a Boolean mode that is more efficient than traditional natural language processing. The Beng tokenizer is suitable for English text and provides efficient word segmentation for English documents. seekdb's built-in tokenizers also include IK (for Chinese), space (for space-separated languages), and ngram (which splits by character length).
We will use the wikIR1k dataset to import data into seekdb, create a table named wikir1k with a document column, and create a full-text index on the document field using the Beng tokenizer.
All query results and performance metrics shown in the examples are for reference only. Your actual results may vary depending on your data volume, machine specifications, and query patterns.
-- Create a table and use the Beng tokenizer for full-text indexing
CREATE TABLE wikir1k (
id INT AUTO_INCREMENT PRIMARY KEY,
document TEXT,
FULLTEXT INDEX ft_idx1_document(document)
WITH PARSER beng
);
Import the dataset into the table through the client's local file method.
-- Import data
LOAD DATA /*+ PARALLEL(8) */ LOCAL INFILE '/home/admin/documents10k.csv' INTO TABLE wikir1k
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
After importing the data, the table contains approximately 10,000 documents (the exact count may vary slightly).
-- Verify the number of imported records
SELECT AVG(LENGTH(document)), COUNT(*) FROM wikir1k;
The following result is returned:
+-----------------------+----------+
| AVG(LENGTH(document)) | COUNT(*) |
+-----------------------+----------+
| 1144.6949 | 369721 |
+-----------------------+----------+
1 row in set (1.07 sec)
-- Query the view to verify the result
SELECT * FROM oceanbase.DBA_OB_TABLE_SPACE_USAGE WHERE DATABASE_NAME = 'test' AND TABLE_NAME LIKE '%wikir1k%';
The following result is returned:
+----------+---------------+------------+-------------+---------------+
| TABLE_ID | DATABASE_NAME | TABLE_NAME | OCCUPY_SIZE | REQUIRED_SIZE |
+----------+---------------+------------+-------------+---------------+
| 500252 | test | wikir1k | 185571540 | 190853120 |
+----------+---------------+------------+-------------+---------------+
1 row in set (0.05 sec)
Step 2: Query using full-text indexing
Using the stored document dataset and index, we can perform multi-condition combination or highly filtered retrieval. For example, if I want to search for documents containing both "london" and "mayfair", I can use Boolean mode.
Compared to string LIKE matching without an index, Boolean mode has simpler syntax and faster query speed.
-- Use Boolean mode to query and find documents that contains both "london" and "mayfair"
SELECT COUNT(*) FROM wikir1k
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE);
The following result is returned:
+----------+
| COUNT(*) |
+----------+
| 58 |
+----------+
1 row in set (0.01 sec)
In contrast, using the LIKE query method:
-- Use LIKE syntax to query
SELECT COUNT(*) FROM wikir1k
WHERE document LIKE '%london%' AND document LIKE '%mayfair%';
The following result is also returned:
+----------+
| COUNT(*) |
+----------+
| 58 |
+----------+
1 row in set (3.48 sec)
For the documents returned, we can further perform ranking by using the score in the output result to determine which documents are more relevant to the query.
-- Return the id and score of the documents to help determine relevance
SELECT id, MATCH (document) AGAINST ('london mayfair') AS score
FROM wikir1k
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE)
LIMIT 10;
The following result is returned:
+---------+--------------------+
| id | score |
+---------+--------------------+
| 425035 | 17.661768297948015 |
| 1122217 | 16.349131415195043 |
| 34959 | 14.813025094926918 |
| 1576669 | 14.620715555483576 |
| 2100682 | 13.40354137543347 |
| 1179964 | 13.40354137543347 |
| 1642217 | 13.391619146335605 |
| 123391 | 13.36985391637557 |
| 852529 | 13.336357369363272 |
| 380931 | 13.249691534256172 |
+---------+--------------------+
10 rows in set (0.03 sec)
At the same time, Boolean mode also allows us to reverse exclude some keywords. For example, if I want to find documents about "london" but exclude those mentioning "westminster", I can use the - operator in Boolean mode.
-- Query documents about london but excluding westminster
SELECT COUNT(*) FROM wikir1k
WHERE MATCH (document) AGAINST ('+london -westminster' IN BOOLEAN MODE);
The following result is returned:
+----------+
| COUNT(*) |
+----------+
| 18771 |
+----------+
1 row in set (0.01 sec)
Step 3: Tuning
Tune using the TOKENIZE function
When the query results of full-text indexing do not meet expectations, it is usually because the tokenization results are not ideal. seekdb provides a fast TOKENIZE function to assist in testing tokenization effects. This function supports all tokenizers and their corresponding properties. You can use the TOKENIZE function to verify tokenizer processing effects.
For example, the tokenization results in the following example show how the Beng tokenizer splits English text into words, which helps verify that the tokenization is working correctly.
-
Use the
TOKENIZEfunction to verify tokenizer processing effects:-- Verify English document tokenization effects using Beng tokenizer
SELECT TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]');The following result is returned:
+---------------------------------------------------------------------------------------------------------------------+
| TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]') |
+---------------------------------------------------------------------------------------------------------------------+
| ["efficient", "processing", "capabilities", "system", "computer", "provides", "and", "information", "management"] |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)The above result shows that the text has been correctly split into individual words.
-
Next, execute the following statement to check whether the query statement hits the target document:
-- Use Boolean mode to retrieve documents about computer systems
SELECT COUNT(*)
FROM wikir1k
WHERE MATCH (document) AGAINST ('+computer +system' IN BOOLEAN MODE);The following result is returned:
+----------+
| COUNT(*) |
+----------+
| 1010 |
+----------+
1 row in set (0.01 sec)The above result shows that target records were matched.
Performance comparison with MySQL
To compare the full-text indexing performance differences between seekdb and MySQL, we use MySQL's full-text indexing feature as a reference. The complete dataset wikir1k (containing 369,721 rows, with an average of 200 words per row) is used for performance comparison.
The test results are provided for reference only and may vary depending on your specific environment, data volume, and query patterns.
The following are the comparison results of various scenarios in natural language mode and Boolean mode. It can be seen that in scenarios that require a large amount of tokenization or return large result sets, seekdb's performance is significantly better than MySQL. For small result sets, since the calculation proportion is small, the query engine's advantage is not obvious, and the performance of both engines is similar.
Test environment: seekdb's test specification is 8c 16g, and MySQL version uses 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL).
Natural language mode
-- q1: Query documents containing "and"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and');
-- q2: Query documents containing "and", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and') LIMIT 10;
-- q3: Query documents containing "librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists');
-- q4: Query documents containing "librettists", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists') LIMIT 10;
-- q5: Query documents containing "alleviating librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists');
-- q6: Query documents containing "black spotted white yellow"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow');
-- q7: Query documents containing "black spotted white yellow", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow') LIMIT 10;
-- q8: Query documents containing "between up and down"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down');
-- q9: Query documents containing "between up and down", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down') LIMIT 10;
-- q10: Query long documents
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire');
-- q11: Query long documents, with "and" appended
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and');
-- q12: Query long documents, limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and') LIMIT 10;
| Scenario | seekdb | MySQL |
|---|---|---|
| q1 Single token high-frequency word | 3820458us | 5718430us |
| q2 Single token high-frequency word limit | 231861us | 503772us |
| q3 Single token low-frequency word | 879us | 672us |
| q4 Single token low-frequency word limit | 720us | 700us |
| q5 Multiple tokens small result set | 1591us | 1100us |
| q6 Multiple tokens medium result set | 259700us | 602221us |
| q7 Multiple tokens medium result set limit | 25502us | 42620us |
| q8 Multiple tokens large result set | 3842391us | 6846847us |
| q9 Multiple tokens large result set limit | 301362us | 784024us |
| q10 Many tokens small result set | 22143us | 10161us |
| q11 Many tokens large result set | 3905829us | 5929343us |
| q12 Many tokens large result set limit | 345968us | 769970us |
Boolean mode
-- q1: +high-frequency word -medium-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -which -his' IN BOOLEAN MODE);
-- q2: +high-frequency word -low-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+which (+and -his)' IN BOOLEAN MODE);
-- q3: +medium-frequency word (+high-frequency word -medium-frequency word)
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -carabantes -bufera' IN BOOLEAN MODE);
-- q4: +high-frequency word +low-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and +librettists' IN BOOLEAN MODE);
| Scenario | seekdb | MySQL |
|---|---|---|
| q1: +high-frequency word -medium-frequency word | 1586657us | 2440798us |
| q2: +high-frequency word -low-frequency word | 3726508us | 7974832us |
| q3: +medium-frequency word (+high-frequency word -medium-frequency word) | 3080644us | 5612041us |
| q4: +high-frequency word +low-frequency word | 230284us | 357580us |
Performance comparison summary
From the above data comparison, it can be seen that when performing complex full-text retrieval, seekdb demonstrates significantly better performance than MySQL in both natural language mode and Boolean mode. Especially when processing queries that require a large amount of tokenization or return large result sets, seekdb's advantages are more obvious. This provides strong reference for developers and data analysts when choosing a database, especially in application scenarios that require efficient retrieval of massive data, where seekdb clearly demonstrates its powerful performance and flexible query capabilities.
seekdb's full-text indexing can always provide fast response times when processing complex queries, making it more suitable for actual application scenarios that require high concurrency and high-performance retrieval.
What's next
For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:
- Experience vector search
- Experience hybrid search
- Experience AI function service
- Experience semantic indexing
- Experience the Vibe Coding paradigm with Cursor Agent + OceanBase MCP
- Build a knowledge base desktop application based on seekdb
- Build a cultural tourism assistant with multi-model integration based on seekdb
- Build an image search application based on seekdb
In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see Experience embedded seekdb using Python SDK and pyseekdb overview.