Experience full-text indexing

Background information

seekdb's full-text indexing feature can effectively solve various problems encountered in actual production, especially in scenarios such as system log analysis and user behavior and profile analysis. This feature can quickly filter and screen data efficiently, as well as perform high-quality relevance evaluation. In addition, combined with the multi-path recall architecture of sparse and dense vectors, more efficient recall can be achieved in RAG systems in specific knowledge domains.

This tutorial uses document retrieval scenarios as an example. In such scenarios, three core challenges place higher demands on retrieval systems:

Real-time requirements: Quickly locate target information from TB-level data.
Semantic complexity: Solve natural language processing challenges such as word segmentation and synonym processing.
Hybrid query requirements: Improve the joint optimization capability of text retrieval and structured queries.

This tutorial demonstrates how to quickly find target documents from massive information by using the full-text indexing feature. We will use keywords in queries to demonstrate the improvements of seekdb's full-text indexing in terms of functionality, performance, and ease of use.

How it works

In seekdb's storage engine, user documents and queries are split into multiple keywords (word/token) by a tokenizer. These keywords and the statistical information features of documents are stored in internal auxiliary tables (tablets) for relevance evaluation (ranking) during the information retrieval phase. seekdb uses the advanced BM25 algorithm, which can more effectively calculate the relevance score between keywords in user query statements and stored documents, and finally output documents that meet the conditions and their scores.

In the full-text indexing query process, combined with seekdb's high-performance query engine, seekdb has optimized the TAAT/DAAT process and supports union merge between multiple indexes. These improvements enable full-text indexing to handle more complex query features and meet users' data retrieval needs.

Prerequisites

To successfully operate and experience seekdb's full-text indexing feature, ensure that the following prerequisites are met:

Environment requirements: seekdb is deployed.
Database creation: Ensure that a database is created. For detailed steps, see Create a database.

Procedure

The following steps guide you through experiencing seekdb's full-text indexing and common views and query techniques.

Step 1: Import a dataset

seekdb has a built-in Beng tokenizer that is suitable for English, as well as a Boolean mode that is more efficient than traditional natural language processing. The Beng tokenizer is suitable for English text and provides efficient word segmentation for English documents. seekdb's built-in tokenizers also include IK (for Chinese), space (for space-separated languages), and ngram (which splits by character length).

We will use the wikIR1k dataset to import data into seekdb, create a table named wikir1k with a document column, and create a full-text index on the document field using the Beng tokenizer.

tip

All query results and performance metrics shown in the examples are for reference only. Your actual results may vary depending on your data volume, machine specifications, and query patterns.

-- Create a table and use the Beng tokenizer for full-text indexing
CREATE TABLE wikir1k (
    id INT AUTO_INCREMENT PRIMARY KEY,
    document TEXT,
    FULLTEXT INDEX ft_idx1_document(document)
        WITH PARSER beng
);

Import the dataset into the table through the client's local file method.

-- Import data
LOAD DATA /*+ PARALLEL(8) */ LOCAL INFILE '/home/admin/documents10k.csv' INTO TABLE wikir1k
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

After importing the data, the table contains approximately 10,000 documents (the exact count may vary slightly).

-- Verify the number of imported records
SELECT AVG(LENGTH(document)), COUNT(*) FROM wikir1k;

The following result is returned:

+-----------------------+----------+
| AVG(LENGTH(document)) | COUNT(*) |
+-----------------------+----------+
|             1144.6949 |   369721 |
+-----------------------+----------+
1 row in set (1.07 sec)

-- Query the view to verify the result
SELECT * FROM oceanbase.DBA_OB_TABLE_SPACE_USAGE WHERE DATABASE_NAME = 'test' AND TABLE_NAME LIKE '%wikir1k%';

The following result is returned:

+----------+---------------+------------+-------------+---------------+
| TABLE_ID | DATABASE_NAME | TABLE_NAME | OCCUPY_SIZE | REQUIRED_SIZE |
+----------+---------------+------------+-------------+---------------+
|   500252 | test          | wikir1k    |   185571540 |     190853120 |
+----------+---------------+------------+-------------+---------------+
1 row in set (0.05 sec)

Step 2: Query using full-text indexing

Using the stored document dataset and index, we can perform multi-condition combination or highly filtered retrieval. For example, if I want to search for documents containing both "london" and "mayfair", I can use Boolean mode.

Compared to string LIKE matching without an index, Boolean mode has simpler syntax and faster query speed.

-- Use Boolean mode to query and find documents that contains both "london" and "mayfair"
SELECT COUNT(*) FROM wikir1k
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE);

The following result is returned:

+----------+
| COUNT(*) |
+----------+
|       58 |
+----------+
1 row in set (0.01 sec)

In contrast, using the LIKE query method:

-- Use LIKE syntax to query
SELECT COUNT(*) FROM wikir1k
WHERE document LIKE '%london%' AND document LIKE '%mayfair%';

The following result is also returned:

+----------+
| COUNT(*) |
+----------+
|       58 |
+----------+
1 row in set (3.48 sec)

For the documents returned, we can further perform ranking by using the score in the output result to determine which documents are more relevant to the query.

-- Return the id and score of the documents to help determine relevance
SELECT id, MATCH (document) AGAINST ('london mayfair') AS score
FROM wikir1k
WHERE MATCH (document) AGAINST ('+london +mayfair' IN BOOLEAN MODE) 
LIMIT 10;

The following result is returned:

+---------+--------------------+
| id      | score              |
+---------+--------------------+
|  425035 | 17.661768297948015 |
| 1122217 | 16.349131415195043 |
|   34959 | 14.813025094926918 |
| 1576669 | 14.620715555483576 |
| 2100682 |  13.40354137543347 |
| 1179964 |  13.40354137543347 |
| 1642217 | 13.391619146335605 |
|  123391 |  13.36985391637557 |
|  852529 | 13.336357369363272 |
|  380931 | 13.249691534256172 |
+---------+--------------------+
10 rows in set (0.03 sec)

At the same time, Boolean mode also allows us to reverse exclude some keywords. For example, if I want to find documents about "london" but exclude those mentioning "westminster", I can use the - operator in Boolean mode.

-- Query documents about london but excluding westminster
SELECT COUNT(*) FROM wikir1k
WHERE MATCH (document) AGAINST ('+london -westminster' IN BOOLEAN MODE);

The following result is returned:

+----------+
| COUNT(*) |
+----------+
|    18771 |
+----------+
1 row in set (0.01 sec)

Step 3: Tuning

Tune using the `TOKENIZE` function

When the query results of full-text indexing do not meet expectations, it is usually because the tokenization results are not ideal. seekdb provides a fast TOKENIZE function to assist in testing tokenization effects. This function supports all tokenizers and their corresponding properties. You can use the TOKENIZE function to verify tokenizer processing effects.

For example, the tokenization results in the following example show how the Beng tokenizer splits English text into words, which helps verify that the tokenization is working correctly.

Use the TOKENIZE function to verify tokenizer processing effects:

-- Verify English document tokenization effects using Beng tokenizer
SELECT TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]');

The following result is returned:

+---------------------------------------------------------------------------------------------------------------------+
| TOKENIZE('The computer system provides efficient processing and information management capabilities', 'beng', '[]') |
+---------------------------------------------------------------------------------------------------------------------+
| ["efficient", "processing", "capabilities", "system", "computer", "provides", "and", "information", "management"]   |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

The above result shows that the text has been correctly split into individual words.

Next, execute the following statement to check whether the query statement hits the target document:

-- Use Boolean mode to retrieve documents about computer systems
SELECT COUNT(*)
FROM wikir1k
WHERE MATCH (document) AGAINST ('+computer +system' IN BOOLEAN MODE);

The following result is returned:

+----------+
| COUNT(*) |
+----------+
|     1010 |
+----------+
1 row in set (0.01 sec)

The above result shows that target records were matched.

Performance comparison with MySQL

To compare the full-text indexing performance differences between seekdb and MySQL, we use MySQL's full-text indexing feature as a reference. The complete dataset wikir1k (containing 369,721 rows, with an average of 200 words per row) is used for performance comparison.

tip

The test results are provided for reference only and may vary depending on your specific environment, data volume, and query patterns.

The following are the comparison results of various scenarios in natural language mode and Boolean mode. It can be seen that in scenarios that require a large amount of tokenization or return large result sets, seekdb's performance is significantly better than MySQL. For small result sets, since the calculation proportion is small, the query engine's advantage is not obvious, and the performance of both engines is similar.

Test environment: seekdb's test specification is 8c 16g, and MySQL version uses 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL).

Natural language mode

-- q1: Query documents containing "and"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and');

-- q2: Query documents containing "and", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and') LIMIT 10;

-- q3: Query documents containing "librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists');

-- q4: Query documents containing "librettists", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists') LIMIT 10;

-- q5: Query documents containing "alleviating librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists');

-- q6: Query documents containing "black spotted white yellow"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow');

-- q7: Query documents containing "black spotted white yellow", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow') LIMIT 10;

-- q8: Query documents containing "between up and down"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down');

-- q9: Query documents containing "between up and down", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down') LIMIT 10;

-- q10: Query long documents
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire');

-- q11: Query long documents, with "and" appended
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and');

-- q12: Query long documents, limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and') LIMIT 10;

Scenario	seekdb	MySQL
q1 Single token high-frequency word	3820458us	5718430us
q2 Single token high-frequency word limit	231861us	503772us
q3 Single token low-frequency word	879us	672us
q4 Single token low-frequency word limit	720us	700us
q5 Multiple tokens small result set	1591us	1100us
q6 Multiple tokens medium result set	259700us	602221us
q7 Multiple tokens medium result set limit	25502us	42620us
q8 Multiple tokens large result set	3842391us	6846847us
q9 Multiple tokens large result set limit	301362us	784024us
q10 Many tokens small result set	22143us	10161us
q11 Many tokens large result set	3905829us	5929343us
q12 Many tokens large result set limit	345968us	769970us

Boolean mode

-- q1: +high-frequency word -medium-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -which -his' IN BOOLEAN MODE);

-- q2: +high-frequency word -low-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+which  (+and -his)' IN BOOLEAN MODE);

-- q3: +medium-frequency word (+high-frequency word -medium-frequency word)
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -carabantes -bufera' IN BOOLEAN MODE);

-- q4: +high-frequency word +low-frequency word
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and +librettists'  IN BOOLEAN MODE);

Scenario	seekdb	MySQL
q1: +high-frequency word -medium-frequency word	1586657us	2440798us
q2: +high-frequency word -low-frequency word	3726508us	7974832us
q3: +medium-frequency word (+high-frequency word -medium-frequency word)	3080644us	5612041us
q4: +high-frequency word +low-frequency word	230284us	357580us

Performance comparison summary

From the above data comparison, it can be seen that when performing complex full-text retrieval, seekdb demonstrates significantly better performance than MySQL in both natural language mode and Boolean mode. Especially when processing queries that require a large amount of tokenization or return large result sets, seekdb's advantages are more obvious. This provides strong reference for developers and data analysts when choosing a database, especially in application scenarios that require efficient retrieval of massive data, where seekdb clearly demonstrates its powerful performance and flexible query capabilities.

seekdb's full-text indexing can always provide fast response times when processing complex queries, making it more suitable for actual application scenarios that require high concurrency and high-performance retrieval.

What's next

For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:

In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see Experience embedded seekdb using Python SDK and pyseekdb overview.

Background information​

How it works​

Prerequisites​

Procedure​

Step 1: Import a dataset​

Step 2: Query using full-text indexing​

Step 3: Tuning​

Tune using the TOKENIZE function​

Performance comparison with MySQL​

Natural language mode​

Boolean mode​

Performance comparison summary​

What's next​

Contents

Background information

How it works

Prerequisites

Procedure

Step 1: Import a dataset

Step 2: Query using full-text indexing

Step 3: Tuning

Tune using the `TOKENIZE` function

Performance comparison with MySQL

Natural language mode

Boolean mode

Performance comparison summary

What's next