Create a hybrid vector index

This topic describes how to create a hybrid vector index in seekdb.

Overview

Hybrid vector indexes leverage seekdb's built-in embedding capabilities to greatly simplify the vector index usage process. They make the vector concept transparent to users: you can directly write raw data (such as text) that needs to be stored, and seekdb will automatically convert it to vectors and build indexes internally. During retrieval, you only need to provide the raw query content, and seekdb will also automatically perform embedding and retrieve the vector index, significantly improving ease of use.

Considering the performance overhead of embedding models, hybrid vector indexes provide two embedding modes for users to choose from:

Synchronous mode: Embedding and indexing are performed immediately after data is written, ensuring real-time data visibility.
Asynchronous mode: Background tasks perform data embedding and indexing in batches, which can significantly improve write performance. You can flexibly set the trigger cycle of background tasks based on your requirements for real-time data visibility.

In addition, this feature also provides the capability to perform brute-force search on hybrid vector indexes to help verify the correctness of search results. Brute-force search refers to performing a search using a full table scan to obtain the exact results of the n nearest rows.

Feature support

tip

This feature currently supports only HNSW/HNSW_BQ indexes.

This feature supports the full lifecycle of hybrid vector indexes, including creation, update, deletion, and retrieval, and is compatible with REFRESH_INDEX and REBUILD_INDEX in the DBMS_VECTOR system package. The syntax for update, deletion, and retrieval is exactly the same as that for regular vector indexes. In asynchronous mode, REFRESH_INDEX will additionally trigger data embedding. For details about creation and retrieval, see the sections below.

The supported features are as follows:

Module	Feature	Description
DDL	Create a hybrid vector index during table creation	You can create a hybrid vector index on a `VARCHAR` column when creating a table
DDL	Create a hybrid vector index after table creation	Supports creating a hybrid vector index on a `VARCHAR` column of an existing table
Retrieval	`semantic_distance` function	Pass raw data through this function for vector retrieval
Retrieval	`semantic_vector_distance` function	Pass vectors through this function for retrieval. There are two usage modes: When the SQL statement includes the `APPROXIMATE`/`APPROX` clause, vector index retrieval is used. When the `APPROXIMATE`/`APPROX` clause is not included, brute-force search using a full table scan is performed.
DBMS_VECTOR	`REFRESH_INDEX`	The usage is the same as that for regular vector indexes. Performs incremental index refresh and embedding in asynchronous mode
DBMS_VECTOR	`REBUILD_INDEX`	The usage is the same as that for regular vector indexes. Performs full index rebuild

Some usage notes are as follows:

In synchronous mode, write performance may be affected by embedding performance. In asynchronous mode, data visibility will be delayed.
For repeated retrieval scenarios, it is recommended to use AI Function Service to pre-obtain query vectors to avoid embedding for each retrieval.

Prerequisites

Before using hybrid vector indexes, you must register an embedding model and endpoint. The following is a registration example:

CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');

CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
'ob_embed', '{
    "type": "dense_embedding",
    "model_name": "BAAI/bge-m3"
}');

CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
'ob_embed_endpoint', '{
    "ai_model_name": "ob_embed",
    "url": "https://api.siliconflow.cn/v1/embeddings",
    "access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "provider": "siliconflow"
}');

info

Replace access_key with your actual API Key. The BAAI/bge-m3 model has a vector dimension of 1024, so you need to use dim=1024 when creating a hybrid vector index.

Creation syntax and description

Hybrid vector indexes support two creation methods: creation during table creation and creation after table creation. When creating an index, note the following:

The index must be created on a column of the VARCHAR type.
The model and sync_mode parameters are not supported for regular vector indexes.
The parameters and descriptions for an index created after table creation are the same as those for an index created during table creation.

Create during table creation

You can use the CREATE TABLE statement to create a hybrid vector index. Through index parameters, background tasks can be initiated synchronously or asynchronously. In synchronous mode, VARCHAR data is automatically converted to vector data when data is inserted. In asynchronous mode, data conversion is performed periodically or manually.

Syntax

CREATE TABLE table_name (
    column_name1 data_type1,
    column_name2 VARCHAR, -- Text column
    ...,
    VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);

Parameter description

Parameter	Default value	Value range	Required	Description	Remarks
`distance`		`l2`/`inner_product`/`cosine`	Yes	Specifies the vector distance algorithm type.	`l2` indicates Euclidean distance, `inner_product` indicates inner product distance, and `cosine` indicates cosine distance.
`type`		Currently supports `hnsw` / `hnsw_bq`	Yes	Specifies the index algorithm type.
`lib`	`vsag`	`vsag`	No	Specifies the vector index library type.	Currently, only the VSAG vector library is supported.
`model`		Registered model name	Yes	Specifies the large language model name used for embedding.	The model must be registered using AI Function Service before creating the index. Note Regular vector indexes do not support setting this parameter.
`dim`		Positive integer, maximum 4096	Yes	Specifies the vector dimension after embedding.	Must match the dimension provided by the model.
`sync_mode`	`async`	`immediate`/`manual`/`async`	No	Specifies the data and index synchronization mode.	`immediate` indicates synchronous mode, `manual` indicates manual mode, and `async` indicates asynchronous mode. Note Regular vector indexes do not support setting this parameter.
`sync_interval`	`10s`	Time interval, such as `10s`, `1h`, `1d`, etc.	No	Sets the trigger cycle of background tasks in asynchronous mode.	The numeric part must be positive. Units supported include seconds (s), hours (h), days (d), etc.

The usage of other vector index parameters (such as m, ef_construction, ef_search, etc.) is the same as that for regular vector indexes. For details, see the related documentation.

Create after table creation

Supports creating a hybrid vector index on a VARCHAR column of an existing table. When creating an index after table creation, synchronous or asynchronous background tasks are initiated through the provided index parameters. In synchronous mode, all existing VARCHAR data is converted to vector data. In asynchronous mode, data conversion is performed periodically or manually.

Syntax

CREATE VECTOR INDEX index_name 
ON table_name(varchar_column_name) 
WITH (param1=value1, param2=value2, ...);

Parameter description

The parameter description is the same as that for creating an index during table creation. For details, see the section above.

Create, update, and delete examples

DML operations (INSERT, UPDATE, DELETE) for hybrid vector indexes are exactly the same as those for regular vector indexes. When inserting or updating data of the VARCHAR type, the system automatically or asynchronously performs embedding based on the sync_mode parameter setting.

Create during table creation

Create the vector_idx index when creating the test table items:

-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE TABLE items (
    id BIGINT PRIMARY KEY, 
    doc VARCHAR(100),
    VECTOR INDEX vector_idx(doc)
    WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s)
);

Insert a row of data into the test table items. The system will automatically perform embedding:

INSERT INTO items(id, doc) VALUES(1, 'Rose');

Create after table creation

After creating the test table items, use the CREATE VECTOR INDEX statement to create the vector_idx index:

CREATE TABLE items (
    id BIGINT PRIMARY KEY, 
    doc VARCHAR(100)
);

-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE VECTOR INDEX vector_idx 
ON items (doc) 
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s);

Insert a row of data into the test table items. The system will automatically perform embedding:

INSERT INTO items(id, doc) VALUES(1, 'Rose');

Update

When updating data of the VARCHAR type, the system will re-perform embedding:

Synchronous mode: Re-embedding is performed immediately after update.
Asynchronous mode: Re-embedding is performed by background tasks at the next trigger cycle after update.

Usage example:

UPDATE items SET doc = 'Lily' WHERE id = 1;

Delete

The delete operation is the same as that for regular vector indexes. You can directly delete the data.

Usage example:

DELETE FROM items WHERE id = 1;

Retrieval

Hybrid vector indexes support two retrieval methods:

Retrieve using raw text
Retrieve using vectors

For detailed usage of the APPROXIMATE/APPROX clause, see the related documentation on creating vector indexes at the end of this topic.

Retrieve using raw text

Use the semantic_distance expression to pass raw text for vector retrieval.

Syntax

SELECT ... FROM table_name 
ORDER BY semantic_distance(column_name, 'query_text') [APPROXIMATE|APPROX] 
LIMIT n;

Where:

column_name: The text column specified when creating the hybrid vector index.
query_text: The raw text for retrieval.
n: The number of result rows to return.

Usage example

-- Assume that the ob_embed model has been created previously
CREATE TABLE items (
    id INT PRIMARY KEY, 
    doc varchar(100),
    VECTOR INDEX vector_idx(doc)
    WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);

INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Sunflower');
INSERT INTO items(id, doc) VALUES(3, 'Rose');
INSERT INTO items(id, doc) VALUES(4, 'Sunflower');
INSERT INTO items(id, doc) VALUES(5, 'Rose');

-- Retrieve using raw text
SELECT id, doc FROM items
ORDER BY semantic_distance(doc, 'Sunflower') 
APPROXIMATE LIMIT 3;

The return result is as follows:

+----+-----------+
| id | doc       |
+----+-----------+
|  2 | Sunflower |
|  4 | Sunflower |
|  5 | Rose      |
+----+-----------+
3 rows in set

Retrieve using vectors (with APPROXIMATE clause)

Use the semantic_vector_distance expression to pass vectors for retrieval. When the retrieval statement includes the APPROXIMATE/APPROX clause, vector index retrieval is used.

Syntax

SELECT ... FROM table_name 
ORDER BY semantic_vector_distance(column_name, 'query_vector') [APPROXIMATE|APPROX] 
LIMIT n;

Where:

column_name: The text column specified when creating the hybrid vector index.
query_vector: The query vector.
n: The number of result rows to return.

Usage example

-- Assume that the ob_embed model has been created previously (please refer to the "Prerequisites" section to register the model)
CREATE TABLE items (
    id INT PRIMARY KEY, 
    doc varchar(100),
    VECTOR INDEX vector_idx(doc)
    WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);

INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Lily');
INSERT INTO items(id, doc) VALUES(3, 'Sunflower');
INSERT INTO items(id, doc) VALUES(4, 'Rose');

-- First, obtain the query vector
SET @query_vector = AI_EMBED('ob_embed', 'Sunflower');

-- Retrieve using vectors with index
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector) 
APPROXIMATE LIMIT 3;

The return result is as follows:

+----+-----------+
| id | doc       |
+----+-----------+
|  3 | Sunflower |
|  1 | Rose      |
|  4 | Rose      |
+----+-----------+
3 rows in set

Retrieve using vectors (without APPROXIMATE clause)

Use the semantic_vector_distance expression to pass vectors for retrieval. When the APPROXIMATE/APPROX clause is not included, brute-force search using a full table scan is performed to obtain the exact results of the n nearest rows. During retrieval execution, the distance type is obtained from the table schema, and then a full table scan is performed. Vector distance is calculated for each row to ensure accurate results.

Syntax

SELECT ... FROM table_name 
ORDER BY semantic_vector_distance(column_name, 'query_vector') 
LIMIT n;

Where:

column_name: The text column specified when creating the hybrid vector index.
query_vector: The query vector.
n: The number of result rows to return.

Usage example

-- Retrieve using vectors with brute-force search (exact results)
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector) 
LIMIT 3;

The return result is as follows:

+----+-----------+
| id | doc       |
+----+-----------+
|  3 | Sunflower |
|  4 | Rose      |
|  1 | Rose      |
+----+-----------+
3 rows in set

Index maintenance

Hybrid vector indexes support using the DBMS_VECTOR system package for index maintenance, including incremental refresh and full rebuild.

Incremental refresh

If a large amount of data is written after the index is created, it is recommended to use the REFRESH_INDEX procedure for incremental refresh. For descriptions and examples, see the related documentation.

Special notes for hybrid vector indexes:

The usage is the same as that for regular vector indexes. For details, see the related documentation.
In asynchronous mode, REFRESH_INDEX will additionally trigger data embedding to ensure that incremental data is correctly converted to vectors and added to the index.

Full refresh (rebuild)

If a large amount of data is updated or deleted after the index is created, it is recommended to use the REBUILD_INDEX procedure for full refresh. For descriptions and examples, see the related documentation.

Special notes for hybrid vector indexes:

The usage is the same as that for regular vector indexes. For details, see the related documentation.
The task merges incremental data and snapshots.

Overview​

Feature support​

Prerequisites​

Creation syntax and description​

Create during table creation​

Syntax​

Parameter description​

Note

Note

Create after table creation​

Syntax​

Parameter description​

Create, update, and delete examples​

Create during table creation​

Create after table creation​

Update​

Delete​

Retrieval​

Retrieve using raw text​

Syntax​

Usage example​

Retrieve using vectors (with APPROXIMATE clause)​

Syntax​

Usage example​

Retrieve using vectors (without APPROXIMATE clause)​

Syntax​

Usage example​

Index maintenance​

Incremental refresh​

Full refresh (rebuild)​

Related documentation​

Contents

Overview

Feature support

Prerequisites

Creation syntax and description

Create during table creation

Syntax

Parameter description

Create after table creation

Syntax

Parameter description

Create, update, and delete examples

Create during table creation

Create after table creation

Update

Delete

Retrieval

Retrieve using raw text

Syntax

Usage example

Retrieve using vectors (with APPROXIMATE clause)

Syntax

Usage example

Retrieve using vectors (without APPROXIMATE clause)

Syntax

Usage example

Index maintenance

Incremental refresh

Full refresh (rebuild)

Related documentation