Vector similarity search

Vector similarity search, also known as nearest neighbor search, is a search method based on distance metrics in vector space. Its core objective is to find the set of vectors most similar to a given query vector. Although specific distance metrics are used during computation, the final output is the Top K nearest vectors, sorted in ascending order of distance.

This topic describes two vector search methods in seekdb: exact nearest neighbor search based on full-scan and approximate nearest neighbor search based on vector index. It also provides examples to illustrate how to use these methods.

tip

For readability, this document refers to vector nearest neighbor search as "vector search," exact nearest neighbor search as "exact search," and approximate nearest neighbor search as "approximate search."

Perform exact search

Exact search uses a full scan strategy, calculating the distance between the query vector and all vectors in the dataset to perform an exact search. This method ensures complete accuracy of the search results, but because it requires calculating the distance for all data, the search performance decreases significantly as the dataset grows.

When executing an exact search, the system calculates and compares the distances between the query vector vₑ and all other vectors in the vector space. After completing the full distance calculations, the system selects the k vectors closest to the query as the search results.

Example: Euclidean search

Euclidean similarity search is used to retrieve the top-k vectors in the vector space that are closest to the query vector, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors from a table that are closest to the query vector:

-- Create a test table
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    c1 VECTOR(3)
);

-- Insert data
INSERT INTO t1 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform an exact search
SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;

The return result is as follows:

+---------------+
| c1            |
+---------------+
| [0.1,0.2,0.3] |
| [0.2,0.3,0.4] |
| [0.3,0.4,0.5] |
| [0.4,0.5,0.6] |
| [0.5,0.6,0.7] |
+---------------+
5 rows in set

Analyze the execution plan

Obtain the execution plan of the preceding example:

EXPLAIN SELECT c1
        FROM t1
        ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;

The return result is as follows:

+---------------------------------------------------------------------------------------------+
| Query Plan                                                                                  |
+---------------------------------------------------------------------------------------------+
| =================================================                                           |
| |ID|OPERATOR         |NAME|EST.ROWS|EST.TIME(us)|                                           |
| -------------------------------------------------                                           |
| |0 |TOP-N SORT       |    |5       |3           |                                           |
| |1 |└─TABLE FULL SCAN|t1  |10      |3           |                                           |
| =================================================                                           |
| Outputs & filters:                                                                          |
| -------------------------------------                                                       |
|   0 - output([t1.c1]), filter(nil), rowset=16                                               |
|       sort_keys([l2_distance(t1.c1, cast('[0.1, 0.2, 0.3]', ARRAY(18, -1))), ASC]), topn(5) |
|   1 - output([t1.c1]), filter(nil), rowset=16                                               |
|       access([t1.c1]), partitions(p0)                                                       |
|       is_index_back=false, is_global_index=false,                                           |
|       range_key([t1.id]), range(MIN ; MAX)always true                                       |
+---------------------------------------------------------------------------------------------+
14 rows in set

The analysis is as follows:

Execution method:
- The full-table scan method is used, which requires traversing all data in the table. The TABLE FULL SCAN operation in the execution plan scans all data in the t1 table.
- The system calculates the vector distance for each record and then sorts the records by distance. The TOP-N SORT operation in the execution plan calculates the vector distance using the l2_distance function and sorts the records by distance in ascending order.
- The system returns the five records with the smallest distances. The topn(5) setting in the execution plan indicates that only the first five records of the sorted list are returned.
Performance characteristics:
- Advantages: The search results are completely accurate and ensure that the true nearest neighbors are returned.
- Disadvantages: The system must scan all data in the table and calculate the distance between all vectors, leading to a significant drop in performance as the data volume increases.
Applicable scenarios:
- Scenarios with a small amount of data.
- Scenarios where high result accuracy is required.
- Scenarios where real-time queries are not suitable for large datasets.

Perform approximate search by using vector indexes

Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. While it cannot guarantee 100% result accuracy, it can significantly improve search performance, allowing for a good balance between accuracy and efficiency in practical applications.

Example: Approximate search by using the HNSW index

-- Create a HNSW vector index with the table.
CREATE TABLE t2 (
    id INT PRIMARY KEY,
    vec VECTOR(3),
    VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);

-- Insert test data
INSERT INTO t2 VALUES 
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform approximate search and return the 5 most similar data records
SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;

The return result is as follows. The result is the same as that of the exact search because the data volume is small:

+------+---------------+
| id   | vec           |
+------+---------------+
|    1 | [0.1,0.2,0.3] |
|    2 | [0.2,0.3,0.4] |
|    3 | [0.3,0.4,0.5] |
|    4 | [0.4,0.5,0.6] |
|    5 | [0.5,0.6,0.7] |
+------+---------------+
5 rows in set

Execution plan analysis

Obtain the execution plan of the preceding example:

EXPLAIN SELECT id, vec
        FROM t2
        ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
        APPROXIMATE
        LIMIT 5;

The return result is as follows:

+--------------------------------------------------------------------------------------------------------------------+
| Query Plan                                                                                                         |
+--------------------------------------------------------------------------------------------------------------------+
| ====================================================                                                               |
| |ID|OPERATOR         |NAME   |EST.ROWS|EST.TIME(us)|                                                               |
| ----------------------------------------------------                                                               |
| |0 |VECTOR INDEX SCAN|t2(idx)|10      |29          |                                                               |
| ====================================================                                                               |
| Outputs & filters:                                                                                                 |
| -------------------------------------                                                                              |
|   0 - output([t2.id], [t2.vec]), filter(nil), rowset=16                                                            |
|       access([t2.id], [t2.vec]), partitions(p0)                                                                    |
|       is_index_back=true, is_global_index=false,                                                                   |
|       range_key([t2.__vid_1750162978114053], [t2.__type_17_1750162978114364]), range(MIN,MIN ; MAX,MAX)always true |
+--------------------------------------------------------------------------------------------------------------------+
11 rows in set

The analysis is as follows:

Execution method:
- The vector index scan method is used, directly locating similar vectors through the pre-built HNSW index. The VECTOR INDEX SCAN operation in the execution plan uses the index t2(idx) for retrieval.
- The graph structure of the index is used to quickly locate nearest neighbors without calculating the distance between all vectors. The is_index_back=true setting in the execution plan indicates that complete data is retrieved through index back-lookup.
- The five records that the index considers to be the most similar are returned. The output([t2.id], [t2.vec]) in the execution plan indicates that the id and vector data are returned.
Performance characteristics:
- Advantage: The search performance is high and remains stable as the data volume increases.
- Disadvantage: A small amount of error may exist in the results, and 100% accuracy is not guaranteed.
Applicable scenarios:
- Real-time search for large-scale datasets.
- Scenarios with high requirements for search performance.
- Scenarios that can tolerate a small amount of result error.

Summary

A comparison of the two search methods is as follows:

Item	Exact search	Approximate search
Execution method	Full-table scan (`TABLE FULL SCAN`) followed by sorting	Direct search through the vector index (`VECTOR INDEX SCAN`)
Execution plan	Contains two operators: `TABLE FULL SCAN` and `TOP-N SORT`	Contains only one operator: `VECTOR INDEX SCAN`
Performance characteristics	Requires full-table scan and sorting, and performance decreases significantly as the data volume increases	Directly locates target data through the index, and performance is stable
Result accuracy	100% accurate, ensuring real nearest neighbors are returned	Approximately accurate, with a small amount of error possible
Applicable scenarios	Scenarios with small data volumes and high accuracy requirements	Scenarios with large-scale datasets and high performance requirements

References

For more information about SQL functions, see Use SQL functions.
For more information about vector indexes and examples, see Create vector indexes.
To perform large-scale performance tests, we recommend that you use the VectorDBBenchmark tool to generate a test dataset to better compare the performance differences between exact search and approximate search.

Perform exact search​

Example: Euclidean search​

Analyze the execution plan​

Perform approximate search by using vector indexes​

Example: Approximate search by using the HNSW index​

Execution plan analysis​

Summary​

References​

Contents

Perform exact search

Example: Euclidean search

Analyze the execution plan

Perform approximate search by using vector indexes

Example: Approximate search by using the HNSW index

Execution plan analysis

Summary

References