Vector similarity search
Vector similarity search, also known as nearest neighbor search, is a search method based on distance metrics in vector space. Its core objective is to find the set of vectors most similar to a given query vector. Although specific distance metrics are used during computation, the final output is the Top K nearest vectors, sorted in ascending order of distance.
This topic describes two vector search methods in seekdb: exact nearest neighbor search based on full-scan and approximate nearest neighbor search based on vector index. It also provides examples to illustrate how to use these methods.
For readability, this document refers to vector nearest neighbor search as "vector search," exact nearest neighbor search as "exact search," and approximate nearest neighbor search as "approximate search."
Perform exact search
Exact search uses a full scan strategy, calculating the distance between the query vector and all vectors in the dataset to perform an exact search. This method ensures complete accuracy of the search results, but because it requires calculating the distance for all data, the search performance decreases significantly as the dataset grows.
When executing an exact search, the system calculates and compares the distances between the query vector vₑ and all other vectors in the vector space. After completing the full distance calculations, the system selects the k vectors closest to the query as the search results.
Example: Euclidean search
Euclidean similarity search is used to retrieve the top-k vectors in the vector space that are closest to the query vector, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors from a table that are closest to the query vector:
-- Create a test table
CREATE TABLE t1 (
id INT PRIMARY KEY,
c1 VECTOR(3)
);
-- Insert data
INSERT INTO t1 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');
-- Perform an exact search
SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
The return result is as follows:
+---------------+
| c1 |
+---------------+
| [0.1,0.2,0.3] |
| [0.2,0.3,0.4] |
| [0.3,0.4,0.5] |
| [0.4,0.5,0.6] |
| [0.5,0.6,0.7] |
+---------------+
5 rows in set
Analyze the execution plan
Obtain the execution plan of the preceding example:
EXPLAIN SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;
The return result is as follows:
+---------------------------------------------------------------------------------------------+
| Query Plan |
+---------------------------------------------------------------------------------------------+
| ================================================= |
| |ID|OPERATOR |NAME|EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------- |
| |0 |TOP-N SORT | |5 |3 | |
| |1 |└─TABLE FULL SCAN|t1 |10 |3 | |
| ================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t1.c1]), filter(nil), rowset=16 |
| sort_keys([l2_distance(t1.c1, cast('[0.1, 0.2, 0.3]', ARRAY(18, -1))), ASC]), topn(5) |
| 1 - output([t1.c1]), filter(nil), rowset=16 |
| access([t1.c1]), partitions(p0) |
| is_index_back=false, is_global_index=false, |
| range_key([t1.id]), range(MIN ; MAX)always true |
+---------------------------------------------------------------------------------------------+
14 rows in set
The analysis is as follows:
-
Execution method:
- The full-table scan method is used, which requires traversing all data in the table. The
TABLE FULL SCANoperation in the execution plan scans all data in thet1table. - The system calculates the vector distance for each record and then sorts the records by distance. The
TOP-N SORToperation in the execution plan calculates the vector distance using thel2_distancefunction and sorts the records by distance in ascending order. - The system returns the five records with the smallest distances. The
topn(5)setting in the execution plan indicates that only the first five records of the sorted list are returned.
- The full-table scan method is used, which requires traversing all data in the table. The
-
Performance characteristics:
- Advantages: The search results are completely accurate and ensure that the true nearest neighbors are returned.
- Disadvantages: The system must scan all data in the table and calculate the distance between all vectors, leading to a significant drop in performance as the data volume increases.
-
Applicable scenarios:
- Scenarios with a small amount of data.
- Scenarios where high result accuracy is required.
- Scenarios where real-time queries are not suitable for large datasets.
Perform approximate search by using vector indexes
Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. While it cannot guarantee 100% result accuracy, it can significantly improve search performance, allowing for a good balance between accuracy and efficiency in practical applications.
Example: Approximate search by using the HNSW index
-- Create a HNSW vector index with the table.
CREATE TABLE t2 (
id INT PRIMARY KEY,
vec VECTOR(3),
VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);
-- Insert test data
INSERT INTO t2 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');
-- Perform approximate search and return the 5 most similar data records
SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;
The return result is as follows. The result is the same as that of the exact search because the data volume is small:
+------+---------------+
| id | vec |
+------+---------------+
| 1 | [0.1,0.2,0.3] |
| 2 | [0.2,0.3,0.4] |
| 3 | [0.3,0.4,0.5] |
| 4 | [0.4,0.5,0.6] |
| 5 | [0.5,0.6,0.7] |
+------+---------------+
5 rows in set
Execution plan analysis
Obtain the execution plan of the preceding example:
EXPLAIN SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;
The return result is as follows:
+--------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------+
| ==================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ---------------------------------------------------- |
| |0 |VECTOR INDEX SCAN|t2(idx)|10 |29 | |
| ==================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t2.id], [t2.vec]), filter(nil), rowset=16 |
| access([t2.id], [t2.vec]), partitions(p0) |
| is_index_back=true, is_global_index=false, |
| range_key([t2.__vid_1750162978114053], [t2.__type_17_1750162978114364]), range(MIN,MIN ; MAX,MAX)always true |
+--------------------------------------------------------------------------------------------------------------------+
11 rows in set
The analysis is as follows:
-
Execution method:
- The vector index scan method is used, directly locating similar vectors through the pre-built HNSW index. The
VECTOR INDEX SCANoperation in the execution plan uses the indext2(idx)for retrieval. - The graph structure of the index is used to quickly locate nearest neighbors without calculating the distance between all vectors. The
is_index_back=truesetting in the execution plan indicates that complete data is retrieved through index back-lookup. - The five records that the index considers to be the most similar are returned. The
output([t2.id], [t2.vec])in the execution plan indicates that the id and vector data are returned.
- The vector index scan method is used, directly locating similar vectors through the pre-built HNSW index. The
-
Performance characteristics:
- Advantage: The search performance is high and remains stable as the data volume increases.
- Disadvantage: A small amount of error may exist in the results, and 100% accuracy is not guaranteed.
-
Applicable scenarios:
- Real-time search for large-scale datasets.
- Scenarios with high requirements for search performance.
- Scenarios that can tolerate a small amount of result error.
Summary
A comparison of the two search methods is as follows:
| Item | Exact search | Approximate search |
|---|---|---|
| Execution method | Full-table scan (TABLE FULL SCAN) followed by sorting | Direct search through the vector index (VECTOR INDEX SCAN) |
| Execution plan | Contains two operators: TABLE FULL SCAN and TOP-N SORT | Contains only one operator: VECTOR INDEX SCAN |
| Performance characteristics | Requires full-table scan and sorting, and performance decreases significantly as the data volume increases | Directly locates target data through the index, and performance is stable |
| Result accuracy | 100% accurate, ensuring real nearest neighbors are returned | Approximately accurate, with a small amount of error possible |
| Applicable scenarios | Scenarios with small data volumes and high accuracy requirements | Scenarios with large-scale datasets and high performance requirements |
References
- For more information about SQL functions, see Use SQL functions.
- For more information about vector indexes and examples, see Create vector indexes.
- To perform large-scale performance tests, we recommend that you use the VectorDBBenchmark tool to generate a test dataset to better compare the performance differences between exact search and approximate search.