Skip to main content

Experience vector search

Vector search overview

In today's era of information explosion, users often need to quickly retrieve the information they need from massive amounts of data. For example, online literature databases, e-commerce platform product catalogs, and growing multimedia content libraries all require efficient retrieval systems to quickly locate content of interest to users. As data volumes continue to grow, traditional keyword-based retrieval methods can no longer meet users' needs for retrieval accuracy and speed. Vector search technology can effectively solve these problems. Vector search encodes different types of data such as text, images, and audio into mathematical vectors and performs retrieval in vector space. This method allows systems to capture deep semantic information of data, thereby providing more accurate and efficient retrieval results.

seekdb provides the capability to store, index, and search embedding vector data, and supports storing vector data together with other data.

seekdb supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance calculations such as Manhattan distance, Euclidean distance, inner product, and cosine distance. It supports creating vector indexes based on HNSW/IVF, and supports incremental updates and deletions without affecting recall.

seekdb vector search has hybrid search capabilities with scalar filtering. It also provides flexible access interfaces, supporting SQL access through MySQL protocol clients in various languages, as well as Python SDK access. It has also completed adaptation to AI application development frameworks LlamaIndex and DB-GPT, and AI application development platform Dify, better serving AI application development.

This topic demonstrates how to quickly perform vector search using SQL.

Prerequisites

  • Ensure that seekdb is installed.
  • You are connected to seekdb.

Quick start

  1. Create vector columns and indexes.

    When creating a table, you can use the VECTOR(dim) data type to declare a column as a vector column and specify its dimension. Vector indexes must be created on vector columns, and at least two parameters, type and distance, must be provided.

    The example creates a vector column embedding with a dimension of 3, and creates an HNSW index on the embedding column, specifying the distance algorithm as L2.

    CREATE TABLE t1(
    id INT PRIMARY KEY,
    doc VARCHAR(200),
    embedding VECTOR(3),
    VECTOR INDEX idx1(embedding) WITH (distance=L2, type=hnsw)
    );
  2. Insert vector data.

    To simulate a vector search scenario, you need to construct some vector data first. Each row of data includes a description of the data and the corresponding vector. In the example, it is assumed that 'apple' corresponds to the vector '[1.2,0.7,1.1]', and 'carrot' corresponds to the vector '[5.3,4.8,5.4]', and so on.

    INSERT INTO t1
    VALUES (1, 'apple', '[1.2,0.7,1.1]'),
    (2, 'banana', '[0.6,1.2,0.8]'),
    (3, 'orange','[1.1,1.1,0.9]'),
    (4, 'carrot', '[5.3,4.8,5.4]'),
    (5, 'spinach', '[4.9,5.3,4.8]'),
    (6, 'tomato','[5.2,4.9,5.1]');

    For convenience of demonstration, this example simplifies the vector dimension to only 3 dimensions, and the vectors are manually generated. In actual applications, you need to use embedding models to generate vectors from real text, and the dimensions can reach hundreds or thousands.

    You can check whether the data is inserted successfully by querying the table.

    SELECT * FROM t1;

    The expected result is as follows:

    +----+---------+---------------+
    | id | doc | embedding |
    +----+---------+---------------+
    | 1 | apple | [1.2,0.7,1.1] |
    | 2 | banana | [0.6,1.2,0.8] |
    | 3 | orange | [1.1,1.1,0.9] |
    | 4 | carrot | [5.3,4.8,5.4] |
    | 5 | spinach | [4.9,5.3,4.8] |
    | 6 | tomato | [5.2,4.9,5.1] |
    +----+---------+---------------+
    6 rows in set
  3. Perform vector search.

    To perform vector search, you need to provide a vector as the search condition. Suppose we need to find all 'fruits', and the corresponding vector is [0.9, 1.0, 0.9], then the corresponding SQL is:

    SELECT id, doc FROM t1
    ORDER BY l2_distance(embedding, '[0.9, 1.0, 0.9]')
    APPROXIMATE LIMIT 3;

    The expected result is as follows:

    +----+--------+
    | id | doc |
    +----+--------+
    | 3 | orange |
    | 2 | banana |
    | 1 | apple |
    +----+--------+
    3 rows in set

Exact search uses a full scan strategy, performing exact search by calculating the distance between the query vector and all vectors in the dataset. This method can guarantee complete accuracy of search results, but since full distance calculation is required, search performance will significantly decrease as the data scale grows.

When performing exact search, the system calculates and compares the distance between the query vector vₑ and all vectors in the vector space. After completing the full distance calculation, the system selects the k vectors with the closest distance as the search results.

Euclidean similarity search is used to retrieve the top-k vectors closest to the query vector in vector space, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors closest to the query vector from a table:

-- Create a test table
CREATE TABLE t1 (
id INT PRIMARY KEY,
c1 VECTOR(3)
);

-- Insert data
INSERT INTO t1 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform exact search
SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;

The result is as follows:

+---------------+
|| c1 |
+---------------+
|| [0.1,0.2,0.3] |
|| [0.2,0.3,0.4] |
|| [0.3,0.4,0.5] |
|| [0.4,0.5,0.6] |
|| [0.5,0.6,0.7] |
+---------------+
5 rows in set

Perform approximate search using vector indexes

Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. Although it cannot guarantee 100% accuracy of results, it can significantly improve search performance, achieving a good balance between accuracy and performance in practical applications.

-- Create an HNSW vector index with the table
CREATE TABLE t2 (
id INT PRIMARY KEY,
vec VECTOR(3),
VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);

-- Insert test data
INSERT INTO t2 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform approximate search, returning the 5 most similar records
SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;

The result is as follows. Due to the small data volume, it is consistent with the exact search result above:

+------+---------------+
|| id | vec |
+------+---------------+
|| 1 | [0.1,0.2,0.3] |
|| 2 | [0.2,0.3,0.4] |
|| 3 | [0.3,0.4,0.5] |
|| 4 | [0.4,0.5,0.6] |
|| 5 | [0.5,0.6,0.7] |
+------+---------------+
5 rows in set

Summary

A comparison of the two search methods is as follows:

Comparison itemExact searchApproximate search
Execution methodFull table scan (TABLE FULL SCAN) followed by sortingDirect search through vector index (VECTOR INDEX SCAN)
Performance characteristicsRequires scanning all table data and sorting, performance significantly decreases as data volume growsDirectly locates target data through index, stable performance
Result accuracy100% accurate, guarantees returning true nearest neighborsApproximately accurate, may have minor errors
Applicable scenariosSmall data volumes, scenarios with high accuracy requirementsLarge-scale datasets, scenarios with high performance requirements

What's next

For more guides on experiencing seekdb's AI Native features and building AI applications based on seekdb, see:

In addition to using SQL for operations, you can also use the Python SDK (pyseekdb) provided by seekdb. For usage instructions, see Experience embedded seekdb and pyseekdb overview.