跳到主要内容

体验混合搜索

本教程将引导您快速上手 seekdb 混合搜索功能,展示混合搜索如何同时利用全文索引关键词和向量索引语义搜索的优势,帮助您更好地理解混合搜索的实际应用。

概述

混合搜索(Hybrid Search)结合了基于向量的语义搜索和基于全文索引的关键词搜索,通过综合排序提供更准确、全面的搜索结果。向量搜索擅长语义近似匹配,但对精确的关键字、数字和专有名词等匹配能力较弱,而全文搜索能有效弥补这一不足。seekdb 通过 DBMS_HYBRID_SEARCH 系统包提供混合搜索功能,支持以下场景:

  • 纯向量搜索:根据语义相似性查找相关内容,适用于语义搜索、推荐系统等场景。
  • 纯全文搜索:根据关键词匹配查找内容,适用于文档搜索、产品搜索等场景。
  • 混合搜索:同时结合关键词匹配和语义理解,提供更精准、全面的搜索结果。

该特性广泛应用于智能搜索、文档搜索、产品推荐等场景。

前提条件

  • 联系管理员获取相应的数据库连接串,然后执行以下命令连接到数据库:
    - host: seekdb 数据库连接 IP。
    - port: seekdb 数据库连接端口。
    - database_name: 需要访问的数据库名称。
    - user_name: 数据库用户名。
    - password: 数据库密码。
    obclient -h$host -P$port -u$user_name -p$password -D$database_name
  • 已创建测试表,并已在表中创建向量索引和全文索引:
    CREATE TABLE doc_table(
    c1 INT,
    vector VECTOR(3),
    query VARCHAR(255),
    content VARCHAR(255),
    VECTOR INDEX idx1(vector) WITH (distance=l2, type=hnsw, lib=vsag),
    FULLTEXT INDEX idx2(query),
    FULLTEXT INDEX idx3(content)
    );

    INSERT INTO doc_table VALUES
    (1, '[1,2,3]', "hello world", "oceanbase Elasticsearch database"),
    (2, '[1,2,1]', "hello world, what is your name", "oceanbase mysql database"),
    (3, '[1,1,1]', "hello world, how are you", "oceanbase oracle database"),
    (4, '[1,3,1]', "real world, where are you from", "postgres oracle database"),
    (5, '[1,3,2]', "real world, how old are you", "redis oracle database"),
    (6, '[2,1,1]', "hello world, where are you from", "starrocks oceanbase database");

步骤一:纯向量搜索

向量搜索通过计算向量相似度来查找语义相关的内容,适用于语义搜索、推荐系统等场景。

设置搜索参数,使用向量搜索查找与查询向量 [1,2,3] 最相似的记录:

SET @parm = '{
"knn" : {
"field": "vector",
"k": 3,
"query_vector": [1,2,3]
}
}';

SELECT JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));

返回结果如下:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"c1": 1,
"query": "hello world",
"_score": 1.0,
"vector": "[1,2,3]",
"content": "oceanbase Elasticsearch database"
},
{
"c1": 5,
"query": "real world, how old are you",
"_score": 0.41421356,
"vector": "[1,3,2]",
"content": "redis oracle database"
},
{
"c1": 2,
"query": "hello world, what is your name",
"_score": 0.33333333,
"vector": "[1,2,1]",
"content": "oceanbase mysql database"
}
] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

返回结果按向量相似度排序,_score 表示相似度分数,分数越高表示越相似。

步骤二:纯全文搜索

全文搜索通过关键词匹配查找内容,适用于文档搜索、产品搜索等场景。

设置搜索参数,使用全文搜索查找 querycontent 字段中包含关键词的记录:

SET @parm = '{
"query": {
"query_string": {
"fields": ["query", "content"],
"query": "hello oceanbase"
}
}
}';

SELECT JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));

返回结果如下:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"c1": 1,
"query": "hello world",
"_score": 0.37162162162162166,
"vector": "[1,2,3]",
"content": "oceanbase Elasticsearch database"
},
{
"c1": 2,
"query": "hello world, what is your name",
"_score": 0.3503184713375797,
"vector": "[1,2,1]",
"content": "oceanbase mysql database"
},
{
"c1": 3,
"query": "hello world, how are you",
"_score": 0.3503184713375797,
"vector": "[1,1,1]",
"content": "oceanbase oracle database"
},
{
"c1": 6,
"query": "hello world, where are you from",
"_score": 0.3503184713375797,
"vector": "[2,1,1]",
"content": "starrocks oceanbase database"
}
] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

返回结果按关键词匹配度排序,_score 表示匹配度分数,分数越高表示匹配度越好。

步骤三:混合搜索

混合搜索同时结合关键词匹配和语义理解,提供更精准、全面的搜索结果,能够同时利用全文索引和向量索引的优势。

设置搜索参数,同时进行全文搜索和向量搜索:

SET @parm = '{
"query": {
"query_string": {
"fields": ["query", "content"],
"query": "hello oceanbase"
}
},
"knn" : {
"field": "vector",
"k": 5,
"query_vector": [1,2,3]
}
}';

SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));

返回结果如下:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| JSON_PRETTY(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"c1": 1,
"query": "hello world",
"_score": 0.37162162162162166,
"vector": "[1,2,3]",
"content": "oceanbase Elasticsearch database"
},
{
"c1": 2,
"query": "hello world, what is your name",
"_score": 0.3503184713375797,
"vector": "[1,2,1]",
"content": "oceanbase mysql database"
},
{
"c1": 3,
"query": "hello world, how are you",
"_score": 0.3503184713375797,
"vector": "[1,1,1]",
"content": "oceanbase oracle database"
},
{
"c1": 6,
"query": "hello world, where are you from",
"_score": 0.3503184713375797,
"vector": "[2,1,1]",
"content": "starrocks oceanbase database"
}
] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MySQL [test]> SET @parm = '{
'> "query": {
'> "query_string": {
'> "fields": ["query", "content"],
'> "query": "hello oceanbase"
'> }
'> },
'> "knn" : {
'> "field": "vector",
'> "k": 5,
'> "query_vector": [1,2,3]
'> }
'> }';
Query OK, 0 rows affected (0.00 sec)

MySQL [test]>
MySQL [test]> SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm)) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [
{
"c1": 1,
"query": "hello world",
"_score": 1.3716216216216217,
"vector": "[1,2,3]",
"content": "oceanbase Elasticsearch database"
},
{
"c1": 2,
"query": "hello world, what is your name",
"_score": 0.6836518013375796,
"vector": "[1,2,1]",
"content": "oceanbase mysql database"
},
{
"c1": 3,
"query": "hello world, how are you",
"_score": 0.6593354613375797,
"vector": "[1,1,1]",
"content": "oceanbase oracle database"
},
{
"c1": 5,
"query": "real world, how old are you",
"_score": 0.41421356,
"vector": "[1,3,2]",
"content": "redis oracle database"
},
{
"c1": 6,
"query": "hello world, where are you from",
"_score": 0.3503184713375797,
"vector": "[2,1,1]",
"content": "starrocks oceanbase database"
},
{
"c1": 4,
"query": "real world, where are you from",
"_score": 0.30901699,
"vector": "[1,3,1]",
"content": "postgres oracle database"
}
] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

混合搜索的结果会综合考虑关键词匹配分数 (_keyword_score) 和语义相似度分数 (_semantic_score)。最终的 _score 是这两者之和,用于对搜索结果进行全面排序。

参数调优

在混合搜索中,可以通过 boost 参数调整全文搜索和向量搜索的权重比例,以优化搜索结果。例如提高全文搜索的权重:

SET @parm = '{
"query": {
"query_string": {
"fields": ["query", "content"],
"query": "hello oceanbase",
"boost": 2.0
}
},
"knn" : {
"field": "vector",
"k": 5,
"query_vector": [1,2,3],
"boost": 1.0
}
}';

SELECT json_pretty(DBMS_HYBRID_SEARCH.SEARCH('doc_table', @parm));

通过调整 boost 参数,可以控制关键词搜索和语义搜索在最终排序中的权重。例如,如果更关注关键词匹配,可以提高 query_stringboost 值;如果更关注语义相似性,可以提高 knnboost 值。

总结

通过本教程,您已经掌握了 seekdb 混合搜索的核心功能:

  • 纯向量搜索:通过语义相似度查找相关内容,适合语义搜索场景。
  • 纯全文搜索:通过关键词匹配查找内容,适合精确搜索场景。
  • 混合搜索:结合关键词和语义理解,提供更全面、更准确的搜索结果。

混合搜索功能是处理海量非结构化数据、构建智能搜索与推荐系统的理想选择,能够显著提升搜索结果的准确性和全面性。

下一步

更多操作

更多体验 seekdb 的 AI Native 特性以及尝试基于 seekdb 搭建 AI 应用的使用指导,参见:

除了使用 SQL 进行操作之外,也支持通过 seekdb 提供的 Python SDK(pyseekdb)进行操作,使用方法参见 通过 Python SDK 体验嵌入式pyseekdb 概述