跳到主要内容

通过 Python SDK 体验嵌入式 seekdb

本示例以 Linux 环境为例演示如何通过 pyseekdb(OceanBase 提供的 Python 客户端)快速体验嵌入式 seekdb。

提示

除了 Linux 系统之外,也支持在 macOS 和 Windows 系统下使用 pyseekdb,但是当前仅支持使用服务器模式的 seekdb。有关 macOS 和 Windows 系统下的使用介绍,参见 pyseekdb 快速开始

在本示例中,我们将进行以下操作:

  1. 部署 pyseekdb 和 嵌入式 seekdb。
  2. 连接 seekdb 并创建数据库。
  3. 连接到数据库并创建带有 Embedding Functions 的 collection。
  4. 使用 documents 添加数据(会自动生成 vectors)。
  5. 执行混合搜索(会自动生成 vectors)并打印查询结果。
  6. 清理环境。

背景信息

pyseekdb

pyseekdb 是 OceanBase 提供的 Python 客户端。实现了基于一套 API 接口,提供三种数据库连接模式,支持连接到嵌入式模式的 seekdb、服务器模式的 seekdb 和 OceanBase 数据库。

安装此客户端,会同时安装嵌入式模式的 seekdb,使您可以直接连接到嵌入式 seekdb 执行创建数据库等操作。或者选择远程连接到已经部署好的服务器模式的 seekdb 或者 OceanBase 数据库。

seekdb 部署模式

seekdb 提供了多种灵活的部署模式,从快速原型开发到支撑海量用户,全面满足您的应用需求。

  • 嵌入式模式

    seekdb 以轻量级库的形式嵌入您的应用,支持 pip 一键安装。适用于个人学习、快速原型开发,并可高效运行于多种端侧设备。

  • 服务器模式

    推荐用于测试及生产环境的部署模式,轻量易用,适合稳定高效地提供服务。

    有关服务器模式的使用方式,参见 通过 SQL 快速体验服务器模式 seekdb

步骤一:安装 pyseekdb 并部署嵌入式 seekdb

前提条件

请确保您的环境需要满足以下要求:

  • 操作系统:Linux(glibc >= 2.28)
  • Python 版本:Python 3.11 及之后版本
  • 系统架构:x86_64,aarch64

安装

使用 pip 安装,会自动识别默认的 Python 版本和平台。\

pip install pyseekdb

如果您的 pip 版本比较低,请先升级 pip 后再安装。

pip install --upgrade pip

安装 pyseekdb 的同时也会安装嵌入式模式的 seekdb,使您可以直接连接到嵌入式 seekdb 执行创建数据库等操作。

步骤二:连接 seekdb 并创建数据库

使用 Admin Client 连接到 seekdb 并创建名为 hybrid_search_test 的数据库。

提示
import pyseekdb

# Create embedded admin client
admin = pyseekdb.AdminClient(path="./seekdb.db")
# Create database
admin.create_database("hybrid_search_test")

步骤三:连接数据库并创建带有 Embedding Functions 的 collection

使用 Client 连接到 hybrid_search_test 数据库并创建 collection。

提示
import pyseekdb

# Create embedded client
client = pyseekdb.Client(path="./seekdb.db", database="hybrid_search_test")
# Create collection
collection = client.create_collection(
name="hybrid_search_demo"
)

步骤四:插入数据

使用 add 方法向 collection 中插入数据。

提示

关于插入数据的详细介绍,参见 add - 插入数据

import pyseekdb

# Create embedded client
client = pyseekdb.Client(path="./seekdb.db", database="hybrid_search_test")
# get collection
collection = client.get_collection("hybrid_search_demo")

# Define documents
documents = [
"Machine learning is revolutionizing artificial intelligence and data science",
"Python programming language is essential for machine learning developers",
"Deep learning neural networks enable advanced AI applications",
"Data science combines statistics, programming, and domain expertise",
"Natural language processing uses machine learning to understand text",
"Computer vision algorithms process images using deep learning techniques",
"Reinforcement learning trains agents through reward-based feedback",
"Python libraries like TensorFlow and PyTorch simplify machine learning",
"Artificial intelligence systems can learn from large datasets",
"Neural networks mimic the structure of biological brain connections"
]
# Define metadatas
metadatas = [
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
]

ids = [f"doc_{i+1}" for i in range(len(documents))]
# Insert data
collection.add(ids=ids, documents=documents, metadatas=metadatas)

步骤五:执行混合搜索并打印查询结果

使用 hybrid_search 方法进行混合搜索查询,并打印查询结果。

提示

关于混合检索的详细介绍,参见 hybrid_search - 混合搜索

import pyseekdb

# Create embedded client
client = pyseekdb.Client(path="./seekdb.db", database="hybrid_search_test")
# get collection
collection = client.get_collection("hybrid_search_demo")

# Perform hybrid search
hybrid_result = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["AI research"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)

# Print results
print("\nhybrid_search() Results:")
print(f" ids: {hybrid_result ['ids'][0]}")
print(f" Document: {hybrid_result ['documents'][0]}")

步骤六:清理环境

如果您不再需要使用上述示例数据库和 collection,可以使用 delete_collection 方法删除 collection 以及使用 delete_database 方法删除数据库。

提示
import pyseekdb

# Create embedded client
admin = pyseekdb.AdminClient(path="./seekdb.db")
client = pyseekdb.Client(path="./seekdb.db", database="hybrid_search_test")

# Delete collection
client.delete_collection("hybrid_search_demo")
print(f"\nDeleted collection")

# Delete database
admin.delete_database("hybrid_search_test")
print(f"\nDeleted database")

完整示例

import pyseekdb

#==================== Create Database ====================
# Create embedded admin client
admin = pyseekdb.AdminClient(path="./seekdb.db")
# Create database
admin.create_database("hybrid_search_test")


# ==================== Create Collection ====================
# Create embedded client
client = pyseekdb.Client(path="./seekdb.db", database="hybrid_search_test")
# Create collection
collection = client.create_collection(
name="hybrid_search_demo"
)

# ==================== Add Data to Collection ====================
# Define documents
documents = [
"Machine learning is revolutionizing artificial intelligence and data science",
"Python programming language is essential for machine learning developers",
"Deep learning neural networks enable advanced AI applications",
"Data science combines statistics, programming, and domain expertise",
"Natural language processing uses machine learning to understand text",
"Computer vision algorithms process images using deep learning techniques",
"Reinforcement learning trains agents through reward-based feedback",
"Python libraries like TensorFlow and PyTorch simplify machine learning",
"Artificial intelligence systems can learn from large datasets",
"Neural networks mimic the structure of biological brain connections"
]
# Define metadatas
metadatas = [
{"category": "AI", "topic": "machine learning", "year": 2023, "popularity": 95},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 88},
{"category": "AI", "topic": "deep learning", "year": 2024, "popularity": 92},
{"category": "Data Science", "topic": "data analysis", "year": 2023, "popularity": 85},
{"category": "AI", "topic": "nlp", "year": 2024, "popularity": 90},
{"category": "AI", "topic": "computer vision", "year": 2023, "popularity": 87},
{"category": "AI", "topic": "reinforcement learning", "year": 2024, "popularity": 89},
{"category": "Programming", "topic": "python", "year": 2023, "popularity": 91},
{"category": "AI", "topic": "general ai", "year": 2023, "popularity": 93},
{"category": "AI", "topic": "neural networks", "year": 2024, "popularity": 94}
]

ids = [f"doc_{i+1}" for i in range(len(documents))]
# Insert data
collection.add(ids=ids, documents=documents, metadatas=metadatas)

# ==================== Perform Hybrid Search ====================
# Perform hybrid search
hybrid_result = collection.hybrid_search(
query={"where_document": {"$contains": "machine learning"}, "n_results": 10},
knn={"query_texts": ["AI research"], "n_results": 10},
rank={"rrf": {}},
n_results=5
)

# ==================== Print Query Results ====================
# Print results
print("\nhybrid_search() Results:")
print(f" ids: {hybrid_result ['ids'][0]}")
print(f" Document: {hybrid_result ['documents'][0]}")

# ==================== Cleanup ====================
# Delete collection
client.delete_collection("hybrid_search_demo")
print(f"\nDeleted collection")

# Delete database
admin.delete_database("hybrid_search_test")
print(f"\nDeleted database")

更多信息