seekdb Vector 与 Cloudflare Workers AI 集成
seekdb 提供了向量类型存储、向量索引、embedding 向量搜索的能力。可以将向量化后的数据存储在 seekdb,供下一步的搜索使用。
Cloudflare Workers AI 是 Cloudflare 提供的一项服务,它允许开发者在其全球网络上运行机器学习模型。通过使用 REST API,开发者可以轻松地将 AI 功能集成到他们的应用中。
前提条件
-
您已经部署了 seekdb。
-
您的环境中已存在可以使用的数据库和账号,并已对数据库账号授读写权限。
-
安装 Python 3.11 及以上版本。
-
安装依赖。
python3 -m pip install cffi pyseekdb requests
步骤一:获取数据库连接信息
联系 seekdb 部署人员或者管理员获取相应的数据库连接串,例如:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
参数说明:
-
$host:提供 seekdb 连接 IP 地址。 -
$port:提供 seekdb 连接端口,默认是 2881。 -
$database_name:需要访问的数据库名称。提示连接的用户需要拥有该数据库的
CREATE、INSERT、DROP和SELECT权限。 -
$user_name:提供数据库连接账户。 -
$password:提供账户密码。
步骤二:构建您的 AI 助手
设置 Cloudflare API key 环境变量
获取 Cloudflare API 密钥,并同 seekdb 连接信息配置环境变量中。
export SEEKDB_DATABASE_URL=YOUR_SEEKDB_DATABASE_URL
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
export CLOUDFLARE_API_KEY=YOUR_CLOUDFLARE_API_KEY
export account_id=you_account_id
示例代码片段
这里以 bge-base-en-v1.5 为例,使用 Cloudflare Workers AI 嵌入 API 生成向量数据:
import requests,os,httpx,pyseekdb
from tqdm import tqdm
from pyseekdb import HNSWConfiguration
documents = [
"Machine learning is the core technology of artificial intelligence",
"Python is the preferred programming language for data science",
"Cloud computing provides elastic and scalable computing resources",
"Blockchain technology ensures data security and transparency",
"Natural language processing helps computers understand human language"
]
BASE_URL = "https://api.cloudflare.com/client/v4/accounts"
model_name = "@cf/baai/bge-base-en-v1.5"
account_id=os.getenv('account_id')
CLOUDFLARE_API_KEY = os.getenv('CLOUDFLARE_API_KEY')
api_url = f"{BASE_URL}/{account_id}/ai/run/{model_name}"
# 创建HTTP客户端
httpclient = httpx.Client()
httpclient.headers.update({
"Authorization": f"Bearer {CLOUDFLARE_API_KEY}",
"Accept-Encoding": "identity"
})
payload = {"text": documents}
response = httpclient.post(api_url, json=payload)
embedding_response = response.json()["result"]["data"]
ids = []
embeddings = []
documents_list = []
for i, text in enumerate(tqdm(documents, desc="Creating embeddings")):
# Use the pre-computed embedding from the response
embedding = embedding_response[i]
ids.append(f"{i+1}")
embeddings.append(embedding)
documents_list.append(text)
print(f"Successfully processed {len(documents_list)} texts")
定义表结构并将数据存入 seekdb
创建一个名为 cloudflare_seekdb_demo_documents 的表,并将数据存入 seekdb:
SEEKDB_DATABASE_HOST = os.getenv('SEEKDB_DATABASE_HOST')
SEEKDB_DATABASE_PORT = int(os.getenv('SEEKDB_DATABASE_PORT', 2881))
SEEKDB_DATABASE_USER = os.getenv('SEEKDB_DATABASE_USER')
SEEKDB_DATABASE_DB_NAME = os.getenv('SEEKDB_DATABASE_DB_NAME')
SEEKDB_DATABASE_PASSWORD = os.getenv('SEEKDB_DATABASE_PASSWORD')
client = pyseekdb.Client(host=SEEKDB_DATABASE_HOST, port=SEEKDB_DATABASE_PORT, database=SEEKDB_DATABASE_DB_NAME, user=SEEKDB_DATABASE_USER, password=SEEKDB_DATABASE_PASSWORD)
table_name = "cloudflare_oceanbase_demo_documents"
config = HNSWConfiguration(dimension=768, distance='cosine')
collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)
print('- Inserting Data to seekdb...')
collection.add(
ids=ids,
embeddings=embeddings,
documents=documents
)
语义搜索
通过 Cloudflare Workers AI API 生成查询文本的向量,然后根据查询文本的向量与向量表中的每个向量的余弦距离,搜索最相关的文档:
query = 'Programming languages for data analysis'
payload = {"text": query}
response = httpclient.post(api_url, json=payload)
query_embedding = response.json()["result"]["data"]
res = collection.query(
query_embeddings=query_embedding,
n_results=1
)
print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
res['ids'][0],
res['documents'][0],
res['distances'][0]
)):
print(f'- ID: {doc_id}')
print(f' content: {document}')
print(f' distance: {distance:.6f}')
预期结果
- ID: 2
content: Python is the preferred programming language for data science
distance: 0.139745337621493