通义千问
通义千问 是由阿里云研发的大语言模型,用于理解和分析用户输入,可以在阿里云 模型体验中心 使用通义千问模型的 API 服务。
seekdb 提供了向量类型存储、向量索引、embedding 向量搜索的能力。可以利用通义千问的 API 接口,将向量化后的数据存储在 seekdb,然后使用 seekdb 的向量搜索能力查询相关数据。
前提条件
-
您已完成部署 seekdb。
-
您的环境中已存在可以使用的 MySQL 数据库和账号,并已对数据库账号授予读写权限。
-
安装 python 3.9 及以上版本 和相应 pip。
-
安装 poetry、seekdb 和 DashScope SDK。
python3 -m pip install pyseekdb dashscope pandas cffi -
准备 通义千问 API 密钥。
步骤一:获取 seekdb 连接串
联系 seekdb 部署人员或者管理员获取相应的数据库连接串,例如:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
参数说明:
-
$host:提供 seekdb 连接 IP 地址。 -
$port:提供 seekdb 连接端口,默认是2881。 -
$database_name:需要访问的数据库名称。提示连接的用户需要拥有该数据库的
CREATE、INSERT、DROP和SELECT权限。 -
$user_name:提供数据库连接账户。 -
$password:提供账户密码。
步骤二:配置 OpenAI API Key 和 seekdb 连接信息到环境变量
对于基于 Unix 的系统(如 Ubuntu 或 MacOS),你可以在终端中运行以下命令:
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
export SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
export SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
对于 Windows,你可以在命令提示符中使用以下命令:
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
set SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
set SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
set SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
set SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
set SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
步骤三:存储向量数据到 seekdb
-
准备测试数据 下载预先计算好向量化数据的 CSV 文件,这个 CSV 文件中包含 1000 条美食评论数据集,最后一列是向量化之后的值,所以不需要再计算向量。 也可以使用下面的代码对 embedding 列(即向量列)重新计算,生成新的 CSV 文件。
import dashscope
import pandas as pd
from dashscope import TextEmbedding
input_datapath = "./fine_food_reviews.csv"
# 这里使用 text_embedding_v1 嵌入模型,可以根据需要调整
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
df = pd.read_csv(input_datapath, index_col=0)
# 实际生成会耗时几分钟,逐行调用通义千问 Embedding API
df["embedding"] = df.combined.apply(generate_embeddings)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath) -
运行下面的脚本,将测试数据插入 seekdb,脚本所在的目录需要和测试数据所在的目录相同。
import os,csv,json
import pyseekdb
from pyseekdb import HNSWConfiguration
ids = []
embeddings = []
documents = []
metadatas = []
file_name = "fine_food_reviews_self_embeddings.csv"
file_path = os.path.join("./", file_name)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
headers = next(csvreader)
print("Headers:", headers)
for i, row in enumerate(csvreader):
if not row or len(row) < 9:
print(f"跳过第 {i+2} 行:数据不完整")
continue
ids.append(row[0])
embeddings.append(json.loads(row[8]))
documents.append(row[6])
metadata = {
"product_id": str(row[1]),
"user_id": str(row[2]),
"score": str(row[3]),
"summary": str(row[4]),
"n_tokens": str(row[7])
}
metadatas.append(metadata)
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
table_name = 'fine_food_reviews'
config = HNSWConfiguration(dimension=1536, distance='cosine')
collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)
# Insert 10 rows each time.
batch_size = 100
total_records = len(ids)
for i in range(0, total_records, batch_size):
end_idx = min(i + batch_size, total_records)
batch_ids = ids[i:end_idx]
batch_embeddings = embeddings[i:end_idx]
batch_documents = documents[i:end_idx]
batch_metadatas = metadatas[i:end_idx]
try:
collection.add(
ids=batch_ids,
embeddings=batch_embeddings,
documents=batch_documents,
metadatas=batch_metadatas
)
print(f"Batch {i//batch_size + 1} inserted successfully!")
except Exception as e:
print(f"Batch {i//batch_size + 1} insertion failed: {e}")
break
print("All data insertion completed!")
步骤四:查询 seekdb 数据
-
保存以下 python 脚本,命名为
query.py。import os,csv,json,sys
import pyseekdb
import dashscope
from pyseekdb import HNSWConfiguration
from dashscope import TextEmbedding
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
# Define the function for generating text vectors.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
def query_ob(query, tableName, top_k=1):
query_embedding = generate_embeddings(query)
collection = client.get_collection(name=tableName)
res = collection.query(
query_embeddings=query_embedding,
n_results=top_k
)
print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
res['ids'][0],
res['documents'][0],
res['distances'][0]
)):
print(f' - ID: {doc_id}')
print(f' content: {document}')
print(f' distance: {distance:.6f}')
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,1) -
输入问题,输出相关答案。
python3 query.py 'pet food'预期结果如下:
- The Most Relevant Document and Its Distance to the Query:
- ID: 444
content: Title: Healthy Dog Food; Content: This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.
distance: 0.509108