seekdb 与 CamelAI 集成
seekdb 提供了向量类型存储、向量索引、embedding 向量搜索的能力。可以将向量化后的数据存储在 seekdb,供下一步的搜索使用。
CamelAI 革命性重塑团队数据交互模式——通过自然语言提问,即时获得精 准 SQL 查询、智能分析与可视化呈现。
前提条件
-
您已完成部署 seekdb 数据库。
-
您的环境中已存在可以使用的 MySQL 数据库和账号,并已对数据库账号授读写权限。
-
安装 Python 3.11 及以上版本。
-
安装依赖。
python3 -m pip install "unstructured[pdf]" camel-ai pyobvector
步骤一:获取数据库连接信息
联系 seekdb 部署人员或者管理员获取相应的数据库连接串,例如:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
参数说明:
-
$host:提供 seekdb 连接 IP 地址。 -
$port:提供 seekdb 连接端口,默认是2881。 -
$database_name:需要访问的数据库名称。提示连接的用户需要拥有该数据库的
CREATE、INSERT、DROP和SELECT权限。 -
$user_name:提供数据库连接账户。 -
$password:提供账户密码。
步骤二:构建您的 AI 助手
设置环境变量
获取 Jina AI API 密钥,并同 OceanBase 连接信息配置环境变量中。
export SEEKDB_DATABASE_URL=YOUR_SEEKDB_DATABASE_URL
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
加载数据
CamelAI 支持多种 Embedding 模型,如 OpenAIEmbedding、VisionLanguageEmbedding、JinaEmbedding 等。
import os
import requests
from camel.embeddings import JinaEmbedding
from camel.storages.vectordb_storages import (
OceanBaseStorage,
VectorDBQuery,
VectorRecord,
)
from camel.storages import OceanBaseStorage
from camel.retrievers import VectorRetriever
from camel.types import EmbeddingModelType
documents = [
"""Artificial Intelligence (AI) is a branch of computer science that aims to create systems capable of performing tasks that typically require human intelligence. AI encompasses multiple subfields including machine learning, deep learning, natural language processing, and computer vision.""",
"""Machine Learning is a subset of artificial intelligence that enables computers to learn and improve without being explicitly programmed. The main types of machine learning include supervised learning, unsupervised learning, and reinforcement learning.""",
"""Deep Learning is a branch of machine learning that uses multi-layered neural networks to simulate how the human brain works. Deep learning has achieved breakthrough progress in areas such as image recognition, speech recognition, and natural language processing.""",
"""Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP applications include machine translation, sentiment analysis, text summarization, and chatbots.""",
"""Computer Vision is a field of artificial intelligence that aims to enable computers to identify and understand content in digital images and videos. Applications include facial recognition, object detection, medical image analysis, and autonomous vehicles.""",
"""Reinforcement Learning is a machine learning method where an agent learns how to make decisions through interaction with an environment. The agent optimizes its behavioral strategy through trial and error and reward mechanisms.""",
"""Neural Networks are computational models inspired by biological neural systems, composed of interconnected nodes (neurons). Neural networks can learn complex patterns and relationships and serve as the foundation for deep learning.""",
"""Large Language Models (LLMs) are natural language processing models based on deep learning. These models are trained on vast amounts of text data and can generate human-like text and answer questions.""",
"""Transformer architecture is a neural network architecture that has revolutionized natural language processing. It uses attention mechanisms to process sequential data and forms the basis for models like GPT and BERT.""",
"""Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications."""
]
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
embedding = JinaEmbedding(
api_key=JINAAI_API_KEY,
model_type=EmbeddingModelType.JINA_EMBEDDINGS_V3)
连接到 seekdb,定义向量表结构和数据存入 seekdb
创建一个名为 my_seekdb_vector_table 的表,表结构固定为 id、embedding 和 metadata。利用 Jina AI Embeddings API 为每段文本生成向量,并存入 seekdb:
OB_URI = os.getenv('SEEKDB_DATABASE_URL')
OB_USER = os.getenv('SEEKDB_DATABASE_USER')
OB_DB_NAME = os.getenv('SEEKDB_DATABASE_DB_NAME')
OB_PASSWORD = os.getenv('SEEKDB_DATABASE_PASSWORD')
# create table
ob_storage = OceanBaseStorage(
vector_dim=embedding.get_output_dim(),
table_name="my_seekdb_vector_table",
uri=OB_URI,
user=OB_USER,
password=OB_PASSWORD,
db_name=OB_DB_NAME,
distance="cosine"
)
vector_retriever = VectorRetriever(
embedding_model=embedding, storage=ob_storage
)
for i, doc in enumerate(documents):
print(f"Processing document {i+1}/{len(documents)}")
vector_retriever.process(content=doc)
语义搜索
通过嵌入 Jina AI API 生成查询文本的向量,然后根据查询文本的向量与向量表中的每个向量的余弦距离,搜索最相关的文档:
retrieved_info = vector_retriever.query(query="What is generative AI?", top_k=1)
print(retrieved_info)
预 期结果
[{'similarity score': '0.8538218656447916', 'content path': 'Generative AI refers to artificial intelligence systems that can create new content, including text,', 'metadata': {'piece_num': 1}, 'extra_info': {}, 'text': 'Generative AI refers to artificial intelligence systems that can create new content, including text, images, audio, and video. Examples include ChatGPT for text generation, DALL-E for image creation, and various AI tools for creative applications.'}]