Qwen
Qwen is a large language model developed by Alibaba Cloud. It can understand and analyze user input. You can use the API service of Qwen in the Model Experience Center of Alibaba Cloud.
seekdb provides vector type storage, vector indexes, and embedding vector search capabilities. You can use the API interface of Qwen to store vectorized data in seekdb and then use seekdb's vector search capabilities to query relevant data.
Prerequisites
-
You have deployed seekdb.
-
Your environment contains a MySQL database and account, and the database account has read and write permissions.
-
You have installed python 3.9 or later and the corresponding pip.
-
You have installed poetry, seekdb, and the DashScope SDK.
python3 -m pip install pyseekdb dashscope pandas cffi -
You have prepared the Qwen API key.
Step 1: Obtain the seekdb connection string
Contact the seekdb deployment personnel or administrator to obtain the corresponding database connection string, for example:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
-
$host: the IP address for connecting to seekdb. -
$port: the port for connecting to seekdb, which is2881by default. -
$database_name: the name of the database to be accessed.tipThe user for the connection must have the
CREATE,INSERT,DROP, andSELECTpermissions on the database. -
$user_name: the database connection account. -
$password: the account password.
Step 2: Configure the OpenAI API key and seekdb connection information in the environment variables
For Unix-based systems (such as Ubuntu or macOS), you can run the following command in the terminal:
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
export SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
export SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
For Windows, you can use the following command in the command prompt:
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
set SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
set SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
set SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
set SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
set SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
Step 3: Store the vector data in seekdb
-
Prepare the test data Download the CSV file containing 1,000 rows of fine food reviews. The last column in the CSV file contains the vector values, so you do not need to compute the vectors. You can also use the following code to recompute the embedding column (vector column) and generate a new CSV file.
import dashscope
import pandas as pd
from dashscope import TextEmbedding
input_datapath = "./fine_food_reviews.csv"
# Here, the text_embedding_v1 embedding model is used. You can adjust it as needed.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
df = pd.read_csv(input_datapath, index_col=0)
# The actual generation process will take a few minutes. The embeddings are generated by calling the Qwen Embedding API row by row.
df["embedding"] = df.combined.apply(generate_embeddings)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath) -
Run the following script to insert the test data into seekdb. Make sure the script is in the same directory as the test data.
import os,csv,json
import pyseekdb
from pyseekdb import HNSWConfiguration
ids = []
embeddings = []
documents = []
metadatas = []
file_name = "fine_food_reviews_self_embeddings.csv"
file_path = os.path.join("./", file_name)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
headers = next(csvreader)
print("Headers:", headers)
for i, row in enumerate(csvreader):
if not row or len(row) < 9:
print(f"Skipping row {i+2}: incomplete data")
continue
ids.append(row[0])
embeddings.append(json.loads(row[8]))
documents.append(row[6])
metadata = {
"product_id": str(row[1]),
"user_id": str(row[2]),
"score": str(row[3]),
"summary": str(row[4]),
"n_tokens": str(row[7])
}
metadatas.append(metadata)
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
table_name = 'fine_food_reviews'
config = HNSWConfiguration(dimension=1536, distance='cosine')
collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)
# Insert 10 rows each time.
batch_size = 100
total_records = len(ids)
for i in range(0, total_records, batch_size):
end_idx = min(i + batch_size, total_records)
batch_ids = ids[i:end_idx]
batch_embeddings = embeddings[i:end_idx]
batch_documents = documents[i:end_idx]
batch_metadatas = metadatas[i:end_idx]
try:
collection.add(
ids=batch_ids,
embeddings=batch_embeddings,
documents=batch_documents,
metadatas=batch_metadatas
)
print(f"Batch {i//batch_size + 1} inserted successfully!")
except Exception as e:
print(f"Batch {i//batch_size + 1} insertion failed: {e}")
break
print("All data insertion completed!")
Step 4: Query the seekdb database
-
Save the following Python script as
query.py.import os,csv,json,sys
import pyseekdb
import dashscope
from pyseekdb import HNSWConfiguration
from dashscope import TextEmbedding
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
# Define the function for generating text vectors.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
def query_ob(query, tableName, top_k=1):
query_embedding = generate_embeddings(query)
collection = client.get_collection(name=tableName)
res = collection.query(
query_embeddings=query_embedding,
n_results=top_k
)
print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
res['ids'][0],
res['documents'][0],
res['distances'][0]
)):
print(f' - ID: {doc_id}')
print(f' content: {document}')
print(f' distance: {distance:.6f}')
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,1) -
Enter a question and get the relevant answer.
python3 query.py 'pet food'The expected result is as follows:
- The Most Relevant Document and Its Distance to the Query:
- ID: 444
content: Title: Healthy Dog Food; Content: This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.
distance: 0.509108