Qwen
Tongyi Qianwen (Qwen) is a large language model (LLM) developed by Alibaba Cloud for interpreting and analyzing user inputs. You can use the API of Qwen in the Alibaba Cloud Model Studio.
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using Qwen's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
Prerequisites
-
You have deployed seekdb.
-
You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
-
You have installed Python 3.9 or later and pip.
-
You have installed Poetry, Pyobvector, and DashScope SDK. The installation commands are as follows:
pip install poetry
pip install pyobvector
pip install dashscope -
You have obtained the Qwen API key.
Step 1: Obtain the connection string of seekdb
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
-
$host: The IP address for connecting to seekdb. -
$port: The port number for connecting to seekdb. Default is2881. -
$database_name: The name of the database to be accessed.tipThe user for connection must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. -
$user_name: The database account. -
$password: The password of the account.
Step 2: Configure the environment variable for the Qwen API key
For a Unix-based system (such as Ubuntu or MacOS), run the following command in the terminal:
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
For Windows, run the following command in the command prompt:
set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY
You must replace YOUR_DASHSCOPE_API_KEY with the actual Qwen API key.
Step 3: Store the vector data in seekdb
-
Prepare the test data. Download the CSV file that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
import dashscope
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
# Here the text_embedding_v1 model is used. You can change the model as needed.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the Tongyi Qianwen Embedding API row by row.
df["embedding"] = df.combined.apply(generate_embeddings)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath) -
Execute the following script to insert the test data into seekdb. The directory where the script is located must be the same as the directory where the test data is stored.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Use pyobvector to connect to seekdb. If @ is in the username or password, replace it with %40.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset is prepared in advance and has been vectorized. By default, it is placed in the same directory as the Python script. If you have vectorized it yourself, replace it with the corresponding file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define the columns. The vectorized column is placed in the last field.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Table name
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index for the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header row.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data and insert it into the database every 10 rows.
for i, row in enumerate(csvreader):
# The CSV file has 9 fields: id, product_id, user_id, score, summary, text, combined, n_tokens, embedding.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert data every 10 rows.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the remaining rows (if any).
if batch:
client.insert(table_name,batch)
# Check the data in the table to ensure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of imported data: {result[0]}")
Step 4: Query seekdb data
-
Save the following Python script as
query.py.import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
import dashscope
# Get command-line arguments
if len(sys.argv) != 2:
print("Please enter a query statement.")
sys.exit()
queryStatement = sys.argv[1]
# Use pyobvector to connect to seekdb. If the username or password contains @, replace it with %40.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# Define a function to generate text vectors.
def generate_embeddings(text):
rsp = dashscope.TextEmbedding.call(model=TextEmbedding.Models.text_embedding_v1, input=text)
embeddings = [record['embedding'] for record in rsp.output['embeddings']]
return embeddings if isinstance(text, list) else embeddings[0]
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Execute approximate nearest neighbor search.
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Table name
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1) -
Enter a question and obtain the related answer.
python3 query.py 'pet food'The expected result is as follows:
This is so good!: I purchased this after my sister sent a small bag to me in a gift box. I loved it so much I wanted to find it to buy for myself and keep it around. I always look on Amazon because you can find everything here and true enough, I found this wonderful candy. It is nice to keep in your purse for when you are out and about and get a dry throat or a tickle in the back of your throat. It is also nice to have in a candy dish at home for guests to try.