OpenAI
OpenAI is an artificial intelligence company that has developed several large language models. These models excel at understanding and generating natural language, making them highly effective for tasks such as text generation, answering questions, and engaging in conversations. Access to these models is available through an API.
seekdb offers features such as vector storage, vector indexing, and embedding-based vector search. By using OpenAI's API, you can convert data into vectors, store these vectors in seekdb, and then take advantage of seekdb's vector search capabilities to find relevant data.
Prerequisites
-
You have deployed seekdb.
-
You have an existing MySQL database and account available in your environment, and the database account has been granted read and write privileges.
-
You have installed Python 3.9 or later and pip.
-
You have installed Poetry, Pyobvector, and OpenAI SDK. The installation commands are as follows:
python3 pip install poetry
python3 pip install pyobvector
python3 pip install openai -
You have obtained an OpenAI API key.
Step 1: Obtain the connection string of seekdb
Contact the seekdb deployment engineer or administrator to obtain the connection string of seekdb, for example:
obclient -h$host -P$port -u$user_name -p$password -D$database_name
Parameters:
-
$host: The IP address for connecting to seekdb. -
$port: The port number for connecting to seekdb. Default is2881. -
$database_name: The name of the database to be accessed.tipThe user for connection must have the
CREATE,INSERT,DROP, andSELECTprivileges on the database. -
$user_name: The database account. -
$password: The password of the account.
Here is an example:
obclient -hxxx.xxx.xxx.xxx -P2881 -utest_user001 -p****** -Dtest
Step 2: Register an LLM account
Obtain an OpenAI API key.
-
Log in to the OpenAI platform.
-
Click API Keys in the upper-right corner.
-
Click Create API Key.
-
Specify the required information and click Create API Key.
Specify the API key for the relevant environment variable.
-
For a Unix-based system such as Ubuntu or macOS, you can run the following command in a terminal:
export OPENAI_API_KEY='your-api-key' -
For a Windows system, you can run the following command in Command Prompt:
set OPENAI_API_KEY=your-api-key
You must replace your-api-key with the actual OpenAI API key.
Step 3: Store vector data in seekdb
Store vector data in seekdb
-
Prepare test data.
Download the CSV file that already contains the vectorized data. This CSV file includes 1,000 food review entries, and the last column contains the vector values. Therefore, you do not need to calculate the vectors yourself. If you want to recalculate the embeddings for the "embedding" column (the vector column), you can use the following code to generate a new CSV file:
from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# Here the text-embedding-ada-002 model is used. You can change the model as needed.
def embedding_text(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = client.embeddings.create(input=text, model=model)
return res.data[0].embedding
df = pd.read_csv(input_datapath, index_col=0)
# It takes a few minutes to generate the CSV file by calling the OpenAI Embedding API row by row.
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath) -
Run the following script to insert the test data into seekdb. The script must be located in the same directory as the test data.
import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import Column, Integer, String
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="username",password="****",db_name="test")
# The test dataset has been vectorized and is stored in the same directory as the Python script by default. If you vectorize the dataset again, specify the new file.
file_name = "fine_food_reviews.csv"
file_path = os.path.join("./", file_name)
# Define columns. The last column is a vector column.
cols = [
Column('id', Integer, primary_key=True, autoincrement=False),
Column('product_id', String(256), nullable=True),
Column('user_id', String(256), nullable=True),
Column('score', Integer, nullable=True),
Column('summary', String(2048), nullable=True),
Column('text', String(8192), nullable=True),
Column('combined', String(8192), nullable=True),
Column('n_tokens', Integer, nullable=True),
Column('embedding', VECTOR(1536))
]
# Define the table name.
table_name = 'fine_food_reviews'
# If the table does not exist, create it.
if not client.check_table_exists(table_name):
client.create_table(table_name,columns=cols)
# Create an index on the vector column.
client.create_index(
table_name=table_name,
is_vec_index=True,
index_name='vidx',
column_names=['embedding'],
vidx_params='distance=l2, type=hnsw, lib=vsag',
)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
# Read the header line.
headers = next(csvreader)
print("Headers:", headers)
batch = [] # Store data by inserting 10 rows into the database each time.
for i, row in enumerate(csvreader):
# The CSV file contains nine columns: `id`, `product_id`, `user_id`, `score`, `summary`, `text`, `combined`, `n_tokens`, and `embedding`.
if not row:
break
food_review_line= {'id':row[0],'product_id':row[1],'user_id':row[2],'score':row[3],'summary':row[4],'text':row[5],\
'combined':row[6],'n_tokens':row[7],'embedding':json.loads(row[8])}
batch.append(food_review_line)
# Insert 10 rows each time.
if (i + 1) % 10 == 0:
client.insert(table_name,batch)
batch = [] # Clear the cache.
# Insert the rest rows, if any.
if batch:
client.insert(table_name,batch)
# Check the data in the table and make sure that all data has been inserted.
count_sql = f"select count(*) from {table_name};"
cursor = client.perform_raw_text_sql(count_sql)
result = cursor.fetchone()
print(f"Total number of inserted rows:{result[0]}")
Query seekdb data
-
Save the following Python script and name it as
openAIQuery.py.import os
import sys
import csv
import json
from pyobvector import *
from sqlalchemy import func
from openai import OpenAI
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to seekdb by using pyobvector and replace the at (@) sign in the username and password with %40, if any.
client = ObVecClient(uri="host:port", user="usename",password="****",db_name="test")
openAIclient = OpenAI()
# Define the function for generating text vectors.
def generate_embeddings(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = openAIclient.embeddings.create(input=text, model=model)
return res.data[0].embedding
def query_ob(query, tableName, vector_name="embedding", top_k=1):
embedding = generate_embeddings(query)
# Perform an approximate nearest neighbor search (ANNS).
res = client.ann_search(
table_name=tableName,
vec_data=embedding,
vec_column_name=vector_name,
distance_func=func.l2_distance,
topk=top_k,
output_column_names=['combined']
)
for row in res:
print(str(row[0]).replace("Title: ", "").replace("; Content: ", ": "))
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,'embedding',1) -
Enter a question for an answer.
python3 openAIQuery.py 'pet food'The expected result is as follows:
Crack for dogs.: These thing are like crack for dogs. I am not sure of the make-up but the doggies sure love them.