OpenAI
OpenAI is an artificial intelligence company that has developed multiple large language models, which excel in natural language understanding and generation. These models can generate text, answer questions, and engage in conversations. You can access these models through their API.
seekdb provides capabilities for storing vector data, creating vector indexes, and performing embedding vector searches. You can use the OpenAI API to store vectorized data in seekdb and then leverage seekdb's vector search capabilities to query relevant data.
Prerequisites
-
You have deployed seekdb.
-
Your environment has a MySQL database and account with read and write permissions.
-
You have installed Python 3.9 or later and the corresponding pip.
-
You have installed Poetry, seekdb, and the OpenAI SDK.
python3 -m pip install pyseekdb openai pandas cffi -
You have prepared an OpenAI API key.
Step 1: Obtain the seekdb connection string
Contact the seekdb deployment team or administrator to obtain the database connection string. For example:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
Parameter description:
-
$host: the IP address for connecting to seekdb. -
$port: the port for connecting to seekdb, default is2881. -
$database_name: the name of the database to be accessed.tipThe user needs to have the
CREATE,INSERT,DROP, andSELECTpermissions for the database. -
$user_name: the database connection account. -
$password: the account password.
Example:
mysql -hxxx.xxx.xxx.xxx -P2881 -utest_user001 -p****** -Dtest
Step 2: Register an LLM platform account
Obtain the OpenAI API key:
-
Log in to the OpenAI platform.
-
Click API Keys in the upper-right corner.
-
Click Create API Key.
-
Fill in the relevant information and click Create API Key.
Configure the OpenAI API key and seekdb connection information in the environment variables:
-
For Unix-based systems (such as Ubuntu or macOS), run the following command in the terminal:
export OPENAI_API_KEY='your-api-key'
export SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
export SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD -
For Windows systems, run the following command in the command prompt:
set OPENAI_API_KEY=your-api-key
set SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
set SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
set SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
set SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
set SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
Make sure to replace your-api-key with your actual OpenAI API key.
Step 3: Store vector data in seekdb
Store vector data in seekdb
-
Prepare test data
Download the CSV file containing 1,000 rows of food review data, where the last column contains the vectorized values. Therefore, there is no need to recalculate the vectors. You can also use the following code to recompute the embedding column (i.e., the vector column) and generate a new CSV file.
from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# The text-embedding-ada-002 embedding model is used here, which can be adjusted as needed
def embedding_text(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = client.embeddings.create(input=text, model=model)
return res.data[0].embedding
df = pd.read_csv(input_datapath, index_col=0)
# The actual generation will take a few minutes, with OpenAI Embedding API called row by row
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath) -
Run the following script to insert the test data into seekdb. The script must be in the same directory as the test data.
import os,csv,json
import pyseekdb
from pyseekdb import HNSWConfiguration
ids = []
embeddings = []
documents = []
metadatas = []
file_name = "fine_food_reviews_self_embeddings.csv"
file_path = os.path.join("./", file_name)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile)
headers = next(csvreader)
print("Headers:", headers)
for i, row in enumerate(csvreader):
if not row or len(row) < 9:
continue
ids.append(row[0])
embeddings.append(json.loads(row[8]))
documents.append(row[6])
metadata = {
"product_id": str(row[1]),
"user_id": str(row[2]),
"score": str(row[3]),
"summary": str(row[4]),
"n_tokens": str(row[7])
}
metadatas.append(metadata)
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
table_name = 'fine_food_reviews'
config = HNSWConfiguration(dimension=1536, distance='cosine')
collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)
# Insert 10 rows each time.
batch_size = 100
total_records = len(ids)
for i in range(0, total_records, batch_size):
end_idx = min(i + batch_size, total_records)
batch_ids = ids[i:end_idx]
batch_embeddings = embeddings[i:end_idx]
batch_documents = documents[i:end_idx]
batch_metadatas = metadatas[i:end_idx]
try:
collection.add(
ids=batch_ids,
embeddings=batch_embeddings,
documents=batch_documents,
metadatas=batch_metadatas
)
print(f"Batch {i//batch_size + 1} inserted successfully!")
except Exception as e:
print(f"Batch {i//batch_size + 1} insertion failed: {e}")
break
print("All data insertion completed!")
Query seekdb data
-
Save the following Python script as
openAIQuery.py.import os,csv,json,sys
import pyseekdb
from pyseekdb import HNSWConfiguration
from openai import OpenAI
# Obtain command-line options.
if len(sys.argv) != 2:
print("Enter a query statement." )
sys.exit()
queryStatement = sys.argv[1]
# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
host=os.getenv('SEEKDB_DATABASE_HOST'),
port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)),
database=os.getenv('SEEKDB_DATABASE_DB_NAME'),
user=os.getenv('SEEKDB_DATABASE_USER'),
password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)
openAIclient = OpenAI()
# Define the function for generating text vectors.
def generate_embeddings(text, model="text-embedding-ada-002"):
# For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
res = openAIclient.embeddings.create(input=text, model=model)
return res.data[0].embedding
def query_ob(query, tableName, top_k=1):
query_embedding = generate_embeddings(query)
collection = client.get_collection(name=tableName)
res = collection.query(
query_embeddings=query_embedding,
n_results=top_k
)
print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
res['ids'][0],
res['documents'][0],
res['distances'][0]
)):
print(f' - ID: {doc_id}')
print(f' content: {document}')
print(f' distance: {distance:.6f}')
# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,1) -
Enter a question and get the relevant answer.
python3 openAIQuery.py 'pet food'The expected result is as follows:
- The Most Relevant Document and Its Distance to the Query:
- ID: 818
content: Title: Good food; Content: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.
distance: 0.159281