OpenAI

OpenAI is an artificial intelligence company that has developed multiple large language models, which excel in natural language understanding and generation. These models can generate text, answer questions, and engage in conversations. You can access these models through their API.

seekdb provides capabilities for storing vector data, creating vector indexes, and performing embedding vector searches. You can use the OpenAI API to store vectorized data in seekdb and then leverage seekdb's vector search capabilities to query relevant data.

Prerequisites

You have deployed seekdb.
Your environment has a MySQL database and account with read and write permissions.
You have installed Python 3.9 or later and the corresponding pip.
You have installed Poetry, seekdb, and the OpenAI SDK.
```
python3 -m pip install pyseekdb openai pandas cffi
```
You have prepared an OpenAI API key.

Step 1: Obtain the seekdb connection string

Contact the seekdb deployment team or administrator to obtain the database connection string. For example:

mysql -h$host -P$port -u$user_name -p$password -D$database_name

Parameter description:

$host: the IP address for connecting to seekdb.
$port: the port for connecting to seekdb, default is 2881.
$database_name: the name of the database to be accessed.

tip
The user needs to have the CREATE, INSERT, DROP, and SELECT permissions for the database.
$user_name: the database connection account.
$password: the account password.

Example:

mysql -hxxx.xxx.xxx.xxx -P2881 -utest_user001 -p****** -Dtest

Step 2: Register an LLM platform account

Obtain the OpenAI API key:

Log in to the OpenAI platform.
Click API Keys in the upper-right corner.
Click Create API Key.
Fill in the relevant information and click Create API Key.

Configure the OpenAI API key and seekdb connection information in the environment variables:

For Unix-based systems (such as Ubuntu or macOS), run the following command in the terminal:

export OPENAI_API_KEY='your-api-key'
export SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
export SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD

For Windows systems, run the following command in the command prompt:

set OPENAI_API_KEY=your-api-key
set SEEKDB_DATABASE_HOST=SEEKDB_DATABASE_HOST
set SEEKDB_DATABASE_PORT=SEEKDB_DATABASE_PORT
set SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
set SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
set SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD

Make sure to replace your-api-key with your actual OpenAI API key.

Step 3: Store vector data in seekdb

Store vector data in seekdb

Prepare test data

Download the CSV file containing 1,000 rows of food review data, where the last column contains the vectorized values. Therefore, there is no need to recalculate the vectors. You can also use the following code to recompute the embedding column (i.e., the vector column) and generate a new CSV file.

from openai import OpenAI
import pandas as pd
input_datapath = "./fine_food_reviews.csv"
client = OpenAI()
# The text-embedding-ada-002 embedding model is used here, which can be adjusted as needed
def embedding_text(text, model="text-embedding-ada-002"):
    # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
    res = client.embeddings.create(input=text, model=model)
    return res.data[0].embedding

df = pd.read_csv(input_datapath, index_col=0)
# The actual generation will take a few minutes, with OpenAI Embedding API called row by row
df["embedding"] = df.combined.apply(embedding_text)
output_datapath = './fine_food_reviews_self_embeddings.csv'
df.to_csv(output_datapath)

Run the following script to insert the test data into seekdb. The script must be in the same directory as the test data.

import os,csv,json
import pyseekdb
from pyseekdb import HNSWConfiguration

ids = []
embeddings = []
documents = []
metadatas = []
file_name = "fine_food_reviews_self_embeddings.csv"
file_path = os.path.join("./", file_name)
# Open and read the CSV file.
with open(file_name, mode='r', newline='', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile)
    headers = next(csvreader)
    print("Headers:", headers)
    
    for i, row in enumerate(csvreader):
        if not row or len(row) < 9:
            continue
            
        ids.append(row[0])
        embeddings.append(json.loads(row[8]))
        documents.append(row[6])
        metadata = {
            "product_id": str(row[1]),
            "user_id": str(row[2]), 
            "score": str(row[3]),
            "summary": str(row[4]),
            "n_tokens": str(row[7])
        }
        metadatas.append(metadata)


# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
    host=os.getenv('SEEKDB_DATABASE_HOST'), 
    port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)), 
    database=os.getenv('SEEKDB_DATABASE_DB_NAME'), 
    user=os.getenv('SEEKDB_DATABASE_USER'), 
    password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)

table_name = 'fine_food_reviews'
config = HNSWConfiguration(dimension=1536, distance='cosine')  
collection = client.create_collection(
    name=table_name,
    configuration=config,
    embedding_function=None
)

# Insert 10 rows each time.
batch_size = 100  
total_records = len(ids)

for i in range(0, total_records, batch_size):
    end_idx = min(i + batch_size, total_records)
    batch_ids = ids[i:end_idx]
    batch_embeddings = embeddings[i:end_idx]
    batch_documents = documents[i:end_idx]
    batch_metadatas = metadatas[i:end_idx]
    
    try:
        collection.add(
            ids=batch_ids,
            embeddings=batch_embeddings,
            documents=batch_documents,
            metadatas=batch_metadatas
        )
        print(f"Batch {i//batch_size + 1} inserted successfully!")
    except Exception as e:
        print(f"Batch {i//batch_size + 1} insertion failed: {e}")
        break
        

print("All data insertion completed!")

Query seekdb data

Save the following Python script as openAIQuery.py.

import os,csv,json,sys
import pyseekdb
from pyseekdb import HNSWConfiguration
from openai import OpenAI

# Obtain command-line options.
if len(sys.argv) != 2:
    print("Enter a query statement." )
    sys.exit()
queryStatement = sys.argv[1]

# Connect to seekdb by using pyseekdb
client = pyseekdb.Client(
    host=os.getenv('SEEKDB_DATABASE_HOST'), 
    port=int(os.getenv('SEEKDB_DATABASE_PORT', 2881)), 
    database=os.getenv('SEEKDB_DATABASE_DB_NAME'), 
    user=os.getenv('SEEKDB_DATABASE_USER'), 
    password=os.getenv('SEEKDB_DATABASE_PASSWORD')
)

openAIclient = OpenAI()
# Define the function for generating text vectors.
def generate_embeddings(text, model="text-embedding-ada-002"):
    # For more information about how to create embedding vectors, see https://community.openai.com/t/embeddings-api-documentation-needs-to-updated/475663.
    res = openAIclient.embeddings.create(input=text, model=model)
    return res.data[0].embedding

def query_ob(query, tableName, top_k=1):
    query_embedding = generate_embeddings(query)
    collection = client.get_collection(name=tableName)
    res = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k
    )
    print('- The Most Relevant Document and Its Distance to the Query:')
    for i, (doc_id, document, distance) in enumerate(zip(
        res['ids'][0], 
        res['documents'][0], 
        res['distances'][0]
    )):
        print(f'  - ID: {doc_id}')
        print(f'    content: {document}')
        print(f'    distance: {distance:.6f}')

# Specify the table name.
table_name = 'fine_food_reviews'
query_ob(queryStatement,table_name,1)

Enter a question and get the relevant answer.

python3 openAIQuery.py 'pet food'

The expected result is as follows:

- The Most Relevant Document and Its Distance to the Query:
  - ID: 818
    content: Title: Good food; Content: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.
    distance: 0.159281

Prerequisites​

Step 1: Obtain the seekdb connection string​

Step 2: Register an LLM platform account​

Step 3: Store vector data in seekdb​

Store vector data in seekdb​

Query seekdb data​

Contents