Skip to main content

seekdb Vector Integration with Hugging Face

seekdb provides vector type storage, vector indexing, and embedding vector search capabilities. You can store vectorized data in seekdb for subsequent searches.

Hugging Face is an open-source machine learning platform that provides pre-trained models, datasets, and tools for developers to easily use and deploy AI models.

Prerequisites

  • You have deployed the seekdb.

  • Your environment has a usable database, and account, and the database account has read and write permissions.

  • Python 3.11 or later is installed.

  • Dependencies are installed.

    python3 -m pip install cffi pyseekdb requests datasets

Step 1: Obtain Database Connection Information

Contact the seekdb deployment personnel or administrator to obtain the database connection string, for example:

mysql -h$host -P$port -u$user_name -p$password -D$database_name

Parameter Description:

  • $host: The IP address for connecting to seekdb.

  • $port: The port for connecting to seekdb, which is 2881 by default.

  • $database_name: The name of the database to access.

    tip

    The user needs to have the CREATE, INSERT, DROP, and SELECT permissions on the database.

  • $user_name: The database connection account.

  • $password: The account password.

Step 2: Build Your AI Assistant

Set Environment Variables

Obtain the Hugging Face API key and configure it along with the seekdb connection information in the environment variables.

export SEEKDB_DATABASE_URL=YOUR_SEEKDB_DATABASE_URL
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
export HUGGING_FACE_API_KEY=YOUR_HUGGING_FACE_API_KEY

Sample Code Snippet

Prepare Data

Hugging Face provides various embedding models, and users can choose the appropriate model based on their needs. Here, we use sentence-transformers/all-MiniLM-L6-v2 as an example to call the Hugging Face API:

import os,shutil,requests,pyseekdb
from pyseekdb import HNSWConfiguration
from sentence_transformers import SentenceTransformer
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset

# delete cache directory
if os.path.exists("./cache"):
shutil.rmtree("./cache")

HUGGING_FACE_API_KEY = os.getenv('HUGGING_FACE_API_KEY')
DATASET = "squad" # Name of dataset from HuggingFace Datasets
INSERT_RATIO = 0.001 # Ratio of example dataset to be inserted
data = load_dataset(DATASET, split="validation", cache_dir="./cache")

# Generates a fixed subset. To generate a random subset, remove the seed.
data = data.train_test_split(test_size=INSERT_RATIO, seed=42)["test"]
# Clean up the data structure in the dataset.
data = data.map(
lambda val: {"answer": val["answers"]["text"][0]},
remove_columns=["id", "answers", "context"],
)

# HuggingFace API config
print("Downloading model...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print("Model download completed!")

def encode_text(batch):
questions = batch["question"]

# Use the local model for inference
embeddings = model.encode(questions)

# Format embeddings
formatted_embeddings = []
for embedding in embeddings:
formatted_embedding = [round(float(val), 6) for val in embedding]
formatted_embeddings.append(formatted_embedding)

batch["embedding"] = formatted_embeddings
return batch

INFERENCE_BATCH_SIZE = 64 # Batch size of model inference
data = data.map(encode_text, batched=True, batch_size=INFERENCE_BATCH_SIZE)
data_list = data.to_list()
ids = []
embeddings = []
documents = []
metadatas = []

for i, item in enumerate(data_list):
ids.append(f"item{i+1}")
embeddings.append(item["embedding"])
documents.append(item["question"])
metadatas.append({"answer": item["answer"]})

Define the Table and Store Data in seekdb

Create a table named huggingface_seekdb_demo_documents and store the data in seekdb:

SEEKDB_DATABASE_HOST = os.getenv('SEEKDB_DATABASE_HOST')
SEEKDB_DATABASE_PORT = int(os.getenv('SEEKDB_DATABASE_PORT', 2881))
SEEKDB_DATABASE_USER = os.getenv('SEEKDB_DATABASE_USER')
SEEKDB_DATABASE_DB_NAME = os.getenv('SEEKDB_DATABASE_DB_NAME')
SEEKDB_DATABASE_PASSWORD = os.getenv('SEEKDB_DATABASE_PASSWORD')

client = pyseekdb.Client(host=SEEKDB_DATABASE_HOST, port=SEEKDB_DATABASE_PORT, database=SEEKDB_DATABASE_DB_NAME, user=SEEKDB_DATABASE_USER, password=SEEKDB_DATABASE_PASSWORD)
table_name = "huggingface_seekdb_demo_documents"
config = HNSWConfiguration(dimension=384, distance='l2')

collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)

print('- Inserting Data to seekdb...')
collection.add(
ids=ids,
embeddings=embeddings,
documents=documents
)
print('- Inserting Data to seekdb completed!')

Generate query text vectors using the Hugging Face API, then search for the most relevant documents based on the L2 distance between the query vector and each vector in the vector table:

questions = {
"question": [
"What is LGM?",
"When did Massachusetts first mandate that children be educated in schools?",
]
}

query_embeddings = encode_text(questions)["embedding"]

res = collection.query(
query_embeddings=query_embeddings,
n_results=1
)

for i in range(len(questions["question"])):
print(f"Question: {questions['question'][i]}")
if i < len(res['ids']) and res['ids'][i]:
for j in range(len(res['ids'][i])):
result = {
"id": res['ids'][i][j],
"original question": res['documents'][i][j],
"distance": res['distances'][i][j]
}
print(result)
else:
print("No results found")

Expected Result

Question: What is LGM?
{'id': 'item10', 'original question': 'What does LGM stands for?', 'distance': 0.29572633579122415}

Question: When did Massachusetts first mandate that children be educated in schools?
{'id': 'item1', 'original question': 'In what year did Massachusetts first require children to be educated in schools?', 'distance': 0.24083293996160604}