Integrate seekdb Vector with Firecrawl
seekdb provides capabilities for storing vectors, creating vector indexes, and performing vector searches. You can store the vectorized data in seekdb for use in subsequent searches.
Firecrawl enables developers to extract high-quality data from any website to build AI applications. This tool offers advanced web scraping, crawling, and data extraction capabilities, efficiently converting website content into clean markup language or structured data to meet the needs of downstream AI workflows.
In this tutorial, we will show you how to build a Retrieval-Augmented Generation (RAG) pipeline using seekdb and Firecrawl. The pipeline integrates Firecrawl for web data scraping, seekdb for vector storage, and Jina AI for generating insightful, context-aware responses.
Prerequisites
-
You have deployed the seekdb .
-
Your environment has a usable MySQL tenant, MySQL database, and account, and the database account has read and write permissions.
-
Python 3.11 or later is installed.
-
Dependencies are installed.
python3 -m pip install cffi pyseekdb requests firecrawl-py
Step 1: Obtain database connection information
Contact the seekdb deployment personnel or administrator to obtain the database connection string, for example:
mysql -h$host -P$port -u$user_name -p$password -D$database_name
Parameter description:
-
$host: the IP address for connecting to seekdb. -
$port: the port for connecting to seekdb, which is 2881 by default. -
$database_name: the name of the database to be accessed.tipThe user needs to have the
CREATE,INSERT,DROP, andSELECTpermissions on the database. -
$user_name: the database connection account. -
$password: the account password.
Step 2: Build your AI assistant
Use Firecrawl to scrape web pages and save the data in seekdb Vector for search.
Set environment variables
Obtain the Firecrawl API key and configure the environment variables with the seekdb connection information.
export SEEKDB_DATABASE_URL=YOUR_SEEKDB_DATABASE_URL
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
export FIRECRAWL_API_KEY=YOUR_FIRECRAWL_API_KEY
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY
Sample code snippet
import os ,requests
from firecrawl import FirecrawlApp
import pyseekdb
from pyseekdb import HNSWConfiguration
from tqdm import tqdm
def split_markdown_content(content):
return [section.strip() for section in content.split("# ") if section.strip()]
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
# Scrape a website:
scrape_status = app.scrape(
url="https://www.oceanbase.ai/docs/seekdb-overview/",
formats=["markdown"]
)
markdown_content = scrape_status.markdown
# Process the scraped markdown content
sections = split_markdown_content(markdown_content)
Get vectors from Jina AI
Jina AI provides various models, and users can choose the appropriate model based on their needs.
Here, we use jina-embeddings-v3 as an example and define a generate_embeddings helper function to call the Jina AI API:
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v3' # with dimension 1024.
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
return response.json()['data'][0]['embedding']
ids = []
embeddings = []
documents = []
for i, section in enumerate(tqdm(sections, desc="Processing sections")):
try:
# Generate the embedding for the section via API
embedding = generate_embeddings(section)
# Truncate content if too long
truncated_content = section[:4900] if len(section) > 4900 else section
# Append to lists
ids.append(f"{i+1}")
embeddings.append(embedding)
documents.append(truncated_content)
except Exception as e:
print(f"Error processing section {i}: {e}")
continue
print(f"Successfully processed {len(documents)} sections")
Define a table and store data in seekdb
Create a table named firecrawl_seekdb_demo_documents:
SEEKDB_DATABASE_HOST = os.getenv('SEEKDB_DATABASE_HOST')
SEEKDB_DATABASE_PORT = int(os.getenv('SEEKDB_DATABASE_PORT', 2881))
SEEKDB_DATABASE_USER = os.getenv('SEEKDB_DATABASE_USER')
SEEKDB_DATABASE_DB_NAME = os.getenv('SEEKDB_DATABASE_DB_NAME')
SEEKDB_DATABASE_PASSWORD = os.getenv('SEEKDB_DATABASE_PASSWORD')
client = pyseekdb.Client(host=SEEKDB_DATABASE_HOST, port=SEEKDB_DATABASE_PORT, database=SEEKDB_DATABASE_DB_NAME, user=SEEKDB_DATABASE_USER, password=SEEKDB_DATABASE_PASSWORD)
table_name = "firecrawl_seekdb_demo_documents"
config = HNSWConfiguration(dimension=1024, distance='cosine')
collection = client.create_collection(
name=table_name,
configuration=config,
embedding_function=None
)
print('- Inserting Data to seekdb...')
collection.add(
ids=ids,
embeddings=embeddings,
documents=documents
)
Perform semantic search
Generate a vector for the query text using the Jina AI API, then search for the most relevant documents based on the cosine distance between the query vector and each vector in the vector table:
query = 'what is seekdb'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
res = collection.query(
query_embeddings=query_embedding,
n_results=1
)
print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
res['ids'][0],
res['documents'][0],
res['distances'][0]
)):
print(f' - ID: {doc_id}')
print(f' content: {document}')
print(f' distance: {distance:.6f}')
Expected result
- ID: 1
content: Skip to main content
On this page
OceanBase seekdb (referred to as seekdb) is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows.
distance: 0.235985