Version: V1.1.0

Integrate seekdb Vector with Firecrawl

seekdb provides capabilities for storing vectors, creating vector indexes, and performing vector searches. You can store the vectorized data in seekdb for use in subsequent searches.

Firecrawl enables developers to extract high-quality data from any website to build AI applications. This tool offers advanced web scraping, crawling, and data extraction capabilities, efficiently converting website content into clean markup language or structured data to meet the needs of downstream AI workflows.

In this tutorial, we will show you how to build a Retrieval-Augmented Generation (RAG) pipeline using seekdb and Firecrawl. The pipeline integrates Firecrawl for web data scraping, seekdb for vector storage, and Jina AI for generating insightful, context-aware responses.

Prerequisites

You have deployed the seekdb .
Your environment has a usable database and account, and the database account has read and write permissions.
Python 3.11 or later is installed.

Dependencies are installed.

python3 -m pip install cffi pyseekdb requests firecrawl-py

Step 1: Obtain database connection information

Contact the seekdb deployment personnel or administrator to obtain the database connection string, for example:

mysql -h$host -P$port -u$user_name -p$password -D$database_name

Parameter description:

$host: the IP address for connecting to seekdb.
$port: the port for connecting to seekdb, which is 2881 by default.
$database_name: the name of the database to be accessed.

tip
The user needs to have the CREATE, INSERT, DROP, and SELECT permissions on the database.
$user_name: the database connection account.
$password: the account password.

Step 2: Build your AI assistant

Use Firecrawl to scrape web pages and save the data in seekdb Vector for search.

Set environment variables

Obtain the Firecrawl API key and configure the environment variables with the seekdb connection information.

export SEEKDB_DATABASE_URL=YOUR_SEEKDB_DATABASE_URL
export SEEKDB_DATABASE_USER=YOUR_SEEKDB_DATABASE_USER
export SEEKDB_DATABASE_DB_NAME=YOUR_SEEKDB_DATABASE_DB_NAME
export SEEKDB_DATABASE_PASSWORD=YOUR_SEEKDB_DATABASE_PASSWORD
export FIRECRAWL_API_KEY=YOUR_FIRECRAWL_API_KEY
export JINAAI_API_KEY=YOUR_JINAAI_API_KEY

Sample code snippet

import os ,requests
from firecrawl import FirecrawlApp
import pyseekdb
from pyseekdb import HNSWConfiguration
from tqdm import tqdm

def split_markdown_content(content):
    return [section.strip() for section in content.split("# ") if section.strip()]

app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])

# Scrape a website:
scrape_status = app.scrape(
url="https://www.oceanbase.ai/docs/seekdb-overview/",
formats=["markdown"]
)

markdown_content = scrape_status.markdown

# Process the scraped markdown content
sections = split_markdown_content(markdown_content)

Get vectors from Jina AI

Jina AI provides various models, and users can choose the appropriate model based on their needs. Here, we use jina-embeddings-v3 as an example and define a generate_embeddings helper function to call the Jina AI API:

JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
def generate_embeddings(text: str):
    JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
    JINAAI_HEADERS = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {JINAAI_API_KEY}'
    }
    JINAAI_REQUEST_DATA = {
        'input': [text],
        'model': 'jina-embeddings-v3'  # with dimension 1024.
    }
    response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
    return response.json()['data'][0]['embedding']


ids = []
embeddings = []
documents = []

for i, section in enumerate(tqdm(sections, desc="Processing sections")):
    try:
        # Generate the embedding for the section via API
        embedding = generate_embeddings(section)
        # Truncate content if too long
        truncated_content = section[:4900] if len(section) > 4900 else section
        # Append to lists
        ids.append(f"{i+1}")
        embeddings.append(embedding)
        documents.append(truncated_content)
        
    except Exception as e:
        print(f"Error processing section {i}: {e}")
        continue

print(f"Successfully processed {len(documents)} sections")

Define a table and store data in seekdb

Create a table named firecrawl_seekdb_demo_documents:

SEEKDB_DATABASE_HOST = os.getenv('SEEKDB_DATABASE_HOST')
SEEKDB_DATABASE_PORT = int(os.getenv('SEEKDB_DATABASE_PORT', 2881)) 
SEEKDB_DATABASE_USER = os.getenv('SEEKDB_DATABASE_USER')
SEEKDB_DATABASE_DB_NAME = os.getenv('SEEKDB_DATABASE_DB_NAME')
SEEKDB_DATABASE_PASSWORD = os.getenv('SEEKDB_DATABASE_PASSWORD')

client = pyseekdb.Client(host=SEEKDB_DATABASE_HOST, port=SEEKDB_DATABASE_PORT, database=SEEKDB_DATABASE_DB_NAME, user=SEEKDB_DATABASE_USER, password=SEEKDB_DATABASE_PASSWORD)
table_name = "firecrawl_seekdb_demo_documents"
config = HNSWConfiguration(dimension=1024, distance='cosine')  
collection = client.create_collection(
    name=table_name,
    configuration=config,
    embedding_function=None
)

print('- Inserting Data to seekdb...')
collection.add(
    ids=ids,
    embeddings=embeddings,
    documents=documents
)

Perform semantic search

Generate a vector for the query text using the Jina AI API, then search for the most relevant documents based on the cosine distance between the query vector and each vector in the vector table:

query = 'what is seekdb'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)

res = collection.query(
    query_embeddings=query_embedding,
    n_results=1
)

print('- The Most Relevant Document and Its Distance to the Query:')
for i, (doc_id, document, distance) in enumerate(zip(
    res['ids'][0], 
    res['documents'][0], 
    res['distances'][0]
)):
    print(f'  - ID: {doc_id}')
    print(f'    content: {document}')
    print(f'    distance: {distance:.6f}')

Expected result

- ID: 1 
  content: Skip to main content
On this page
OceanBase seekdb (referred to as seekdb) is an AI-native search database. It unifies relational, vector, text, JSON and GIS in a single engine, enabling hybrid search and in-database AI workflows.
distance: 0.235985

Prerequisites​

Step 1: Obtain database connection information​

Step 2: Build your AI assistant​

Set environment variables​

Sample code snippet​

Get vectors from Jina AI​

Define a table and store data in seekdb​

Perform semantic search​

Expected result​

Contents