Skip to main content
Version: V1.0.0

Build a RAG application with seekdb

This tutorial shows you how to import Markdown documents into seekdb, build a hybrid-retrieval knowledge base with the SDK, and launch a Streamlit-based RAG interface.

Prerequisites

  • Embedded seekdb is installed. For details, see Embedded mode.

  • Python 3.11 or later is installed.

  • uv is installed. If you do not have it yet, install it with one of the following commands:

    curl -LsSf https://astral.sh/uv/install.sh | sh

    # Or install via pip
    pip install uv

Step 1: Obtain the LLM API key

In this tutorial, Alibaba Cloud Bailian is used as an example. You can use any provider that fits your requirements (for example, any OpenAI-compatible API service).

tip

Enabling the Qwen service on Alibaba Cloud is done through a third-party platform and is subject to that platform's billing rules. It may incur charges. Before proceeding, review the official documentation and pricing, and continue only if you accept the terms.

Create an Alibaba Cloud Bailian account, enable the model service, and copy your API key.

Step 2: Clone the repo

git clone https://github.com/oceanbase/pyseekdb.git

Step 3: Install dependencies

  1. Go to the pyseekdb/demo/rag directory:

    cd pyseekdb/demo/rag
  2. Install dependencies:

    • For default or api embedding types:

      uv sync
    • For local embedding type:

      uv sync --extra local
    info
    • The local extra installs sentence-transformers and related dependencies (roughly 2–3 GB).

    • If you are in mainland China, you can use a domestic mirror to speed up downloads:

      • Basic install (TUNA): uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple
      • Basic install (Aliyun): uv sync --index-url https://mirrors.aliyun.com/pypi/simple
      • Local model (TUNA): uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple
      • Local model (Aliyun): uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple

Step 4: Set environment variables

  1. Copy the environment file template:

    cp .env.example .env
  2. Edit .env and choose an embedding function type:

    seekdb supports three embedding function types:

    • default (recommended to get started)

      • Uses pyseekdb's built-in DefaultEmbeddingFunction (ONNX-based).
      • Downloads the model automatically on first use; no embedding API key required.
      • Best for local development and testing.
    • local (local model)

      • Uses a custom sentence-transformers model.
      • Requires installing the sentence-transformers extra.
      • You can configure the model name and device (CPU/GPU).
    • api (API service)

      • Uses an OpenAI-compatible embedding API (for example, DashScope or OpenAI).
      • Requires an embedding API key and model name.
      • Typically used in production.

Example configuration using Qwen (with EMBEDDING_FUNCTION_TYPE=api):

# Embedding function type: api, local, default
EMBEDDING_FUNCTION_TYPE=api

# LLM configuration (used to generate answers)
OPENAI_API_KEY=sk-your-dashscope-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_MODEL_NAME=qwen-plus

# Embedding API configuration (required only when EMBEDDING_FUNCTION_TYPE=api)
EMBEDDING_API_KEY=sk-your-dashscope-key
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
EMBEDDING_MODEL_NAME=text-embedding-v4

# Local model configuration (required only when EMBEDDING_FUNCTION_TYPE=local)
SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2
SENTENCE_TRANSFORMERS_DEVICE=cpu

# seekdb configuration
SEEKDB_DIR=./data/seekdb_rag
SEEKDB_NAME=test
COLLECTION_NAME=embeddings

Parameter reference:

ParameterDescriptionDefault / ExampleRequired
EMBEDDING_FUNCTION_TYPEEmbedding function typedefault (optional: api, local, default)Required
OPENAI_API_KEYLLM API key (supports OpenAI, Qwen, and other compatible services)-Required (for generating answers)
OPENAI_BASE_URLLLM API base URLhttps://dashscope.aliyuncs.com/compatible-mode/v1Optional
OPENAI_MODEL_NAMELanguage model nameqwen-plusOptional
EMBEDDING_API_KEYEmbedding API key-Required when EMBEDDING_FUNCTION_TYPE=api
EMBEDDING_BASE_URLEmbedding API base URLhttps://dashscope.aliyuncs.com/compatible-mode/v1Required when EMBEDDING_FUNCTION_TYPE=api
EMBEDDING_MODEL_NAMEEmbedding model nametext-embedding-v4Required when EMBEDDING_FUNCTION_TYPE=api
SENTENCE_TRANSFORMERS_MODEL_NAMELocal model nameall-mpnet-base-v2Required when EMBEDDING_FUNCTION_TYPE=local
SENTENCE_TRANSFORMERS_DEVICEDevice for running the modelcpuRequired when EMBEDDING_FUNCTION_TYPE=local
SEEKDB_DIRseekdb database directory./data/seekdb_ragOptional
SEEKDB_NAMEDatabase nametestOptional
COLLECTION_NAMEEmbedding table nameembeddingsOptional
tip
  • If you use default, configure EMBEDDING_FUNCTION_TYPE=default and the LLM-related parameters only.
  • If you use api, you must also configure the embedding API variables.
  • If you use local, install the sentence-transformers extra and configure the local model variables.

Step 5: Import documents

This tutorial uses the pyseekdb SDK documentation as sample data. You can also import your own Markdown files or an entire directory.

Run the import script:

# Import a single file
uv run python seekdb_insert.py ../../README.md

# Or import all Markdown files in a directory
uv run python seekdb_insert.py path/to/your_dir

What happens during import:

  • The script reads the specified Markdown file, or all Markdown files in the target directory.
  • It splits documents into chunks by heading (using # as the delimiter).
  • It selects an embedding function based on EMBEDDING_FUNCTION_TYPE in .env:
    • default: Uses pyseekdb's built-in DefaultEmbeddingFunction (downloads the model on first use).
    • local: Uses a custom sentence-transformers model.
    • api: Uses the configured embedding API service.
  • It generates embeddings for each chunk and stores them in seekdb.
  • It automatically skips failed chunks to keep batch processing resilient.

Step 6: Run the RAG app

Start the Streamlit app:

uv run streamlit run seekdb_app.py

You should see output similar to the following:

You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://xxx.xxx.xxx.19:8501
External URL: http://xxx.xxx.xxx.143:8501

Once the app is running, open the URL in your browser to access the RAG interface and query your imported documents.

tip

If you manage dependencies with uv, prefix commands with uv run to ensure you are using the correct environment and installed packages.

RAG interface