Skip to main content
Version: V1.0.0

seekdb MLDR benchmark

This guide explains how to benchmark seekdb with MLDR and evaluate retrieval quality and latency.

What is MLDR?

MLDR (Multi-Lingual Document Retrieval) is a dataset-based benchmarking framework for evaluating multilingual document retrieval systems. It supports multiple query types, including BM25, dense vector retrieval, and hybrid retrieval.

Supported query types

Query typeDescriptionSupported backends
bm25BM25 full-text searchOceanBase Database, seekdb
denseDense vector retrievalOceanBase Database, seekdb
hybrid_dense_bm25Hybrid retrieval (dense + BM25)OceanBase Database, seekdb

Metrics

  • Recall@10: Recall within the top 10 results.
  • NDCG@10: Normalized Discounted Cumulative Gain at 10.
  • Average query time: Average latency per query.

Environment setup

Before running the benchmark, make sure your environment meets the following requirements:

  • Python 3.11 or later
  • JDK 11 or later (required by pyserini)
  • seekdb in client/server mode. For deployment instructions, see Deploy seekdb using yum install.

For the deployed seekdb instance, place the log disk, clog disk, and data disk on three separate disks. Use performance level PL1. When starting seekdb, set the following parameters in /etc/oceanbase/seekdb.cnf:

port=2881
base-dir=/data/1/seekdb
data-dir=/data/2/seekdb
redo-dir=/data/3/seekdb

Test plan

  • The benchmark uses two machines:
    • One machine runs MLDR.
    • The other machine runs seekdb (4C8G, PL1, with log/clog/data on three separate disks).
  • Dataset size: 200,000 records.
  • MLDR is used to evaluate seekdb retrieval quality (Recall/NDCG) and performance (latency).

Run an MLDR benchmark (manual)

Step 1: Get the seekdb connection string

mysql -hxx.xx.xx.xx -P2881 -uroot -p**** -A

Step 2: Install Java

sudo dnf install java-11-openjdk-devel -y
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
export JVM_PATH=$JAVA_HOME/lib/server/libjvm.so

Step 3: Create a Python virtual environment

  1. Install Miniconda:

    mkdir -p ~/miniconda3
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
    bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
    rm ~/miniconda3/miniconda.sh
  2. Reopen your terminal and initialize Conda:

    source ~/miniconda3/bin/activate
    conda init --all
  3. Create and activate a dedicated environment for MLDR:

    conda create -n test python=3.11
    conda activate test

Step 4: Download MLDR

git clone https://github.com/oceanbase/ob-mldr-test
cd ob-mldr-test

Step 5: Install Python dependencies

  1. Upgrade pip:

    pip install --upgrade pip
  2. Install dependencies:

    pip install -r requirements.txt

Step 6: Configure database connection

  1. Copy the example config:

    cp config.yaml.example config.yaml
  2. Edit config.yaml and replace the database settings with your own:

    vim config.yaml

    Key fields to update:

    oceanbase.host: database host
    oceanbase.port: database port
    oceanbase.user: database username
    oceanbase.password: database password
    oceanbase.database: database name
    embedding.vector_download_url: URL to download vector files (required for vector retrieval)

Step 7: Run the full benchmark

# Hybrid retrieval benchmark (English)
python mldr_test_runner.py --lang en --query-type hybrid_dense_bm25

# Skip data insertion (when data already exists)
python mldr_test_runner.py --lang en --query-type hybrid_dense_bm25 --skip-insert

Parameters:

ParameterTypeDefaultDescription
--langstrenTest language (for example, en, zh)
--backendstroceanbaseDatabase backend (currently only oceanbase is supported)
--query-typestrbm25Query type: hybrid_dense_bm25 (hybrid), dense (vector), bm25 (full-text)
--skip-insertflagFalseSkip data insertion (use when data already exists)
--result-dirstrtemporary directoryOutput directory for results
--configstrNoneConfig file path (YAML). Defaults to config.yaml in the current directory

What the runner does

The test runner performs these steps automatically:

  1. Insert data: Insert 200,000 records and create indexes.
  2. Warm-up (optional): Run a warm-up query.
  3. Benchmark: Execute the configured number of queries and compute metrics.
  4. Report: Output average Recall@10, NDCG@10, and average latency.

Results

For detailed results, see seekdb MLDR benchmark report.