Version: V1.0.0

seekdb MLDR benchmark

This guide explains how to benchmark seekdb with MLDR and evaluate retrieval quality and latency.

What is MLDR?

MLDR (Multi-Lingual Document Retrieval) is a dataset-based benchmarking framework for evaluating multilingual document retrieval systems. It supports multiple query types, including BM25, dense vector retrieval, and hybrid retrieval.

Supported query types

Query type	Description	Supported backends
`bm25`	BM25 full-text search	OceanBase Database, seekdb
`dense`	Dense vector retrieval	OceanBase Database, seekdb
`hybrid_dense_bm25`	Hybrid retrieval (dense + BM25)	OceanBase Database, seekdb

Metrics

Recall@10: Recall within the top 10 results.
NDCG@10: Normalized Discounted Cumulative Gain at 10.
Average query time: Average latency per query.

Environment setup

Before running the benchmark, make sure your environment meets the following requirements:

Python 3.11 or later
JDK 11 or later (required by pyserini)
seekdb in client/server mode. For deployment instructions, see Deploy seekdb using yum install.

For the deployed seekdb instance, place the log disk, clog disk, and data disk on three separate disks. Use performance level PL1. When starting seekdb, set the following parameters in /etc/oceanbase/seekdb.cnf:

port=2881
base-dir=/data/1/seekdb
data-dir=/data/2/seekdb
redo-dir=/data/3/seekdb

Test plan

The benchmark uses two machines:
- One machine runs MLDR.
- The other machine runs seekdb (4C8G, PL1, with log/clog/data on three separate disks).
Dataset size: 200,000 records.
MLDR is used to evaluate seekdb retrieval quality (Recall/NDCG) and performance (latency).

Run an MLDR benchmark (manual)

Step 1: Get the seekdb connection string

mysql -hxx.xx.xx.xx -P2881 -uroot -p**** -A

Step 2: Install Java

sudo dnf install java-11-openjdk-devel -y
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
export JVM_PATH=$JAVA_HOME/lib/server/libjvm.so

Step 3: Create a Python virtual environment

Install Miniconda:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

Reopen your terminal and initialize Conda:

source ~/miniconda3/bin/activate
conda init --all

Create and activate a dedicated environment for MLDR:
```
conda create -n test python=3.11
conda activate test
```

Step 4: Download MLDR

git clone https://github.com/oceanbase/ob-mldr-test
cd ob-mldr-test

Step 5: Install Python dependencies

Upgrade pip:
```
pip install --upgrade pip
```
Install dependencies:
```
pip install -r requirements.txt
```

Step 6: Configure database connection

Copy the example config:
```
cp config.yaml.example config.yaml
```

Edit config.yaml and replace the database settings with your own:

vim config.yaml

Key fields to update:

oceanbase.host: database host
oceanbase.port: database port
oceanbase.user: database username
oceanbase.password: database password
oceanbase.database: database name
embedding.vector_download_url: URL to download vector files (required for vector retrieval)

Step 7: Run the full benchmark

# Hybrid retrieval benchmark (English)
python mldr_test_runner.py --lang en --query-type hybrid_dense_bm25

# Skip data insertion (when data already exists)
python mldr_test_runner.py --lang en --query-type hybrid_dense_bm25 --skip-insert

Parameters:

Parameter	Type	Default	Description
`--lang`	`str`	`en`	Test language (for example, `en`, `zh`)
`--backend`	`str`	`oceanbase`	Database backend (currently only `oceanbase` is supported)
`--query-type`	`str`	`bm25`	Query type: `hybrid_dense_bm25` (hybrid), `dense` (vector), `bm25` (full-text)
`--skip-insert`	flag	`False`	Skip data insertion (use when data already exists)
`--result-dir`	`str`	temporary directory	Output directory for results
`--config`	`str`	`None`	Config file path (YAML). Defaults to `config.yaml` in the current directory

seekdb MLDR benchmark

What is MLDR?

Supported query types

Metrics

Environment setup

Test plan

Run an MLDR benchmark (manual)

Step 1: Get the seekdb connection string

Step 2: Install Java

Step 3: Create a Python virtual environment

Step 4: Download MLDR

Step 5: Install Python dependencies

Step 6: Configure database connection

Step 7: Run the full benchmark

What the runner does

Results

Contents

What is MLDR?​

Supported query types​

Metrics​

Environment setup​

Test plan​

Run an MLDR benchmark (manual)​

Step 1: Get the seekdb connection string​

Step 2: Install Java​

Step 3: Create a Python virtual environment​

Step 4: Download MLDR​

Step 5: Install Python dependencies​

Step 6: Configure database connection​

Step 7: Run the full benchmark​

What the runner does​

Results​

Contents

What is MLDR?

Supported query types

Metrics

Environment setup

Test plan

Run an MLDR benchmark (manual)

Step 1: Get the seekdb connection string

Step 2: Install Java

Step 3: Create a Python virtual environment

Step 4: Download MLDR

Step 5: Install Python dependencies

Step 6: Configure database connection

Step 7: Run the full benchmark

What the runner does

Results