跳到主要内容

MATCH AGAINST

描述

该表达式用于在全文索引上进行文本检索,AGAINST 接受一个查找字符串,并且按照字符集的比较方式在索引中进行查找,对于表中的每一行数据,MATCH 的返回值代表了查找字符串和行中数据的相关度,也就是查找字符串中的文本和数据表中的文本相似度。

语法

MATCH (column_set) AGAINST (query_expr [mode_flag])

column_set:
column_name [, column_name ...]

mode_flag:
IN NATURAL LANGUAGE MODE
| IN BOOLEAN MODE

参数解释

字段说明
column_set指定要进行全文搜索的列。如果要列出多个列,列之间需要使用英文逗号分隔,查询与其中的列顺序无关,必须存在一个被索引列与 column_set 完全匹配的全文索引才可以执行对应的 MATCH AGAINST 表达式。
query_expr指定要搜索的关键字或短语,也就是待匹配的参数。seekdb 会按匹配的全文索引使用的分词器对 query_expr 进行分词后,在全文索引上进行检索。
mode_flag可选项,表示全文搜索的模式。缺省值为 IN NATURAL LANGUAGE MODE。详细介绍可参见下文 mode_flag

mode_flag

  • IN NATURAL LANGUAGE MODE:默认值,用于指定使用自然语言搜索模式进行搜索。该模式的全文搜索是通过对查询表达式(query_expr)进行分词得到词条(token)集合,与索引中的词条(token)进行匹配来进行检索,只要有一个词条与查询表达式中的词条匹配成功就视为匹配成功。同时会通过基于词频的方法(Okapi BM25)对被匹配成功的行相关性进行 ranking(排序)。

    默认情况下或者指定 IN NATURAL LANGUAGE MODE 标示符,MATCH AGAINST 将使用 NATURAL LANGUAGE 模式来进行全文查找。在 NATURAL LANGUAGE 模式下,AGAINST 接受一个查找字符串,并且按照字符集的比较方式在索引中进行查找,对于表中的每一行数据,MATCH 的返回值代表了查找字符串和行中数据的相关度,也就是查找字符串中的文本和数据表中的文本相似度。

  • IN BOOLEAN MODE:用于指定使用布尔模式进行搜索。当前版本支持三种最常用的布尔运算符,以及运算嵌套,具体如下:

    • +:表示 AND,交集。

    • -:表示非,差集。

    • 无操作符号:单独使用时表示 OR,并集,如 A B 表示 A OR B。和符号混用会让存在的句子相关性变高,但会失去 OR 的语意,如 +A B 时,表示必须有 A,并计算句子中 A 和 B 的相关性。

    • ():表示运算嵌套。外层无符号时,有 OR 的语意,如 +A (嵌套子句),表示必须有 A 或者有嵌套子句。

      示例如下:

      1. 输出句子必须包含 computer。

        SELECT * FROM my_table WHERE MATCH (doc) AGAINST ("+computer" IN BOOLEAN MODE);
      2. 输出句子必须包含 computer,且一定不包含 weather。

        SELECT * FROM my_table WHERE MATCH (doc) AGAINST ("+computer -weather" IN BOOLEAN MODE);
      3. 输出句子必须包含 computer,有 oceanbase 更匹配。

        SELECT * FROM my_table WHERE MATCH (doc) AGAINST ("+computer oceanbase" IN BOOLEAN MODE);

使用说明

MATCH AGAINST 表达式除了表示文本匹配外,在 SQL 的不同子句中出现时还包含一些其他的语义:

  • 投影相关性:

    • 全文搜索支持基于词频的 ranking,可以通过 MATCH AGAINST 表达式表示相关性的投影。
    • 相关性表示匹配的数据行与 MATCH AGAINST 的查询(Query)之间的相关程度。
    • 相关性是大于等于 0 的 DOUBLE 类型数据,等于 0 表示数据行与查询(Query)无关,数值越大相关性越强。
  • 过滤语义:

    • 对于在 WHERE 子句中出现的 MATCH AGAINST 表达式,且与其他所有过滤条件通过 AND 连接时包含过滤语义,表示会过滤掉不匹配的数据行。

      示例如下:

      SELECT id, digest, detail
      FROM t1
      WHERE MATCH (detail) AGAINST ('oceanbase');

      等价于

      SELECT id, digest, detail
      FROM t1
      WHERE MATCH (detail) AGAINST ('oceanbase') > 0;
  • 排序语义:

    • 对于 WHERE 子句中和其他过滤条件通过 AND 连接的 MATCH AGAINST 表达式,包含排序语义。
    • 表示会对全文搜索的结果按 MATCH AGAINST 的相关序 ranking 降序排序。
    • 当存在多个 MATCH AGAINST 表达式时,会按第一个 MATCH AGAINST 的相关性排序。

    示例如下:

    SELECT id, digest, MATCH (detail) AGAINST ('oceanbase') AS relevance
    FROM t1
    WHERE MATCH (detail) against ('oceanbase');

    等价于

    SELECT id, digest, MATCH (detail) AGAINST ('oceanbase') AS relevance
    FROM t1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    ORDER BY relevance DESC;

文本检索的执行方式

  • 当 SQL 中的 MATCH AGAINST 包含过滤语义时,seekdb 可以通过扫描全文索引计算 MATCH AGAINST 表达式后回表的方式执行查询。

  • 当 SQL 中的 MATCH AGAINST 不包含过滤语义时,seekdb 支持通过其他二级索引进行扫描和过滤,并根据主键随机访问全文索引的方式计算 MATCH AGAINST 表达式来执行查询。

  • 当 SQL 的 WHERE 子句中包含 MATCH AGAINST 和通过 AND 连接的多个过滤条件,且可以命中其他二级索引时,seekdb 会尝试根据代价选择合适的索引进行扫描。

  • 当通过扫描全文索引计算 MATCH AGAINST 时,如果 Query 中包含 LIMIT,seekdb 会尝试将 top-k 计算下压到全文索引扫描中执行来获得更好的性能。

  • 在查询包含 OR 连接的 MATCH AGAINST 谓词时,会自动尝试生成 Index Merge 计划,最终是否选择 Index Merge 计划取决于代价竞争的结果。

示例

  1. 创建示例表 test_tbl1,同时创建全文索引 ft_idx1_test_tbl1ft_idx1_test_tbl1,及索引 idx_test_tbl1

    CREATE TABLE test_tbl1(id INT, ref_no INT, digest VARCHAR(512), detail VARCHAR(4096),
    FULLTEXT INDEX ft_idx1_test_tbl1(detail),
    FULLTEXT INDEX ft_idx2_test_tbl1(digest, detail),
    INDEX idx_test_tbl1 (id));
  2. 向表 test_tbl1 中插入测试数据。

    INSERT INTO test_tbl1 VALUES
    (1, 1234, 'fulltext', 'Try text retrieval with OceanBase fulltext index'),
    (2, 2345, 'log', 'OceanBase can halp with log analysis'),
    (3, 3456, 'order', 'Simple text retrieval scan will return result set in order of descending ranking in OceanBase'),
    (4, 4567, 'ranking', 'OceanBase will ranking relevance to query for matched result set'),
    (5, 5678, 'filter', 'Using text retrieval as a filter condition would be probably more efficient');
  3. 通过 MATCH AGAINST 表达式投影相关性。

    SELECT id, digest, detail, MATCH (detail) AGAINST ('oceanbase') AS relevance
    FROM test_tbl1;

    返回结果如下:

    +------+----------+-----------------------------------------------------------------------------------------------+--------------------+
    | id | digest | detail | relevance |
    +------+----------+-----------------------------------------------------------------------------------------------+--------------------+
    | 1 | fulltext | Try text retrieval with OceanBase fulltext index | 0.2989130434782609 |
    | 2 | log | OceanBase can halp with log analysis | 0.3142857142857143 |
    | 3 | order | Simple text retrieval scan will return result set in order of descending ranking in OceanBase | 0.240174672489083 |
    | 4 | ranking | OceanBase will ranking relevance to query for matched result set | 0.2849740932642488 |
    | 5 | filter | Using text retrieval as a filter condition would be probably more efficient | 0 |
    +------+----------+-----------------------------------------------------------------------------------------------+--------------------+
    5 rows in set (0.003 sec)
  4. 通过 MATCH AGAINST 表达式进行过滤。

    SELECT id, digest, detail
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase');

    返回结果如下:

    +------+----------+-----------------------------------------------------------------------------------------------+
    | id | digest | detail |
    +------+----------+-----------------------------------------------------------------------------------------------+
    | 2 | log | OceanBase can halp with log analysis |
    | 1 | fulltext | Try text retrieval with OceanBase fulltext index |
    | 4 | ranking | OceanBase will ranking relevance to query for matched result set |
    | 3 | order | Simple text retrieval scan will return result set in order of descending ranking in OceanBase |
    +------+----------+-----------------------------------------------------------------------------------------------+
    4 rows in set (0.002 sec)
  5. 通过 MATCH AGAINST 表达式进行检索和排序。

    SELECT id, digest, MATCH (detail) AGAINST ('oceanbase') AS relevance
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase');

    或者

    SELECT id, digest, MATCH (detail) AGAINST ('oceanbase') AS relevance
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    ORDER BY relevance DESC;

    返回结果如下:

    +------+----------+--------------------+
    | id | digest | relevance |
    +------+----------+--------------------+
    | 2 | log | 0.3142857142857143 |
    | 1 | fulltext | 0.2989130434782609 |
    | 4 | ranking | 0.2849740932642488 |
    | 3 | order | 0.240174672489083 |
    +------+----------+--------------------+
    4 rows in set (0.002 sec)

文本检索执行计划示例

  1. MATCH AGAINST 包含过滤语义,通过全文索引 ft_idx1_test_tbl1 扫描。

    EXPLAIN SELECT id, digest, detail
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    AND id = 3;

    返回结果如下:

    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | ============================================================================= |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | ----------------------------------------------------------------------------- |
    | |0 |SORT | |1 |30 | |
    | |1 |└─TEXT RETRIEVAL SCAN|test_tbl1(ft_idx1_test_tbl1)|1 |30 | |
    | ============================================================================= |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([test_tbl1.id], [test_tbl1.digest], [test_tbl1.detail]), filter(nil), rowset=16 |
    | sort_keys([MATCH(test_tbl1.detail) AGAINST('oceanbase'), DESC]) |
    | 1 - output([test_tbl1.detail], [test_tbl1.id], [test_tbl1.digest], [MATCH(test_tbl1.detail) AGAINST('oceanbase')]), filter([test_tbl1.id = 3]), rowset=16 |
    | access([test_tbl1.__pk_increment], [test_tbl1.detail], [test_tbl1.id], [test_tbl1.digest]), partitions(p0) |
    | is_index_back=true, is_global_index=false, filter_before_indexback[false], |
    | calc_relevance=true, match_expr(MATCH(test_tbl1.detail) AGAINST('oceanbase')), |
    | pushdown_match_filter(MATCH(test_tbl1.detail) AGAINST('oceanbase')) |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    15 rows in set (0.008 sec)
  2. MATCH AGAINST 包含过滤语义,但命中普通二级索引 idx_test_tbl1,且 idx_test_tbl1 索引扫描代价更低,通过索引 idx_test_tbl1 扫描。

    EXPLAIN SELECT id
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase simple text retreival scan on fulltext index')
    AND id = 3;

    返回结果如下:

    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | ====================================================================== |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | ---------------------------------------------------------------------- |
    | |0 |SORT | |1 |103 | |
    | |1 |└─TABLE RANGE SCAN|test_tbl1(idx_test_tbl1)|1 |103 | |
    | ====================================================================== |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([test_tbl1.id]), filter(nil), rowset=16 |
    | sort_keys([MATCH(test_tbl1.detail) AGAINST('oceanbase simple text retreival scan on fulltext index'), DESC]) |
    | 1 - output([test_tbl1.id], [MATCH(test_tbl1.detail) AGAINST('oceanbase simple text retreival scan on fulltext index')]), filter([MATCH(test_tbl1.detail) |
    | AGAINST('oceanbase simple text retreival scan on fulltext index')]), rowset=16 |
    | access([test_tbl1.__pk_increment], [test_tbl1.id]), partitions(p0) |
    | is_index_back=false, is_global_index=false, filter_before_indexback[false], |
    | range_key([test_tbl1.id], [test_tbl1.__pk_increment]), range(3,MIN ; 3,MAX), |
    | range_cond([test_tbl1.id = 3]), has_functional_lookup=true |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    16 rows in set (0.006 sec)
  3. MATCH AGAINST 不包含过滤语义,无法过滤结果集,无法通过全文索引扫描,通过主表扫描。

    EXPLAIN SELECT id, digest, detail
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    OR id = 3;

    返回结果如下:

    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | ==================================================== |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | ---------------------------------------------------- |
    | |0 |TABLE FULL SCAN|test_tbl1|5 |503 | |
    | ==================================================== |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([test_tbl1.id], [test_tbl1.digest], [test_tbl1.detail]), filter([MATCH(test_tbl1.detail) AGAINST('oceanbase') OR test_tbl1.id = 3]), rowset=16 |
    | access([test_tbl1.__pk_increment], [test_tbl1.detail], [test_tbl1.id], [test_tbl1.digest]), partitions(p0) |
    | is_index_back=false, is_global_index=false, filter_before_indexback[false], |
    | range_key([test_tbl1.__pk_increment]), range(MIN ; MAX)always true, has_functional_lookup=true |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
    11 rows in set (0.005 sec)
  4. MATCH AGAINST 不包含过滤语义,无法过滤结果集,无法通过全文索引扫描,命中索引 idx_test_tbl1,通过索引 idx_test_tbl1 扫描。

    EXPLAIN SELECT id
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    OR id = 3;

    返回结果如下:

    +------------------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +------------------------------------------------------------------------------------------------------------------------------+
    | =================================================================== |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | ------------------------------------------------------------------- |
    | |0 |TABLE FULL SCAN|test_tbl1(idx_test_tbl1)|5 |503 | |
    | =================================================================== |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([test_tbl1.id]), filter([MATCH(test_tbl1.detail) AGAINST('oceanbase') OR test_tbl1.id = 3]), rowset=16 |
    | access([test_tbl1.__pk_increment], [test_tbl1.id]), partitions(p0) |
    | is_index_back=false, is_global_index=false, filter_before_indexback[false], |
    | range_key([test_tbl1.id], [test_tbl1.__pk_increment]), range(MIN,MIN ; MAX,MAX)always true, has_functional_lookup=true |
    +------------------------------------------------------------------------------------------------------------------------------+
    11 rows in set (0.005 sec)
  5. MATCH AGAINST 包含过滤语义,命中对应全文索引 ft_idx1_test_tbl1 扫描,同时 LIMIT 作为 top-k 下推索引扫描中。

    EXPLAIN SELECT id, digest, detail
    FROM test_tbl1
    WHERE MATCH (detail) AGAINST ('oceanbase')
    LIMIT 3;

    返回结果如下:

    +------------------------------------------------------------------------------------------------------------------+
    | Query Plan |
    +------------------------------------------------------------------------------------------------------------------+
    | =========================================================================== |
    | |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
    | --------------------------------------------------------------------------- |
    | |0 |TEXT RETRIEVAL SCAN|test_tbl1(ft_idx1_test_tbl1)|4 |31 | |
    | =========================================================================== |
    | Outputs & filters: |
    | ------------------------------------- |
    | 0 - output([test_tbl1.id], [test_tbl1.digest], [test_tbl1.detail]), filter(nil), rowset=16 |
    | access([test_tbl1.__pk_increment], [test_tbl1.detail], [test_tbl1.id], [test_tbl1.digest]), partitions(p0) |
    | is_index_back=true, is_global_index=false, |
    | calc_relevance=true, match_expr(MATCH(test_tbl1.detail) AGAINST('oceanbase')), |
    | pushdown_match_filter(MATCH(test_tbl1.detail) AGAINST('oceanbase')), |
    | sort_keys([MATCH(test_tbl1.detail) AGAINST('oceanbase'), DESC]), limit(3), offset(nil), with_ties(false) |
    +------------------------------------------------------------------------------------------------------------------+
    13 rows in set (0.004 sec)

相关文档

全文查询