Skip to main content

Semi-structured encoding

This topic describes the semi-structured encoding feature supported by seekdb.

seekdb supports enabling semi-structured encoding when creating tables, primarily controlled by the table-level parameter SEMISTRUCT_PROPERTIES. You must also set ROW_FORMAT=COMPRESSED for the table, otherwise an error will occur.

Considerations

  • When SEMISTRUCT_PROPERTIES=(encoding_type=encoding), the table is considered a semi-structured table, meaning all JSON columns in the table will have semi-structured encoding enabled.
  • When SEMISTRUCT_PROPERTIES=(encoding_type=none), the table is considered a structured table.
  • You can also set the frequency threshold using the freq_threshold parameter. When semi-structured encoding is enabled, the system analyzes the frequency of each path in the JSON data and stores paths with frequencies exceeding the specified threshold as independent subcolumns, known as frequent columns. For example, if you have a user table where the JSON field stores user information and 90% of users have the name and age fields, the system will automatically extract name and age as independent frequent columns. During queries, these columns are accessed directly without parsing the entire JSON, thereby improving query performance.
  • Currently, encoding_type and freq_threshold can only be modified using online DDL statements, not offline DDL statements.

Data format

JSON data is split and stored as structured columns in a specific format. The columns split from JSON columns are called subcolumns. Subcolumns can be categorized into different types, including sparse columns and frequent columns.

  • Sparse columns: Subcolumns that exist in some JSON documents but not in others, with an occurrence frequency lower than the threshold specified by the table-level parameter freq_threshold.
  • Frequent columns: Subcolumns that appear in JSON data with a frequency higher than the threshold specified by the table-level parameter freq_threshold. These subcolumns are stored as independent columns to improve filtering query performance.

For example:

{"id": 1001, "name": "n1", "nickname": "nn1"}
{"id": 1002, "name": "n2", "nickname": "nn2"}
{"id": 1003, "name": "n3", "nickname": "nn3"}
{"id": 1004, "name": "n4", "nickname": "nn4"}
{"id": 1005, "name": "n5"}

In this example, id and name are fields that exist in every JSON document with an occurrence frequency of 100%, while nickname exists in only four JSON documents with an occurrence frequency of 80%.

If freq_threshold is set to 100%, then nickname will be inferred as a sparse column, while id and name will be inferred as frequent columns. If set to 80%, then nickname, id, and name will all be inferred as frequent columns.

Examples

  1. Enable semi-structured encoding.

    tip

    If you enable semi-structured encoding, make sure that the parameter micro_block_merge_verify_level is set to the default value 2. Do not disable micro-block major compaction verification.

    tab Example: Enable semi-structured encoding during table creation

    CREATE TABLE t1(  j json)
    ROW_FORMAT=COMPRESSED
    SEMISTRUCT_PROPERTIES=(encoding_type=encoding, freq_threshold=50);

    For more information about the syntax, see CREATE TABLE.

    tab Example: Enable semi-structured encoding for existing table

    CREATE TABLE t1(j json);
    ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=encoding, freq_threshold=50);

    For more information about the syntax, see ALTER TABLE.

    Some modification limitations:

    • If semi-structured encoding is not enabled, modifying the frequent column threshold will not report an error but will have no effect.
    • The freq_threshold parameter cannot be modified during direct load operations or when the table is locked.
    • Modifying one sub-parameter does not affect the others.
  2. Disable semi-structured encoding.

    When SEMISTRUCT_PROPERTIES is set to (encoding_type=none), semi-structured encoding is disabled. This operation does not affect existing data and only applies to data written afterward. Here is an example of disabling semi-structured encoding:

    ALTER TABLE t1 SET ROW_FORMAT=COMPRESSED SEMISTRUCT_PROPERTIES = (encoding_type=none);
  3. Query semi-structured encoding configuration.

    Use the SHOW CREATE TABLE statement to query the semi-structured encoding configuration. Here is an example statement:

    SHOW CREATE TABLE t1;

    The result is as follows:

    +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Table | Create Table |
    +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | t1 | CREATE TABLE `t1` (
    `j` json DEFAULT NULL
    ) ORGANIZATION INDEX DEFAULT CHARSET = utf8mb4 ROW_FORMAT = COMPRESSED COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE ENABLE_MACRO_BLOCK_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 SEMISTRUCT_PROPERTIES=(ENCODING_TYPE=ENCODING, FREQ_THRESHOLD=50) |
    +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set

    When SEMISTRUCT_PROPERTIES=(encoding_type=encoding) is specified, the query displays this parameter, indicating that semi-structured encoding is enabled.

Using semi-structured encoding can improve the performance of conditional filtering queries with the JSON_VALUE() function. Based on JSON semi-structured encoding technology, seekdb optimizes the performance of JSON_VALUE expression conditional filtering query scenarios. Since JSON data is split into sub-columns, the system can filter directly based on the encoded sub-column data without reconstructing the complete JSON structure, significantly improving query efficiency.

Here is an example query:

-- Query rows where the value of the name field is 'Devin'
SELECT * FROM t WHERE JSON_VALUE(j_doc, '$.name' RETURNING CHAR) = 'Devin';

Character set considerations:

  • seekdb uses utf8_bin encoding for JSON.

  • To ensure string whitebox filtering works properly, we recommend the following settings:

    SET @@collation_server = 'utf8mb4_bin';
    SET @@collation_connection='utf8mb4_bin';