Skip to main content

CREATE EXTERNAL TABLE

Description

This statement is used to create a new external table in the database.

An external table is a key feature of a database management system. In a database, tables are stored in the database's storage space, while the data of an external table is stored in an external storage service.

When creating an external table, you need to define the file path and file format of the data. After that, users can read the data from the external storage service through the external table. External tables are read-only and can be used in query statements, but DML operations cannot be performed on them. External tables do not support defining constraints or creating indexes.

Syntax

CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = 'file_name'
{FORMAT = (format_type_options)
| PROPERTIES = (properties_type_options)}
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = 'regex_pattern']
[AUTO_REFRESH = 'xxx'];

column_definition_list:
column_definition [, column_definition ...]

column_definition:
column_name column_type [AS expr]

format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option

type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}

type_parquet_option:
TYPE = 'PARQUET'

type_orc_option:
TYPE = 'ORC'

properties_type_options:
type_odps_option

type_odps_option:
TYPE = 'ODPS'
ACCESSID = '<string>'
ACCESSKEY = '<string>'
ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>',
API_MODE = {"tunnel_api" | "storage_api"},
SPLIT = {"byte" | "row"}

Parameters

ParameterDescription
table_nameThe name of the external table to be created.
column_nameThe name of the column of the external table. By default, the data columns in the file are automatically mapped to the columns defined in the external table in sequence.
column_typeThe type of the column of the external table. Constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY cannot be defined.
AS exprManually specifies the column mapping. If the order of the columns in the file does not match the order of the columns defined in the external table, you can use the pseudo column metadata$filecol{N} to specify the mapping between the columns of the external table and the Nth column in the file. For example, c2 INT AS (metadata$filecol4) indicates that the c2 column of the external table corresponds to the 4th column in the file. Note that if you manually specify column mappings, the automatic mapping relationships will be disabled, and all columns must be manually defined with mapping relationships.
LOCATION = 'file_name'The path where the external table files are stored. Usually, the data files of the external table are stored in a separate directory, which can contain subdirectories. When you create an external table, it automatically collects all files in this directory. For more information, see file_name below.
FORMAT = (format_type_options)The attributes of the external file format. Use TYPE to specify the file format for export, such as CSV, PARQUET, or ORC. The TYPE parameter cannot be empty. For more information, see format_type_options below.
PROPERTIES = (properties_type_options)The attributes of the external file format. Use TYPE to specify the file format for export, such as ODPS. The TYPE parameter cannot be empty. For more information, see properties_type_options below.
PATTERNThe regular expression pattern for filtering files in the LOCATION directory. For each file path in the LOCATION directory, if the path matches the pattern, the external table accesses the file; otherwise, it skips the file. If you do not specify this parameter, the external table accesses all files in the LOCATION directory by default. The external table saves the list of files in the LOCATION directory that match the PATTERN in the database system table. When the external table scans files, it accesses the external files based on this list.
PARTITION_TYPE = USER_SPECIFIEDIf you want to manually add and delete partitions instead of letting the external table automatically manage partitions, specify the PARTITION_TYPE = USER_SPECIFIED parameter.
AUTO_REFRESH = 'xxx'Specifies whether to enable automatic refresh for the external table. Valid values:
  • OFF: The default value. Automatic refresh is disabled.
  • INTERVAL: The external table can be managed by dbms_external_table to set automatic refresh rules.
  • IMMEDIATE: Each time an SQL statement accesses the external table, the META data of the external table is refreshed.

file_name

The path of the external file is in the following format:

  • If the file is stored locally, the LOCATION parameter is in the LOCATION = '[file://] local_file_path' format. In this case, local_file_path can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of seekdb. The secure_file_priv parameter specifies the file path that seekdb can access. local_file_path must be a subpath of secure_file_priv.

  • If the file is stored remotely, the LOCATION parameter is in the following format:

    tip

    When you use an object storage path, the parameters of the object storage path are separated by &. Make sure that the parameter values contain only uppercase and lowercase letters, digits, and the characters /-_$+=, and wildcards. Otherwise, the setting may fail.

    • If the file is stored in OSS or S3, the LOCATION parameter is in the LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path' format. In this case, $ACCESS_ID, $ACCESS_KEY, and $HOST are the access information required for accessing Alibaba Cloud OSS, AWS S3, and object storage services that are compatible with the S3 protocol. s3_region is the region information selected when you use S3. These sensitive access information are stored in the system table of the database in an encrypted manner.

format_type_options

  • TYPE = 'CSV': specifies the format of the external file as CSV. It also contains the following fields:

    • LINE_DELIMITER: specifies the line delimiter of the CSV file. The default value is LINE_DELIMITER='\n'.

    • FIELD_DELIMITER: specifies the column delimiter of the CSV file. The default value is FIELD_DELIMITER='\t'.

    • ESCAPE: specifies the escape character of the CSV file. It can be only one byte. The default value is ESCAPE ='\'.

    • FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character that wraps the field values in the CSV file. The default value is empty.

      tip

      If the external file contains NULL values (non-string NULL values, that is, values other than "NULL"), you must explicitly configure the FIELD_OPTIONALLY_ENCLOSED_BY parameter and the value of the parameter cannot be empty.

    • ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in MySQL mode, see Character sets. If you do not specify this parameter, the default value is UTF8MB4.

    • NULL_IF: specifies the string that is treated as NULL. The default value is empty.

    • SKIP_HEADER: specifies the number of header rows to skip.

    • SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value is FALSE, which indicates that blank lines are not skipped.

    • TRIM_SPACE: specifies whether to remove leading and trailing spaces in the fields of the file. The default value is FALSE, which indicates that leading and trailing spaces in the fields of the file are not removed.

    • EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings as NULL. The default value is FALSE, which indicates that empty strings are not treated as NULL.

    • IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line. The default value is TRUE, which indicates that the last empty field at the end of a line is ignored.

  • TYPE = 'PARQUET': specifies the format of the external file as PARQUET.

  • TYPE = 'ORC': specifies the format of the external file as ORC.

properties_type_options

  • TYPE = 'ODPS': specifies the format of the external file as ODPS. It also contains the following fields:

    • ACCESSID: specifies the AccessKey ID of the Alibaba Cloud account for identity authentication.

    • ACCESSKEY: specifies the AccessKey secret corresponding to the AccessKey ID for identity verification.

    • ENDPOINT: specifies the connection address of the ODPS service.

    • PROJECT_NAME: specifies the name of the ODPS project to be accessed.

    • SCHEMA_NAME: specifies the schema name in ODPS. This parameter is optional.

    • TABLE_NAME: specifies the name of the target table in ODPS.

    • QUOTA_NAME: specifies the quota to be used. This parameter is optional.

    • COMPRESSION_CODE: specifies the compression format of the data source. Valid values: ZLIB, ZSTD, LZ4, and ODPS_LZ4. If you do not specify this parameter, compression is not enabled.

    • API_MODE: specifies the API mode for accessing ODPS. Valid values:

      • tunnel_api (default):

        • No special network configuration is required. This mode is applicable to all deployment scenarios and does not require seekdb and MaxCompute to be deployed in the same VPC (Virtual Private Cloud).

        • No additional MaxCompute permissions are required. You only need to provide the AccessID and AccessKey for authentication. You do not need to enable the MaxCompute Storage API permission.

        • Applicable scenarios:

          • seekdb and MaxCompute are not deployed in the same VPC.
          • The MaxCompute Storage API is not enabled.
          • The data transmission has low latency requirements.
      • storage_api:

        • Network dependency: seekdb and MaxCompute must be deployed in the same VPC to achieve low-latency, high-throughput data transmission.

        • Permission dependency: You must enable the Storage API permission in MaxCompute and ensure that the AccessKey has the corresponding permissions.

        • Applicable scenarios:

          • seekdb and MaxCompute are deployed in the same VPC.
          • The MaxCompute Storage API is enabled.
          • The data volume is extremely large or the real-time requirement is high.
    • SPLIT: specifies whether to split the task by byte or row when you use the storage_api mode. If the data size of each row in a table varies significantly, set SPLIT to byte. Otherwise, set SPLIT to row.

Usage notes

  • When an external file is deleted, the file no longer exists in the file list of the external table. In this case, the external table ignores the file.
  • When an external file is modified, the external table accesses the latest data of the external file. If the modification of the external file and the query of the external table are performed concurrently, unexpected results may occur. Therefore, avoid modifying the external file while querying the external table.
  • When a new file is added to the external directory, the external table accesses only the files in the file list. To add the new file to the file list of the external table, you must update the external table files.

Example

  1. Prepare data. First, set the secure_file_priv path to /home/admin/ and place the CSV file extdata.csv corresponding to the external table data in the /home/admin/test directory of the local connection to seekdb.

    Here is an example of setting the global secure file path.

    SET GLOBAL secure_file_priv = ""
    Query OK, 0 rows affected
    obclinet> \q
    Bye
    info

    Since secure_file_priv is a GLOBAL variable, you need to execute \q to exit and make the setting take effect.

    The content of the CSV file is as follows:

    1,'Dave','Smith','dsmith@outlook.com','friend',32
    2,'Xena','Johnson','xjonson@outlook.com','contact',45
    3,'Fred','Jackon','fjackson@outlook.com','co-worker',19
    4,'Alma','Tyler','atyler@outlook.com','friend',53
  2. Log in to the database and create the external table contacts.

    CREATE EXTERNAL TABLE contacts (
    id INT,
    firstname VARCHAR(100),
    lastname VARCHAR(100),
    email VARCHAR(255),
    category CHAR(30),
    age NUMBER
    )
    LOCATION = '/home/admin/test'
    FORMAT = (
    TYPE = 'CSV'
    FIELD_DELIMITER = ','
    FIELD_OPTIONALLY_ENCLOSED_BY ='\''
    )
    PATTERN = 'extdata.csv';
  3. Query the data in the external table contacts.

    SELECT * FROM contacts;
    +----+-----------+----------+----------------------+-----------+------+
    | id | firstname | lastname | email | category | age |
    +----+-----------+----------+----------------------+-----------+------+
    | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 |
    | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 |
    | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 |
    | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 |
    +----+-----------+----------+----------------------+-----------+------+
    4 rows in set (0.003 sec)

References