BigQuery datasets: Tables

Feb 13, 2024

BigQuery tables are key to Google's serverless data warehouse, enabling efficient data storage, access, and analysis. This guide covers the essentials of BigQuery table management, including creation, access control, and leveraging advanced features like external tables. It provides a concise overview for effectively utilizing BigQuery for data analysis and management.

Sources:

Google BiqQuery’s official documentation on tables

Contents:

BigQuery tables

BigQuery standard tables store structured data in a columnar format, including:

Tables: These have a defined schema with typed columns.
Table Clones: Writable, lightweight copies storing only changes from the base table.
Table Snapshots: Read-only, point-in-time copies, storing differences from the base table, allowing restoration.

Limitations:

Table names must be unique per dataset.
When you export BigQuery table data, the only supported destination is Cloud Storage.
When you use an API call, enumeration performance slows as you approach 50,000 tables in a dataset.
The Google Cloud console can display up to 50,000 tables for each dataset.

Table naming:

Unique per dataset, up to 1,024 UTF-8 bytes.
Can include Unicode characters in certain categories.
Examples: table 01, ग्राहक, 00_お客様, étudiant-01.
Caveats:
- Case-sensitive by default.
- Some names are reserved; choose a different name if an error occurs.
- Duplicate dot operators are implicitly removed.
Example:
- Before: project_name....dataset_name..table_name
- After: project_name.dataset_name.table_name

Create and use tables:

Create an empty table
```
CREATE TABLE `myproject.mydataset.newtable` (
  column1_name datatype [OPTIONS],
  column2_name datatype [OPTIONS],
  ...
) 
OPTIONS (
  expiration_timestamp = TIMESTAMP 'YYYY-MM-DD HH:MM:SS UTC',
  description = 'Description of the table',
  labels = [('label1_key', 'label1_value'), ('label2_key', 'label2_value')]
);
```
- In BigQuery, options provide additional settings or metadata for tables. Here's a concise explanation of the options used in the provided template:
  - expiration_timestamp: Sets a timestamp for table expiration.
  - description: Provides a brief description of the table.
  - labels: Assigns metadata labels to the table for organization.

Create a table from a query result

CREATE TABLE `myproject.mydataset.new_table` AS (
  SELECT
    column1_name,
    column2_name,
    ...
  FROM
    dataset_name.table_name
);

For more information see the official documentation

Access to the tables

Access control in BigQuery involves granting IAM roles to entities at different levels of the Google Cloud resource hierarchy, such as project, folder, or organization, as well as at the dataset or table/view level.
Levels of access control:
1. High level in the resource hierarchy (e.g., project): Provides broad access to multiple datasets.
2. Dataset level: Specifies operations allowed on tables/views within a specific dataset.
3. Table/view level: Specifies operations allowed on individual tables/views.
Access is additive: Granting access at a lower level overrides lack of access at higher levels.
Methods for restricting data access within tables:
- Column-level security
- Column data masking
- Row-level security
Custom IAM roles can be created to tailor permissions for specific operations.
It's not possible to set "deny" permissions on resources protected by IAM.
For more information see the official documentation

Get information about tables

INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, routines, tables, views, jobs, reservations, and streaming data.
You can query the following views to get table information:
- Use the INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.TABLE_OPTIONS views to retrieve metadata about tables and views in a project.
- Use the INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.COLUMN_FIELD_PATHS views to retrieve metadata about the columns (fields) in a table.
- Use the INFORMATION_SCHEMA.TABLE_STORAGE views to retrieve metadata about current and historical storage usage by a table.
For more information see the official documentation

Specify table schemas

Schema components
- Column names (required)
  - BigQuery column names must start with a letter or underscore, can include letters, numbers, and underscores, are capped at 300 characters, and cannot use reserved prefixes. Duplicate names with different cases are not permitted. Flexible naming extends character support for diverse languages and symbols.
- Column descriptions (optional)
  - Each column can include an optional description. The description is a string with a maximum length of 1,024 characters.
- Default values (optional)
  - The default value for a column must be a literal or one of the following functions:
    - CURRENT_DATE
    - CURRENT_DATETIME
    - CURRENT_TIME
    - CURRENT_TIMESTAMP
    - GENERATE_UUID
    - RAND
    - SESSION_USER
    - ST_GEOGPOINT
- GoogleSQL Data Types (required)
- Modes (optional)
  - Mode is optional. If the mode is unspecified, the column defaults to NULLABLE.
- Rounding mode (optional)
  - For NUMERIC or BIGNUMERIC columns in BigQuery, you can specify rounding_mode to control how values are rounded:
    1. ROUND_HALF_AWAY_FROM_ZERO (default) rounds away from zero,
    2. ROUND_HALF_EVEN rounds towards the nearest even number.
  - This option isn't available for other data types.
Specify the schema
```
CREATE TABLE IF NOT EXISTS mydataset.newtable (x INT64, y STRING, z BOOL)
  OPTIONS(
    description = 'My example table');
```
- Specifying a schema in the API
  - To define a table schema via API in BigQuery, use jobs.insert for loading data with a schema or tables.insert to create a table with a schema. Configure the schema in the respective resource properties.
- For more information see the official documentation

Schema of table with nested and repeated columns

Nested columns in BigQuery are defined using the RECORD data type, acting like a STRUCT to contain ordered fields.
Repeated columns are set with the mode REPEATED, functioning as an ARRAY.
A RECORD can be repeated, representing an array of STRUCTs, and a RECORD's field can also be repeated, shown as a STRUCT containing an ARRAY. However, directly nesting arrays within each other is not allowed.
Note the limitations: no more than 15 levels of nested RECORD types and RECORD is incompatible with certain SQL operations.

Create table (repeated and nested)

CREATE TABLE IF NOT EXISTS dataset_name.table_name (
  simple_field_name SIMPLE_DATA_TYPE,
  repeated_struct_field_name 
    ARRAY<
      STRUCT<
        nested_field_name DATA_TYPE,
        ...
      >
    >,
  ...
) OPTIONS (
  description = 'Your table description'
  ...
);

Insert table (repeated and nested)

INSERT INTO dataset_name.table_name (column1, column2, ..., complex_column) 
VALUES 
(value1, 
value2, 
..., 
ARRAY<
  STRUCT<
    nested_field1, 
    nested_field2, 
    ...
  >
>
([("nested_value1", "nested_value2", ...)]));

INSERT INTO dataset_name.table_name (column1, column2, ..., complex_column) 
VALUES (
  "value1", 
  "value2", 
  ...
  [("nested_value1", "nested_value2", ...)]
);

Query table (repeated and nested)

SELECT
  column1,
  column2,
  complex_column[offset(0)].nested_field
FROM
  dataset_name.table_name;

Deduplicate duplicate records in a table

CREATE OR REPLACE TABLE dataset_name.table_name 
AS (
  SELECT * 
  EXCEPT(row_num) 
  FROM ( 
    SELECT *, ROW_NUMBER() OVER (PARTITION BY deduplicate_columns ORDER BY order_column) AS row_num 
    FROM dataset_name.table_name 
  ) AS temp_table 
  WHERE row_num = 1 
);

Schema auto-detection
- Schema auto-detection in BigQuery allows for automatic schema inference for CSV, JSON, and Google Sheets during data loading or querying external sources. It scans a sample of data to determine data types for each column.
- For self-describing formats like Avro, Parquet, ORC, Firestore, and Datastore exports, BigQuery automatically infers schemas without needing to enable auto-detection.
- However, you can manually provide a schema for these formats if desired.
- Detected schemas can be viewed in the Google Cloud console or using the `bq show` command.
- For more information see the official documentation
Modify table schemas
- Most schema modifications can be done using SQL data definition language (DDL) statements, which are free of charge.
- To modify a table schema, you can export the table data to Cloud Storage, then load it into a new table with the modified schema. While BigQuery load and export jobs are free, there are costs for storing exported data in Cloud Storage.
- Note: Immediate schema changes might not be reflected in INFORMATION_SCHEMA views. Use the tables.get method for immediate updates.
- For more information see the official documentation
BigQuery’s tables also support table partition and table clustering, discussed separately

External Tables

External data sources allow direct querying in BigQuery, even if the data resides outside BigQuery storage. For instance, data might be stored in another Google Cloud database, Cloud Storage files, or a different cloud service. This setup enables analysis in BigQuery without migration.

Use cases for external data sources include:

Streamlining extract-load-transform (ELT) workflows by loading and cleansing data in one step, then saving the cleaned result to BigQuery storage using a CREATE TABLE ... AS SELECT query.
Joining BigQuery tables with frequently updated data from external sources. Querying directly avoids reloading data into BigQuery storage for each change.

BigQuery offers two mechanisms for querying external data: external tables and federated queries.

External tables in BigQuery point to data stored outside BigQuery, including:
- BigLake tables: Target structured data in services like Cloud Storage, Amazon S3, and Azure Blob Storage, with a focus on enforcing detailed security at the table level.
- Object tables: Link to unstructured data in Cloud Storage, with security considerations for accessing such data.
- Non-BigLake external tables: Connect to structured data in Cloud Storage, Google Drive, and Bigtable, lacking the advanced security features of BigLake tables.
Federated queries enable sending query statements to Spanner or Cloud SQL databases and receiving the results as a temporary table. Leveraging the BigQuery Connection API, federated queries establish connections with Spanner or Cloud SQL. By employing the EXTERNAL_QUERY function in your query, you can transmit a query statement to the external database using its SQL dialect. The results are then converted to GoogleSQL data types.

External data source feature comparison

For more information see the official documentation

Ramazan’s Substack

Discussion about this post