BigQuery tables: Partitioned vs Clustered

Feb 12, 2024

Google BigQuery documentation covers two powerful features: partitioned tables and clustered tables. Partitioned tables organize data into partitions for efficient storage and querying, while clustered tables enhance query performance by organizing data within storage blocks based on specified column values.

The content explores various aspects of partitioned and clustered tables, including partitioning types, browsing data, exportation, and best practices. Similarly, it delves into cluster column types, block pruning, and best practices for clustered tables, providing insights to optimize data storage and query performance in Google BigQuery.

Sources:

Google BigQuery official documentation on:
- partitioned tables
- clustered tables

Contents:

Partitioned tables

Partitioned tables in BigQuery are segmented to enhance data management and query efficiency, reducing costs by limiting data scanned during queries.
Partitioning occurs through a specified column, enabling query optimization through pruning (only scanning relevant partitions).
Each partition stores data in separate physical blocks, with metadata facilitating accurate query cost estimation.

Types of partitioning

When partitioning tables in BigQuery, there are three key types of partitioning to consider, each suited to different data management and query optimization needs:

Integer Range Partitioning: This approach involves dividing a table based on ranges of values in a specific INTEGER column. You specify the column, the start (inclusive) and end (exclusive) values for the range, and the interval between ranges. Special partitions __UNPARTITIONED__ and __NULL__ handle rows with values outside the defined range and rows with NULL values in the partitioning column, respectively.

CREATE TABLE mydataset.newtable
(
  transaction_id INT64,
  column_name DATA_TYPE
)
PARTITION BY FUNCTION(column_name, INTERVAL)
OPTIONS
(
  partition_expiration_days = N,
  require_partition_filter = TRUE_OR_FALSE
);


-- FUNCTION(column_name, INTERVAL): Use DATE_TRUNC, TIMESTAMP_TRUNC, or DATETIME_TRUNC as appropriate, with column_name being your date or timestamp column, and INTERVAL being DAY, MONTH, etc.

Time-Unit Column Partitioning: Tables can also be partitioned based on DATE, TIMESTAMP, or DATETIME columns. This method automatically sorts data into the correct partition based on the column values, with granularity options including hourly, daily, monthly, or yearly. The __NULL__ and __UNPARTITIONED__ partitions are used similarly to handle rows with null values or values outside the permissible date range.

CREATE TABLE mydataset.newtable
(
  transaction_id INT64
)
PARTITION BY FUNCTION(_PARTITIONTIME_OR_DATE, INTERVAL)
OPTIONS
(
  partition_expiration_days = N,
  require_partition_filter = TRUE_OR_FALSE
);


-- For daily partitions, use _PARTITIONDATE directly. For monthly (or other intervals) partitions, replace FUNCTION(_PARTITIONTIME_OR_DATE, INTERVAL) with DATE_TRUNC(_PARTITIONTIME, MONTH) or your desired granularity.

Ingestion Time Partitioning: This type automatically assigns rows to partitions based on when the data is ingested into BigQuery, with options for hourly, daily, monthly, or yearly partition granularity. A pseudocolumn (_PARTITIONTIME or _PARTITIONDATE) records the ingestion time to the nearest partition boundary.

CREATE TABLE mydataset.newtable
(
  column_name INT64,
  additional_column DATA_TYPE
)
PARTITION BY RANGE_BUCKET(column_name, GENERATE_ARRAY(start_value, end_value, interval))
OPTIONS
(
  require_partition_filter = TRUE_OR_FALSE
);


-- RANGE_BUCKET(column_name, GENERATE_ARRAY(start_value, end_value, interval)): Define the partitioning logic, replacing start_value, end_value, and interval with your specific range criteria.
-- require_partition_filter = TRUE_OR_FALSE: Opt for TRUE to mandate partition filters in queries, enhancing efficiency and cost-effectiveness.

Choosing the right partitioning strategy depends on various factors:
- Daily partitioning is the default choice, ideal for datasets that grow over time across a broad date range.
- Hourly partitioning suits high-volume, short-range data, ensuring the partition count remains manageable.
- Monthly or yearly partitioning is recommended for datasets with small daily volumes spanning a long period or requiring frequent updates.

Partitioning vs sharding

Partitioning and sharding are two techniques for managing large datasets in databases, including BigQuery, with distinct differences and use cases:
- Partitioning involves dividing a single table's data into segments based on certain criteria (e.g., date, integer range). It allows for efficient queries by enabling operations on relevant segments of data rather than the entire dataset. Partitioned tables maintain a single schema and set of metadata, simplifying management and improving query performance through optimizations like pruning.
- Sharding, on the other hand, involves splitting data across multiple tables, often using a naming convention (e.g., [PREFIX]_YYYYMMDD for date-based sharding). Each shard acts as a separate table with its own schema and metadata. While sharding can distribute load and potentially improve write performance, it complicates management and querying. Queries across shards require combining data from multiple tables, each of which may have its own access permissions and metadata, leading to higher overhead and slower query performance compared to partitioned tables.
BigQuery favors partitioning over sharding for several reasons:
- Performance: Partitioned tables allow BigQuery to optimize queries by scanning only relevant partitions, reducing the amount of data processed and improving query speed.
- Simplicity: Managing one partitioned table is simpler than managing multiple sharded tables. With partitioning, there's only one schema and set of metadata to maintain.
- Cost Efficiency: By reducing the amount of data scanned, partitioned tables can also help lower query costs.
For those with existing date-sharded tables, BigQuery provides a pathway to convert these tables into ingestion-time partitioned tables, enabling users to benefit from partitioning's advantages without losing historical data.

Partition decorators

Partition decorators in BigQuery specify exact partitions within a table, using table_name$partition_id. The partition_id format varies by partition type. For example, for daily partitioning, the format is yyyymmdd, leading to a decorator like my_table$20210712 to reference data from July 12, 2021.

Browse the data in a partition

To view data in a specific partition, use the bq head command with a partition decorator.
For example, to see the first 10 rows of the my_dataset.my_table for the partition dated February 24, 2018, use:
```
bq head --max_rows=10 'my_dataset.my_table$20180224'
```

Export table data

Exporting data from both partitioned and non-partitioned tables in BigQuery follows the same procedure. To specifically export data from an individual partition, include a partition decorator with the bq extract command. For instance, to export data from the partition dated February 1, 2016, use:
```
bq extract 'my_table$20160201' <destination>
```

For special partitions like __NULL__ (rows with NULL values in the partitioning column) or __UNPARTITIONED__ (rows that don't fit into any partition), append these partition names directly to the table name in your extract command:
- For __NULL__ partition: bq extract 'my_table$__NULL__' <destination>
- For __UNPARTITIONED__ partition: bq extract 'my_table$__UNPARTITIONED__' <destination>

Limitations

SQL Compatibility: BigQuery does not allow the use of legacy SQL for querying partitioned tables or for writing query results to partitioned tables. You must use standard SQL for these operations.
Single Column Partitioning: Only one column can be used to partition a table. BigQuery does not support partitioning by multiple columns.
Maximum Partitions: The limit for partitions within a table is capped at 4000.
Time-Unit Column-Partitioned Tables Limitations:
- The partitioning column must be a DATE, TIMESTAMP, or DATETIME column, and it can be either REQUIRED or NULLABLE but not REPEATED (no arrays).
- This column must be a top-level field, meaning you cannot use a field within a STRUCT (RECORD) as the partitioning column.
Integer-Range Partitioned Tables Limitations:
- The partitioning column for these tables must be an INTEGER type, with the same constraints on mode (REQUIRED or NULLABLE, but not REPEATED) as time-unit column-partitioned tables.
- Similarly, the partitioning column must be a top-level field, excluding the use of STRUCT fields as the partitioning column.

Managing partitioned tables

Get partition metadata
- Using INFORMATION_SCHEMA view
  To obtain metadata for partitions within a partitioned table in BigQuery, you can query the INFORMATION_SCHEMA.PARTITIONS view. This approach provides details for each partition, including the table name, partition ID, and total rows. Here's a concise query example to list partitions in a specific dataset named mydataset:
```
SELECT table_name, partition_id, total_rows 
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS` 
WHERE partition_id IS NOT NULL;
```
- This query returns information for every partition, excluding pseudo partitions like __NULL__ and __UNPARTITIONED__, by filtering out rows where partition_id is not null.
- Using meta-tables (legacy SQL only)
  To access metadata for table partitions in BigQuery using legacy SQL, query the __PARTITIONS_SUMMARY__ meta-table. Note that this approach is specific to legacy SQL and isn't compatible with standard SQL due to syntax differences, particularly the partition decorator separator ($) used.
```
#legacySQL 
SELECT * 
FROM [dataset.table$__PARTITIONS_SUMMARY__];
```
- Replace dataset and table with your specific dataset and table names to retrieve metadata such as the project ID, dataset ID, table ID, partition ID, creation time, and last modified time of each partition.
Partition expiration
In BigQuery, specifying a partition expiration for tables partitioned by ingestion time or by a time-unit column determines how long the data within each partition is retained. This expiration setting applies to all partitions but calculates the expiration time for each partition based on its specific time boundary (e.g., midnight UTC for daily partitions). When a partition reaches its expiration time, BigQuery automatically deletes its data.
Key points include:
- Partition Expiration Calculation: For daily partitioned tables, if the expiration is set to 6 hours, data in a partition expires at 06:00:00 UTC the day after its creation.
- Dataset-Level Default Expiration: A default partition expiration can be set for the entire dataset, which is overridden by any table-specific expiration setting.
- Exclusion for Integer-Range Partitions: Integer-range partitioned tables do not support partition expiration settings.
- Table Expiration Precedence: If a table's expiration setting is shorter than its partition expiration, the table and its partitions are deleted according to the table's expiration.
- Updating Expiration Settings: The partition expiration setting can be updated at any time, affecting all partitions, including existing ones. Partitions older than the updated expiration time are immediately expired.
- Effect of Expiration: Expired partitions are removed, eliminating storage charges for those partitions. They remain part of table quotas until physically deleted by BigQuery.

Update the partition expiration

ALTER TABLE mydataset.mytable
  SET OPTIONS (
    -- Sets partition expiration to 5 days
    partition_expiration_days = 5);

Partition filter requirements
When creating a partitioned table in BigQuery, you have the option to enforce that all queries against the table must include a filter on the partitioning column. This requirement, known as "Require partition filter", helps in optimizing query performance and reducing query costs by allowing BigQuery to only scan relevant partitions. Here's how it works:
- Enforce Filter Requirement: This setting ensures queries explicitly filter on the partitioning column, enabling BigQuery to exclude irrelevant partitions from the scan.
- Error on Missing Filter: Queries lacking a suitable partition column predicate will fail with an error indicating the need for a partition elimination filter.
This approach is beneficial for managing large datasets efficiently, as it prevents full table scans and focuses on the necessary data, directly impacting performance and cost.
Update the partition filter requirement
```
ALTER TABLE mydataset.mypartitionedtable
  SET OPTIONS (
    require_partition_filter = true);
```
Copy a partition table
Copying a partitioned table in BigQuery follows the same procedure as copying a standard table, with specific considerations:
- Partitioned to New Table: Copying retains all partitioning information, making the new table identical in partition structure.
- Non-partitioned to Partitioned: Supported only for ingestion-time partitioned tables, where data goes into the current date's partition. It's not available for time-unit or integer-range partitioned tables.
- Partitioned to Partitioned: Requires matching partition specifications between source and destination tables.
- Partitioned to Non-partitioned: The destination table will not have partitions.
- Multiple Partitioned Tables: When copying multiple tables in one job to a partitioned table, all source tables must either be partitioned or non-partitioned with matching partition specifications for the former.
Copying options include appending or overwriting data in the destination table, depending on your needs.
You can copy the data from one or more partitions to another table.
Delete a partition
To delete a partition from a BigQuery partitioned table, specify the partition's decorator in your delete command, except for __NULL__ or __UNPARTITIONED__ partitions, which cannot be deleted.
Only one partition can be deleted at a time, with permissions required for table deletion applying here as well.
For more information see the official documentation

Querying partitioned tables

Partition pruning in BigQuery reduces query cost by scanning only partitions that match the filter condition, skipping the rest. This eliminates unnecessary partitions from the scan, reducing the bytes processed by the query. Pruning behaviors vary based on partitioning types, so performing a dry run can help estimate the bytes processed accurately.

Query a time-unit column-partitioned table

SELECT * FROM dataset.table
WHERE transaction_date >= '2016-01-01'

Query an ingestion-time partitioned table

Ingestion-time partitioned tables have a pseudo-column called _PARTITIONTIME, representing the UTC ingestion time truncated to the partition boundary (e.g., hourly, daily). For example, appending data on April 15, 2021, 08:15:00 UTC, gives:

Hourly partitioned: TIMESTAMP("2021-04-15 08:00:00")
Daily partitioned: TIMESTAMP("2021-04-15")
Monthly partitioned: TIMESTAMP("2021-04-01")
Yearly partitioned: TIMESTAMP("2021-01-01")

For daily granularity, there's also _PARTITIONDATE pseudo-column.

To prune partitions, filter on these columns. For instance:

-- Query template for working with partitioned tables

-- Example 1: Pruning partitions using _PARTITIONTIME
SELECT column

FROM dataset.table
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02');

-- Example 2: Selecting _PARTITIONTIME with an alias

SELECT _PARTITIONTIME AS pt, column
FROM dataset.table;

-- Example 3: Selecting _PARTITIONDATE for daily partitioned tables

SELECT _PARTITIONDATE AS pd, column
FROM dataset.table;

-- Example 4: Selecting _PARTITIONTIME using SELECT * with an alias

SELECT _PARTITIONTIME AS pt, *
FROM dataset.table;

Handle time zones in ingestion-time partitioned tables:
When working with ingestion-time partitioned tables in BigQuery, note that _PARTITIONTIME values are based on UTC. To query data in a different time zone:
1. Adjust for time zone differences in your SQL queries.
2. Use partition decorators to load data into specific ingestion-time partitions based on a different time zone than UTC.

Better performance with pseudo-columns:

To enhance query performance, leverage the _PARTITIONTIME pseudo-column by placing it alone on the left side of a comparison:

-- Might be slower.
 
SELECT field1 FROM dataset.table1 
WHERE TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15"); 

-- Often performs better. 

SELECT field1 
FROM dataset.table1 
WHERE _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP('2016-04-15'), INTERVAL 5 DAY);

To limit scanned partitions effectively, use a constant expression in your filter:

SELECT column 
FROM dataset.table2 
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01') 
AND _PARTITIONTIME = (SELECT MAX(timestamp) FROM dataset.table1);

-- second expression does not prune since MAX function is dynamic function

Avoid including other columns in a _PARTITIONTIME filter to ensure effective partition pruning:

-- Scans all partitions of table2, because field1 is a column in the table.. No pruning. 

SELECT field1 
FROM dataset.table2 
WHERE _PARTITIONTIME + field1 = TIMESTAMP('2016-03-28');

Consider creating views filtered on _PARTITIONTIME for frequently queried time ranges:

-- This view provides pruning. 

CREATE VIEW dataset.past_week 
AS 
SELECT * FROM dataset.partitioned_table 
WHERE _PARTITIONTIME BETWEEN
TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 7 * 24 HOUR), DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP, DAY);

Query an integer-range partitioned table
- To prune partitions in an integer-range partitioned table, filter on the integer partitioning column:
```
SELECT * FROM dataset.table 
WHERE customer_id BETWEEN 30 AND 50;
```
- However, partition pruning isn't supported for functions over an integer range partitioned column. For instance:
```
SELECT * FROM dataset.table 
WHERE customer_id + 1 BETWEEN 30 AND 50;
```
- Use legacy SQL to query integer-range partitioned tables:
  In legacy SQL, querying across an entire integer-range partitioned table isn't supported. Instead, an error like the following is returned:
```
Querying tables partitioned on a field is not supported in Legacy SQL
```
- However, you can use table decorators to target a specific partition in an integer-range partitioned table. The key to address a range partition is the start of the range. For instance:
```
SELECT * FROM dataset.table$30;
```
Query data in the write-optimized storage
- The UNPARTITIONED partition temporarily holds data streamed to a partitioned table in its write-optimized storage. This data lacks _PARTITIONTIME and _PARTITIONDATE values. To query it, use _PARTITIONTIME with NULL values:
```
SELECT column FROM dataset.table 
WHERE _PARTITIONTIME IS NULL;
```

Best practices for partition pruning

Use a constant filter expression
To limit scanned partitions efficiently, use constant expressions in your filter. Dynamic expressions require scanning all partitions.
For example, this query prunes partitions due to the constant expression in the filter:
```
SELECT t1.name, t2.category 
FROM table1 AS t1 
INNER JOIN table2 AS t2 
ON t1.id_field = t2.field2 
WHERE t1.ts = CURRENT_TIMESTAMP() -- constant expression
```
Isolate the partition column in your filter
Isolate the partition column in your filter expressions to ensure partition pruning. Filters involving computations with multiple fields won't prune partitions.
For instance, this filter won't prune partitions as it involves a computation with the partitioning column ts and another field ts2:
```
WHERE TIMESTAMP_ADD(ts, INTERVAL 6 HOUR) > ts2
```
Require a partition filter in queries
When creating a partitioned table, enable the "Require partition filter" option to enforce predicate filters in queries. Without such filters, querying the table will result in an error:
The filter must reference only partition columns to be eligible for partition elimination. For example, both of these WHERE clauses satisfy the requirement:
```
WHERE partition_id = "20221231" 
WHERE partition_id = "20221231" AND f = "20221130"
```
However, a combined filter like WHERE (partition_id = "20221231" OR f = "20221130") is not sufficient.
For ingestion-time partitioned tables, use _PARTITIONTIME or _PARTITIONDATE pseudo-columns.

Clustered tables

Clustered tables in BigQuery improve query performance and reduce costs by sorting storage blocks based on user-defined clustered columns. Queries filtering or aggregating by these columns only scan relevant blocks, rather than the entire table.
When clustering a table with multiple columns (up to 4 columns), the column order determines sorting precedence. While this enhances performance, it may affect cost estimation as BigQuery dynamically adjusts block sizes.
As a result, query cost estimates for clustered tables aren't accurate before execution, as the number of scanned blocks isn't known in advance. Costs are determined post-execution based on scanned blocks.
Table with data size < 1 GB, don’t show significant improvement with partitioning and clustering

Cluster column types

Cluster columns in BigQuery must be top-level, non-repeated, and fall into these types:

STRING
INT64
NUMERIC
BIGNUMERIC
DATE
DATETIME
TIMESTAMP
BOOL
GEOGRAPHY

Cluster column ordering

For optimal query performance with clustering, ensure the filter order matches the clustered column order and includes at least the first clustered column.

For example, if the table is clustered by Order_Date, Country, and Status, filtering on Order_Date and Country benefits from clustering, whereas filtering on just Country and Status does not. To optimize clustering, filter from clustered columns starting with the first one.

Block pruning

Block pruning in clustered tables reduces query costs by scanning only relevant data. BigQuery sorts clustered table data based on clustering columns, organizing them into blocks.

When querying a clustered table with filters on clustered columns, BigQuery uses filter expressions and block metadata to prune scanned blocks, avoiding unnecessary processing.

Pruned blocks are not scanned, and only scanned blocks contribute to query data processing bytes. Query costs reflect bytes read in each referenced column within scanned blocks.

When a clustered table is referenced multiple times in a query with multiple filters, BigQuery charges for scanning appropriate blocks per filter.

Automatic re-clustering

When new data is inserted into a clustered table, it may be stored in blocks that share key ranges with blocks written earlier, diluting the table's sorted nature. To preserve the efficiency of clustered tables, BigQuery automatically undertakes background re-clustering to reinstate the table's sorting order.

In the case of partitioned tables, clustering integrity is upheld within each individual partition's boundaries.

Note: Automatic re-clustering does not affect query capacity.

Limitations:

Only GoogleSQL supports querying and writing query results to clustered tables.
Up to four clustering columns are allowed; for more , combine clustering with partitioning.
When clustering with STRING columns, only the first 1,024 characters are considered.
Altering a non-clustered table to clustered doesn't cluster existing data; only new data is subject to automatic re-clustering.

Create clustered tables

To create a clustered table, you can:

Create from a query result:

Run a DDL CREATE TABLE AS SELECT statement.

CREATE TABLE mydataset.myclusteredtable
(
  customer_id STRING,
  transaction_amount NUMERIC
)
CLUSTER BY
  customer_id
  OPTIONS (
    description = 'a table clustered by customer_id');

Run a query creating a clustered destination table.

CREATE TABLE mydataset.clustered_table
(
  customer_id STRING,
  transaction_amount NUMERIC
)
CLUSTER BY
  customer_id
AS (
  SELECT * FROM mydataset.unclustered_table
);

Use a DDL CREATE TABLE statement with CLUSTER BY clause:
- Specify a clustering_column_list.
Utilize the bq command-line tool:
- Run bq mk command.
Make API calls:
- Use tables.insert method.

Load data into BigQuery.

LOAD DATA INTO mydataset.mytable
PARTITION BY transaction_date
CLUSTER BY customer_id
  OPTIONS (
    partition_expiration_days = 3)
FROM FILES(
  format = 'AVRO',
  uris = ['gs://bucket/path/file.avro']);

Employ client libraries.

Use clustered tables

To get information about clustered tables:

Use the Google Cloud console.
Employ the bq command-line tool's bq show command.
Call the tables.get API method.

Query INFORMATION_SCHEMA views.

CREATE TABLE mydataset.data (column1 INT64, column2 INT64)
CLUSTER BY column1, column2;
SELECT
  column_name, clustering_ordinal_position
FROM
  mydataset.INFORMATION_SCHEMA.COLUMNS;

To list clustered tables in datasets:

Use the Google Cloud console.
Utilize the bq command-line tool's bq ls command.
Call the tables.list API method.
Employ client libraries.
Query the CLUSTERING_ORDINAL_POSITION column in the INFORMATION_SCHEMA.COLUMNS view.

To modify a table's clustering specification:

Use the bq tool:
```
bq update --clustering_fields=CLUSTER_COLUMN DATASET.ORIGINAL_TABLE
```
- Replace:
  - CLUSTER_COLUMN with the column you're clustering on (e.g., mycolumn)
  - DATASET with the dataset name (e.g., mydataset)
  - ORIGINAL_TABLE with the original table name (e.g., mytable)

Alternatively, use the tables.update or tables.patch API method.

To cluster all rows based on the new clustering specification, run:

UPDATE DATASET.ORIGINAL_TABLE 
SET CLUSTER_COLUMN=CLUSTER_COLUMN WHERE true

Note: Applying a new clustering specification to a table in long-term storage reverts it to active storage pricing. Refer to Storage pricing for details.

Query clustered tables

When you create a clustered table in BigQuery, data is automatically organized based on specified columns to colocate related data. The order of specified columns determines data sort order.

To optimize query performance:

Filter on clustered columns or multiple clustered columns in specified order.
Queries filtering on clustered columns generally perform better than those filtering only on non-clustered columns.

BigQuery sorts clustered table data based on clustering columns and organizes them into blocks. When querying a clustered table:

BigQuery efficiently uses clustering information to determine relevant data blocks, enabling block pruning.
Only relevant blocks are scanned.

Query clustered tables via:

Google Cloud console
bq command-line tool's bq query command
Calling jobs.insert API method and configuring a query job
Using client libraries

Currently, only GoogleSQL is supported for clustered tables.

Best practices for clustered tables:

Filter columns in sort order: Filter on clustered columns in the order specified in the CLUSTER BY clause. Include all clustered columns or a left-to-right subset starting with the first column.

SELECT *
FROM `mydataset.ClusteredTable`
WHERE column1 = value1 AND column2 = value2

-- column1 and column2 are clustered columns specified in the CLUSTER BY clause. 
-- They are filtered in the same order as specified in the clustering, which optimizes query performance.

Avoid complex filter expressions: Complex expressions hinder block pruning, affecting query performance. Use simple filters directly on clustered columns.

SELECT *
FROM `mydataset.ClusteredTable`
WHERE MONTH(date_column) = 12 AND YEAR(date_column) = 2022

-- the filter expression involves applying functions (MONTH and YEAR) to the date_column. 
-- This complexity hinders block pruning and can affect query performance negatively.

Avoid comparing clustered columns with others: Comparing clustered columns with other columns impairs block pruning. Stick to filtering directly on clustered columns.

Partitioned vs clustered tables

Difference

Use cases

Partitioning a table is beneficial in scenarios where enhancing query performance, managing larger table operations within quotas, forecasting query costs, and leveraging partition-level management are priorities. Specifically, partitioning is advisable when:
- Improving Query Performance: If the goal is to boost query efficiency by limiting scans to specific table segments.
- Handling Large Operations: When table operations exceed standard quotas, partitioning by specific column values can allow for higher quotas on partitioned tables.
- Estimating Query Costs: BigQuery offers the ability to estimate query costs on partitioned tables before running the query. This is achieved by pruning the table and conducting a dry run of the query.
- Partition-Level Management Features: If there's a need for features such as setting expiration times for partitions, writing to or deleting specific partitions without affecting or scanning the entire table.
In contrast, clustering a table might be more appropriate in cases where:
- Granularity Requirements: More detailed sorting and filtering than what partitioning offers is needed.
- Multiple Column Filters and Aggregations: Queries often involve filters or aggregations on multiple columns.
- High Cardinality: There's a large number of distinct values in a column or a group of columns.
- No Need for Strict Cost Estimates Pre-Execution: If precise cost estimations before running queries aren't a requirement.
- Small Data Per Partition: When partitioning would result in many small partitions (less than 10 GB each), which could negatively impact metadata access times.
- Excessive Number of Partitions: The partitioning would exceed the limits set for partitioned tables.
- Frequent DML Operations: If Data Manipulation Language (DML) operations are regularly modifying most partitions in the table.

Combination

Combining partitioning and clustering in BigQuery optimizes queries by first dividing data into partitions based on keys like dates or integers, and then further organizing it within each partition by clustering columns. This two-tiered structure allows for targeted and efficient data scanning, significantly reducing query times by minimizing the data read. This approach is ideal for large datasets, ensuring precise and cost-effective query performance.

Comparing tables that are not clustered or partitioned to tables that are clustered and partitioned.

Ramazan’s Substack

Discussion about this post