BigQuery query internals

Feb 15, 2024

In this comprehensive guide, we delve into various aspects of Google BigQuery, drawing insights from its official documentation. From understanding different query modes to exploring advanced topics like recursive common table expressions (CTEs) and transactions, this post aims to provide a detailed overview of BigQuery's functionalities. Whether you're a beginner or an experienced user, this guide covers essential topics such as writing query results, working with JSON data, utilizing sketches for efficient data analysis, and much more. Let's dive in!

Source:

Google BigQuery’s official documentation

Contents:

Query modes

Interactive Queries: Run immediately on-demand, prioritized for quick execution. BigQuery dynamically adjusts limits for concurrent queries based on resource availability, with a preference for interactive over batch queries. Exceeding limits places extra queries in a queue.
Batch Queries: Executed when idle resources are available, providing a lower-priority, cost-effective option. Useful for non-urgent jobs.
- to specify this type of query in google console after entering a query select
  More → Query Settings → Resource Management → Batch

By default, queries are interactive. Results are stored in temporary or specified permanent tables, with options to append, overwrite, or create new tables.

Writing query results

BigQuery stores query results in either temporary or permanent tables.

Temporary Tables: Automatically used to cache results not directed to a permanent table, existing up to 24 hours. They are created in a special dataset, have random names, and their data is only accessible by the creator. Temporary tables support multi-statement queries and sessions but cannot be shared or listed through standard methods. They are region-bound to the queried tables.
Permanent Tables: Can be new or existing within any accessible dataset. Storing data in new permanent tables incurs storage charges. Destination datasets for query results must be in the same location as the queried tables.

Billing: The project executing the query pays for processing, while the data-hosting project covers storage costs in BigQuery.

Save query results
After running a SQL query in the Google Cloud console, you can download the results to a local file, Google Sheets, or Google Drive, preserving column sort order. However, the bq command-line tool and API do not support this feature.
Limitations:
- Download formats are limited to CSV or newline-delimited JSON.
- Google Sheets cannot store query results with nested and repeated data.
- Results over 1 GB can't be saved to Google Drive via the console; use a table for larger datasets.
- Local CSV downloads are capped at 10 MB, with actual size depending on the query's schema.
- Only CSV or newline-delimited JSON formats are supported for saving to Google Drive.

Query data with SQL

Arrays

Array is an ordered list of values of the same data type.

Access array elements
In BigQuery, you can access array elements using zero-based index (OFFSET()), or one-based index (ORDINAL()), within queries. Use SAFE_OFFSET() or SAFE_ORDINAL() to avoid errors for out-of-range indexes.

SELECT
  some_numbers,
  some_numbers[0] AS index_0,      -- First element
  some_numbers[OFFSET(1)] AS offset_1,  -- Second element
  some_numbers[ORDINAL(1)] AS ordinal_1  -- First element
FROM Sequences

Finding lengths - ARRAY_LENGTH()
Converting elements in an array to rows in a table
To break down an ARRAY into rows, use UNNEST. To maintain order, use WITH OFFSET. You can then cross join to flatten an entire column of ARRAYs while preserving other column values.1
To query nested arrays in a table containing an ARRAY of STRUCTs, or to flatten ARRAY type fields of STRUCT values, you can use the UNNEST operator. This allows you to access the fields of the STRUCT within the ARRAY.
Querying STRUCT elements in an array
To query STRUCT elements in an ARRAY, use UNNEST with CROSS JOIN to flatten the ARRAY of STRUCTs. You can access specific information from repeated fields within the STRUCT.2
Querying ARRAY-type fields in a struct
To retrieve information from nested repeated fields, such as the fastest lap in an 800M race3, use UNNEST with comma operator (,) or CROSS JOIN to flatten the nested arrays. Ensure to handle NULL or empty arrays using LEFT JOIN if needed.
Creating arrays from subqueries
In GoogleSQL, you can create arrays from subquery results using the ARRAY() function. For instance, you can multiply each value in an array by two using a subquery with UNNEST and then recombine the results into an array.4
Filtering arrays
In GoogleSQL, you can filter arrays using the WHERE clause in the ARRAY() function's subquery to return specific rows. Additionally, you can use SELECT DISTINCT to retrieve unique elements within an array. Furthermore, you can filter rows of arrays using the IN keyword to check if a specific value matches an element in the array.5
Scanning arrays6
- Using IN with UNNEST: To scan an array for a specific value, employ the IN operator with UNNEST. This checks if the array contains the specified value.
- Using EXISTS with UNNEST: To scan an array for values that satisfy a condition, utilize UNNEST to return a table of elements in the array. Then, filter the resulting table using EXISTS and the condition to check if any rows match the criteria.
Arrays and aggregation
ARRAY_AGG() in GoogleSQL aggregates values into an array, useful for consolidating data. You can order array elements with ORDER BY within ARRAY_AGG() and apply aggregate functions like SUM() to array elements. ARRAY_CONCAT_AGG() concatenates array elements across rows, creating a consolidated array. Note that the order of elements in arrays returned by these functions is non-deterministic.7
Converting arrays to strings
The ARRAY_TO_STRING() function converts an array to a single string value, with optional handling for NULL elements and customizable separators.8
Combining arrays
In some cases, you might want to combine multiple arrays into a single array. You can accomplish this using the ARRAY_CONCAT() function.9
Updating arrays
Updating arrays in GoogleSQL involves using the UPDATE statement along with array functions like ARRAY_CONCAT() to modify array elements. You can append new elements to an existing array or update array fields within nested structures. The UPDATE statement is typically followed by a SET clause specifying the modifications to be made, and a WHERE clause to filter the rows to be updated.10
Zipping arrays
Zipping arrays involves merging two arrays of equal size into a single array containing pairs of elements from corresponding positions. This operation can be achieved using UNNEST and WITH OFFSET to create pairs stored as STRUCTs in an array.11
Building arrays of arrays
GoogleSQL does NOT support creating arrays of arrays directly. Instead, you can achieve this by creating an array of structs, where each struct contains a field of type ARRAY.12

JSON data

JSON is a widely used format for semi-structured data, allowing a "schema-on-read" approach. Unlike BigQuery's fixed schema for STRUCT type, JSON data type enables loading semi-structured data without upfront schema. It permits storing and querying data without fixed schemas, encoding and processing JSON fields individually. Querying JSON in BigQuery is intuitive and cost-efficient, using the field access operator.

Limitations
- JSON data can only be ingested into a table via batch load jobs in CSV, Avro, or JSON format.
- JSON data type has a nesting limit of 500 levels.
- Legacy SQL cannot be used to query tables containing JSON types.
- Row-level access policies cannot be applied to JSON columns.
Create a table with a JSON column
```
CREATE TABLE mydataset.table1(
  id INT64,
  cart JSON
);
```
You can't partition or cluster a table on JSON columns, because the equality and comparison operators are not defined on the JSON type.
Create JSON values13
- Create a JSON value: You can insert JSON values directly into a table using either JSON literals or functions like JSON_ARRAY() and JSON_OBJECT().
- Convert a STRING type to JSON type: Use the PARSE_JSON() function to convert JSON-formatted STRING values to JSON type. This is particularly useful when converting columns from existing tables to JSON type.
- Convert schematized data to JSON: Utilize functions like JSON_OBJECT() to convert schematized data, such as key-value pairs, into JSON format. This allows you to aggregate data into JSON objects based on specific criteria.
- Convert SQL type to JSON type: The TO_JSON function in BigQuery converts SQL structured data types, such as STRUCTs, to JSON format.
Ingest JSON data
To ingest JSON data into BigQuery:
- Batch Load: Load from CSV, Avro, or JSON files.
- Storage Write API: Programmatically ingest JSON data.
- Legacy Streaming API: Stream JSON data into BigQuery.
Examples:
- Load from CSV: Use bq load with CSV files.
- Load from NDJSON: Use bq load with newline delimited JSON files.
- Storage Write API: Format data using protocol buffers.
- Legacy Streaming API: Stream JSON data using Python client.
Query JSON data14
- Extract values as JSON
  In BigQuery, you can access JSON fields using field access and subscript operators. These allow for extracting specific elements or fields from JSON data. Operators return JSON types and are equivalent to the JSON_QUERY function.
  They return NULL if a member or element is not found. However, direct comparison operations aren't supported. Use functions like JSON_VALUE for such operations.
- Extract values as strings
  To extract values from JSON data as strings in BigQuery, you can use the JSON_VALUE function. This function retrieves scalar values from JSON and returns them as SQL strings. It's useful for contexts requiring equality or comparison, like WHERE clauses.
  Additionally, the STRING function can also be used for similar purposes. JSON data can be extracted and converted to different SQL data types using various value extraction functions like BOOL, INT64, and FLOAT64. To determine the type of a JSON value, the JSON_TYPE function can be employed.
- Flexibility convert JSON
  LAX conversion functions in BigQuery enable flexible and error-free conversion of JSON values to scalar SQL types. Functions like LAX_INT64 automatically infer and process input, ensuring smooth conversion. Other options include LAX_STRING, LAX_BOOL, and LAX_FLOAT64, making the conversion process accurate and reliable.
- Extract arrays from JSON
  To extract arrays from JSON in BigQuery, you can use the JSON_QUERY_ARRAY and JSON_VALUE_ARRAY functions.
  JSON_QUERY_ARRAY extracts JSON arrays, while JSON_VALUE_ARRAY extracts arrays of scalar values.
  UNNEST is then used to split the array into individual elements, which can be further processed or aggregated using other functions like ARRAY_AGG.
- JSON nulls
  JSON null values in BigQuery are distinct from SQL NULL values. When working with JSON data, it's essential to understand how null values are handled by different functions. While the JSON_QUERY function returns a JSON null when extracting a field with a null value, the JSON_VALUE function returns SQL NULL, as JSON null is not considered a scalar value.

Sketches

GoogleSQL for BigQuery offers support for data sketches, which are compact summaries of data aggregations. Using sketches can significantly reduce computation costs compared to exact calculations, making them ideal for scenarios where query time and resource usage need to be optimized. Sketches allow for estimating metrics like cardinalities and quantiles without the need to process raw data, enabling faster and more efficient analyses.

Sketched data introduces a statistical error represented by an error bound or confidence interval, but this trade-off in accuracy is generally acceptable given the benefits of faster computations and reduced storage requirements. Sketches represent approximate aggregates for specific metrics, are compact, and can be merged with other sketches to summarize larger data sets.

Sketch merging enables the construction of summaries for multiple data streams, facilitating tasks such as estimating the number of distinct users over time or creating roll-ups of OLAP cubes. These operations help optimize query performance and resource usage by avoiding the need to process the entire input data set.

Multi-statement queries

Multi-statement queries enable the execution of multiple SQL statements in a single request, supporting procedural language statements for tasks like variable definition and control flow implementation. They are commonly employed in stored procedures and may have side effects on table data.

Write, run and save

Multi-statement queries in BigQuery allow the execution of multiple SQL statements in one request.
These queries can include procedural language statements, enabling variable usage and control flow implementation.
To write a multi-statement query, separate SQL statements with semicolons.
BigQuery interprets any request with multiple statements as a multi-statement query, unless the statements consist entirely of CREATE TEMP FUNCTION statements followed by a single SELECT statement.
Running a multi-statement query is done similarly to a single statement query, through the Google Cloud console or bq command-line tool.
Dry runs estimate the bytes read by a multi-statement query, but have specific handling for different types of statements.
Dry runs operate on a best-effort basis and might not always accurately predict query execution outcomes.

Variables

Multi-statement queries in BigQuery support the use of both user-created and system variables.
User-created variables are declared using the DECLARE statement and can be assigned values with the SET statement.
```
-- Declare a user-created variable and set its value
DECLARE x INT64 DEFAULT 0;
SET x = 10;
```

System variables are built into BigQuery and can be overridden with the SET statement.

SET @@dataset_project_id = 'MyProject';
SET @@dataset_id = 'MyDataset';

BEGIN
  CREATE TABLE MyTempTableA (id STRING);
  CREATE TABLE MyTempTableB (id STRING);
END;

User-created variables must be declared at the start of a multi-statement query or within a BEGIN block, while system variables are implicitly available.
User-created variables have a maximum size limit of 1 MB per variable and 10 MB for all variables in a multi-statement query.
After declaring and setting a user-created variable, you can reference it in a multi-statement query. If a variable and column share the same name, the column takes precedence.
```
DECLARE x INT64 DEFAULT 0;
SET x = 10;

WITH Numbers AS (SELECT 50 AS x)
SELECT (x+x) AS result FROM Numbers;
+--------+
| result |
+--------+
| 100    |
+--------+
```

System variables can be used to simplify queries by setting default values for projects, datasets, or other parameters.

BEGIN
  CREATE OR REPLACE TABLE MyDataset.MyTable(default_time_zone STRING)
  OPTIONS (description = @@time_zone);
END;

Both user-created and system variables can be referenced throughout a multi-statement query.
However, system variables have limitations in certain contexts, such as using them in table paths or as project names in DDL and DML queries, which can result in errors.

Temporary tables

Temporary tables in BigQuery facilitate storing intermediate results within multi-statement queries without the need for dataset maintenance. They're created, referenced, and deleted within a query's scope, residing in automatically managed datasets.
To create a temporary table, use the CREATE TEMP TABLE statement followed by the table definition and query. References to temporary tables are local to the query, and they're automatically deleted after 24 hours.
You can delete temporary tables explicitly using DROP TABLE or allow them to be automatically deleted. Access temporary table data through the BigQuery Explorer page in the Google Cloud console.
For clarity, qualify temporary table references with _SESSION, but note that you can't use _SESSION to create non-temporary tables.

Job information

A multi-statement query job in BigQuery contains information about a query consisting of multiple SQL statements. You can retrieve details about the job, including the last executed statement and all executed statements.
To return the last executed statement, use the jobs.getQueryResults method.
To retrieve the results of all statements executed in the multi-statement query, enumerate the child jobs created for each statement. Use the jobs.list method with the parentJobId parameter set to the multi-statement query job ID to enumerate the child jobs. Then, call jobs.getQueryResults on each child job to obtain the results for all statements.

Debug

ASSERT Statement: Use the ASSERT statement to verify that a Boolean condition is true. If the condition is false, an error is raised.
BEGIN...EXCEPTION...END: Employ BEGIN...EXCEPTION...END blocks to handle errors gracefully. This structure catches errors and allows you to display error messages and stack traces.
SELECT FORMAT("..."): Utilize SELECT FORMAT("...") to display intermediate results during the execution of a multi-statement query.
Viewing Output: In the Google Cloud console and with the bq command-line tool, you can view the results of each statement within a multi-statement query. Additionally, you can select and run individual statements within the query editor in the Google Cloud console.

Recursive CTEs

In GoogleSQL for BigQuery, the WITH clause lets you define temporary tables known as common table expressions (CTEs). CTEs can be recursive or non-recursive, with recursion enabled by the RECURSIVE keyword (WITH RECURSIVE ). Recursive CTEs are useful for querying hierarchical and graph data but can be computationally expensive.

Create

To create a recursive common table expression (CTE) in GoogleSQL, use the WITH RECURSIVE clause.

WITH RECURSIVE
  CTE_1 AS (
    (SELECT 1 AS iteration UNION ALL SELECT 1 AS iteration)
    UNION ALL
    SELECT iteration + 1 AS iteration FROM CTE_1 WHERE iteration < 3
  )
SELECT iteration FROM CTE_1
ORDER BY 1 ASC;
/*-----------*
 | iteration |
 +-----------+
 | 1         |
 | 1         |
 | 2         |
 | 2         |
 | 3         |
 | 3         |
 *-----------*/

Recursive CTEs consist of a base term, a union operator, and a recursive term. The base term initializes the first iteration, while the recursive term generates subsequent iterations. Only the recursive term can reference the CTE itself. Refer to the GoogleSQL reference documentation for detailed syntax and examples.

Explore reachability in a directed acyclic graph (DAG)

Recursive queries are useful for exploring reachability in directed acyclic graphs (DAGs). By leveraging recursive common table expressions (CTEs), you can iteratively traverse a graph to find all nodes reachable from a specific starting node. This approach simplifies complex graph traversal tasks and enables efficient analysis of graph structures.

Troubleshoot iteration limit errors

Recursive common table expressions (CTEs) in BigQuery are subject to a maximum iteration limit of 500 to prevent infinite recursion. This limit helps control resource consumption and query execution time. If you encounter a recursion iteration limit error, it typically indicates a missing termination condition or inappropriate use of recursive CTEs. Reviewing and adjusting your query logic can help address this issue.

Check for infinite recursion15
- To prevent infinite recursion, ensure the recursive term stops producing results after a set number of iterations. Another approach is to convert the recursive CTE to a temporary table and use a REPEAT loop to limit the iterations, typically up to 100.
Verify the appropriate usage of the recursive CTE
- Verify that your recursive CTE is suitable for querying hierarchical or graph data, as they are optimized for these scenarios. Otherwise, consider alternatives like using the LOOP statement with a non-recursive CTE to avoid unnecessary computational overhead.
Split a recursive CTE into multiple recursive CTEs
Use a loop instead of a recursive CTE
Change the recursive limit
- If needed, contact Customer Care to raise the recursive limit beyond 500 iterations, understanding the potential risks such as longer execution times and resource constraints.

Transactions

BigQuery enables multi-statement transactions for atomic operations across single or multiple queries within sessions. They facilitate complex operations like mutating rows, ensuring either all changes are committed or rolled back together.

Common uses include:

Batch mutations across tables spanning datasets or projects.
Sequential mutations on a single table based on interim computations.

Transactions ensure ACID properties and provide snapshot isolation, ensuring consistent reads within the transaction. However, reads from external data sources may lack consistency if the source data changes during the transaction.

Scope

In BigQuery, transactions are contained within a single SQL query, except in Session mode. They can't be nested but can span multiple queries within a session.

To start a transaction, use BEGIN TRANSACTION. It ends with either a COMMIT TRANSACTION to commit changes or a ROLLBACK TRANSACTION to discard them. If there's an error and no exception handler, the query fails, and BigQuery automatically rolls back the transaction.16

Supported statements

Only specific statement types are supported within transactions in BigQuery:

Query statements: SELECT
DML statements: INSERT, UPDATE, DELETE, MERGE, and TRUNCATE TABLE
DDL statements on temporary entities: CREATE TEMP TABLE, CREATE TEMP FUNCTION, DROP TABLE on a temporary table, DROP FUNCTION on a temporary function

DDL statements creating or dropping permanent entities like datasets, tables, and functions aren't supported within transactions.

Date/time functions

In transactions, specific behaviors apply to date/time functions:

CURRENT_TIMESTAMP, CURRENT_DATE, and CURRENT_TIME return the transaction start timestamp.
Using FOR SYSTEM_TIME AS OF to read a table beyond the transaction start timestamp results in an error.

Example transaction17

Transaction concurrency

Transactions mutating rows in a table prevent concurrent transactions or DML statements on the same table, resulting in cancellation of conflicting operations.
Conflicting DML statements outside transactions are queued for later execution.
if the transaction only reads or appends rows concurrent operations include
- SELECT statements
- BigQuery Storage Read API read operations
- queries from BigQuery BI Engine
- INSERT statements
- load jobs using WRITE_APPEND
- streaming writes,

View transaction information

Each multi-statement transaction in BigQuery is assigned a transaction ID.
Query the INFORMATION_SCHEMA.JOBS* views for the transaction_id column to see the transaction IDs for your jobs.
BigQuery generates a child job for each statement within a transaction, and all child jobs associated with a transaction share the same transaction_id value.

Limitations

DDL Statements: Transactions cannot use DDL statements on permanent entities.
Materialized Views: Materialized views act as logical views in transactions, offering no performance benefits.
Rollback: Transaction failures trigger a rollback, undoing all pending changes.
Table Mutation Limits: Transactions can mutate data in up to 100 tables and perform a maximum of 100,000 partition modifications.
BI Engine: BI Engine does not accelerate queries within transactions

Converting elements in an array to rows in a table

-- Flattening an ARRAY with UNNEST and maintaining order with WITH OFFSET
SELECT *
FROM UNNEST(['foo', 'bar', 'baz']) AS element 
WITH OFFSET AS offset
ORDER BY offset;

-- Flattening an entire column of ARRAYs while preserving other column values
WITH Sequences AS (
  SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
  UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
  UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers
)
SELECT id, flattened_numbers
FROM Sequences
CROSS JOIN UNNEST(Sequences.some_numbers) AS flattened_numbers;

Querying STRUCT elements in an array

-- Flattening an ARRAY of STRUCTs
WITH Races AS (
  SELECT "800M" AS race,
    [STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
     STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
     STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
     STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
     STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
     STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
     STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
     STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
       AS participants)
SELECT
  race,
  participant
FROM Races AS r
CROSS JOIN UNNEST(r.participants) AS participant;

-- Querying specific information from repeated fields
WITH Races AS (
  SELECT "800M" AS race,
    [STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
     STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
     STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
     STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
     STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
     STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
     STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
     STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
       AS participants)
SELECT
  race,
  (SELECT name
   FROM UNNEST(participants)
   ORDER BY (
     SELECT SUM(duration)
     FROM UNNEST(laps) AS duration) ASC
   LIMIT 1) AS fastest_racer
FROM Races;

Querying ARRAY-type fields in a struct

-- Querying the runner with the fastest lap
WITH Races AS (
 SELECT "800M" AS race,
   [STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
    STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
    STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
    STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
    STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
    STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
    STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
    STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
    AS participants)
SELECT
  race,
  (SELECT name
   FROM UNNEST(participants),
   UNNEST(laps) AS duration
   ORDER BY duration ASC LIMIT 1) AS runner_with_fastest_lap
FROM Races;

-- Flattening arrays with a CROSS JOIN and handling NULL or empty arrays
WITH Races AS (
 SELECT "800M" AS race,
   [STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
    STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
    STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
    STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
    STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
    STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
    STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
    STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps),
    STRUCT("Nathan" AS name, ARRAY<FLOAT64>[] AS laps),
    STRUCT("David" AS name, NULL AS laps)]
    AS participants)
SELECT
  name, sum(duration) AS finish_time
FROM Races CROSS JOIN Races.participants LEFT JOIN participants.laps AS duration
GROUP BY name;

Creating arrays from subqueries

-- Creating arrays from subqueries
WITH Sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
  UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
  UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
  ARRAY(SELECT x * 2
        FROM UNNEST(some_numbers) AS x) AS doubled
FROM Sequences;

Filtering arrays

-- Filtering arrays using WHERE clause
WITH Sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
  ARRAY(SELECT x * 2
        FROM UNNEST(some_numbers) AS x
        WHERE x < 5) AS doubled_less_than_five
FROM Sequences;

-- Filtering arrays using SELECT DISTINCT
WITH Sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers)
SELECT ARRAY(SELECT DISTINCT x
             FROM UNNEST(some_numbers) AS x) AS unique_numbers
FROM Sequences;

-- Filtering rows of arrays using the IN keyword
WITH Sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
   ARRAY(SELECT x
         FROM UNNEST(some_numbers) AS x
         WHERE 2 IN UNNEST(some_numbers)) AS contains_two
FROM Sequences;

Scanning arrays

-- Scanning arrays for specific values
SELECT 2 IN UNNEST([0, 1, 1, 2, 3, 5]) AS contains_value;
/*----------------*
 | contains_value |
 +----------------+
 | true           |
 *----------------*/

-- Scanning rows of a table for specific values in array columns
WITH Sequences AS
  (SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id AS matching_rows
FROM Sequences
WHERE 2 IN UNNEST(Sequences.some_numbers)
ORDER BY matching_rows;
/*---------------*
 | matching_rows |
 +---------------+
 | 1             |
 | 2             |
 *---------------*/

-- Scanning arrays for values that satisfy a condition
WITH Sequences AS (
  SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
  UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
  UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers
)
SELECT id AS matching_rows
FROM Sequences
WHERE EXISTS(SELECT * FROM UNNEST(some_numbers) AS x WHERE x > 5);
/*---------------*
 | matching_rows |
 +---------------+
 | 2             |
 | 3             |
 *---------------*/

-- Scanning arrays of STRUCTs for field values that satisfy a condition
WITH Sequences AS (
  SELECT 1 AS id, [STRUCT(0 AS a, 1 AS b)] AS some_numbers
  UNION ALL SELECT 2 AS id, [STRUCT(2 AS a, 4 AS b)] AS some_numbers
  UNION ALL SELECT 3 AS id, [STRUCT(5 AS a, 3 AS b), STRUCT(7 AS a, 4 AS b)] AS some_numbers
)
SELECT id AS matching_rows
FROM Sequences
WHERE EXISTS(SELECT 1 FROM UNNEST(some_numbers) WHERE b > 3);
/*---------------*
 | matching_rows |
 +---------------+
 | 2             |
 | 3             |
 *---------------*/

Arrays and aggregation

-- Aggregating values into an array with ARRAY_AGG()
WITH Fruits AS
  (SELECT "apple" AS fruit
   UNION ALL SELECT "pear" AS fruit
   UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit) AS fruit_basket
FROM Fruits;

-- Result:
/*-----------------------*
 | fruit_basket          |
 +-----------------------+
 | [apple, pear, banana] |
 *-----------------------*/


-- Ordering array elements with ARRAY_AGG()
WITH Fruits AS
  (SELECT "apple" AS fruit
   UNION ALL SELECT "pear" AS fruit
   UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit ORDER BY fruit) AS fruit_basket
FROM Fruits;

-- Result:
/*-----------------------*
 | fruit_basket          |
 +-----------------------+
 | [apple, banana, pear] |
 *-----------------------*/


-- Applying aggregate functions to array elements
WITH Sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
   UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
   UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
  (SELECT SUM(x)
   FROM UNNEST(s.some_numbers) AS x) AS sums
FROM Sequences AS s;

-- Result:
/*--------------------+------*
 | some_numbers       | sums |
 +--------------------+------+
 | [0, 1, 1, 2, 3, 5] | 12   |
 | [2, 4, 8, 16, 32]  | 62   |
 | [5, 10]            | 15   |
 *--------------------+------*/


-- Concatenating array elements across rows with ARRAY_CONCAT_AGG()
WITH Aggregates AS
  (SELECT [1,2] AS numbers
   UNION ALL SELECT [3,4] AS numbers
   UNION ALL SELECT [5, 6] AS numbers)
SELECT ARRAY_CONCAT_AGG(numbers) AS count_to_six_agg
FROM Aggregates;

-- Result:
/*--------------------------------------------------*
 | count_to_six_agg                                 |
 +--------------------------------------------------+
 | [1, 2, 3, 4, 5, 6]                               |
 *--------------------------------------------------*/

Converting arrays to strings

-- Converting arrays to strings with ARRAY_TO_STRING()
WITH Words AS
  (SELECT ["Hello", "World"] AS greeting)
SELECT ARRAY_TO_STRING(greeting, " ") AS greetings
FROM Words;

-- Result:
/*-------------*
 | greetings   |
 +-------------+
 | Hello World |
 *-------------*/


-- Handling NULL values in ARRAY_TO_STRING()
SELECT
  ARRAY_TO_STRING(arr, ".", "N") AS non_empty_string,
  ARRAY_TO_STRING(arr, ".", "") AS empty_string,
  ARRAY_TO_STRING(arr, ".") AS omitted
FROM (SELECT ["a", NULL, "b", NULL, "c", NULL] AS arr);

-- Result:
/*------------------+--------------+---------*
 | non_empty_string | empty_string | omitted |
 +------------------+--------------+---------+
 | a.N.b.N.c.N      | a..b..c.     | a.b.c   |
 *------------------+--------------+---------+*/

Combining arrays

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5, 6]) AS count_to_six;

/*--------------------------------------------------*
 | count_to_six                                     |
 +--------------------------------------------------+
 | [1, 2, 3, 4, 5, 6]                               |
 *--------------------------------------------------*/

Updating arrays

-- Table before update
WITH arrays_table AS (
  SELECT
    [1, 2] AS regular_array,
    STRUCT([10, 20] AS first_array, [100, 200] AS second_array) AS nested_arrays
  UNION ALL SELECT
    [3, 4] AS regular_array,
    STRUCT([30, 40] AS first_array, [130, 400] AS second_array) AS nested_arrays
)
SELECT * FROM arrays_table;

-- Result:
/*---------------*-------------------*------------------------*
 | regular_array | nested_arrays.    | nested_arrays.         |
 |               | first_array       | second_array           |
 +---------------+-------------------+------------------------+
 | [1, 2]        | [10, 20]          | [100, 200]             |
 | [3, 4]        | [30, 40]          | [130, 400]             |
 *---------------*-------------------+------------------------*


-- Updating arrays
UPDATE
  arrays_table
SET
  regular_array = ARRAY_CONCAT(regular_array, [5]),
  nested_arrays.second_array =
    ARRAY_CONCAT(nested_arrays.second_array,
                 nested_arrays.first_array)
WHERE TRUE;

-- Table after update
SELECT * FROM arrays_table;

-- Result:
/*-----------*---------------*--------------------*
 | reg_array | nested_arrays.| nested_arrays.     |
 |           | first_array   | second_array       |
 +-----------+---------------+--------------------+
 | [1, 2, 5] | [10, 20]      | [100, 200, 10, 20] |
 | [3, 4, 5] | [30, 40]      | [130, 400, 30, 40] |
 *-----------*---------------+--------------------*

Zipping arrays

-- Define a Common Table Expression (CTE) named 'Combinations' containing two arrays: 'letters' and 'numbers'.
WITH
  Combinations AS (
    SELECT
      ['a', 'b'] AS letters,
      [1, 2, 3] AS numbers
  )
-- Selecting the zipped pairs of elements from the 'letters' and 'numbers' arrays.
SELECT
  ARRAY(
    SELECT AS STRUCT
      -- Accessing elements from 'letters' and 'numbers' arrays using SAFE_OFFSET(index).
      letters[SAFE_OFFSET(index)] AS letter,
      numbers[SAFE_OFFSET(index)] AS number
    FROM Combinations
    CROSS JOIN
      -- Generating an array of indices based on the minimum length of the two input arrays.
      UNNEST(
        GENERATE_ARRAY(
          0,
          LEAST(ARRAY_LENGTH(letters), ARRAY_LENGTH(numbers)) - 1)) AS index
    ORDER BY index
  ) AS pairs;

/*------------------------------*
 | pairs                        |
 +------------------------------+
 | [{ letter: "a", number: 1 }, |
 |  { letter: "b", number: 2 }] |
 *------------------------------*/

Building array of arrays

-- Define the Points table
WITH Points AS
  (SELECT [1, 5] AS point
   UNION ALL SELECT [2, 8] AS point
   UNION ALL SELECT [3, 7] AS point
   UNION ALL SELECT [4, 1] AS point
   UNION ALL SELECT [5, 7] AS point)
   
-- Create an array of structs, each containing a point 
SELECT ARRAY(
  SELECT STRUCT(point)
  FROM Points)
  AS coordinates;

+---------------------+
| coordinates         |
+---------------------+
| [{point: [1,5]},    |
|  {point: [2,8]},    |
|  {point: [3,7]},    |
|  {point: [4,1]},    |
|  {point: [5,7]}]    |
+---------------------+

Create JSON values

-- Create a JSON value
INSERT INTO mydataset.table1 VALUES
(1, JSON '{"name": "Alice", "age": 30}'),
(2, JSON_ARRAY(10, ['foo', 'bar'], [20, 30])),
(3, JSON_OBJECT('foo', 10, 'bar', ['a', 'b']));

-- Convert a STRING type to JSON type
CREATE OR REPLACE TABLE mydataset.table_new
AS (
  SELECT
    id, SAFE.PARSE_JSON(cart) AS cart_json
  FROM
    mydataset.old_table
);

-- Convert schematized data to JSON
WITH Fruits AS (
SELECT 0 AS id, 'color' AS k, 'Red' AS v UNION ALL
SELECT 0, 'fruit', 'apple' UNION ALL
SELECT 1, 'fruit','banana' UNION ALL
SELECT 1, 'ripe', 'true'
)

SELECT JSON_OBJECT(ARRAY_AGG(k), ARRAY_AGG(v)) AS json_data
FROM Fruits
GROUP BY id;

+----------------------------------+
| json_data                        |
+----------------------------------+
| {"color":"Red","fruit":"apple"}  |
| {"fruit":"banana","ripe":"true"} |
+----------------------------------+

-- Convert SQL type to JSON type
SELECT TO_JSON(STRUCT(1 AS id, [10,20] AS coordinates)) AS pt;
+--------------------------------+
| pt                             |
+--------------------------------+
| {"coordinates":[10,20],"id":1} |
+--------------------------------+

Query JSON data

-- Defining a table and inserting JSON data
CREATE OR REPLACE TABLE mydataset.table1(id INT64, cart JSON);

INSERT INTO mydataset.table1 VALUES
(1, JSON '{
    "name": "Alice", 
    "items": [
      {"product": "book", "price": 10}, 
      {"product": "food", "price": 5}
    ]
  }'
),
(2, JSON '{
    "name": "Bob", 
    "items": [{
      "product": "pen", 
      "price": 20}
    ]
  }'
);

-- VALUES AS JSON

-- Accessing fields using the field access operator
SELECT cart.name AS name
FROM mydataset.table1;

+---------+
|  name   |
+---------+
| "Alice" |
| "Bob"   |
+---------+

-- Accessing array elements using the JSON subscript operator
SELECT cart.items[0] AS first_item
FROM mydataset.table1;

+-------------------------------+
|          first_item           |
+-------------------------------+
| {"price":10,"product":"book"} |
| {"price":20,"product":"pen"}  |
+-------------------------------+

-- Referencing members of a JSON object by name using the JSON subscript operator
SELECT cart['name'] AS name
FROM mydataset.table1;

+---------+
|  name   |
+---------+
| "Alice" |
| "Bob"   |
+---------+

-- Using non-constant expressions inside the JSON subscript operator
DECLARE int_val INT64 DEFAULT 0;
SELECT cart[CONCAT('it','ems')][int_val + 1].product AS item
FROM mydataset.table1;

+--------+
|  item  |
+--------+
| "food" |
| NULL   |
+--------+

-- Chaining expressions using field access and subscript operators
SELECT cart.address AS address,
  cart.items[1].price AS item1_price
FROM mydataset.table1;

+---------+-------------+
| address | item1_price |
+---------+-------------+
| NULL    | NULL        |
| NULL    | 5           |
+---------+-------------+

-- VALUES AS STRINGS

-- Extracting a scalar value as a string using JSON_VALUE
SELECT JSON_VALUE(cart.name) AS name
FROM mydataset.table1;

-- Result:
+-------+
| name  |
+-------+
| Alice |
+-------+

-- Filtering rows based on a JSON value using JSON_VALUE in a WHERE clause
SELECT cart.items[0] AS first_item
FROM mydataset.table1
WHERE JSON_VALUE(cart.name) = 'Alice';

-- Result:
+-------------------------------+
| first_item                    |
+-------------------------------+
| {"price":10,"product":"book"} |
+-------------------------------+

-- Extracting a JSON string using the STRING function
SELECT STRING(JSON '"purple"') AS color;

-- Result:
+--------+
| color  |
+--------+
| purple |
+--------+

-- Extracting JSON values as other SQL data types
SELECT 
  BOOL(JSON '"true"') AS is_true,
  INT64(JSON '42') AS integer_value,
  FLOAT64(JSON '3.14') AS float_value;

-- Result:
+---------+---------------+-------------+
| is_true | integer_value | float_value |
+---------+---------------+-------------+
| true    | 42            | 3.14        |
+---------+---------------+-------------+

-- Determining the type of a JSON value using JSON_TYPE
SELECT 
  JSON_TYPE(JSON '42') AS type_of_integer,
  JSON_TYPE(JSON '3.14') AS type_of_float,
  JSON_TYPE(JSON '"string"') AS type_of_string;

-- Result:
+-----------------+---------------+----------------+
| type_of_integer | type_of_float | type_of_string |
+-----------------+---------------+----------------+
| INTEGER         | FLOAT         | STRING         |
+-----------------+---------------+----------------+

-- LAX CONVERSION

-- Convert JSON to INT64
SELECT LAX_INT64(JSON '"10"') AS id;

+----+
| id |
+----+
| 10 |
+----+

-- ARRAYS FROM JSON

-- Extract JSON arrays using JSON_QUERY_ARRAY
SELECT JSON_QUERY_ARRAY(cart.items) AS items
FROM mydataset.table1;

+----------------------------------------------------------------+
| items                                                          |
+----------------------------------------------------------------+
| [{"price":10,"product":"book"}","{"price":5,"product":"food"}] |
| [{"price":20,"product":"pen"}]                                 |
+----------------------------------------------------------------+

-- Split JSON arrays into individual elements
SELECT
  id,
  JSON_VALUE(item.product) AS product
FROM
  mydataset.table1, UNNEST(JSON_QUERY_ARRAY(cart.items)) AS item
ORDER BY id;

+----+---------+
| id | product |
+----+---------+
|  1 | book    |
|  1 | food    |
|  2 | pen     |
+----+---------+

-- Aggregate values back into a SQL array
SELECT
  id,
  ARRAY_AGG(JSON_VALUE(item.product)) AS products
FROM
  mydataset.table1, UNNEST(JSON_QUERY_ARRAY(cart.items)) AS item
GROUP BY id
ORDER BY id;

+----+-----------------+
| id | products        |
+----+-----------------+
|  1 | ["book","food"] |
|  2 | ["pen"]         |
+----+-----------------+

-- JSON NULLS

SELECT
  json.a AS json_query, -- Equivalent to JSON_QUERY(json, '$.a')
  JSON_VALUE(json, '$.a') AS json_value
FROM (SELECT JSON '{"a": null}' AS json);

+------------+------------+
| json_query | json_value |
+------------+------------+
| null       | NULL       |
+------------+------------+

Check for infinite recursion

DECLARE current_iteration INT64 DEFAULT 0;

CREATE TEMP TABLE recursive_cte_name AS
SELECT base_expression, current_iteration AS iteration;

REPEAT
  SET current_iteration = current_iteration + 1;
  INSERT INTO recursive_cte_name
    SELECT recursive_expression, current_iteration
    FROM recursive_cte_name
    WHERE termination_condition_expression
      AND iteration = current_iteration - 1
      AND current_iteration < 100;
  UNTIL NOT EXISTS(SELECT * FROM recursive_cte_name WHERE iteration = current_iteration)
END REPEAT;

Transaction scope

BEGIN

  BEGIN TRANSACTION;
  INSERT INTO mydataset.NewArrivals
    VALUES ('top load washer', 100, 'warehouse #1');
  -- Trigger an error.
  SELECT 1/0;
  COMMIT TRANSACTION;

EXCEPTION WHEN ERROR THEN
  -- Roll back the transaction inside the exception handler.
  SELECT @@error.message;
  ROLLBACK TRANSACTION;
END;

Example transaction

-- Tables

CREATE OR REPLACE TABLE mydataset.Inventory
(
 product string,
 quantity int64,
 supply_constrained bool
);

CREATE OR REPLACE TABLE mydataset.NewArrivals
(
 product string,
 quantity int64,
 warehouse string
);

INSERT mydataset.Inventory (product, quantity)
VALUES('top load washer', 10),
     ('front load washer', 20),
     ('dryer', 30),
     ('refrigerator', 10),
     ('microwave', 20),
     ('dishwasher', 30);

INSERT mydataset.NewArrivals (product, quantity, warehouse)
VALUES('top load washer', 100, 'warehouse #1'),
     ('dryer', 200, 'warehouse #2'),
     ('oven', 300, 'warehouse #1');

-- Transaction

BEGIN TRANSACTION;

-- Create a temporary table that holds new arrivals from 'warehouse #1'.
CREATE TEMP TABLE tmp
  AS SELECT * FROM mydataset.NewArrivals WHERE warehouse = 'warehouse #1';

-- Delete the matching records from the NewArravals table.
DELETE mydataset.NewArrivals WHERE warehouse = 'warehouse #1';

-- Merge the records from the temporary table into the Inventory table.
MERGE mydataset.Inventory AS I
USING tmp AS T
ON I.product = T.product
WHEN NOT MATCHED THEN
 INSERT(product, quantity, supply_constrained)
 VALUES(product, quantity, false)
WHEN MATCHED THEN
 UPDATE SET quantity = I.quantity + T.quantity;

-- Drop the temporary table and commit the transaction.
DROP TABLE tmp;

COMMIT TRANSACTION;

Ramazan’s Substack

Discussion about this post