BigQuery query internals
In this comprehensive guide, we delve into various aspects of Google BigQuery, drawing insights from its official documentation. From understanding different query modes to exploring advanced topics like recursive common table expressions (CTEs) and transactions, this post aims to provide a detailed overview of BigQuery's functionalities. Whether you're a beginner or an experienced user, this guide covers essential topics such as writing query results, working with JSON data, utilizing sketches for efficient data analysis, and much more. Let's dive in!
Source:
Contents:
Annotations
Query modes
Interactive Queries: Run immediately on-demand, prioritized for quick execution. BigQuery dynamically adjusts limits for concurrent queries based on resource availability, with a preference for interactive over batch queries. Exceeding limits places extra queries in a queue.
Batch Queries: Executed when idle resources are available, providing a lower-priority, cost-effective option. Useful for non-urgent jobs.
to specify this type of query in google console after entering a query select
More → Query Settings → Resource Management → Batch
By default, queries are interactive. Results are stored in temporary or specified permanent tables, with options to append, overwrite, or create new tables.
Writing query results
BigQuery stores query results in either temporary or permanent tables.
Temporary Tables: Automatically used to cache results not directed to a permanent table, existing up to 24 hours. They are created in a special dataset, have random names, and their data is only accessible by the creator. Temporary tables support multi-statement queries and sessions but cannot be shared or listed through standard methods. They are region-bound to the queried tables.
Permanent Tables: Can be new or existing within any accessible dataset. Storing data in new permanent tables incurs storage charges. Destination datasets for query results must be in the same location as the queried tables.
Billing: The project executing the query pays for processing, while the data-hosting project covers storage costs in BigQuery.
Save query results
After running a SQL query in the Google Cloud console, you can download the results to a local file, Google Sheets, or Google Drive, preserving column sort order. However, the
bq
command-line tool and API do not support this feature.Limitations:
Download formats are limited to CSV or newline-delimited JSON.
Google Sheets cannot store query results with nested and repeated data.
Results over 1 GB can't be saved to Google Drive via the console; use a table for larger datasets.
Local CSV downloads are capped at 10 MB, with actual size depending on the query's schema.
Only CSV or newline-delimited JSON formats are supported for saving to Google Drive.
Query data with SQL
Arrays
Array is an ordered list of values of the same data type.
Access array elements
In BigQuery, you can access array elements using zero-based index (
OFFSET()
), or one-based index (ORDINAL()
), within queries. UseSAFE_OFFSET()
orSAFE_ORDINAL()
to avoid errors for out-of-range indexes.
SELECT
some_numbers,
some_numbers[0] AS index_0, -- First element
some_numbers[OFFSET(1)] AS offset_1, -- Second element
some_numbers[ORDINAL(1)] AS ordinal_1 -- First element
FROM Sequences
Finding lengths -
ARRAY_LENGTH()
Converting elements in an array to rows in a table
To break down an ARRAY into rows, use
UNNEST
. To maintain order, useWITH OFFSET
. You can then cross join to flatten an entire column of ARRAYs while preserving other column values.1To query nested arrays in a table containing an ARRAY of STRUCTs, or to flatten ARRAY type fields of STRUCT values, you can use the UNNEST operator. This allows you to access the fields of the STRUCT within the ARRAY.
Querying
STRUCT
elements in an arrayTo query STRUCT elements in an ARRAY, use UNNEST with CROSS JOIN to flatten the ARRAY of STRUCTs. You can access specific information from repeated fields within the STRUCT.2
Querying
ARRAY
-type fields in a structTo retrieve information from nested repeated fields, such as the fastest lap in an 800M race3, use UNNEST with comma operator (,) or CROSS JOIN to flatten the nested arrays. Ensure to handle NULL or empty arrays using LEFT JOIN if needed.
Creating arrays from subqueries
In GoogleSQL, you can create arrays from subquery results using the ARRAY() function. For instance, you can multiply each value in an array by two using a subquery with UNNEST and then recombine the results into an array.4
Filtering arrays
In GoogleSQL, you can filter arrays using the WHERE clause in the ARRAY() function's subquery to return specific rows. Additionally, you can use SELECT DISTINCT to retrieve unique elements within an array. Furthermore, you can filter rows of arrays using the IN keyword to check if a specific value matches an element in the array.5
Scanning arrays6
Using
IN
withUNNEST
: To scan an array for a specific value, employ the IN operator with UNNEST. This checks if the array contains the specified value.Using
EXISTS
withUNNEST
: To scan an array for values that satisfy a condition, utilize UNNEST to return a table of elements in the array. Then, filter the resulting table using EXISTS and the condition to check if any rows match the criteria.
Arrays and aggregation
ARRAY_AGG()
in GoogleSQL aggregates values into an array, useful for consolidating data. You can order array elements with ORDER BY within ARRAY_AGG() and apply aggregate functions like SUM() to array elements.ARRAY_CONCAT_AGG()
concatenates array elements across rows, creating a consolidated array. Note that the order of elements in arrays returned by these functions is non-deterministic.7Converting arrays to strings
The
ARRAY_TO_STRING()
function converts an array to a single string value, with optional handling for NULL elements and customizable separators.8Combining arrays
In some cases, you might want to combine multiple arrays into a single array. You can accomplish this using the
ARRAY_CONCAT()
function.9Updating arrays
Updating arrays in GoogleSQL involves using the
UPDATE
statement along with array functions likeARRAY_CONCAT()
to modify array elements. You can append new elements to an existing array or update array fields within nested structures. TheUPDATE
statement is typically followed by aSET
clause specifying the modifications to be made, and aWHERE
clause to filter the rows to be updated.10Zipping arrays
Zipping arrays involves merging two arrays of equal size into a single array containing pairs of elements from corresponding positions. This operation can be achieved using UNNEST and WITH OFFSET to create pairs stored as STRUCTs in an array.11
Building arrays of arrays
GoogleSQL does NOT support creating arrays of arrays directly. Instead, you can achieve this by creating an array of structs, where each struct contains a field of type ARRAY.12
JSON data
JSON is a widely used format for semi-structured data, allowing a "schema-on-read" approach. Unlike BigQuery's fixed schema for STRUCT type, JSON data type enables loading semi-structured data without upfront schema. It permits storing and querying data without fixed schemas, encoding and processing JSON fields individually. Querying JSON in BigQuery is intuitive and cost-efficient, using the field access operator.
Limitations
JSON data can only be ingested into a table via batch load jobs in CSV, Avro, or JSON format.
JSON data type has a nesting limit of 500 levels.
Legacy SQL cannot be used to query tables containing JSON types.
Row-level access policies cannot be applied to JSON columns.
Create a table with a
JSON
columnCREATE TABLE mydataset.table1( id INT64, cart JSON );
You can't partition or cluster a table on
JSON
columns, because the equality and comparison operators are not defined on theJSON
type.Create
JSON
values13Create a JSON value: You can insert JSON values directly into a table using either JSON literals or functions like JSON_ARRAY() and JSON_OBJECT().
Convert a STRING type to JSON type: Use the PARSE_JSON() function to convert JSON-formatted STRING values to JSON type. This is particularly useful when converting columns from existing tables to JSON type.
Convert schematized data to JSON: Utilize functions like JSON_OBJECT() to convert schematized data, such as key-value pairs, into JSON format. This allows you to aggregate data into JSON objects based on specific criteria.
Convert SQL type to JSON type: The
TO_JSON
function in BigQuery converts SQL structured data types, such as STRUCTs, to JSON format.
To ingest JSON data into BigQuery:
Batch Load: Load from CSV, Avro, or JSON files.
Storage Write API: Programmatically ingest JSON data.
Legacy Streaming API: Stream JSON data into BigQuery.
Examples:
Load from CSV: Use
bq load
with CSV files.Load from NDJSON: Use
bq load
with newline delimited JSON files.Storage Write API: Format data using protocol buffers.
Legacy Streaming API: Stream JSON data using Python client.
Query JSON data14
Extract values as JSON
In BigQuery, you can access JSON fields using field access and subscript operators. These allow for extracting specific elements or fields from JSON data. Operators return JSON types and are equivalent to the JSON_QUERY function.
They return NULL if a member or element is not found. However, direct comparison operations aren't supported. Use functions like
JSON_VALUE
for such operations.Extract values as strings
To extract values from JSON data as strings in BigQuery, you can use the
JSON_VALUE
function. This function retrieves scalar values from JSON and returns them as SQL strings. It's useful for contexts requiring equality or comparison, like WHERE clauses.Additionally, the STRING function can also be used for similar purposes. JSON data can be extracted and converted to different SQL data types using various value extraction functions like BOOL, INT64, and FLOAT64. To determine the type of a JSON value, the
JSON_TYPE
function can be employed.Flexibility convert JSON
LAX conversion functions in BigQuery enable flexible and error-free conversion of JSON values to scalar SQL types. Functions like LAX_INT64 automatically infer and process input, ensuring smooth conversion. Other options include
LAX_STRING
,LAX_BOOL
, andLAX_FLOAT64
, making the conversion process accurate and reliable.Extract arrays from JSON
To extract arrays from JSON in BigQuery, you can use the
JSON_QUERY_ARRAY
andJSON_VALUE_ARRAY
functions.JSON_QUERY_ARRAY extracts JSON arrays, while JSON_VALUE_ARRAY extracts arrays of scalar values.
UNNEST is then used to split the array into individual elements, which can be further processed or aggregated using other functions like ARRAY_AGG.
JSON nulls
JSON null values in BigQuery are distinct from SQL NULL values. When working with JSON data, it's essential to understand how null values are handled by different functions. While the JSON_QUERY function returns a JSON null when extracting a field with a null value, the JSON_VALUE function returns SQL NULL, as JSON null is not considered a scalar value.
Sketches
GoogleSQL for BigQuery offers support for data sketches, which are compact summaries of data aggregations. Using sketches can significantly reduce computation costs compared to exact calculations, making them ideal for scenarios where query time and resource usage need to be optimized. Sketches allow for estimating metrics like cardinalities and quantiles without the need to process raw data, enabling faster and more efficient analyses.
Sketched data introduces a statistical error represented by an error bound or confidence interval, but this trade-off in accuracy is generally acceptable given the benefits of faster computations and reduced storage requirements. Sketches represent approximate aggregates for specific metrics, are compact, and can be merged with other sketches to summarize larger data sets.
Sketch merging enables the construction of summaries for multiple data streams, facilitating tasks such as estimating the number of distinct users over time or creating roll-ups of OLAP cubes. These operations help optimize query performance and resource usage by avoiding the need to process the entire input data set.
Multi-statement queries
Multi-statement queries enable the execution of multiple SQL statements in a single request, supporting procedural language statements for tasks like variable definition and control flow implementation. They are commonly employed in stored procedures and may have side effects on table data.
Write, run and save
Multi-statement queries in BigQuery allow the execution of multiple SQL statements in one request.
These queries can include procedural language statements, enabling variable usage and control flow implementation.
To write a multi-statement query, separate SQL statements with semicolons.
BigQuery interprets any request with multiple statements as a multi-statement query, unless the statements consist entirely of CREATE TEMP FUNCTION statements followed by a single SELECT statement.
Running a multi-statement query is done similarly to a single statement query, through the Google Cloud console or bq command-line tool.
Dry runs estimate the bytes read by a multi-statement query, but have specific handling for different types of statements.
Dry runs operate on a best-effort basis and might not always accurately predict query execution outcomes.
Variables
Multi-statement queries in BigQuery support the use of both user-created and system variables.
User-created variables are declared using the DECLARE statement and can be assigned values with the SET statement.
-- Declare a user-created variable and set its value DECLARE x INT64 DEFAULT 0; SET x = 10;
System variables are built into BigQuery and can be overridden with the SET statement.
SET @@dataset_project_id = 'MyProject'; SET @@dataset_id = 'MyDataset'; BEGIN CREATE TABLE MyTempTableA (id STRING); CREATE TABLE MyTempTableB (id STRING); END;
User-created variables must be declared at the start of a multi-statement query or within a BEGIN block, while system variables are implicitly available.
User-created variables have a maximum size limit of 1 MB per variable and 10 MB for all variables in a multi-statement query.
After declaring and setting a user-created variable, you can reference it in a multi-statement query. If a variable and column share the same name, the column takes precedence.
DECLARE x INT64 DEFAULT 0; SET x = 10; WITH Numbers AS (SELECT 50 AS x) SELECT (x+x) AS result FROM Numbers; +--------+ | result | +--------+ | 100 | +--------+
System variables can be used to simplify queries by setting default values for projects, datasets, or other parameters.
BEGIN CREATE OR REPLACE TABLE MyDataset.MyTable(default_time_zone STRING) OPTIONS (description = @@time_zone); END;
Both user-created and system variables can be referenced throughout a multi-statement query.
However, system variables have limitations in certain contexts, such as using them in table paths or as project names in DDL and DML queries, which can result in errors.
Temporary tables
Temporary tables in BigQuery facilitate storing intermediate results within multi-statement queries without the need for dataset maintenance. They're created, referenced, and deleted within a query's scope, residing in automatically managed datasets.
To create a temporary table, use the
CREATE TEMP TABLE
statement followed by the table definition and query. References to temporary tables are local to the query, and they're automatically deleted after 24 hours.You can delete temporary tables explicitly using
DROP TABLE
or allow them to be automatically deleted. Access temporary table data through the BigQuery Explorer page in the Google Cloud console.For clarity, qualify temporary table references with
_SESSION
, but note that you can't use_SESSION
to create non-temporary tables.
Job information
A multi-statement query job in BigQuery contains information about a query consisting of multiple SQL statements. You can retrieve details about the job, including the last executed statement and all executed statements.
To return the last executed statement, use the
jobs.getQueryResults
method.To retrieve the results of all statements executed in the multi-statement query, enumerate the child jobs created for each statement. Use the
jobs.list
method with the parentJobId parameter set to the multi-statement query job ID to enumerate the child jobs. Then, calljobs.getQueryResults
on each child job to obtain the results for all statements.
Debug
ASSERT Statement: Use the
ASSERT
statement to verify that a Boolean condition is true. If the condition is false, an error is raised.BEGIN...EXCEPTION...END: Employ
BEGIN
...EXCEPTION
...END
blocks to handle errors gracefully. This structure catches errors and allows you to display error messages and stack traces.SELECT FORMAT("..."): Utilize
SELECT FORMAT("...")
to display intermediate results during the execution of a multi-statement query.Viewing Output: In the Google Cloud console and with the bq command-line tool, you can view the results of each statement within a multi-statement query. Additionally, you can select and run individual statements within the query editor in the Google Cloud console.
Recursive CTEs
In GoogleSQL for BigQuery, the WITH clause lets you define temporary tables known as common table expressions (CTEs). CTEs can be recursive or non-recursive, with recursion enabled by the RECURSIVE keyword (WITH RECURSIVE
). Recursive CTEs are useful for querying hierarchical and graph data but can be computationally expensive.
Create
To create a recursive common table expression (CTE) in GoogleSQL, use the WITH RECURSIVE clause.
WITH RECURSIVE
CTE_1 AS (
(SELECT 1 AS iteration UNION ALL SELECT 1 AS iteration)
UNION ALL
SELECT iteration + 1 AS iteration FROM CTE_1 WHERE iteration < 3
)
SELECT iteration FROM CTE_1
ORDER BY 1 ASC;
/*-----------*
| iteration |
+-----------+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
*-----------*/
Recursive CTEs consist of a base term, a union operator, and a recursive term. The base term initializes the first iteration, while the recursive term generates subsequent iterations. Only the recursive term can reference the CTE itself. Refer to the GoogleSQL reference documentation for detailed syntax and examples.
Explore reachability in a directed acyclic graph (DAG)
Recursive queries are useful for exploring reachability in directed acyclic graphs (DAGs). By leveraging recursive common table expressions (CTEs), you can iteratively traverse a graph to find all nodes reachable from a specific starting node. This approach simplifies complex graph traversal tasks and enables efficient analysis of graph structures.
Troubleshoot iteration limit errors
Recursive common table expressions (CTEs) in BigQuery are subject to a maximum iteration limit of 500 to prevent infinite recursion. This limit helps control resource consumption and query execution time. If you encounter a recursion iteration limit error, it typically indicates a missing termination condition or inappropriate use of recursive CTEs. Reviewing and adjusting your query logic can help address this issue.
Check for infinite recursion15
To prevent infinite recursion, ensure the recursive term stops producing results after a set number of iterations. Another approach is to convert the recursive CTE to a temporary table and use a REPEAT loop to limit the iterations, typically up to 100.
Verify the appropriate usage of the recursive CTE
Verify that your recursive CTE is suitable for querying hierarchical or graph data, as they are optimized for these scenarios. Otherwise, consider alternatives like using the LOOP statement with a non-recursive CTE to avoid unnecessary computational overhead.
If needed, contact Customer Care to raise the recursive limit beyond 500 iterations, understanding the potential risks such as longer execution times and resource constraints.
Transactions
BigQuery enables multi-statement transactions for atomic operations across single or multiple queries within sessions. They facilitate complex operations like mutating rows, ensuring either all changes are committed or rolled back together.
Common uses include:
Batch mutations across tables spanning datasets or projects.
Sequential mutations on a single table based on interim computations.
Transactions ensure ACID properties and provide snapshot isolation, ensuring consistent reads within the transaction. However, reads from external data sources may lack consistency if the source data changes during the transaction.
Scope
In BigQuery, transactions are contained within a single SQL query, except in Session mode. They can't be nested but can span multiple queries within a session.
To start a transaction, use BEGIN TRANSACTION. It ends with either a COMMIT TRANSACTION to commit changes or a ROLLBACK TRANSACTION to discard them. If there's an error and no exception handler, the query fails, and BigQuery automatically rolls back the transaction.16
Supported statements
Only specific statement types are supported within transactions in BigQuery:
Query statements: SELECT
DML statements: INSERT, UPDATE, DELETE, MERGE, and TRUNCATE TABLE
DDL statements on temporary entities: CREATE TEMP TABLE, CREATE TEMP FUNCTION, DROP TABLE on a temporary table, DROP FUNCTION on a temporary function
DDL statements creating or dropping permanent entities like datasets, tables, and functions aren't supported within transactions.
Date/time functions
In transactions, specific behaviors apply to date/time functions:
CURRENT_TIMESTAMP, CURRENT_DATE, and CURRENT_TIME return the transaction start timestamp.
Using FOR SYSTEM_TIME AS OF to read a table beyond the transaction start timestamp results in an error.
Example transaction17
Transaction concurrency
Transactions mutating rows in a table prevent concurrent transactions or DML statements on the same table, resulting in cancellation of conflicting operations.
Conflicting DML statements outside transactions are queued for later execution.
if the transaction only reads or appends rows concurrent operations include
SELECT statements
BigQuery Storage Read API read operations
queries from BigQuery BI Engine
INSERT statements
load jobs using WRITE_APPEND
streaming writes,
Each multi-statement transaction in BigQuery is assigned a transaction ID.
Query the
INFORMATION_SCHEMA.JOBS*
views for thetransaction_id
column to see the transaction IDs for your jobs.BigQuery generates a child job for each statement within a transaction, and all child jobs associated with a transaction share the same transaction_id value.
Limitations
DDL Statements: Transactions cannot use DDL statements on permanent entities.
Materialized Views: Materialized views act as logical views in transactions, offering no performance benefits.
Rollback: Transaction failures trigger a rollback, undoing all pending changes.
Table Mutation Limits: Transactions can mutate data in up to 100 tables and perform a maximum of 100,000 partition modifications.
BI Engine: BI Engine does not accelerate queries within transactions
Converting elements in an array to rows in a table
-- Flattening an ARRAY with UNNEST and maintaining order with WITH OFFSET
SELECT *
FROM UNNEST(['foo', 'bar', 'baz']) AS element
WITH OFFSET AS offset
ORDER BY offset;
-- Flattening an entire column of ARRAYs while preserving other column values
WITH Sequences AS (
SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers
)
SELECT id, flattened_numbers
FROM Sequences
CROSS JOIN UNNEST(Sequences.some_numbers) AS flattened_numbers;
Querying STRUCT elements in an array
-- Flattening an ARRAY of STRUCTs
WITH Races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
AS participants)
SELECT
race,
participant
FROM Races AS r
CROSS JOIN UNNEST(r.participants) AS participant;
-- Querying specific information from repeated fields
WITH Races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
AS participants)
SELECT
race,
(SELECT name
FROM UNNEST(participants)
ORDER BY (
SELECT SUM(duration)
FROM UNNEST(laps) AS duration) ASC
LIMIT 1) AS fastest_racer
FROM Races;
Querying ARRAY-type fields in a struct
-- Querying the runner with the fastest lap
WITH Races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps)]
AS participants)
SELECT
race,
(SELECT name
FROM UNNEST(participants),
UNNEST(laps) AS duration
ORDER BY duration ASC LIMIT 1) AS runner_with_fastest_lap
FROM Races;
-- Flattening arrays with a CROSS JOIN and handling NULL or empty arrays
WITH Races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" AS name, [23.4, 26.3, 26.4, 26.1] AS laps),
STRUCT("Makhloufi" AS name, [24.5, 25.4, 26.6, 26.1] AS laps),
STRUCT("Murphy" AS name, [23.9, 26.0, 27.0, 26.0] AS laps),
STRUCT("Bosse" AS name, [23.6, 26.2, 26.5, 27.1] AS laps),
STRUCT("Rotich" AS name, [24.7, 25.6, 26.9, 26.4] AS laps),
STRUCT("Lewandowski" AS name, [25.0, 25.7, 26.3, 27.2] AS laps),
STRUCT("Kipketer" AS name, [23.2, 26.1, 27.3, 29.4] AS laps),
STRUCT("Berian" AS name, [23.7, 26.1, 27.0, 29.3] AS laps),
STRUCT("Nathan" AS name, ARRAY<FLOAT64>[] AS laps),
STRUCT("David" AS name, NULL AS laps)]
AS participants)
SELECT
name, sum(duration) AS finish_time
FROM Races CROSS JOIN Races.participants LEFT JOIN participants.laps AS duration
GROUP BY name;
Creating arrays from subqueries
-- Creating arrays from subqueries
WITH Sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
ARRAY(SELECT x * 2
FROM UNNEST(some_numbers) AS x) AS doubled
FROM Sequences;
Filtering arrays
-- Filtering arrays using WHERE clause
WITH Sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
ARRAY(SELECT x * 2
FROM UNNEST(some_numbers) AS x
WHERE x < 5) AS doubled_less_than_five
FROM Sequences;
-- Filtering arrays using SELECT DISTINCT
WITH Sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers)
SELECT ARRAY(SELECT DISTINCT x
FROM UNNEST(some_numbers) AS x) AS unique_numbers
FROM Sequences;
-- Filtering rows of arrays using the IN keyword
WITH Sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT
ARRAY(SELECT x
FROM UNNEST(some_numbers) AS x
WHERE 2 IN UNNEST(some_numbers)) AS contains_two
FROM Sequences;
Scanning arrays
-- Scanning arrays for specific values
SELECT 2 IN UNNEST([0, 1, 1, 2, 3, 5]) AS contains_value;
/*----------------*
| contains_value |
+----------------+
| true |
*----------------*/
-- Scanning rows of a table for specific values in array columns
WITH Sequences AS
(SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers)
SELECT id AS matching_rows
FROM Sequences
WHERE 2 IN UNNEST(Sequences.some_numbers)
ORDER BY matching_rows;
/*---------------*
| matching_rows |
+---------------+
| 1 |
| 2 |
*---------------*/
-- Scanning arrays for values that satisfy a condition
WITH Sequences AS (
SELECT 1 AS id, [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT 2 AS id, [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT 3 AS id, [5, 10] AS some_numbers
)
SELECT id AS matching_rows
FROM Sequences
WHERE EXISTS(SELECT * FROM UNNEST(some_numbers) AS x WHERE x > 5);
/*---------------*
| matching_rows |
+---------------+
| 2 |
| 3 |
*---------------*/
-- Scanning arrays of STRUCTs for field values that satisfy a condition
WITH Sequences AS (
SELECT 1 AS id, [STRUCT(0 AS a, 1 AS b)] AS some_numbers
UNION ALL SELECT 2 AS id, [STRUCT(2 AS a, 4 AS b)] AS some_numbers
UNION ALL SELECT 3 AS id, [STRUCT(5 AS a, 3 AS b), STRUCT(7 AS a, 4 AS b)] AS some_numbers
)
SELECT id AS matching_rows
FROM Sequences
WHERE EXISTS(SELECT 1 FROM UNNEST(some_numbers) WHERE b > 3);
/*---------------*
| matching_rows |
+---------------+
| 2 |
| 3 |
*---------------*/
Arrays and aggregation
-- Aggregating values into an array with ARRAY_AGG()
WITH Fruits AS
(SELECT "apple" AS fruit
UNION ALL SELECT "pear" AS fruit
UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit) AS fruit_basket
FROM Fruits;
-- Result:
/*-----------------------*
| fruit_basket |
+-----------------------+
| [apple, pear, banana] |
*-----------------------*/
-- Ordering array elements with ARRAY_AGG()
WITH Fruits AS
(SELECT "apple" AS fruit
UNION ALL SELECT "pear" AS fruit
UNION ALL SELECT "banana" AS fruit)
SELECT ARRAY_AGG(fruit ORDER BY fruit) AS fruit_basket
FROM Fruits;
-- Result:
/*-----------------------*
| fruit_basket |
+-----------------------+
| [apple, banana, pear] |
*-----------------------*/
-- Applying aggregate functions to array elements
WITH Sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
(SELECT SUM(x)
FROM UNNEST(s.some_numbers) AS x) AS sums
FROM Sequences AS s;
-- Result:
/*--------------------+------*
| some_numbers | sums |
+--------------------+------+
| [0, 1, 1, 2, 3, 5] | 12 |
| [2, 4, 8, 16, 32] | 62 |
| [5, 10] | 15 |
*--------------------+------*/
-- Concatenating array elements across rows with ARRAY_CONCAT_AGG()
WITH Aggregates AS
(SELECT [1,2] AS numbers
UNION ALL SELECT [3,4] AS numbers
UNION ALL SELECT [5, 6] AS numbers)
SELECT ARRAY_CONCAT_AGG(numbers) AS count_to_six_agg
FROM Aggregates;
-- Result:
/*--------------------------------------------------*
| count_to_six_agg |
+--------------------------------------------------+
| [1, 2, 3, 4, 5, 6] |
*--------------------------------------------------*/
Converting arrays to strings
-- Converting arrays to strings with ARRAY_TO_STRING()
WITH Words AS
(SELECT ["Hello", "World"] AS greeting)
SELECT ARRAY_TO_STRING(greeting, " ") AS greetings
FROM Words;
-- Result:
/*-------------*
| greetings |
+-------------+
| Hello World |
*-------------*/
-- Handling NULL values in ARRAY_TO_STRING()
SELECT
ARRAY_TO_STRING(arr, ".", "N") AS non_empty_string,
ARRAY_TO_STRING(arr, ".", "") AS empty_string,
ARRAY_TO_STRING(arr, ".") AS omitted
FROM (SELECT ["a", NULL, "b", NULL, "c", NULL] AS arr);
-- Result:
/*------------------+--------------+---------*
| non_empty_string | empty_string | omitted |
+------------------+--------------+---------+
| a.N.b.N.c.N | a..b..c. | a.b.c |
*------------------+--------------+---------+*/
Combining arrays
SELECT ARRAY_CONCAT([1, 2], [3, 4], [5, 6]) AS count_to_six;
/*--------------------------------------------------*
| count_to_six |
+--------------------------------------------------+
| [1, 2, 3, 4, 5, 6] |
*--------------------------------------------------*/
Updating arrays
-- Table before update
WITH arrays_table AS (
SELECT
[1, 2] AS regular_array,
STRUCT([10, 20] AS first_array, [100, 200] AS second_array) AS nested_arrays
UNION ALL SELECT
[3, 4] AS regular_array,
STRUCT([30, 40] AS first_array, [130, 400] AS second_array) AS nested_arrays
)
SELECT * FROM arrays_table;
-- Result:
/*---------------*-------------------*------------------------*
| regular_array | nested_arrays. | nested_arrays. |
| | first_array | second_array |
+---------------+-------------------+------------------------+
| [1, 2] | [10, 20] | [100, 200] |
| [3, 4] | [30, 40] | [130, 400] |
*---------------*-------------------+------------------------*
-- Updating arrays
UPDATE
arrays_table
SET
regular_array = ARRAY_CONCAT(regular_array, [5]),
nested_arrays.second_array =
ARRAY_CONCAT(nested_arrays.second_array,
nested_arrays.first_array)
WHERE TRUE;
-- Table after update
SELECT * FROM arrays_table;
-- Result:
/*-----------*---------------*--------------------*
| reg_array | nested_arrays.| nested_arrays. |
| | first_array | second_array |
+-----------+---------------+--------------------+
| [1, 2, 5] | [10, 20] | [100, 200, 10, 20] |
| [3, 4, 5] | [30, 40] | [130, 400, 30, 40] |
*-----------*---------------+--------------------*
Zipping arrays
-- Define a Common Table Expression (CTE) named 'Combinations' containing two arrays: 'letters' and 'numbers'.
WITH
Combinations AS (
SELECT
['a', 'b'] AS letters,
[1, 2, 3] AS numbers
)
-- Selecting the zipped pairs of elements from the 'letters' and 'numbers' arrays.
SELECT
ARRAY(
SELECT AS STRUCT
-- Accessing elements from 'letters' and 'numbers' arrays using SAFE_OFFSET(index).
letters[SAFE_OFFSET(index)] AS letter,
numbers[SAFE_OFFSET(index)] AS number
FROM Combinations
CROSS JOIN
-- Generating an array of indices based on the minimum length of the two input arrays.
UNNEST(
GENERATE_ARRAY(
0,
LEAST(ARRAY_LENGTH(letters), ARRAY_LENGTH(numbers)) - 1)) AS index
ORDER BY index
) AS pairs;
/*------------------------------*
| pairs |
+------------------------------+
| [{ letter: "a", number: 1 }, |
| { letter: "b", number: 2 }] |
*------------------------------*/
Building array of arrays
-- Define the Points table
WITH Points AS
(SELECT [1, 5] AS point
UNION ALL SELECT [2, 8] AS point
UNION ALL SELECT [3, 7] AS point
UNION ALL SELECT [4, 1] AS point
UNION ALL SELECT [5, 7] AS point)
-- Create an array of structs, each containing a point
SELECT ARRAY(
SELECT STRUCT(point)
FROM Points)
AS coordinates;
+---------------------+
| coordinates |
+---------------------+
| [{point: [1,5]}, |
| {point: [2,8]}, |
| {point: [3,7]}, |
| {point: [4,1]}, |
| {point: [5,7]}] |
+---------------------+
Create JSON values
-- Create a JSON value
INSERT INTO mydataset.table1 VALUES
(1, JSON '{"name": "Alice", "age": 30}'),
(2, JSON_ARRAY(10, ['foo', 'bar'], [20, 30])),
(3, JSON_OBJECT('foo', 10, 'bar', ['a', 'b']));
-- Convert a STRING type to JSON type
CREATE OR REPLACE TABLE mydataset.table_new
AS (
SELECT
id, SAFE.PARSE_JSON(cart) AS cart_json
FROM
mydataset.old_table
);
-- Convert schematized data to JSON
WITH Fruits AS (
SELECT 0 AS id, 'color' AS k, 'Red' AS v UNION ALL
SELECT 0, 'fruit', 'apple' UNION ALL
SELECT 1, 'fruit','banana' UNION ALL
SELECT 1, 'ripe', 'true'
)
SELECT JSON_OBJECT(ARRAY_AGG(k), ARRAY_AGG(v)) AS json_data
FROM Fruits
GROUP BY id;
+----------------------------------+
| json_data |
+----------------------------------+
| {"color":"Red","fruit":"apple"} |
| {"fruit":"banana","ripe":"true"} |
+----------------------------------+
-- Convert SQL type to JSON type
SELECT TO_JSON(STRUCT(1 AS id, [10,20] AS coordinates)) AS pt;
+--------------------------------+
| pt |
+--------------------------------+
| {"coordinates":[10,20],"id":1} |
+--------------------------------+
Query JSON data
-- Defining a table and inserting JSON data
CREATE OR REPLACE TABLE mydataset.table1(id INT64, cart JSON);
INSERT INTO mydataset.table1 VALUES
(1, JSON '{
"name": "Alice",
"items": [
{"product": "book", "price": 10},
{"product": "food", "price": 5}
]
}'
),
(2, JSON '{
"name": "Bob",
"items": [{
"product": "pen",
"price": 20}
]
}'
);
-- VALUES AS JSON
-- Accessing fields using the field access operator
SELECT cart.name AS name
FROM mydataset.table1;
+---------+
| name |
+---------+
| "Alice" |
| "Bob" |
+---------+
-- Accessing array elements using the JSON subscript operator
SELECT cart.items[0] AS first_item
FROM mydataset.table1;
+-------------------------------+
| first_item |
+-------------------------------+
| {"price":10,"product":"book"} |
| {"price":20,"product":"pen"} |
+-------------------------------+
-- Referencing members of a JSON object by name using the JSON subscript operator
SELECT cart['name'] AS name
FROM mydataset.table1;
+---------+
| name |
+---------+
| "Alice" |
| "Bob" |
+---------+
-- Using non-constant expressions inside the JSON subscript operator
DECLARE int_val INT64 DEFAULT 0;
SELECT cart[CONCAT('it','ems')][int_val + 1].product AS item
FROM mydataset.table1;
+--------+
| item |
+--------+
| "food" |
| NULL |
+--------+
-- Chaining expressions using field access and subscript operators
SELECT cart.address AS address,
cart.items[1].price AS item1_price
FROM mydataset.table1;
+---------+-------------+
| address | item1_price |
+---------+-------------+
| NULL | NULL |
| NULL | 5 |
+---------+-------------+
-- VALUES AS STRINGS
-- Extracting a scalar value as a string using JSON_VALUE
SELECT JSON_VALUE(cart.name) AS name
FROM mydataset.table1;
-- Result:
+-------+
| name |
+-------+
| Alice |
+-------+
-- Filtering rows based on a JSON value using JSON_VALUE in a WHERE clause
SELECT cart.items[0] AS first_item
FROM mydataset.table1
WHERE JSON_VALUE(cart.name) = 'Alice';
-- Result:
+-------------------------------+
| first_item |
+-------------------------------+
| {"price":10,"product":"book"} |
+-------------------------------+
-- Extracting a JSON string using the STRING function
SELECT STRING(JSON '"purple"') AS color;
-- Result:
+--------+
| color |
+--------+
| purple |
+--------+
-- Extracting JSON values as other SQL data types
SELECT
BOOL(JSON '"true"') AS is_true,
INT64(JSON '42') AS integer_value,
FLOAT64(JSON '3.14') AS float_value;
-- Result:
+---------+---------------+-------------+
| is_true | integer_value | float_value |
+---------+---------------+-------------+
| true | 42 | 3.14 |
+---------+---------------+-------------+
-- Determining the type of a JSON value using JSON_TYPE
SELECT
JSON_TYPE(JSON '42') AS type_of_integer,
JSON_TYPE(JSON '3.14') AS type_of_float,
JSON_TYPE(JSON '"string"') AS type_of_string;
-- Result:
+-----------------+---------------+----------------+
| type_of_integer | type_of_float | type_of_string |
+-----------------+---------------+----------------+
| INTEGER | FLOAT | STRING |
+-----------------+---------------+----------------+
-- LAX CONVERSION
-- Convert JSON to INT64
SELECT LAX_INT64(JSON '"10"') AS id;
+----+
| id |
+----+
| 10 |
+----+
-- ARRAYS FROM JSON
-- Extract JSON arrays using JSON_QUERY_ARRAY
SELECT JSON_QUERY_ARRAY(cart.items) AS items
FROM mydataset.table1;
+----------------------------------------------------------------+
| items |
+----------------------------------------------------------------+
| [{"price":10,"product":"book"}","{"price":5,"product":"food"}] |
| [{"price":20,"product":"pen"}] |
+----------------------------------------------------------------+
-- Split JSON arrays into individual elements
SELECT
id,
JSON_VALUE(item.product) AS product
FROM
mydataset.table1, UNNEST(JSON_QUERY_ARRAY(cart.items)) AS item
ORDER BY id;
+----+---------+
| id | product |
+----+---------+
| 1 | book |
| 1 | food |
| 2 | pen |
+----+---------+
-- Aggregate values back into a SQL array
SELECT
id,
ARRAY_AGG(JSON_VALUE(item.product)) AS products
FROM
mydataset.table1, UNNEST(JSON_QUERY_ARRAY(cart.items)) AS item
GROUP BY id
ORDER BY id;
+----+-----------------+
| id | products |
+----+-----------------+
| 1 | ["book","food"] |
| 2 | ["pen"] |
+----+-----------------+
-- JSON NULLS
SELECT
json.a AS json_query, -- Equivalent to JSON_QUERY(json, '$.a')
JSON_VALUE(json, '$.a') AS json_value
FROM (SELECT JSON '{"a": null}' AS json);
+------------+------------+
| json_query | json_value |
+------------+------------+
| null | NULL |
+------------+------------+
Check for infinite recursion
DECLARE current_iteration INT64 DEFAULT 0;
CREATE TEMP TABLE recursive_cte_name AS
SELECT base_expression, current_iteration AS iteration;
REPEAT
SET current_iteration = current_iteration + 1;
INSERT INTO recursive_cte_name
SELECT recursive_expression, current_iteration
FROM recursive_cte_name
WHERE termination_condition_expression
AND iteration = current_iteration - 1
AND current_iteration < 100;
UNTIL NOT EXISTS(SELECT * FROM recursive_cte_name WHERE iteration = current_iteration)
END REPEAT;
Transaction scope
BEGIN
BEGIN TRANSACTION;
INSERT INTO mydataset.NewArrivals
VALUES ('top load washer', 100, 'warehouse #1');
-- Trigger an error.
SELECT 1/0;
COMMIT TRANSACTION;
EXCEPTION WHEN ERROR THEN
-- Roll back the transaction inside the exception handler.
SELECT @@error.message;
ROLLBACK TRANSACTION;
END;
Example transaction
-- Tables
CREATE OR REPLACE TABLE mydataset.Inventory
(
product string,
quantity int64,
supply_constrained bool
);
CREATE OR REPLACE TABLE mydataset.NewArrivals
(
product string,
quantity int64,
warehouse string
);
INSERT mydataset.Inventory (product, quantity)
VALUES('top load washer', 10),
('front load washer', 20),
('dryer', 30),
('refrigerator', 10),
('microwave', 20),
('dishwasher', 30);
INSERT mydataset.NewArrivals (product, quantity, warehouse)
VALUES('top load washer', 100, 'warehouse #1'),
('dryer', 200, 'warehouse #2'),
('oven', 300, 'warehouse #1');
-- Transaction
BEGIN TRANSACTION;
-- Create a temporary table that holds new arrivals from 'warehouse #1'.
CREATE TEMP TABLE tmp
AS SELECT * FROM mydataset.NewArrivals WHERE warehouse = 'warehouse #1';
-- Delete the matching records from the NewArravals table.
DELETE mydataset.NewArrivals WHERE warehouse = 'warehouse #1';
-- Merge the records from the temporary table into the Inventory table.
MERGE mydataset.Inventory AS I
USING tmp AS T
ON I.product = T.product
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
WHEN MATCHED THEN
UPDATE SET quantity = I.quantity + T.quantity;
-- Drop the temporary table and commit the transaction.
DROP TABLE tmp;
COMMIT TRANSACTION;