Navigating Analytics Engineering: From dbt Setup to Data Visualization

DE ZoomCamp course: Module 4

Feb 24, 2024

Explore the fundamentals of analytics engineering with guidance from DE ZoomCamp 2024 and dbt documentation. Learn to set up dbt with BigQuery, deploy models, and use visualization tools like Google Data Studio and Metabase. Discover key concepts such as ETL vs. ELT and Dimensional Modeling, designed to boost your data analytics skills. This concise guide provides a clear path for anyone aiming to enhance their analytics engineering knowledge for practical, impactful insights.

Source:

Contents:

dbt setup for using BigQuery

create a new GCP service account
- use BigQuery API with access to application data
- give the account any name (e.g. dbt_service_account)
- give the account BigQuery admin role
- get new service account key in JSON format

Cloud

setup a new dbt cloud project

name your project
set up BigQuery connection
clone your GitHub repo using SSH keys
- paste SSH key from your GH repo in dbt cloud
- deploy result keys in your GH repo settings

Local

setup a new dbt core project

install required dbt packages to your machine
- install both dbt-core and dbt-bigquery adapter
```
pip install dbt-bigquery
```

set up connection details

in your root directory create a .dbt directory create a profiles.yml file

<profile-name>:
  target: <target-name> # this is the default target
  outputs:
    <target-name>:
      type: <bigquery | postgres | redshift | snowflake | other>
      schema: <schema_identifier>
      threads: <natural_number>

      ### database-specific connection details
      ...

    <target-name>: # additional targets
      ...

<profile-name>: # additional profiles
  ...

create your first project
- in the directory for your dbt project run
```
dbt init
```
- This will create a directory for your first project
- Enter your project folder and check for connection
```
dbt debug
```
set up dbt_project.yml
- you can create the name for your project. Make sure to change the name in models section as well.
- check profile section - should be the same as in profiles.yml file.

Analytics Engineering concepts

Data Team Roles

Data Engineer: Responsible for preparing and maintaining the infrastructure required by the data team.
Analytics Engineer: Brings good software engineering practices into the efforts of data analysts and data scientists.
Data Analyst: Utilizes data to respond to inquiries and address challenges.

Analytics Engineer Tools

Data Loading: Tools and processes focused on importing, cleaning, and preprocessing data.
Data Storing: Utilizes cloud-based data warehouses such as Snowflake, BigQuery, and Redshift for efficient data storage and management.
Data Modeling: Employs tools like dbt (data build tool) or Dataform for structuring and organizing data in a useful format.
Data Presentation: Leverages Business Intelligence (BI) tools such as Google Data Studio, Looker, Mode, and Tableau for visualizing data and sharing insights.

ETL vs ELT

ETL (Extract, Transform, Load)
- Provides a more stable and compliant environment for data analysis.
- Incurs higher storage and compute costs due to the processing required before data is loaded into the warehouse.
ELT (Extract, Load, Transform)
- Offers faster and more flexible data analysis by loading data into the warehouse before transforming it.
- Results in lower costs and reduced maintenance compared to ETL, as data transformation is handled within the data warehouse.

Kimball’s Dimensional Modeling (DM)

Objective
- To make data understandable for business users.
- To ensure fast query performance.
Approach
- Focuses on user understandability and query performance, prioritizing these over non-redundant data (3NF - Third Normal Form).
Other Approaches
- Bill Inmon's methodology.
- Data Vault modeling.
Elements of DM
- Fact Tables
  - Contain measurements, metrics, or facts.
  - Correspond to a business process.
  - Represented by "verbs."
- Dimension Tables
  - Correspond to a business entity.
  - Provide context to the business.
  - Represented by "nouns."

Kimball's Dimensional Data Modeling | The Analytics Setup Guidebook

DM architecture

Staging Area (Bronze Layer)
- Houses raw data.
- Not designed for broad accessibility.
Processing Area (Silver Layer)
- Transforms raw data into structured data models.
- Concentrates on efficiency and standardization.
Presentation Area (Gold Layer)
- Delivers the final data presentation.
- Tailored for business stakeholders' consumption.

dbt (data build tool)

dbt is a SQL-based transformation tool that streamlines analytics code deployment in data warehouses, emphasizing modularity, CI/CD, and documentation.

It enables complex data transformations directly within the data warehouse (ELT), supporting version control, testing, and streamlined workflow sharing.

How does dbt work

Define transformations: Users write SQL queries to define transformations, which are stored as dbt models.
Run models: dbt compiles these SQL models and runs them in the data warehouse, transforming raw data into structured formats.
Version control: Models are managed in version control, facilitating collaboration and tracking changes.
Testing and documentation: dbt supports automated testing and documentation of data models, ensuring data quality and ease of use

dbt deployment

dbt Core (local):
- Open-source, command-line tool.
- Manual setup and execution.
- User-managed infrastructure.
- Compatible with databases like Postgres and others.
dbt Cloud:
- Managed SaaS platform.
- Web interface with scheduling and permissions.
- Includes IDE, CI/CD pipelines, and support.
- Free and paid tiers available.
- Supports BigQuery, Postgres, etc..

dbt model

A dbt model is a SQL file with configurations and transformations that structure raw data for analysis. Here’s a concise overview of its components:

Configuration Header

Defines model settings like materialization (e.g., table, view) using Jinja templating.

{{ config(materialized='table') }}

Documentation (Optional)

Comments describing the model's purpose, author, and creation date for better maintainability.

Jinja Templating

Injects logic, variables, and references to other models for dynamic SQL generation.

SELECT * FROM {{ ref('raw_data') }}

SQL Transformation Logic

The SQL code applying the data transformation, potentially using CTEs for complex logic.

Dependencies

Defined by ref() functions, indicating relationships between models for execution order.

Materialization Strategies

Determines how dbt materializes the output (as a view, table, incremental, or ephemeral).

This streamlined structure facilitates efficient and organized data transformation, making dbt models crucial for scalable analytics engineering.

dbt concepts

Macros

Utilize control structures (e.g., if statements, for loops) in SQL.
Leverage environment variables for production deployments.
Generate queries based on the results of other queries.
Abstract SQL snippets into reusable macros, akin to functions in programming languages.

To create in macros folder create an sql file and use the following template:

{# Description of what the macro does #} 
{% macro unnamed_macro(argument1, argument2) -%} 
  # Macro logic here 
{%- endmacro %}

To call the macros use the following template in the model sql
```
{{ name_of_the_macro(argument) }}
```

Ref Macro

Used to reference underlying tables and views in the data warehouse.
Enables running the same code across different environments, resolving schema automatically.
Dependencies are automatically built based on references.
Example:

select
  column1,
  column2
from {{ ref('table_or_view_name') }}

FROM clause

Sources
- The data loaded to our DWH that we use as sources for our models
- Configuration defined in the yml files in the models folder
  version: 2
```
sources:
  - name: source_name
    database: your_database
    schema: your_schema
    tables:
      - name: table_name
        freshness: 
          error_after: {count: 6, period: hour}
```
- Used with the source macro that will resolve the name to the right schema, plus build the dependencies automatically
- Source freshness can be defined and tested
Seeds
- CSV files stored in the project repository under the seed folder.
  version: 2
```
seeds:
  - name: seed_name
    schema: your_schema
    table_name: table_name
    filename: path/to/seed.csv
```
- Benefits from version control for tracking changes.
- Equivalent to a COPY command in database terms.
- Recommended for static or infrequently changing data.
- Executed using dbt seed -s file_name command, where file_name is the name of the seed file.

Packages

Similar to libraries in other programming languages, containing models and macros for specific areas.
Standalone dbt projects that integrate into your own project via packages.yml.
```
packages:
  - package: <publisher>/<package_name>
    version: <version>
```
Imported with dbt deps.
Available on the dbt Package Hub.

To call a function from a package in the dbt model use the following template:

{{ package_name.function_name(
    arg1 = 'value1',
    arg2 = 'value2',
    ...
) }}

Variables

Enable defining project-wide values.
Utilized with {{ var('...') }} for data provision in models.

Defined in:

dbt_project.yml file.

vars:
  my_variable: value
  another_variable: another_value

Command line

dbt run --vars 'my_variable: value, another_variable: another_value'

Tests

Assumptions about data:
- Tests in dbt are SQL queries.
- These assumptions compile to SQL, revealing failing records.
- Tests are defined in the schema.yml file.
Basic tests provided by dbt:
- Uniqueness
- Null values
- Accepted values
- Foreign key relations

Custom tests can be created using queries.

version: 2

models:
  - name: my_model_name
    description: >
      This model represents [description of the model].
    tests:
      - unique:
          columns: [column_name]
          severity: error
      - not_null:
          columns: [column_name]
          severity: error
      - accepted_values:
          column: column_name
          values: [value_1, value_2, value_3]
          severity: warning
      - foreign_key:
          columns: [column_name]
          table: referenced_table_name
          to: referenced_column_name
          severity: error

Documentation

dbt offers documentation generation for projects, rendered as a website in schema.yml file .
- Includes project details:
  - Model code (from .sql files and compiled)
  - Model dependencies
  - Sources
  - Auto-generated Directed Acyclic Graph (DAG) from ref and source macros
  - Descriptions (from .yml files) and tests
- Provides information about the data warehouse (information_schema):
  - Column names and data types
  - Table statistics such as size and rows

dbt docs can be hosted in dbt Cloud.

models:
  - name: my_model_name
    description: >
      Description of my_model_name.
  - name: another_model_name
    description: >
      Description of another_model_name.

Model deployment

Model Deployment: Process involves moving models from development to production environments.

Separation: Development allows building and testing models without impacting production.

Environment Differences: Deployment typically uses a distinct schema in the data warehouse and different user accounts.

Workflow:

Develop models in a user-specific branch.
Open a Pull Request (PR) to merge changes into the main branch.
Merge the branch into the main branch.
Execute new models in production using the main branch.
Schedule models for regular operation.

Running a dbt model into production

dbt Cloud Scheduler: Used to create jobs for production runs.
Job Commands: A single job can execute multiple commands.
Triggering Jobs: Can be initiated manually or scheduled.
Logging: Each job maintains a log of its runs, with individual command logs.
Documentation: Jobs can generate documentation accessible under run details.
Source Freshness: If run, results are viewable post-job completion.

Continuous integration

Continuous Integration (CI): Regular merging of development branches into a central repo, followed by automated builds and tests.
Goal: Minimize bugs in production code, maintain project stability.
dbt CI Integration: Enables CI on pull requests through webhooks from GitHub or GitLab.
Webhooks: Trigger a new dbt Cloud job run upon PR readiness.
Temporary Schema: CI job runs use a temporary schema.
Merge Conditions: PRs can only be merged after successful job completion.

Steps to deploy the model (via dbt cloud IDE)

Push Changes: Upload all modifications to the GitHub repository.
Deployment Environment: Establish a new environment targeting the BigQuery dataset.
Job Creation: Initiate a job within the configured environment.
Execution Settings: Specify commands for the job's execution.
Triggers: Configure job triggers for automated execution.

dbt commands

dbt run: Executes all models within the dbt project.
dbt build: Executes all models and seeds in the dbt project.
dbt build --select <model_name>: Executes only the specified model.
dbt build --select +<model_name>: Executes the specified model and all other dependent models.

Data visualization

Google Data Studio (cloud)

Web-based Reporting Tool: Allows for the creation of customizable dashboards and reports.
Integration: Offers seamless integration with other Google products and various external data sources.
Interactive Visualizations: Enables users to create interactive charts and graphs.
Collaboration: Supports sharing and collaborative editing of reports and dashboards.
Free to Use: Available at no cost, making it accessible for individuals and businesses.

Metabase (local)

Open-Source: A free tool for creating visualizations and dashboards from your data.
User-Friendly: Designed with a focus on simplicity, making data analysis accessible to non-technical users.
Flexible Data Sources: Connects to a variety of databases and SQL databases without requiring in-depth SQL knowledge.
Customization: Offers customizable dashboards and the ability to write SQL queries for complex data exploration.
Collaboration Features: Includes sharing and publishing capabilities to enhance teamwork.

Ramazan’s Substack

Discussion about this post