Skip to content

Blog rovingdev

Go back

Dataform: The ELT Game Changer for GCP

Edit page

If you’re just starting out with data transformation and using Google Cloud Platform (GCP), Dataform might be the perfect tool for you, especially if you’re working with BigQuery. Dataform is a service designed to help you manage SQL-based data operations within BigQuery, making it easy to turn raw data into clean, usable tables or views for analysis.
It’s part of GCP, so it works seamlessly within this ecosystem, offering features like version control and collaboration through Git. For beginners, think of it as a simple way to organize and clean up your data in BigQuery without needing deep programming knowledge.


Default alt text

What is Dataform?

Dataform is a service designed to manage SQL-based data operations within BigQuery, a part of Google Cloud Platform. It serves as a framework for data teams to develop, test, version control, and schedule complex workflows for data transformation. For beginners, think of Dataform as a tool that helps organize and clean up data in BigQuery, making it ready for analysis or reporting. It is particularly useful for creating scalable data pipelines, leveraging software engineering best practices like version control with Git and automated testing.

Key Features:

Components of Dataform:

  1. Repositories
    1. A workspace to store transformation logic with Git integration for version control.
  2. SQLX Files (Scripts)
    1. Extended SQL format for defining transformations:
    2. Tables (.sqlx) - Creates transformed tables.
    3. Views (.sqlx) - Defines virtual tables.
    4. Incremental Tables - Processes only new data.
  3. Dependencies & DAG
    1. Auto-detects dependencies and builds a Directed Acyclic Graph (DAG) to run queries in order.
  4. Assertions (Data Quality Checks)
    1. Ensures data correctness (e.g., checks for negative revenue).
  5. Scheduling & Orchestration
    1. Automates transformation runs using Cloud Workflows/Scheduler.
  6. Testing & Documentation
    1. Supports unit tests and auto-generates documentation with data lineage.

SQLX file format:

Terminal window
// Declaration: Define table/view/incremental settings
config {
type: "table",
description: "This table joins t1 and t2",
columns: {
column_1: "column1 description",
column_2: "column1 description",
},
assertions: {
uniqueKey: ["id"]
}
}
---
// Transformation: The core SQL logic for data processing
SELECT
<column_1>,
<column_2>,
<aggregations/calculations>
FROM <source_table>
JOIN <another_table>
ON <join_condition>
WHERE <filter_conditions>
GROUP BY <grouping_columns>;
---
// Assertion: Data quality check conditions
config {
type: "table",
assertions: {
nonNull: ["column_1"]
}
}

Why Dataform Excels in ELT Workflows?

Before diving into why Dataform is better for ELT, let’s clarify the workflow types

Here’s why ELT with Dataform is better:

Dataform vs. dbt: A Beginner’s Take

You might have heard of dbt (Data Build Tool), another big name in data transformation. Both Dataform and dbt help you write SQL to transform data, but they’re a little different. Here’s a quick comparison:

PS: Dataform is part of BigQuery.


Edit page
Share this post on:

Next Post
Poetry : A Better Alternative to requirements.txt. The Maven for Python ?