Dataform SQL Workflow Execution

Streamlining Data Transformation: A Practical Guide to Dataform SQL Workflow Execution

Dataform SQL Workflow Execution

Overview: Dataform empowers cloud engineers to streamline data transformation through SQL workflows. This Proof of Concept (PoC) demonstrates the creation and execution of a SQL workflow in Dataform, highlighting its features for efficient data transformation.

Prerequisites:

  • Access to Google Cloud Console with a project enabled for billing.

  • Google Cloud project with BigQuery and Dataform APIs enabled.

  • Give Dataform Admin and Dataform editor permissions for your account.

  • Familiarity with SQL concepts and cloud engineering.

Implementation Steps:

1. Create Dataform Repository:

  • Access the Dataform page in the Google Cloud console.

  • Click on create a repository named "quickstart-repository" in the "us-central1" region.

2. Initialize Development Workspace:

  • Inside "quickstart-repository," create a development workspace named "quickstart-workspace”.

Note: Dataform initializes your workspace with a set of directories and configuration files that contain default settings for deploying workflows. You can override the default settings in the dataform.json file to best suit your needs, but it is not a requirement to deploy SQL workflows to BigQuery. This quickstart uses the default Dataform settings.

The default structure that Dataform provides when you initialize a workspace looks like this:

Here's an explanation of each item in this structure:

definitions/: This directory is where you'll create your SQLX files, which define various components of your data transformation workflows, such as tables, views, and transformations.

  • first_view.sqlx: This SQLX file represents a data transformation component, likely a view, defined using SQL code.

  • second_view.sqlx: Another SQLX file representing another data transformation component, such as a view or a table.

includes/: This directory could be used for including common SQL or JavaScript code snippets that you might use across multiple SQLX files.

  • .gitignore: This file specifies which files or directories should be ignored by Git version control.

  • dataform.json: This configuration file defines settings for your Dataform project, including warehouse configurations, default schema, and more.

  • package-lock.json and package.json: These files are related to JavaScript package management and dependencies. They might be generated if you use JavaScript-based scripting in your Dataform project.

3. Define a View:

  • Now create a SQLX file, "definitions/quickstart-source.sqlx," to define a view that serves as a data source and click on create.

  • Define view data using SQL UNION ALL statements. In the file, enter the following example code snippet:

config {
  type: "view"
}

SELECT
  "apples" AS fruit,
  2 AS count
UNION ALL
SELECT
  "oranges" AS fruit,
  5 AS count
UNION ALL
SELECT
  "pears" AS fruit,
  1 AS count
UNION ALL
SELECT
  "bananas" AS fruit,
  0 AS count

4. Define a Table:

  • Create a SQLX file, "definitions/quickstart-table.sqlx," to define a table.

  • Specify the table type and structure, referencing the "definitions/quickstart-source.sqlx" view.

  • Dataform will validate the query and resolve dependencies during execution.

  • Select quickstart-table.sqlx, and then enter the following table type and SELECT statement:

config {
  type: "table"
}

SELECT
  fruit,
  SUM(count) as count
FROM ${ref("quickstart-source")}
GROUP BY 1

5. Grant Dataform access to BigQuery:

  • In the Google Cloud console, go to the IAM page: Go to the IAM page

  • Click Add.

  • In the New principals field, enter your Dataform service account ID.

  • In the Select a role drop-down list, select the BigQuery Job User role.

  • Click Add another role, and then in the Select a role drop-down list, select the BigQuery Data Editor role and similarly add BigQuery Data Viewer role.

  • Click Save.

5. Execute the Workflow:

  • On the quickstart-workspace page, click Start execution.

  • Click All actions.

  • In the Execute pane, click Start execution.

  • Dataform uses the default repository settings to create the contents of your workflow in a BigQuery dataset called dataform.

6. View Execution Logs:

  • Check Workflow Execution Logs on the "quickstart-repository" page for details of execution.

Benefits: Dataform's SQL workflows enhance data transformation efficiency by offering Git integration, dependency management, and automated data quality checks. JavaScript integration facilitates code reuse, while documentation and version control maintain data integrity. This PoC showcases how Dataform aligns with your expertise, providing a robust toolset for effective data transformation.

Conclusion: Dataform's seamless integration, dependency handling, and data validation capabilities make it an invaluable asset for cloud engineers. By incorporating Dataform into your workflow, you can optimize data transformation processes while maintaining high data quality and efficient collaboration.