dagster-bigquery integration reference#

This reference page provides information for working with dagster-bigquery features that are not covered as part of the Using Dagster with BigQuery tutorial.


Providing credentials as configuration#

In most cases, you will authenticate with Google Cloud Project (GCP) using one of the methods outlined in the GCP documentation. However, in some cases you may find that you need to provide authentication credentials directly to the BigQuery I/O manager. For example, if you are using Dagster Cloud Serverless you cannot upload a credential file, so must provide your credentials as an environment variable.

You can provide credentials directly to the BigQuery I/O manager by using the gcp_credentials configuration value. The BigQuery I/O manager will create a temporary file to store the credential and will set GOOGLE_APPLICATION_CREDENTIALS to point to this file. When the Dagster run is completed, the temporary file is deleted and GOOGLE_APPLICATION_CREDENTIALS is unset.

To avoid issues with newline characters in the GCP credential key, you must base64 encode the key. For example, if your GCP key is stored at ~/.gcp/key.json you can base64 encode the key by using the following shell command:

cat ~/.gcp/key.json | base64

Then you can set an environment variable in your Dagster deployment (for example GCP_CREDS) to the encoded key and provide it to the BigQuery I/O manager:

from dagster_gcp_pandas import bigquery_pandas_io_manager

from dagster import Definitions

defs = Definitions(
    assets=[iris_data],
    resources={
        "io_manager": bigquery_pandas_io_manager.configured(
            {
                "project": "my-gcp-project",
                "location": "us-east5",
                "dataset": "IRIS",
                "gcp_credential": {"env": "GCP_CREDS"},
            }
        )
    },
)

Selecting specific columns in a downstream asset#

Sometimes you may not want to fetch an entire table as the input to a downstream asset. With the BigQuery I/O manager, you can select specific columns to load by supplying metadata on the downstream asset.

import pandas as pd

from dagster import AssetIn, asset


# this example uses the iris_data asset from Step 2 of the Using Dagster with BigQuery tutorial


@asset(
    ins={
        "iris_sepal": AssetIn(
            key="iris_data",
            metadata={"columns": ["Sepal length (cm)", "Sepal width (cm)"]},
        )
    }
)
def sepal_data(iris_sepal: pd.DataFrame) -> pd.DataFrame:
    iris_sepal["Sepal area (cm2)"] = (
        iris_sepal["Sepal length (cm)"] * iris_sepal["Sepal width (cm)"]
    )
    return iris_sepal

In this example, we only use the columns containing sepal data from the IRIS_DATA table created in Step 2: Create tables in BigQuery of the Using Dagster with BigQuery tutorial. Fetching the entire table would be unnecessarily costly, so to select specific columns, we can add metadata to the input asset. We do this in the metadata parameter of the AssetIn that loads the iris_data asset in the ins parameter. We supply the key columns with a list of names of the columns we want to fetch.

When Dagster materializes sepal_data and loads the iris_data asset using the BigQuery I/O manager, it will only fetch the Sepal length (cm) and Sepal width (cm) columns of the IRIS.IRIS_DATA table and pass them to sepal_data as a Pandas DataFrame.


Storing partitioned assets#

The BigQuery I/O manager supports storing and loading partitioned data. In order to correctly store and load data from the BigQuery table, the BigQuery I/O manager needs to know which column contains the data defining the partition bounds. The BigQuery I/O manager uses this information to construct the correct queries to select or replace the data. In the following sections, we describe how the I/O manager constructs these queries for different types of partitions.

Storing static partitioned assets#

In order to store static partitioned assets in BigQuery, you must specify partition_expr metadata on the asset to tell the BigQuery I/O manager which column contains the partition data:

import pandas as pd

from dagster import StaticPartitionsDefinition, asset


@asset(
    partitions_def=StaticPartitionsDefinition(
        ["Iris-setosa", "Iris-virginica", "Iris-versicolor"]
    ),
    metadata={"partition_expr": "SPECIES"},
)
def iris_data_partitioned(context) -> pd.DataFrame:
    species = context.asset_partition_key_for_output()

    full_df = pd.read_csv(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        names=[
            "Sepal length (cm)",
            "Sepal width (cm)",
            "Petal length (cm)",
            "Petal width (cm)",
            "Species",
        ],
    )

    return full_df[full_df["Species"] == species]


@asset
def iris_cleaned(iris_data_partitioned: pd.DataFrame):
    return iris_data_partitioned.dropna().drop_duplicates()

Dagster uses the partition_expr metadata to craft the SELECT statement when loading the partition in the downstream asset. When loading a static partition, the following statement is used:

SELECT *
 WHERE [partition_expr] = ([selected partitions])

When the partition_expr value is injected into this statement, the resulting SQL query must follow BigQuery's SQL syntax. Refer to the BigQuery documentation for more information.

When materializing the above assets, a partition must be selected, as described in Materializing partitioned assets. In this example, the query used when materializing the Iris-setosa partition of the above assets would be:

SELECT *
 WHERE SPECIES in ('Iris-setosa')

Storing tables in multiple datasets#

You may want to have different assets stored in different BigQuery datasets. The BigQuery I/O manager allows you to specify the dataset in several ways.

If you want all of your assets to be stored in the same dataset, you can specify the dataset as configuration to the I/O manager, like we did in Step 1: Configure the BigQuery I/O manager of the Using Dagster with BigQuery tutorial.

If you want to store assets in different datasets, you can specify the dataset as part of the the asset's asset key:

import pandas as pd

from dagster import SourceAsset, asset

daffodil_data = SourceAsset(key=["gcp", "bigquery", "daffodil", "daffodil_data"])


@asset(key_prefix=["gcp", "bigquery", "iris"])
def iris_data() -> pd.DataFrame:
    return pd.read_csv(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        names=[
            "Sepal length (cm)",
            "Sepal width (cm)",
            "Petal length (cm)",
            "Petal width (cm)",
            "Species",
        ],
    )

The dataset will be the last prefix before the asset's name. In this example, the iris_data asset will be stored in the IRIS dataset, and the daffodil_data asset will be found in the DAFFODIL dataset.

The two options for specifying the dataset are mutually exclusive. If you provide dataset configuration to the I/O manager, you cannot also provide it via the asset key and vice versa. If no dataset is provided, either from configuration or asset keys, the default dataset PUBLIC will be used.


Using the BigQuery I/O manager with other I/O managers#

You may have assets that you don't want to store in BigQuery. You can provide an I/O manager to each asset using the io_manager_key parameter in the asset decorator:

import pandas as pd
from dagster_aws.s3.io_manager import s3_pickle_io_manager
from dagster_gcp_pandas import bigquery_pandas_io_manager

from dagster import Definitions, asset


@asset(io_manager_key="warehouse_io_manager")
def iris_data() -> pd.DataFrame:
    return pd.read_csv(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        names=[
            "Sepal length (cm)",
            "Sepal width (cm)",
            "Petal length (cm)",
            "Petal width (cm)",
            "Species",
        ],
    )


@asset(io_manager_key="blob_io_manager")
def iris_plots(iris_data):
    # plot_data is a function we've defined somewhere else
    # that plots the data in a DataFrame
    return plot_data(iris_data)


defs = Definitions(
    assets=[iris_data, iris_plots],
    resources={
        "warehouse_io_manager": bigquery_pandas_io_manager.configured(
            {
                "project": "my-gcp-project",
                "dataset": "IRIS",
            }
        ),
        "blob_io_manager": s3_pickle_io_manager,
    },
)

In this example, the iris_data asset uses the I/O manager bound to the key warehouse_io_manager and iris_plots will use the I/O manager bound to the key blob_io_manager. In the Definitions object, we supply the I/O managers for those keys. When the assets are materialized, the iris_data will be stored in BigQuery, and iris_plots will be saved in Amazon S3.


Storing and loading PySpark DataFrames in BigQuery#

The BigQuery I/O manager also supports storing and loading PySpark DataFrames. To use the bigquery_pyspark_io_manager, first install the package:

pip install dagster-gcp-pyspark

Then you can use the gcp_pyspark_io_manager in your Definitions as in Step 1: Configure the BigQuery I/O manager of the Using Dagster with BigQuery tutorial.

from dagster_gcp_pyspark import bigquery_pyspark_io_manager

from dagster import Definitions

defs = Definitions(
    assets=[iris_data],
    resources={
        "io_manager": bigquery_pyspark_io_manager.configured(
            {
                "project": "my-gcp-project",  # required
                "location": "us-east5",  # optional, defaults to the default location for the project - see https://cloud.google.com/bigquery/docs/locations for a list of locations
                "data": "IRIS",  # optional, defaults to PUBLIC
                "temporary_gcs_bucket": "my-gcs-bucket",  # optional, defaults to None, which will result in a direct write to BigQuery
            }
        )
    },
)

When using the bigquery_pyspark_io_manager you may provide the temporary_gcs_bucket configuration. This will store the data is a temporary GCS bucket, then all of the data into BigQuery in one operation. If not provided, data will be directly written to BigQuery.

The bigquery_pyspark_io_manager requires that a SparkSession be active and configured with the BigQuery connector for Spark. You can either create your own SparkSession or use the spark_resource.

from dagster_gcp_pyspark import bigquery_pyspark_io_manager
from dagster_pyspark import pyspark_resource
from pyspark import SparkFiles
from pyspark.sql import (
    DataFrame,
    SparkSession,
)
from pyspark.sql.types import (
    DoubleType,
    StringType,
    StructField,
    StructType,
)

from dagster import Definitions, asset

BIGQUERY_JARS = "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.28.0"


@asset(required_resource_keys={"pyspark"})
def iris_data(context) -> DataFrame:
    spark = context.resources.pyspark.spark_session

    schema = StructType(
        [
            StructField("Sepal length (cm)", DoubleType()),
            StructField("Sepal width (cm)", DoubleType()),
            StructField("Petal length (cm)", DoubleType()),
            StructField("Petal width (cm)", DoubleType()),
            StructField("Species", StringType()),
        ]
    )

    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    spark.sparkContext.addFile(url)

    return spark.read.schema(schema).csv("file://" + SparkFiles.get("iris.data"))


defs = Definitions(
    assets=[iris_data],
    resources={
        "io_manager": bigquery_pyspark_io_manager.configured(
            {
                "project": "my-gcp-project",
                "dataset": "IRIS",
            }
        ),
        "pyspark": pyspark_resource.configured(
            {"spark_conf": {"spark.jars.packages": BIGQUERY_JARS}}
        ),
    },
)

Note: In order to load data from BigQuery as a PySpark DataFrame, the BigQuery PySpark connector will create a view containing the data. This will result in the creation of a temporary table in your BigQuery dataset. See the BigQuery PySpark connectordocumentation for more details.


Using Pandas and PySpark DataFrames with BigQuery#

If you work with both Pandas and PySpark DataFrames and want a single I/O manager to handle storing and loading these DataFrames in BigQuery, you can construct a BigQuery I/O manager using build_bigquery_io_manager:

from dagster_gcp import build_bigquery_io_manager
from dagster_gcp_pandas import BigQueryPandasTypeHandler
from dagster_gcp_pyspark import BigQueryPySparkTypeHandler

from dagster import Definitions

bigquery_io_manager = build_bigquery_io_manager(
    [BigQueryPandasTypeHandler(), BigQueryPySparkTypeHandler()]
)

defs = Definitions(
    assets=[iris_data, rose_data],
    resources={
        "io_manager": bigquery_io_manager.configured(
            {
                "project": "my-gcp-project",
                "data": "IRIS",
            }
        )
    },
)

Executing custom SQL commands with the BigQuery resource#

In addition to the BigQuery I/O manager, Dagster also provides a BigQuery resource for executing custom SQL queries.

from dagster_gcp import bigquery_resource

from dagster import Definitions, asset

# this example executes a query against the IRIS.IRIS_DATA table created in Step 2 of the
# Using Dagster with BigQuery tutorial


@asset(required_resource_keys={"bigquery"})
def small_petals(context):
    return context.resources.bigquery.query(
        (
            'SELECT * FROM IRIS.IRIS_DATA WHERE "Petal length (cm)" < 1 AND "Petal'
            ' width (cm)" < 1'
        ),
    ).result()


defs = Definitions(
    assets=[small_petals],
    resources={
        "bigquery": bigquery_resource.configured(
            {
                "project": "my-gcp-project",
                "location": "us-east5",
            }
        )
    },
)

In this example, we attach the BigQuery resource to the small_petals asset. In the body of the asset function, we use the execute_query method of the resource to execute a custom SQL query against the IRIS_DATA table created in Step 2: Create tables in BigQuery of the Using Dagster with BigQuery tutorial.

For more information on the BigQuery resource, see the BigQuery resource API docs.