Dagster can orchestrate your Databricks jobs and other Databricks API calls, making it easy to chain together multiple Databricks jobs, as well as orchestrate Databricks alongside your other technologies.
To get started, you will need to install the dagster and dagster-databricks Python packages:
pip install dagster dagster-databricks
You'll also want to have a Databricks workspace with an existing project that is deployed with a Databricks job. If you don't have one already, you can follow the Databricks quickstart to set one up.
The first step in using Databricks with Dagster is to tell Dagster how to connect to your Databricks workspace using a Databricks resource. This resource contains information on where your Databricks workspace is located and any credentials sourced from environment variables that are needed to access it. By configuring the resource, you can access the underlying Databricks API client to communicate to your Databricks workspace.
For more information about the Databricks resource, see the API reference.
from dagster_databricks import databricks_client
databricks_client_instance = databricks_client.configured({"host":{"env":"DATABRICKS_HOST"},"token":{"env":"DATABRICKS_TOKEN"},})
Step 2: Create an op/asset that connects to Databricks#
In this step, we show several ways to model a Databricks API call as either a Dagster op, or as the computation backing a software-defined asset. You can either: