Dagster is mainly used to build data pipelines, and most data pipelines can be expressed in Dagster as sets of software-defined assets. If you’re a new Dagster user and your goal is to build a data pipeline, we recommend starting with software-defined assets and not worrying about ops or graphs. This is because most of the code you’ll be writing will directly relate to producing data assets.
However, there are some situations where you want to run code without thinking about data assets that the code is producing. In these cases, it’s appropriate to use ops and graphs. For example:
You want to schedule a workflow where the goal is not to keep a set of data assets up-to-date. It might do something like:
Send emails to a set of users
Scan a data warehouse for tables that haven't been used in months and delete them
Record metadata about a set of data assets
In these cases, you should define your workflow in terms ops and graphs, not software-defined assets. The Intro to ops and jobs guide is a good place to start learning how to do this.
Additionally, note that a single Dagster deployment can contain software-defined assets and op/graph-based jobs side-by-side, which means that you’re not bound to one particular choice. If your workflow reads from software-defined assets, you can model that explicitly in Dagster, which is discussed in a a later section.
Situation 2: You want to break an asset into multiple steps#
If you're in a situation like the following:
An asset requires multiple steps
Some of the steps don't produce assets of their own
You need to be able to re-execute individual steps
Situation 3: You’re anchored in task-based workflows#
Task-based workflows have been a popular way of defining data pipelines for a long time. While we believe that software-defined assets provide a superior way of writing and operating data pipelines, we acknowledge that teams often have existing codebases or mindsets that are heavily anchored in task-based workflows.
Op-based graphs resemble task-based workflows very closely, so they’re a natural choice for data pipelines that want to stick to that paradigm, either permanently or temporarily, while migrating to software-defined assets.
Next, we'll discuss how assets relate to ops and graphs. By the end of this section, you should understand how each type of software-defined asset relates to ops and graphs.
A software-defined asset is a description of how to compute the contents of a particular data asset.
Under the hood, every software-defined asset contains an op (or graph of ops), which is the function that’s invoked to compute its contents. In most cases, the underlying op is invisible to the user.
Dagster supports composing a set of ops into an op graph, usually by using the @graph decorator. A software-defined asset can be backed by an op graph, instead of an op.
Graph-backed assets are useful when you want to execute multiple separate steps to compute an asset and some of those steps don’t produce assets of their own.
For example, to compute the contents of a table, you need to fetch data from an API and then perform a heavy data transformation on it. You don’t care about writing the fetched, pre-transformed data to any known location, but you want the fetching and transforming to happen in two separate steps that can run in different processes. If there’s a failure, you’d like to be able to re-execute the transformation step without re-executing the fetching step.
In some cases, you might want to build a job that doesn't produce any assets, but does read from at least one asset. Dagster facilitates this by allowing you to designate assets as inputs to ops within a graph or graph-based job:
For example, you have a table that represents a list of emails that you want to send. A job reads data from the table and uses it to send the emails:
from dagster import asset, job, op
@assetdefemails_to_send():...@opdefsend_emails(emails)->None:...@jobdefsend_emails_job():
send_emails(emails_to_send.to_source_asset())
In this case, the asset - specifically, the table the job reads from - is only used as a data source for the job. It’s not materialized when the graph is run.