Assets without I/O#

In the tutorial, all the assets we've looked at are straightforward to work with as Python objects in memory - e.g. topstories is created by constructing a DataFrame of the top stories from the Hacker News API.

However, often we don't want to build or load our assets as Python objects in memory. For example, we might compute an asset by copying a file, by issuing a create table statement to a database, or by executing a command-line utility.

In those situations, we can define assets whose functions don't have return values or arguments. And we can mix those assets with assets that do have return values or arguments.


Prerequisites#

This guide builds off of the project written in the tutorial. If you haven't already, you should complete the tutorial before continuing.


Writing an asset with no output#

In the tutorial, we defined an asset that contains a list of stopwords to filter out common words or prepositions from our word cloud. The English language has a lot of stopwords, so rather than hardcoding them into your asset, you'll bring them in from a separate data source.

You'll augment your asset graph with a new asset that contains the stopwords from a zip file on the internet and create an asset from those stopwords. These stopwords will replace the hardcoded stopwords in your topstories_word_cloud asset. By the end of this guide, your asset graph will have two new assets:

  • stopwords_zip - A zip file on our local filesystem, which contains the stopwords.
  • stopwords_csv - A list of stopwords, which contains the unzipped contents of stopwords_zip.

A zip file of stopwords from the internet#

Let's start with stopwords_zip. We'll add this to the same assets.py you wrote in the tutorial:

import urllib.request # Addition: new import at the top of `assets.py`

@asset
def stopwords_zip() -> None:
    urllib.request.urlretrieve(
        "https://docs.dagster.io/assets/stopwords.zip",
        "stopwords.zip",
    )

The implementation downloads a zip file from the internet to our local filesystem. It looks a lot like the assets we defined in the tutorial, with two differences:

  • It doesn't have a return statement. Instead of returning an object and letting the I/O manager write it to storage, the function handles writing to storage itself.
  • It has a None annotation for its return type. This helps Dagster understand what we already know by looking at it - that the function never returns any values.

Using an asset without loading it#

Let's define an asset that contains the unzipped contents of our zip file:

import zipfile

@asset(non_argument_deps={"stopwords_zip"})
def stopwords_csv() -> None:
    with zipfile.ZipFile("stopwords.zip", "r") as zip_ref:
        zip_ref.extractall(".")

The implementation extracts our zip file to a CSV file in the same directory. It looks a lot like the assets we defined in the tutorial, with one main difference.

In the assets you wrote in the tutorial, you indicated the asset dependencies by including function arguments with the name of the upstream assets. Dagster then invoked an I/O manager to load the asset as a Python object and supplied it to your function.

However, in this case, we don't want a Python object for the upstream asset, we just want to work with the file directly. So we instead supply "stopwords_zip" to the non_argument_deps parameter of the @asset decorator. This lets us tell Dagster that the stopwords asset depends on the stopwords_zip asset, without telling it to load stopwords_zip for us.

Combining assets with and without I/O#

Finally, let's modify the topstories_word_cloud asset to read from the stopwords_csv asset instead of the hardcoded stopwords:

import csv

@asset(non_argument_deps={"stopwords_csv"})
def topstories_word_cloud(topstories):
    with open("stopwords.csv", "r") as f:
        stopwords = {row[0] for row in csv.reader(f)}

    titles_text = " ".join([str(item) for item in topstories["title"]])
    titles_cloud = WordCloud(stopwords=stopwords, background_color="white").generate(
        titles_text
    )

    # Generate the word cloud image
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(titles_cloud, interpolation="bilinear")
    plt.axis("off")
    plt.tight_layout(pad=0)

    # Convert the image to a saveable format
    buffer = BytesIO()
    plt.savefig(buffer, format="png")
    image_data = base64.b64encode(buffer.getvalue())

    # Convert the image to Markdown to preview it within Dagster
    md_content = f"![img](data:image/png;base64,{image_data.decode()})"

    # Attach the Markdown content as metadata to the asset
    return Output(
        value=image_data,
        metadata={
            "plot": MetadataValue.md(md_content)
        }
    )

    return result

This asset mixes and matches argument-based dependencies with non-argument-based dependencies. It depends on the topstories_word_cloud asset, which we defined in the previous section of the tutorial, and it relies on Dagster to load that asset's value into memory. It also depends on the stopwords asset, and the decorated function takes responsibility for loading that asset's value into memory.

Now let's visualize all these assets in Dagit:

dagster dev

Navigate to http://127.0.0.1:3000:

Asset graph with assets using non-argument deps

When to use assets without arguments and return values#

In the tutorial, we saw assets with return values and arguments — where business logic and I/O are kept separate. In this guide, we saw assets with no return values and no arguments — where there's no separation between business logic and I/O. When should you use one instead of the other?

In general, doing the former is preferred - to use return values and arguments in the @asset-decorated function and to keep any custom I/O code inside an I/O manager. A couple of the reasons are:

  • Separating business logic from I/O makes it easier to write lightweight tests for the business logic.
  • There are fewer opportunities for bugs. In the implementation of stopwords, it would be easy to accidentally read from the wrong file.

However, sometimes this is impossible or inconvenient. When you're working with tools that aren't designed for a clean separation between business logic and I/O, it's often not worth going through contortions to fit them into the argument/return model. For those cases, use the patterns that were shown here.