Got questions about our recommendations or something to add? Join our GitHub discussion to share how you organize your Dagster code.
Dagster aims to enable teams to ship data pipelines with extraordinary velocity. In this guide, we'll talk about how we imagine structuring larger Dagster projects which help achieve that goal.
At a high level, here are the aspects we'd like to optimize when structuring a complex project:
You can quickly get stuff done (e.g., write a new job, fix a breakage, or retire existing data pipelines) without thinking much about where you need to make the change or how it may break something.
You can quickly find the relevant code regardless of your familiarity with the related business logic.
You can organize at your own pace when you feel things have grown too big, but not over-optimize too early.
As your experience with Dagster grows, certain aspects of this guide might no longer apply to your use cases, and you may want to change the structure to adapt to your business needs.
This guide uses the fully featured project example to walk through our recommendations. This example project is a large-size project that simulates real-world use cases and showcases a wide range of Dagster features. You can read more about this project and the application of Dagster concept best practices in the example project walkthrough guide.
Below is the complete file tree of the example project.
This project was scaffolded by the dagster project CLI. This tool generates files and folder structures that enable you to quickly get started with everything set up, especially the Python setup. Refer to the Create a new project guide to learn more about the default project skeleton.
Keep all assets together in an assets/ directory. As your business logic and complexity grows, grouping assets by business domains in multiple directories inside assets/ helps to organize assets further.
In this example, we keep all assets together in the project_fully_featured/assets/ directory. It is useful because you can use load_assets_from_package_module or load_assets_from_modules to load assets into your definition, as opposed to needing to add assets to the definition every time you define one. It also helps collaboration as your teammates can quickly navigate to the right place to find the core business logic (i.e., assets) regardless of their familiarity with the codebase.
In this example, we put sensors and schedules together in the sensors folder. When we build sensors, they are considered policies for when to trigger a particular job. Keeping all the policies together helps us understand what what's available when creating jobs.
Note: Certain sensors, like run status sensors, can listen to multiple jobs and do not trigger a job. We recommend keeping these sensors in the definition as they are often for alerting and monitoring at the code location level.
Make resources reusable and share them across jobs or asset groups.
In this example, we grouped resources (e.g., database connections, Spark sessions, API clients, and I/O managers) in the resources folder, where they are bound to configuration sets that vary based on the environment.
In complex projects, we find it helpful to make resources reusable and configured with pre-defined values via configured. This approach allows your teammates to use a pre-defined resource set or make changes to shared resources, thus enabling more efficient project development.
This pattern also helps you easily execute jobs in different environments without code changes. In this example, we dynamically defined a code location based on the deployment in __init__.py and can keep all code the same across testing, local development, staging, and production. Read more about our recommendations in the Transitioning data pipelines from Development to Production guide.
This project does not include ops or graphs; if it did, this would be the recommendation on how to structure it.
We recommend having a jobs folder rather than a jobs.py file in this situation. Depending on the types of jobs you have, you can create a separate file for each type of job.
We recommend defining ops and graphs a job file along with the job definition within a single file.
So far, we've discussed our recommendations for structuring a large project which contains only one code location. Dagster also allows you to structure a project with multiple definitions. We don't recommend over-abstracting too early; in most cases, one code location should be sufficient. A helpful pattern uses multiple code locations to separate conflicting dependencies, where each definition has its own package requirements (e.g., setup.py) and deployment specs (e.g., Dockerfile).
We recommend setting up a separate test folder structure that mirrors the main project (e.g., having a folder for test assets with any applicable subfolders), which contains the unit tests for each of the components of the data pipeline.
Each of the components in Dagster such as assets, sensors, and resources can all be tested separately. Refer to the Testing in Dagster documentation for more info.