Dagster assets and ops can perform computations using Spark.
Running computations on Spark presents unique challenges, because, unlike other computations, Spark jobs typically execute on infrastructure that's specialized for Spark - i.e. that can network sets of workers into clusters that Spark can run computations against. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or AWS EMR cluster.
Dagster pipes is our toolkit for orchestrating remote compute from Dagster. It allows you to run code outside of the Dagster process, and stream logs and events back to Dagster. This is the recommended approach for running Spark jobs.
With Pipes, the code inside the asset or op definition submits a Spark job to an external system like Databricks or AWS EMR, usually pointing to a jar or zip of Python files that contain the actual Spark data transformations and actions.
You can either use one of the available Pipes Clients or make your own. The available Pipes Clients for popular Spark providers are:
Existing Spark jobs can be used with Pipes without any modifications. In this case, Dagster will be receiving logs from the job, but not events like asset checks or attached metadata.
Additionally, it's possible to send events to Dagster from the job by utilizing the dagster_pipes
module. This requires minimal code changes on the job side.
This approach also works for Spark jobs written in Java or Scala, although we don't have Pipes implementations for emitting events from those languages yet.
Previously, Step Launchers were the recommended way to run Spark jobs from Dagster. They have been superseded in favor of Dagster Pipes. A migration guide can be found here. Learn more in this GitHub discussion.