Table metadata
Table metadata provides additional context about a tabular asset, such as its schema, row count, and more. This metadata can be used to improve collaboration, debugging, and data quality in your data platform.
Dagster supports attaching different types of table metadata to assets, including:
- Column schema, which describes the structure of the table, including column names and types
- Row count, which describes the number of rows in a materialized table
- Column-level lineage, which describes how a column is created and used by other assets
Attaching column schema
For assets defined in Dagster
Column schema metadata can be attached to Dagster assets either as definition metadata or runtime metadata, which will then be visible in the Dagster UI. For example:
If the schema of your asset is pre-defined, you can attach it as definition metadata. If the schema is only known when an asset is materialized, you can attach it as metadata to the materialization.
To attach schema metadata to an asset, you will need to:
- Construct a
TableSchema
object withTableColumn
entries describing each column in the table - Attach the
TableSchema
object to the asset as part of themetadata
parameter under thedagster/column_schema
key. This can be attached to your asset definition, or to theMaterializeResult
object returned by the asset function.
Below are two examples of how to attach column schema metadata to an asset, one as definition metadata and one as runtime metadata:
from dagster import AssetKey, MaterializeResult, TableColumn, TableSchema, asset
# Definition metadata
# Here, we know the schema of the asset, so we can attach it to the asset decorator
@asset(
deps=[AssetKey("source_bar"), AssetKey("source_baz")],
metadata={
"dagster/column_schema": TableSchema(
columns=[
TableColumn(
"name",
"string",
description="The name of the person",
),
TableColumn(
"age",
"int",
description="The age of the person",
),
]
)
},
)
def my_asset(): ...
# Materialization metadata
# Here, the schema isn't known until runtime
@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_other_asset():
column_names = ...
column_types = ...
columns = [
TableColumn(name, column_type)
for name, column_type in zip(column_names, column_types)
]
yield MaterializeResult(
metadata={"dagster/column_schema": TableSchema(columns=columns)}
)
The schema for my_asset
will be visible in the Dagster UI.
For assets loaded from integrations
Dagster's dbt integration enables automatically attaching column schema metadata to assets loaded from dbt models. For more information, see the dbt documentation.
Attaching row count
Row count metadata can be attached to Dagster assets as runtime metadata to provide additional context about the number of rows in a materialized table. This will be highlighted in the Dagster UI. For example:
In addition to showing the latest row count, Dagster will let you track changes in the row count over time, and you can use this information to monitor data quality.
To attach row count metadata to an asset, you will need to attach a numerical value to the dagster/row_count
key in the metadata parameter of the MaterializeResult
object returned by the asset function. For example:
import pandas as pd
from dagster import AssetKey, MaterializeResult, asset
@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_asset():
my_df: pd.DataFrame = ...
yield MaterializeResult(metadata={"dagster/row_count": 374})
Attaching column-level lineage
Column lineage enables data and analytics engineers alike to understand how a column is created and used in your data platform. For more information, see the column-level lineage documentation.
Ensuring table schema consistency
When column schemas are defined at runtime through runtime metadata, it can be helpful to detect and alert on schema changes between materializations. Dagster provides build_column_schema_change_checks
API to help detect these changes.
This function creates asset checks which compare the current materialization's schema against the schema from the previous materialization. These checks can detect:
- Added columns
- Removed columns
- Changed column types
Let's define a column schema change check for our asset from the example above that defines table schema at runtime, my_other_asset
.
from dagster import build_column_schema_change_checks
schema_checks = build_column_schema_change_checks(assets=[my_other_asset])