Databricks configurations

Configuring tables

When materializing a model as table, you may include several optional configs that are specific to the dbt-databricks plugin, in addition to the standard model configs.

The `append` strategy

Following the append strategy, dbt will perform an insert into statement with all new data. The appeal of this strategy is that it is straightforward and functional across all platforms, file types, connection methods, and Apache Spark versions. However, this strategy cannot update, overwrite, or delete existing data, so it is likely to insert duplicate records for many data sources.

Source code
Run code

databricks_incremental.sql

{{ config(
    materialized='incremental',
    incremental_strategy='append',
) }}

--  All rows returned by this query will be appended to the existing table

select * from {{ ref('events') }}
{% if is_incremental() %}
  where event_ts > (select max(event_ts) from {{ this }})
{% endif %}

databricks_incremental.sql

create temporary view databricks_incremental__dbt_tmp as

    select * from analytics.events

    where event_ts >= (select max(event_ts) from {{ this }})

;

insert into table analytics.databricks_incremental
    select `date_day`, `users` from databricks_incremental__dbt_tmp

The `insert_overwrite` strategy

caution

This strategy is currently only compatible with All Purpose Clusters, not SQL Warehouses.

This strategy is most effective when specified alongside a partition_by clause in your model config. dbt will run an atomic insert overwrite statement that dynamically replaces all partitions included in your query. Be sure to re-select all of the relevant data for a partition when using this incremental strategy.

If no partition_by is specified, then the insert_overwrite strategy will atomically replace all contents of the table, overriding all existing data with only the new records. The column schema of the table remains the same, however. This can be desirable in some limited circumstances, since it minimizes downtime while the table contents are overwritten. The operation is comparable to running truncate + insert on other databases. For atomic replacement of Delta-formatted tables, use the table materialization (which runs create or replace) instead.

Source code
Run code

databricks_incremental.sql

{{ config(
    materialized='incremental',
    partition_by=['date_day'],
    file_format='parquet'
) }}

/*
  Every partition returned by this query will be overwritten
  when this model runs
*/

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    date_day,
    count(*) as users

from new_events
group by 1

databricks_incremental.sql

create temporary view databricks_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        date_day,
        count(*) as users

    from events
    group by 1

;

insert overwrite table analytics.databricks_incremental
    partition (date_day)
    select `date_day`, `users` from databricks_incremental__dbt_tmp

The `merge` strategy

The merge incremental strategy requires:

file_format: delta or hudi
Databricks Runtime 5.1 and above for delta file format
Apache Spark for hudi file format

The Databricks adapter will run an atomic merge statement similar to the default merge behavior on Snowflake and BigQuery. If a unique_key is specified (recommended), dbt will update old records with values from new records that match on the key column. If a unique_key is not specified, dbt will forgo match criteria and simply insert all new records (similar to append strategy).

Specifying merge as the incremental strategy is optional since it's the default strategy used when none is specified.

Source code
Run code

merge_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta', # or 'hudi'
    unique_key='user_id',
    incremental_strategy='merge'
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

target/run/merge_incremental.sql

create temporary view merge_incremental__dbt_tmp as

    with new_events as (

        select * from analytics.events


        where date_day >= date_add(current_date, -1)


    )

    select
        user_id,
        max(date_day) as last_seen

    from events
    group by 1

;

merge into analytics.merge_incremental as DBT_INTERNAL_DEST
    using merge_incremental__dbt_tmp as DBT_INTERNAL_SOURCE
    on DBT_INTERNAL_SOURCE.user_id = DBT_INTERNAL_DEST.user_id
    when matched then update set *
    when not matched then insert *

The `replace_where` strategy

The replace_where incremental strategy requires:

file_format: delta
Databricks Runtime 12.0 and above

dbt will run an atomic replace where statement which selectively overwrites data matching one or more incremental_predicates specified as a string or array. Only rows matching the predicates will be inserted. If no incremental_predicates are specified, dbt will perform an atomic insert, as with append.

caution

replace_where inserts data into columns in the order provided, rather than by column name. If you reorder columns and the data is compatible with the existing schema, you may silently insert values into an unexpected column. If the incoming data is incompatible with the existing schema, you will instead receive an error.

Source code
Run code

replace_where_incremental.sql

{{ config(
    materialized='incremental',
    file_format='delta',
    incremental_strategy = 'replace_where'
    incremental_predicates = 'user_id >= 10000' # Never replace users with ids < 10000
) }}

with new_events as (

    select * from {{ ref('events') }}

    {% if is_incremental() %}
    where date_day >= date_add(current_date, -1)
    {% endif %}

)

select
    user_id,
    max(date_day) as last_seen

from events
group by 1

target/run/replace_where_incremental.sql

create temporary view replace_where__dbt_tmp as

    with new_events as (

        select * from analytics.events

        where date_day >= date_add(current_date, -1)

    )

    select
        user_id,
        max(date_day) as last_seen

    from events
    group by 1

;

insert into analytics.replace_where_incremental
    replace where user_id >= 10000
    table `replace_where__dbt_tmp`

Selecting compute per model

Beginning in version 1.7.2, you can assign which compute resource to use on a per-model basis. For SQL models, you can select a SQL Warehouse (serverless or provisioned) or an all purpose cluster. For details on how this feature interacts with python models, see Specifying compute for Python models.

note

This is an optional setting. If you do not configure this as shown below, we will default to the compute specified by http_path in the top level of the output section in your profile. This is also the compute that will be used for tasks not associated with a particular model, such as gathering metadata for all tables in a schema.

To take advantage of this capability, you will need to add compute blocks to your profile:

profile.yml

profile-name:
  target: target-name # this is the default target
  outputs:
    target-name:
      type: databricks
      catalog: optional catalog name if you are using Unity Catalog
      schema: schema name # Required        
      host: yourorg.databrickshost.com # Required

      ### This path is used as the default compute
      http_path: /sql/your/http/path # Required        
      
      ### New compute section
      compute:

        ### Name that you will use to refer to an alternate compute
       Compute1:
          http_path: '/sql/your/http/path' # Required of each alternate compute

        ### A third named compute, use whatever name you like
        Compute2:
          http_path: '/some/other/path' # Required of each alternate compute
      ...

    target-name: # additional targets
      ...
      ### For each target, you need to define the same compute,
      ### but you can specify different paths
      compute:

        ### Name that you will use to refer to an alternate compute
        Compute1:
          http_path: '/sql/your/http/path' # Required of each alternate compute

        ### A third named compute, use whatever name you like
        Compute2:
          http_path: '/some/other/path' # Required of each alternate compute
      ...

The new compute section is a map of user chosen names to objects with an http_path property. Each compute is keyed by a name which is used in the model definition/configuration to indicate which compute you wish to use for that model/selection of models. We recommend choosing a name that is easily recognized as the compute resources you're using, such as the name of the compute resource inside the Databricks UI.

note

You need to use the same set of names for compute across your outputs, though you may supply different http_paths, allowing you to use different computes in different deployment scenarios.

To configure this inside of dbt Cloud, use the extended attributes feature on the desired environments:

compute:
  Compute1:
    http_path: /SOME/OTHER/PATH
  Compute2:
    http_path: /SOME/OTHER/PATH

Specifying the compute for models

As with many other configuaration options, you can specify the compute for a model in multiple ways, using databricks_compute. In your dbt_project.yml, the selected compute can be specified for all the models in a given directory:

dbt_project.yml

...

models:
  +databricks_compute: "Compute1"     # use the `Compute1` warehouse/cluster for all models in the project...
  my_project:
    clickstream:
      +databricks_compute: "Compute2" # ...except for the models in the `clickstream` folder, which will use `Compute2`.

snapshots:
  +databricks_compute: "Compute1"     # all Snapshot models are configured to use `Compute1`.

For an individual model the compute can be specified in the model config in your schema file.

schema.yml

models:
  - name: table_model
    config:
      databricks_compute: Compute1
    columns:
      - name: id
        data_type: int

Alternatively the warehouse can be specified in the config block of a model's SQL file.

model.sql

{{
  config(
    materialized='table',
    databricks_compute='Compute1'
  )
}}
select * from {{ ref('seed') }}

To validate that the specified compute is being used, look for lines in your dbt.log like:

Databricks adapter ... using default compute resource.

Databricks adapter ... using compute resource <name of compute>.

Specifying compute for Python models

Materializing a python model requires execution of SQL as well as python. Specifically, if your python model is incremental, the current execution pattern involves executing python to create a staging table that is then merged into your target table using SQL.

When you specify your databricks_compute for a python model, you are currently only specifying which compute to use when running the model-specific SQL. If you wish to use a different compute for executing the python itself, you must specify an alternate compute in the config for the model. For example:

model.py

def model(dbt, session):
   dbt.config(
     http_path="sql/protocolv1/..."
   )

If your default compute is a SQL Warehouse, you will need to specify an all purpose cluster http_path in this way.

Persisting model descriptions

Relation-level docs persistence is supported in dbt v0.17.0. For more information on configuring docs persistence, see the docs.

When the persist_docs option is configured appropriately, you'll be able to see model descriptions in the Comment field of describe [table] extended or show table extended in [database] like '*'.

Default file format configurations

To access advanced incremental strategies features, such as snapshots and the merge incremental strategy, you will want to use the Delta or Hudi file format as the default file format when materializing models as tables.

It's quite convenient to do this by setting a top-level configuration in your project file:

dbt_project.yml

models:
  +file_format: delta # or hudi
  
seeds:
  +file_format: delta # or hudi
  
snapshots:
  +file_format: delta # or hudi

Setting table properties

Table properties can be set with your configuration for tables or views using tblproperties:

with_table_properties.sql

{{ config(
    tblproperties={
      'delta.autoOptimize.optimizeWrite' : 'true',
      'delta.autoOptimize.autoCompact' : 'true'
    }
 ) }}

caution

These properties are sent directly to Databricks without validation in dbt, so be thoughtful with how you use this feature. You will need to do a full refresh of incremental materializations if you change their tblproperties.

One application of this feature is making delta tables compatible with iceberg readers using the Universal Format.

{{ config(
    tblproperties={
      'delta.enableIcebergCompatV2' = 'true'
      'delta.universalFormat.enabledFormats' = 'iceberg'
    }
 ) }}

tblproperties can be specified for python models, but they will be applied via an ALTER statement after table creation. This is due to a limitation in PySpark.

Configuring tables​

The append strategy​

The insert_overwrite strategy​

The merge strategy​

The replace_where strategy​

Selecting compute per model​

Specifying the compute for models​

Specifying compute for Python models​

Persisting model descriptions​

Default file format configurations​

Setting table properties​

Configuring tables

The `append` strategy

The `insert_overwrite` strategy

The `merge` strategy

The `replace_where` strategy

Selecting compute per model

Specifying the compute for models

Specifying compute for Python models

Persisting model descriptions

Default file format configurations

Setting table properties