DBT - Using SELECT * in the staging layer

30 views Asked by At

I have encountered this pattern of selecting all columns using * in DBT, when bringing the source data into the staging layer.

You can find it in the dbt docs about the staging layer but in this dbt course.

The pattern goes like this:

with

source as (

    select * from {{ source('stripe','payment') }}

),

transformed as (

    select
        id as payment_id,
        orderid as order_id,
    from source
)
select * from transformed

As far as I am concerned selecting * from any source is a bad idea. We are most likely selecting data that we do not need which can incur unnecessary costs but also slow down our queries. We most likely do not have control over the source data so we have absolutely no transparency of what we are selecting.

Does anyone see something wrong with my logic? I see how code looks very clean but how can this be a good idea from a cost/performance point of view?

1

There are 1 answers

1
Matt On

With no control over source table, you have sort of hit on one of a few use cases to actually use the SELECT * pattern. However, the answer of what to do is really determined by your answer to how will you resolve your broken ETL if the source is changed.

If it is critical, can't go back in time to get past data, or something to that nature then SELECT * is appropriate but you wouldn't apply the transformation there instead you would dump to a staging location and then transform off the staging location. That way if columns change or data integrity is broken etc. you still have the data, can fix your transformation, and not lose data.

But if you don't care can just fix your DBT and repull the data then simply go with somethig like:

select id as payment_id, orderid as order_id from {{ source('stripe','payment') }}

Which as you mention is likely more performant under the circumstances.