Why self-reference a column from the same table in a SQL subquery?

72 views Asked by At

I'm working on subqueries in the Google Data Analytics Course. There are many places you can rent bikes from. These places are bike stations. The query is to calculate the difference between average trip length for a particular station and the total trip duration average of all the stations.

There are no JOINs or any data being pulled from a 2nd table.

What does WHERE column = alias.column achieve?

Why self-reference a column from the same table in a SQL subquery?

SELECT
    starttime,
    start_station_id,
    tripduration,
    (SELECT
        ROUND(AVG(tripduration),2)
    FROM bigquery-public-data.new_york_citibike.citibike_trips
    WHERE start_station_id = outer_trips.start_station_id
    ) AS avg_duration_for_station,
    ROUND(tripduration -
        (SELECT
            ROUND(AVG(tripduration),2)
        FROM bigquery-public-data.new_york_citibike.citibike_trips
        WHERE start_station_id = outer_trips.start_station_id
        ),2) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
LIMIT 25;

I expected to get the same result without having to self reference.

1

There are 1 answers

3
Zegarek On

In this demo, the query returns this:

starttime start_station_id tripduration avg_duration_for_station difference_from_avg
2024-01-15 08:23:47.883618+00 1 03:44:47.110507 03:01:27.152532 00:43:19.957975
2024-01-20 02:24:20.187858+00 0 00:51:19.64743 00:51:19.64743 00:00:00
2024-01-17 02:44:39.358079+00 1 02:18:07.194556 03:01:27.152532 -00:43:19.957976
2024-01-15 06:32:48.073332+00 2 03:52:40.097185 03:52:40.097185 00:00:00

Note that avg_duration_for_station depends on station identifier. Removing the "self-reference" removes correlation of the scalar subquery to a given row through start_station_id, leaving you with the same, global avg everywhere:

starttime start_station_id tripduration avg_duration_for_station difference_from_avg
2024-01-15 08:23:47.883618+00 1 03:44:47.110507 02:41:43.51242 01:03:03.598087
2024-01-20 02:24:20.187858+00 0 00:51:19.64743 02:41:43.51242 -01:50:23.86499
2024-01-17 02:44:39.358079+00 1 02:18:07.194556 02:41:43.51242 -00:23:36.317864
2024-01-15 06:32:48.073332+00 2 03:52:40.097185 02:41:43.51242 01:10:56.584765

A window function can have the same effect as those correlated subqueries:

SELECT
    starttime,
    start_station_id,
    tripduration,
    avg(tripduration) over (w1) AS avg_duration_for_station,
    tripduration - avg(tripduration) over (w1) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
WINDOW w1 AS (partition by start_station_id)
LIMIT 25;

You can also join an uncorrelated subquery to your from list:

SELECT starttime,
       start_station_id,
       tripduration,
       avg_duration_for_station,
       tripduration - avg_duration_for_station AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
LEFT JOIN (SELECT start_station_id,
                  AVG(tripduration) AS avg_duration_for_station
           FROM bigquery-public-data.new_york_citibike.citibike_trips
           GROUP BY start_station_id
           ) AS station USING (start_station_id)
LIMIT 25;