Why self-reference a column from the same table in a SQL subquery?

Question

Why self-reference a column from the same table in a SQL subquery?

72 views Asked by Razi Syed At 19 January 2024 at 16:59

I'm working on subqueries in the Google Data Analytics Course. There are many places you can rent bikes from. These places are bike stations. The query is to calculate the difference between average trip length for a particular station and the total trip duration average of all the stations.

There are no JOINs or any data being pulled from a 2nd table.

What does WHERE column = alias.column achieve?

Why self-reference a column from the same table in a SQL subquery?

SELECT
    starttime,
    start_station_id,
    tripduration,
    (SELECT
        ROUND(AVG(tripduration),2)
    FROM bigquery-public-data.new_york_citibike.citibike_trips
    WHERE start_station_id = outer_trips.start_station_id
    ) AS avg_duration_for_station,
    ROUND(tripduration -
        (SELECT
            ROUND(AVG(tripduration),2)
        FROM bigquery-public-data.new_york_citibike.citibike_trips
        WHERE start_station_id = outer_trips.start_station_id
        ),2) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
LIMIT 25;

I expected to get the same result without having to self reference.

Original Q&A

There are 1 answers

**Zegarek** · Answer 1 · 2024-01-20T13:34:16+00:00

In this demo, the query returns this:

starttime	start_station_id	tripduration	avg_duration_for_station	difference_from_avg
2024-01-15 08:23:47.883618+00	1	03:44:47.110507	03:01:27.152532	00:43:19.957975
2024-01-20 02:24:20.187858+00	0	00:51:19.64743	00:51:19.64743	00:00:00
2024-01-17 02:44:39.358079+00	1	02:18:07.194556	03:01:27.152532	-00:43:19.957976
2024-01-15 06:32:48.073332+00	2	03:52:40.097185	03:52:40.097185	00:00:00

Note that avg_duration_for_station depends on station identifier. Removing the "self-reference" removes correlation of the scalar subquery to a given row through start_station_id, leaving you with the same, global avg everywhere:

starttime	start_station_id	tripduration	avg_duration_for_station	difference_from_avg
2024-01-15 08:23:47.883618+00	1	03:44:47.110507	02:41:43.51242	01:03:03.598087
2024-01-20 02:24:20.187858+00	0	00:51:19.64743	02:41:43.51242	-01:50:23.86499
2024-01-17 02:44:39.358079+00	1	02:18:07.194556	02:41:43.51242	-00:23:36.317864
2024-01-15 06:32:48.073332+00	2	03:52:40.097185	02:41:43.51242	01:10:56.584765

A window function can have the same effect as those correlated subqueries:

SELECT
    starttime,
    start_station_id,
    tripduration,
    avg(tripduration) over (w1) AS avg_duration_for_station,
    tripduration - avg(tripduration) over (w1) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
WINDOW w1 AS (partition by start_station_id)
LIMIT 25;

You can also join an uncorrelated subquery to your from list:

SELECT starttime,
       start_station_id,
       tripduration,
       avg_duration_for_station,
       tripduration - avg_duration_for_station AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
LEFT JOIN (SELECT start_station_id,
                  AVG(tripduration) AS avg_duration_for_station
           FROM bigquery-public-data.new_york_citibike.citibike_trips
           GROUP BY start_station_id
           ) AS station USING (start_station_id)
LIMIT 25;

TechQA.

Why self-reference a column from the same table in a SQL subquery?

There are 1 answers

Related Questions in SQL

Related Questions in GOOGLE-BIGQUERY

Related Questions in SUBQUERY

Related Questions in SELF-JOIN

Related Questions in SELF-REFERENCE

Popular Questions

Trending Questions