I'm working on subqueries in the Google Data Analytics Course. There are many places you can rent bikes from. These places are bike stations. The query is to calculate the difference between average trip length for a particular station and the total trip duration average of all the stations.
There are no JOINs or any data being pulled from a 2nd table.
What does WHERE column = alias.column achieve?
Why self-reference a column from the same table in a SQL subquery?
SELECT
starttime,
start_station_id,
tripduration,
(SELECT
ROUND(AVG(tripduration),2)
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE start_station_id = outer_trips.start_station_id
) AS avg_duration_for_station,
ROUND(tripduration -
(SELECT
ROUND(AVG(tripduration),2)
FROM bigquery-public-data.new_york_citibike.citibike_trips
WHERE start_station_id = outer_trips.start_station_id
),2) AS difference_from_avg
FROM bigquery-public-data.new_york_citibike.citibike_trips AS outer_trips
LIMIT 25;
I expected to get the same result without having to self reference.
In this demo, the query returns this:
Note that
avg_duration_for_stationdepends on station identifier. Removing the "self-reference" removes correlation of the scalar subquery to a given row throughstart_station_id, leaving you with the same, globalavgeverywhere:A window function can have the same effect as those correlated subqueries:
You can also
joinan uncorrelated subquery to yourfromlist: