Self-join query optimizations

101 views Asked by At

I have a table, dev_base_low which stores that have several fields but for this example we will focus on the key fields, from, to, carrier, type, v_type, goods_category. The table contains 4,000 records and are indexed on the key fields.

The objective is simple, under the column type I want to join the spokes - hub - spokes. My first attempt is to join all spokes to hubs with the following:

WITH 
    temp_hub as 
    (SELECT * FROM dev_base_low WHERE carrier_type = 'hub')

SELECT
    1
FROM 
  (select * from temp_hub where type = 'spoke-hub') leg1 
    INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2 
        on  
                        leg1.to = leg2.from and
                        leg1.carrier = leg2.carrier and
                        leg1.from <> leg2.to and
                        leg1.v_type = leg2.v_type and
                        leg1.goods_category = leg2.goods_category 

The query as we can imagine runs optimally using hash joins under a 1 second, which outputs 8,000 records. EXPLAIN ANALYZE output, to my knowledge from below, indexes should not have a big difference given that it is a hash join and also because the total records are in no ways large:

QUERY PLAN
Hash Join  (cost=224.21..308.71 rows=1 width=4) (actual time=6.369..15.949 rows=8982 loops=1)
  Hash Cond: ((temp_hub.to = temp_hub_1.from) AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category))
  Join Filter: (temp_hub.from <> temp_hub_1.to)
  Rows Removed by Join Filter: 4
  CTE temp_hub
    ->  Seq Scan on dev_base_low  (cost=0.00..139.72 rows=3738 width=155) (actual time=0.020..2.094 rows=3738 loops=1)
          Filter: (carrier_type = 'hub'::text)
  ->  CTE Scan on temp_hub  (cost=0.00..84.11 rows=19 width=160) (actual time=0.027..1.514 rows=1240 loops=1)
        Filter: (type = 'spoke-hub'::text)
        Rows Removed by Filter: 2498
  ->  Hash  (cost=84.11..84.11 rows=19 width=160) (actual time=6.313..6.314 rows=1088 loops=1)
        Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 107kB
        ->  CTE Scan on temp_hub temp_hub_1  (cost=0.00..84.11 rows=19 width=160) (actual time=0.011..5.106 rows=1088 loops=1)
              Filter: (type = 'hub-hub'::text)
              Rows Removed by Filter: 2650
Planning Time: 1.871 ms
Execution Time: 16.928 ms

The problem occurs when a second self-inner-join is added and the query times out, unfortunately, I have a hard timeout on my client-side at 20 seconds. Shown below:

SELECT
    1
FROM 
  (select * from temp_hub where type = 'spoke-hub') leg1 
    INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2 
        on  
                        leg1.to = leg2.from and
                        leg1.carrier = leg2.carrier and
                        leg1.from <> leg2.to and
                        leg1.v_type = leg2.v_type and
                        leg1.goods_category = leg2.goods_category 
  INNER JOIN (select * from temp_hub where type = 'hub-spoke' ) leg3
        on  leg2.to = leg3.from and
                        leg2.carrier = leg3.carrier and
                        leg1.from <> leg3.to and
                        leg2.from <> leg3.to and
                        leg2.v_type = leg3.v_type and
                        leg2.goods_category = leg3.goods_category

I have tried several optimizations with indexing, using sub-queries and CTEs, using different join methods (hash,nested,merge) and checking DB configs on memory allocation but with no real benefits. I have estimated the total output with the 2nd inner join to be under 400,000 records.

My questions:

  1. Is there anything wrong with the query joins or methods?
  2. Are there any optimizations on troubleshooting query performance that I can run?
  3. Even given 2 tables, with 8,000 and 4,000 records respectively, is there anything I can do to ensure that the runtime remains below 20 seconds?

EDIT

So even after increasing timeout, it timesout at 2 minutes, I know there is an option to increase timeout but I guess that defeats the purpose. Added the EXPLAIN for the query below:

Nested Loop  (cost=224.16..393.19 rows=1 width=4)
"  Join Filter: ((temp_hub.""from"" <> temp_hub_1.""to"") AND (temp_hub_1.""from"" <> temp_hub_2.""to"") AND (temp_hub.""to"" = temp_hub_1.""from"") AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category) AND (temp_hub_2.""from"" = temp_hub_1.""to""))"
  CTE temp_hub
    ->  Seq Scan on dev_base_low  (cost=0.00..139.72 rows=3738 width=155)
          Filter: (carrier_type = 'hub'::text)
  ->  Hash Join  (cost=84.44..168.84 rows=1 width=320)
        Hash Cond: ((temp_hub.carrier = temp_hub_2.carrier) AND (temp_hub.v_type = temp_hub_2.v_type) AND (temp_hub.goods_category = temp_hub_2.goods_category))
"        Join Filter: (temp_hub.""from"" <> temp_hub_2.""to"")"
        ->  CTE Scan on temp_hub  (cost=0.00..84.11 rows=19 width=160)
              Filter: (type = 'spoke-hub'::text)
        ->  Hash  (cost=84.11..84.11 rows=19 width=160)
              ->  CTE Scan on temp_hub temp_hub_2  (cost=0.00..84.11 rows=19 width=160)
                    Filter: (type = 'hub-spoke'::text)
  ->  CTE Scan on temp_hub temp_hub_1  (cost=0.00..84.11 rows=19 width=160)
        Filter: (type = 'hub-hub'::text)
1

There are 1 answers

1
BernardL On

After a small change, the query now runs under a second.

Nested Loop  (cost=0.58..1804.05 rows=43262 width=4) (actual time=0.235..926.969 rows=310362 loops=1)
"  Join Filter: ((dev_base.""from"" <> dev_base_2.""to"") AND (dev_base.carrier = dev_base_2.carrier) AND (dev_base.v_type = dev_base_2.v_type) AND (dev_base.goods_category = dev_base_2.goods_category))"
  Rows Removed by Join Filter: 2564
  ->  Nested Loop  (cost=0.29..941.32 rows=7986 width=93) (actual time=0.194..493.947 rows=23818 loops=1)
        ->  Seq Scan on dev_base  (cost=0.00..297.30 rows=2031 width=53) (actual time=0.128..4.099 rows=1778 loops=1)
              Filter: ((carrier_type = 'hub'::text) AND (type = 'spoke-hub'::text))
              Rows Removed by Filter: 5842
        ->  Memoize  (cost=0.29..0.82 rows=1 width=53) (actual time=0.039..0.269 rows=13 loops=1778)
"              Cache Key: dev_base.""from"", dev_base.""to"", dev_base.carrier, dev_base.v_type, dev_base.goods_category"
              Hits: 0  Misses: 1778  Evictions: 0  Overflows: 0  Memory Usage: 2164kB
              ->  Index Scan using idx_base_routes on dev_base dev_base_1  (cost=0.28..0.81 rows=1 width=53) (actual time=0.035..0.253 rows=13 loops=1778)
"                    Index Cond: ((""from"" = dev_base.""to"") AND (carrier = dev_base.carrier) AND (v_type = dev_base.v_type) AND (goods_category = dev_base.goods_category))"
"                    Filter: ((carrier_type = 'hub'::text) AND (type = 'hub-hub'::text) AND (dev_base.""from"" <> ""to""))"
                    Rows Removed by Filter: 36
  ->  Memoize  (cost=0.29..0.88 rows=1 width=53) (actual time=0.004..0.009 rows=13 loops=23818)
"        Cache Key: dev_base_1.""from"", dev_base_1.carrier, dev_base_1.v_type, dev_base_1.goods_category, dev_base_1.""to"""
        Hits: 22644  Misses: 1174  Evictions: 0  Overflows: 0  Memory Usage: 1070kB
        ->  Index Scan using idx_base_routes on dev_base dev_base_2  (cost=0.28..0.87 rows=1 width=53) (actual time=0.049..0.127 rows=10 loops=1174)
"              Index Cond: ((""from"" = dev_base_1.""to"") AND (carrier = dev_base_1.carrier) AND (v_type = dev_base_1.v_type) AND (goods_category = dev_base_1.goods_category))"
"              Filter: ((carrier_type = 'hub'::text) AND (type = 'hub-spoke'::text) AND (dev_base_1.""from"" <> ""to""))"
              Rows Removed by Filter: 11
Planning Time: 3.428 ms
Execution Time: 943.901 ms

In summary, indexes are VERY important during joins and in Postgres, using a CTE with the WITH statement might materialize and not have the benefits of its index. For join cases like mine, it is much more effecient if it is written as;

WITH
    temp_hub AS NOT MATERIALIZED
    (SELECT * FROM dev_base WHERE carrier_type = 'hub')

This is also recommended under docs: https://www.postgresql.org/docs/current/queries-with.html with the caveat that your WITH statement does not contain any computationally expensive operations.

The comment from @Charlieface pointed this to the right direction.