Can this query be optimized? (Choosing a random row to insert, that excludes previously inserted Rows)

30 views Asked by At

I am using this query to randomly select a row in the user table that has not already been inserted into the task_pool table, laterally done per id in task_dispersal table.


PostgreSQL 16.2

INSERT into task_pool(user_id, order_id)
select u.user_id, task_dispersal.id
from task_dispersal
cross join lateral (
    select user_id
    from users
    tablesample bernoulli(.1)
    repeatable(random())
    where user_id not in (
        select user_id
        from task_pool
        where order_id = task_dispersal.id
    )
          order by random()
    limit 1) u

Is there anything that can be done to optimize this query?

I added an "explain analyze verbose" Query Plan result at end of this post

create table users(
    user_id serial primary key,
    irellevant_columns varchar
    );
create table task_pool(
    id serial primary key,
    irellevant_columns varchar
    );
create table task_pool(
    task_id serial primary key,   
    order_id integer,
    user_id integer
    );
user_id(serial key) irellevant_columns
1 ~~~
... ...
49156 ~~~

task_dispersal

id(serial key) irellevant_columns
1 ~~~
2 ~~~
... ...
100 ~~~

task_pool

task_id(serial key) order_id(not unique) user_id (not unique)
1 5 267
2 55 9,999
... ... ...
1 23 602
985187 5 15

For example, let's say there's 100 rows in task_dispersal (therefore 100 order_ids). For each of those 100 order_ids, the query will chose a random user_id in the user_table, filtering out user_id's that have already been inserted into the task_pool for that order_id.

The query above works great, but slows down as the amount of rows in users and task_pool increase (expected of course, I'd just hope to minimize the delay).

Let's say I have 985187 rows in task_pool (so 985187 task_ids), 49156 rows in user table (49156 user_ids), and 100 rows in task_dispersal (100 order_ids).

The query now takes ~9.5 seconds to complete.


Here is the info with explain analyze verbose:

"Insert on public.task_pool  (cost=21150.09..2115016.50 rows=0 width=0) (actual time=9565.389..9565.390 rows=0 loops=1)"
"  ->  Nested Loop  (cost=21150.09..2115016.50 rows=100 width=28) (actual time=81.912..9562.038 rows=100 loops=1)"
"        Output: nextval('pool_task_id_seq'::regclass), task_dispersal.id, users.user_id, CURRENT_TIMESTAMP, NULL::integer, 0"
"        ->  Seq Scan on public.task_dispersal  (cost=0.00..4.00 rows=100 width=4) (actual time=0.019..0.147 rows=100 loops=1)"
"              Output: task_dispersal.id"
"        ->  Limit  (cost=21150.09..21150.10 rows=1 width=12) (actual time=95.599..95.599 rows=1 loops=100)"
"              Output: users.user_id, (random())"
"              ->  Sort  (cost=21150.09..21150.15 rows=24 width=12) (actual time=95.593..95.593 rows=1 loops=100)"
"                    Output: users.user_id, (random())"
"                    Sort Key: (random())"
"                    Sort Method: top-N heapsort  Memory: 25kB"
"                    ->  Sample Scan on public.users  (cost=19563.30..21149.97 rows=24 width=12) (actual time=93.586..95.572 rows=47 loops=100)"
"                          Output: users.user_id, random()"
"                          Sampling: bernoulli ('0.1'::real) REPEATABLE (random())"
"                          Filter: (NOT (hashed SubPlan 1))"
"                          Rows Removed by Filter: 0"
"                          SubPlan 1"
"                            ->  Seq Scan on public.task_pool task_pool_1  (cost=0.00..19560.84 rows=985 width=4) (actual time=9.673..93.218 rows=995 loops=100)"
"                                  Output: task_pool_1.user_id"
"                                  Filter: (task_pool_1.order_id = task_dispersal.id)"
"                                  Rows Removed by Filter: 984192"
"Planning Time: 0.487 ms"
"Execution Time: 9565.444 ms"

I don't assume this can really be optimized, but I'm still pretty new to SQL so I thought I'd ask here.

0

There are 0 answers