I am using this query to randomly select a row in the user table that has not already been inserted into the task_pool table, laterally done per id in task_dispersal table.
PostgreSQL 16.2
INSERT into task_pool(user_id, order_id)
select u.user_id, task_dispersal.id
from task_dispersal
cross join lateral (
select user_id
from users
tablesample bernoulli(.1)
repeatable(random())
where user_id not in (
select user_id
from task_pool
where order_id = task_dispersal.id
)
order by random()
limit 1) u
Is there anything that can be done to optimize this query?
I added an "explain analyze verbose" Query Plan result at end of this post
create table users(
user_id serial primary key,
irellevant_columns varchar
);
create table task_pool(
id serial primary key,
irellevant_columns varchar
);
create table task_pool(
task_id serial primary key,
order_id integer,
user_id integer
);
| user_id(serial key) | irellevant_columns |
|---|---|
| 1 | ~~~ |
| ... | ... |
| 49156 | ~~~ |
task_dispersal
| id(serial key) | irellevant_columns |
|---|---|
| 1 | ~~~ |
| 2 | ~~~ |
| ... | ... |
| 100 | ~~~ |
task_pool
| task_id(serial key) | order_id(not unique) | user_id (not unique) |
|---|---|---|
| 1 | 5 | 267 |
| 2 | 55 | 9,999 |
| ... | ... | ... |
| 1 | 23 | 602 |
| 985187 | 5 | 15 |
For example, let's say there's 100 rows in task_dispersal (therefore 100 order_ids). For each of those 100 order_ids, the query will chose a random user_id in the user_table, filtering out user_id's that have already been inserted into the task_pool for that order_id.
The query above works great, but slows down as the amount of rows in users and task_pool increase (expected of course, I'd just hope to minimize the delay).
Let's say I have 985187 rows in task_pool (so 985187 task_ids), 49156 rows in user table (49156 user_ids), and 100 rows in task_dispersal (100 order_ids).
The query now takes ~9.5 seconds to complete.
Here is the info with explain analyze verbose:
"Insert on public.task_pool (cost=21150.09..2115016.50 rows=0 width=0) (actual time=9565.389..9565.390 rows=0 loops=1)"
" -> Nested Loop (cost=21150.09..2115016.50 rows=100 width=28) (actual time=81.912..9562.038 rows=100 loops=1)"
" Output: nextval('pool_task_id_seq'::regclass), task_dispersal.id, users.user_id, CURRENT_TIMESTAMP, NULL::integer, 0"
" -> Seq Scan on public.task_dispersal (cost=0.00..4.00 rows=100 width=4) (actual time=0.019..0.147 rows=100 loops=1)"
" Output: task_dispersal.id"
" -> Limit (cost=21150.09..21150.10 rows=1 width=12) (actual time=95.599..95.599 rows=1 loops=100)"
" Output: users.user_id, (random())"
" -> Sort (cost=21150.09..21150.15 rows=24 width=12) (actual time=95.593..95.593 rows=1 loops=100)"
" Output: users.user_id, (random())"
" Sort Key: (random())"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Sample Scan on public.users (cost=19563.30..21149.97 rows=24 width=12) (actual time=93.586..95.572 rows=47 loops=100)"
" Output: users.user_id, random()"
" Sampling: bernoulli ('0.1'::real) REPEATABLE (random())"
" Filter: (NOT (hashed SubPlan 1))"
" Rows Removed by Filter: 0"
" SubPlan 1"
" -> Seq Scan on public.task_pool task_pool_1 (cost=0.00..19560.84 rows=985 width=4) (actual time=9.673..93.218 rows=995 loops=100)"
" Output: task_pool_1.user_id"
" Filter: (task_pool_1.order_id = task_dispersal.id)"
" Rows Removed by Filter: 984192"
"Planning Time: 0.487 ms"
"Execution Time: 9565.444 ms"
I don't assume this can really be optimized, but I'm still pretty new to SQL so I thought I'd ask here.