Join columns to a table in parallel

271 views Asked by At

Currently I use a function to run an aj join for a single column from a large table on to a smaller table which has its time column shifted t milliseconds ahead, joining on a sym column as well as time. I then compute and programatically name a new column based on this joined column, before deleting the original joined column from the small table. This returns the small table with a new column based on values joined from the larger table t milliseconds ahead.

I then use an Over loop / to repeat this over a list of different delays t, recursively adding one new column for each delay in the list, passing as an argument the table so columns are added recursively.

My issue is the query, join and processing are slow on a large table. I have many cores so I would like to parallelise this operation to take advantage of all available cores, as well as optimising the steps taken to add the new columns. The large table is partitioned on disk by date and sym.

[Edit:] Here is an example of what I have at the moment.

smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));

bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));

delays: 00:00:00 + (7 * til 5);

foo: ([bigTab; smallTab2; delays]

    smallTab2: aj[ `sym`time; `sym`time xasc select from (update time:time+delays from smallTab2); `sym`time xasc select sym, time, future pricesprice from bigTabl;

    smallTab2: ![smallTab2; (); 0b; enlist[$"colnametime_", string(`int$delays)] ! enlist(%;`future_price;100)];

    delete future_price from smallTab2

}[bigT];

smallT:foo/[select from smallT; delays];

smallT

I am relatively new to q and kdb so verbose explanations of how and why a solution works with working code on a toy example would be very greatly appreciated.

1

There are 1 answers

2
rianoc On BEST ANSWER

Your function is repeatedly sorting the tables inside the loop which will slow you down.

Also as noted in documentation attributes should be applied on the tables which will greatly improve performance. https://code.kx.com/q/ref/aj/#performance

Rather than looping through the delay offsets you can instead create the full list and only aj once. https://community.kx.com/t5/New-kdb-q-users-question-forum/How-do-you-start-thinking-in-vectors/td-p/12722

smallT: ([] sym: (20#`AAPL),(20#`MSFT); time: (asc 00:00:00+20?til 100), (asc 00:00:00+20?til 100));
bigT: ([] sym: (100#`AAPL),(100#`MSFT); time: (asc 00:00:00+til 100), (asc 00:00:00+til 100); price: (til 100),(til 100));
delays: 00:00:00 + (7 * til 5);

bigT:update `p#sym from `sym`time xasc bigT

res:raze {[x;y] update row:i, delay:(`$"delay_",string`int$y),time+y from x}[smallT] each delays
res:update `g#sym from `sym`time xasc res

res:aj[ `sym`time;res; select sym, time, price from bigT]

delete row from `sym`time xcols 0!(exec ({`$"delay_",string`int$x} each delays)#(delay!price) by row:row from res) lj 1!select row,sym,time from res
sym  time     delay_0 delay_7 delay_14 delay_21 delay_28
--------------------------------------------------------
AAPL 00:00:17 17      24      31       38       45      
AAPL 00:00:18 18      25      32       39       46      
AAPL 00:00:18 18      25      32       39       46      
AAPL 00:00:28 28      35      42       49       56      
AAPL 00:00:33 33      40      47       54       61       
...