Is there a ways to speed up beam_sql magic execution?

22 views Asked by At

I'd like to give friendly exploratory environment to Data Scientists to do some early iteration on Beam data processing pipelines.

Jupyter notebooks seem like a great environment to do this.

However, I'm running into the problem that SQL commands are very slow to execute from Jupyter.

E.g. when using beam_sql even the simplest command only returning a few constants takes more than 60 seconds to execute:

%%beam_sql -o pcoll
SELECT CAST(1 AS INT) AS `id`, CAST('foo' AS VARCHAR) AS `str`, CAST(3.14 AS DOUBLE) AS `flt`

My understanding is that this is because a separate Java process is launched in the background for each SQL command?

Is there maybe some way to speed things up?

Or should I consider exploring Pandas Dataframes rather than SQL when I aim to provide a friendlier user experience to Python users in the context of Jupyter notebooks?

p.s.: the timing comes from following the "Develop Apache Beam notebooks with the interactive runner" guide, so Jupyter runs on a GCP Dataflow Workbench.

0

There are 0 answers