Is there a ways to speed up beam_sql magic execution?

22 views Asked by Gergely At 25 March 2024 at 18:44

I'd like to give friendly exploratory environment to Data Scientists to do some early iteration on Beam data processing pipelines.

Jupyter notebooks seem like a great environment to do this.

However, I'm running into the problem that SQL commands are very slow to execute from Jupyter.

E.g. when using beam_sql even the simplest command only returning a few constants takes more than 60 seconds to execute:

%%beam_sql -o pcoll
SELECT CAST(1 AS INT) AS `id`, CAST('foo' AS VARCHAR) AS `str`, CAST(3.14 AS DOUBLE) AS `flt`

My understanding is that this is because a separate Java process is launched in the background for each SQL command?

Is there maybe some way to speed things up?

Or should I consider exploring Pandas Dataframes rather than SQL when I aim to provide a friendlier user experience to Python users in the context of Jupyter notebooks?

p.s.: the timing comes from following the "Develop Apache Beam notebooks with the interactive runner" guide, so Jupyter runs on a GCP Dataflow Workbench.

Original Q&A

TechQA.

Is there a ways to speed up beam_sql magic execution?

There are 0 answers

Related Questions in APACHE-BEAM

Popular Questions

Trending Questions