Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

Question

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

711 views Asked by vanhooser At 13 December 2021 at 16:42

I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM?

Original Q&A

There are 1 answers

**vanhooser** · Accepted Answer · 2021-12-13T16:42:13+00:00

This will not materialize your entire dataset to the driver, nor will it necessarily increase your risk of OOM. (It forces the evaluation of the incoming DataFrame, so if that evaluation means you will OOM then that will be recognized at the point you .count(), but the .count() itself didn't cause this, it only made you realize it).

What this will do, however, is halt the execution of your job from the point you make the call to. .count(). This is because this value must be known to the driver before it can proceed with any of the rest of your work, so it's not a particularly efficient use of Spark / distributed compute. Use .count() only when necessary, i.e. when making choices about partition counts or other such dynamic sizing operations.

TechQA.

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

There are 1 answers

Related Questions in PYSPARK

Related Questions in PALANTIR-FOUNDRY

Related Questions in FOUNDRY-CODE-REPOSITORIES

Related Questions in FOUNDRY-PYTHON-TRANSFORM

Popular Questions

Trending Questions