Does scikit-learn train_test_split copy data?

31 views Asked by Roger V. At 18 March 2024 at 16:08

Does the train_test_split method of scikit-learn duplicate the data? In other words, if I work with a large dataset, X, y, does it mean that after performing something like

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)

my data use twice as much memory as the original dataset? Or is there some scikit-learn (or basic python) magic that prevents it? (E.g., as using .to_numpy() does not necessarily lead to data duplication)

If the memory use does double, what is the best practical way around this problem? Perhaps, something like

X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=2023)

Remark
Using np.shares_memor(X_train, X) suggests that the data is indeed duplicated.

Original Q&A

TechQA.

Does scikit-learn train_test_split copy data?

There are 0 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in TRAINING-DATA

Popular Questions

Trending Questions